yeah but need help with ip bans, been just banning the user and deleting them but they keep coming back
a aussie car forum I frequent has gone through alot of process's to cull bot invasions......
Maybe some of this jargon can help you guys for its cleaned up that site very well.........
I'm just copy pasting some intel the admin posted, if it helps you Pascuali for I have no idea..........
Here is the CPU utilisation (user) for all cores since midnight:
12:20:01 95.33
12:40:01 95.03
1:00:02 95.68
1:20:02 94.19
1:40:01 94.33
2:00:01 96.03
2:20:01 90.25
2:40:01 94.11
3:00:02 91.88
3:20:03 22.90
3:40:01 81.86
4:00:01 73.29
4:20:01 87.73
4:40:01 80.32
5:00:01 81.92
5:20:01 96.75
5:40:01 77.94
6:00:01 70.08
6:20:01 77.57
6:40:02 95.66
7:00:01 90.83
7:06:08 RESTART
7:20:01 86.88
7:40:02 88.08
8:00:01 84.57
8:20:01 48.71
8:40:01 91.88
9:00:01 94.26
.. all of which are just too high and translate to server load ratings 10-20 times what they should be. The CPU utilisation is split between the web and database servers with the latter using the most which is what I'd expect when handling large numbers of requests.
As a short term measure, I've reduced the session time out which means those bots that are 'stopped' will time out quicker which has reduced the guest numbers from 35k to 18k. The down side of that approach is real users will get asked to log in more frequently.
I really don't want to go down the Cloud Flare route as it has couple of big gotcha's from my PoV.
Instead, I have disabled guest access - this doesn't help a great deal in the short term as the AI bots are currently blocked from seeing anything anyway and if you could see what they were doing then they would mostly be looking at an error message telling them they had no access - it probably doesn't even help in the long term because no one cares whether they are actually scraping any content anyway.
It's hard to know what percentage of the guests are legitimate and which are AI bots without extracting a whole day of log entries (1.6 Gb on a day like yesterday) and sorting them by IP address and then looking up the heavily used IP ranges to see where they come from as their are some well known hosts (including AWS) for the AI bots but it might be worth the exercise to see.
Somebody asked earlier about email notifications - emails (except a couple of types) will be deferred when CPU load is above a set threshold and if the server drops below that they then get sent in a big bunch, albeit restricted to batches of 150 at a time.
for his help with a script that parses the access log file and sorts the IP addresses into Class B ranges and summarises the access attempts for each. That identified that between 4-6 million access requests per day were not valid and from this information, I've been able to identify those IP address ranges that have been flooding us with requests and a lot of them were originally coming from Vietnam so I blocked all of those and then they started coming from South American countries and they are now blocked as well.
You never actually win for long as they just use another range of addresses but having this information is a big help. To show what the impact of the bot activity is, we would normally use about 25 Gb data and handle ~2M responses and a similar number of pages per day.
on the 7th the stats were: 46.3 Gb / 6.655M responses / 6.569M pages;on the 8th the stats were: 36.9 Gb / 5.288M responses / 5.2M pages;on the 9th the stats were: 41.7 Gb / 4.604M responses / 4.509M pages;on the 10th the stats were: 35.8 Gb / 4.53M responses / 4.468M pages;on the 11th the stats were: 53.8 Gb / 8.653M responses / 8.562M pages; and
yesterday it was down to 25.86 Gb / 2.82M responses / 2.76M.