A Simple .htaccess Rule to Punish Bad Bots Hitting Your Joomla Site

We manage quite a few large Joomla sites, and all of these sites get their fair share of bad traffic caused by bad bots. What most system administrators do is that they block these bad bots with an .htaccess rule. At itoctopus, we go the extra mile and we punish those bots.

How do we do that? Well, with an .htaccess rule of course:

RewriteCond %{HTTP_USER_AGENT} ^.*(BadBot1|BadBot2|BadBot3|BadBot4|BadBot5).*$ [NC]
RewriteRule ^/?.*$ "http\:\/\/127\.0\.0\.1" [R,L]

So how does the above rule punish the bad bots?

The above rule checks if the user agent is BadBot1 or BadBot2 or BadBot3 or BadBot4 or BadBot5, and if it is, then it redirects the traffic to 127.0.0.1, which is the IP of the machine the request came from. This means that any attempt to index the website by the bad bot, will end up redirecting the bad bot to the web server on its own machine. This will confuse the bot (not to mention increase the load on the originating server) and will typically require intervention by the system administrator of the bad bot, who will manually blacklist the website from being indexed. A simple, funny, and yet very powerful rule! Don’t you agree?

So, how do you know which bad bots are visiting your Joomla website?

Before answering the question, it is important to explain that the term “bad bot” is subjective. What might be a bad bot for one website might be a good bot for another website. For example, for most of our US clients, we block the “Baidu” and the “Yandex” bots which are the bots of the top Chinese and the top Russian search engines, respectively. We know for sure that sites located in China, for example, would never ever think of blocking Baidu (this would technically mean killing their sites). So, again, the term “bad bot” is subjective.

Now that we have determined that the term “bad bot” is subjective, it is clearly up to you to decide which bots you want to block. But, in order to do that, you will need first a list of the top user agents visiting your website, which you can get using the following shell command (for an explanation about this command, check this post):

awk -F\" '{print $6}' /usr/local/apache/domlogs/yourjoomlawebsite.com | sort | uniq -c | sort -k1nr | awk '!($1="")' | sed -n 1,500p > user-agents.txt

The above command generates a list of the top 500 user agent strings visiting your Joomla website and stores them in the user-agents.txt file. Once you have that file (note that it might take a few minutes for the file to be generated), you can filter out the bots (unfortunately, there is no solid script to filter out the bots, so that must be done manually) and then you can decide which bots you need to block.

Note that if your website uses HTTPS instead of HTTP, then you should use the below command instead (it’s the same as the above, but with the exception that the domain has -ssl_log appended to its end):

awk -F\" '{print $6}' /usr/local/apache/domlogs/yourjoomlawebsite.com-ssl_log | sort | uniq -c | sort -k1nr | awk '!($1="")' | sed -n 1,500p > user-agents.txt

The above 2 commands apply to a WHM based environment. If you use Plesk or Webmin (or something else), then a good idea would be to check with your host for the location of your domain logs. Of course, you can always contact us for help and we’ll take care of those bad bots for you quickly, professionally, and affordably!

No comments yet.

Leave a comment