A Downloadable List of the Top 500 User Agent Strings on a High Traffic Joomla Website

This morning, we thought we had a little time to do something fun, and so we created a command to generate a list of all user agent strings (or signatures) on a very high traffic Joomla website that we maintain. “Why is that?”, we hear you ask… Well, because 1) we were curious about which user agent signature is the most common on the Internet, and 2) we had some security concerns because of the 500 HTTP Errors that we found a few days ago on that particular website and which were caused by a fake user agent signature.

So, what is the most common user agent string as of October 2016?

Well, if you really are interested, then the most common user agent string (as of October 2016) is (drum rolls, please):

Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko

In case you’re wondering which browser the above user agent string belongs to, then it’s FireFox (we love FireFox, although we have to say it’s becoming more memory hungry with each iteration).

And what are the top 500 user agent strings?

A list containing the top 500 user agents strings can be downloaded here. The original raw list contained 53530 unique user agent strings (that’s unique, in case you didn’t notice the underline), and we generated it by issuing the following command in the Linux shell (note that the below will just return the top 500 user agent strings, if you care about the whole list, then you should delete this part | sed -n 1,500p from the below string):

awk -F\" '{print $6}' [ourclientjoomlawebsite].com-Oct-2016 | sort | uniq -c | sort -k1nr | awk '!($1="")' | sed -n 1,500p > user-agents.txt

We know that you’re curious about the above set of commands, so let us dissect them for you:

  • The first command…

    awk -F\" '{print $6}' [ourclientjoomlawebsite].com-Oct-2016

    …grabs the 6th column from the deflated Apache log file. The 6th column is typically the user agent string.

  • The second command…

    | sort

    …uses the list returned by the first command as input (which is a raw list of all the user agent strings), and sorts it by user agent (this is necessary for the subsequent command).

  • The third command…

    | uniq -c

    …creates a unique list of all the user agent strings (out of the result of the second command), where each line consists of a user agent string, preceded with the number of times that that user agent string occurred in the logs.

  • The fourth command…

    | sort -k1nr

    …sorts the list returned by the previous command by number of occurrences of user agent strings descending.

  • The fifth command…

    | awk '!($1="")'

    …removes the first column (which contains the number of occurrences of each user agent string).

  • The sixth command…

    | sed -n 1,500p

    …returns the first 500 lines (e.g. the top 500 user agent strings, ordered by number of occurrences, descending).

  • The seventh command…

    > user-agents.txt

    …redirects the output to the user-agents.txt file.

As you can see, it’s not that hard once it is explained! Nevertheless, one has to admit it’s a bit hard, however, it clearly highlights the power of shell commands in Linux (if you’re a PHP developer, then imagine how much code would be needed to do all the above in PHP).

If you decide to run the above set of commands, then please keep in mind that it will take about 10-15 minutes to process a 10 GB log file even when using a relatively powerful server with an SSD drive.

A couple of notes about the generated file…

The Apache log that we have used to generate the list is that of a very professional website, which main visitors work for professional companies and governmental agencies. Additionally, the website in question has the absolute majority of the known spam bots blocked (which explains the absence of many mainstream spam bots in the list).

We hope that you found this post fun and useful. If you need help generating an even more interesting set of information from the Apache logs of your Joomla website, then please contact us. We are particularly fond of Linux (and awk), we love our job, we love our clients, and we won’t charge you much!

No comments yet.

Leave a comment