An Awk Line to Generate a List of 404 Pages Sorted By Number of Hits

Note: This post assumes a WHM based environment. If you are using Plesk or anything else, then the log path may be different. Please take that into consideration.

With time, most websites will have an increasing number of 404 pages. This is natural and it is caused by changes in the link structure, wrong incoming links, deleted/moved files, etc… The majority of websites typically ignore 404s, but we know that this is not the best route to take, especially in Joomla: in a previous post, we have explained the dramatic load issues that 404 pages can cause on Joomla websites. Now while we did propose, in that post, a core solution to address the consequences of 404 pages (which is also critical to lessen the effects of DoS attacks), that doesn’t mean that we think that 404 pages should be left unattended. In fact, we think that, as a Joomla administrator, you should analyze your logs often (ideally once a month), and, at the very least, address the 404 pages that appear the most in your logs to ensure long term stability of your website and your server (not to mention the positive SEO implications of doing so).

To make that task easier, we have devised a line, just one simple line, which will list all the 404 pages that you have in your logs reversely ordered by the number of hits. This will allow you to address the 404s that matter most. The line consists of a shell command using awk, a very powerful Linux shell tool mainly used for searching and analyzing files. Without further delay, here is the line (note that you must ssh as root in order to run the below command):

awk '{ if ($9 == 404) print $7 }' /home/domlogs/cpanel-user/yourjoomlawebsite.com | sort | uniq -c | sort -nr > 404-urls.txt

(Note: cpanel-user is the cPanel account username of your domain. Yes, duh!)

The above command checks each line (in the log file) if its 9th field is 404, if it is, then it will extract the URL from that line (which is the 7th field), put it in temporary memory, and then get a list of all the unique URLs as well as the number of their occurrences in the original log file. Finally, it sorts the list of unique URLs based on the number of occurrences of each URL, and then outputs the result to the file 404-urls.txt.

But why not just use Google Search Console to get the list of 404 pages?

Google Search Console (previously known as Google Webmaster Tools) generates a list of 404 pages on any website. However, that list is often out of date and does not generate the accurate results that one gets when examining the Apache access logs. Nevertheless, it is still a good idea to address all the issues that Google Search Console finds on your website, including 404 pages.

We hope that you found this post useful. If it didn’t work for you, then it might be that your log paths are different on your server. If you can’t find them or if you need help with the implementation, then please contact us. Our work is top notch, our fees are affordable, and we are the most courteous and friendliest Joomla developers on planet Earth.

No comments yet.

Leave a comment