Finding popular images in Apache Logs

I’m sure there are a million web pages out there that cover this already but here’s how I do it:

First, grep the interesting URIs out of the access log.  Here I look for everything ending in .png or .jpg – a better regex would filter out all the Wordpress stuff, all the thumbnails, all the stuff that didn’t return 200, etc.

cat logs/www_log | awk '{print $7}' | grep -E "\.jpg|\.png$"

Then a quick awk script will indicate the number of times each image has been downloaded.  Because this script keeps all the image URIs in an associative array, it probably won’t scale to millions of images.  (But it will scale to millions of hits on a few images)

awk '{ if ($0 in linecount) linecount[$0]++; else linecount[$0] = 1} END { for (elem in linecount) print elem" "linecount[elem] }' | sort -r -n -k 2 | wc -l

Pipe the two commands together for UNIXy goodness. Produces output like so (with a more aggressive filter regex in step 1):

/wp-content/uploads/2010/01/IMG_7203.jpg 1543
/wp-content/uploads/2010/01/IMG_5761.jpg 1466
/wp-content/uploads/2010/01/IMG_5765.jpg 857
/wp-content/uploads/2010/01/IMG_5764.jpg 732
/wp-content/uploads/2010/01/IMG_5763.jpg 678
/wp-content/uploads/2010/01/IMG_7197.jpg 409
/wp-content/uploads/2010/01/IMG_7198.jpg 406
/wp-content/uploads/2010/01/box-plans-6.jpg 102
/wp-content/uploads/2010/01/box-plans-5.png 87
/wp-content/uploads/2010/01/box-plans-4.png 34
/wp-content/uploads/2010/01/box-plans-2.png 33
/wp-content/uploads/2010/01/box-plans-1.png 26
/wp-content/uploads/2010/01/box-plans-3.png 22

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>