Analyze Apache HTTP server access log with Pandas

data-analysis pandas python

In the last post we saw how to read an Apache HTTP server access log with pandas. We ended with a dataframe structured like this:

ip timestamp size referer user_agent resource
X.X.X.X 2015-11-23 18:17:40+00:00 5303 NaN Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2... /
X.X.X.X 2015-11-23 18:52:14+00:00 1550 https://duckduckgo.com Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:42.... /
X.X.X.X 2015-11-23 19:16:48+00:00 1513 NaN Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2... /
X.X.X.X 2015-11-23 19:16:56+00:00 5303 NaN Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2... /
X.X.X.X 2015-11-23 19:24:38+00:00 2754 https://www.google.com/ Mozilla/5.0 (Windows NT 6.3) AppleWebKit/537.3... /querying_hive

Here's we'll do some data wrangling and aggregation to display information about this website visits. Let's start:

import pandas as pd
import matplotlib.pyplot as plt

Referer

Information about the page that linked to a resource of our page.

referers = data['referer'].dropna()

Referers domain

The two more common referer domains from the total referers (normed):

domains = referers.str.extract(r'^(https?://)?(www.)?([^/]*)')[2].str.lower()
domains.value_counts()[:2].divide(domains.count())
mastortosa.com    0.280564
google.com        0.145877
Name: 2, dtype: float64

The most common referers are pages of this website.

Google searches

Google queries that linked this website:

google_searches = referers[referers.str.contains(
    r'^(https?://)?(www.)?(google.[^/]*)/search?')]
google_queries = google_searches.str.extract(r'[?&]q=([^&]*)&?')
google_queries = google_queries.str.replace('+', ' ')
google_queries[:5]
3812             scikit image vs opencv
4143     pandas code datetime.timedelta
5276                 opencv get skimage
5277                 opencv get skimage
5974    comparison opencv scipy.ndimage
Name: referer, dtype: object

Time

Information about the visits over time.

Visits by week day

Normed count of visits by week day (being Monday the first day of the week):

data['weekday'] = data['timestamp'].apply(lambda x: x.isoweekday())
weekdays = data.groupby('weekday')['ip'].agg(len)
weekdays = weekdays.divide(weekdays.sum())
weekdays
weekday
1    0.145757
2    0.142374
3    0.134118
4    0.144539
5    0.139261
6    0.152389
7    0.141562
Name: ip, dtype: float64
weekdays.index = ['mon', 'tue', 'wed', 'thu', 'fri', 'sat', 'sun']
weekdays.plot(kind='barh')
plt.title('Visits over the week')
plt.xlabel('visits (normed)')
plt.show()

Pandas barh plot

Not big differences between them, but Saturday leads the number of visits.

Total daily visits

Plot the daily visits counts since June 2015:

visits = data['resource'].copy()
visits.index = data['timestamp']
visits = visits.resample('D', how='count', kind='period')
visits.index.name = 'date'
visits['6/2015':].plot()
plt.title('Total visits')
plt.ylabel('visits')
plt.show()

Pandas timeseries plot

Content

Information about the content visited within the website.

Tags searches

Searches from the site made by tag:

visits = data['resource'].copy()
tags = visits[visits.str.match(r'/tags/')]
tags = tags.str.extract(r'/tags/(.*)')
tags.value_counts().plot(kind='pie', colors=list('rgbymc'))
plt.title('Tag searches')
plt.xlabel('')
plt.ylabel('')
plt.show()

Pandas pie plot

Entries visited

To get the visits per entry we need to clean the resource URI. In this case, we have to filter out the home page (/) visits as well as the searches by tag (/tags/{tag}). Also, previously there were two sections in this website, blog and lab, so the visits to these sections have to be filtered out too and, later, assume the entries within these sections as entries with the current URL map (this means everything after /blog/ or /lab/ will be assumed as directly appended to the root /):

visits = data['resource'].copy()
visits.index = data['timestamp']
entries = visits[visits.str.match(
    r'(?!.*/tags)(?!^/blog\/$)(?!^/lab/$)/[^\?]+$')]
entries = entries.str.replace(r'/blog/|/lab/', '/')
entries[:10]
timestamp
2015-02-13 08:25:06+00:00                  /simple-web-analytics-python-pandas
2015-02-13 08:25:17+00:00    /web-visits-data-analytics-javascript-python--...
2015-02-13 08:42:14+00:00    /web-visits-data-analytics-javascript-python--...
2015-02-13 08:42:24+00:00                  /simple-web-analytics-python-pandas
2015-02-13 08:58:19+00:00                  /simple-web-analytics-python-pandas
2015-02-13 08:58:26+00:00    /web-visits-data-analytics-javascript-python--...
2015-02-13 09:01:38+00:00    /web-visits-data-analytics-javascript-python--...
2015-02-13 09:02:04+00:00                  /simple-web-analytics-python-pandas
2015-02-13 10:18:36+00:00    /web-visits-data-analytics-javascript-python--...
2015-02-13 10:18:39+00:00                  /simple-web-analytics-python-pandas
Name: resource, dtype: object

The set of entries:

for i in entries.unique():
    print i
/simple-web-analytics-python-pandas
/web-visits-data-analytics-javascript-python--geoip
/python-image-processing-libraries-performance-opencv-scipy-scikit-image
/freelance-invoices-manager
/hadoop_practical_introduction_mapreduce_python
/hadoop_streaming_practical_introduction_mapreduce_python
/data_analysis_apache_pig_practical_introduction
/data_analyisis_apache_hive_practical_introduction
/loading_data_hive
/querying_hive
/interpolation_scipy
/optimization_scipy
/least_squares_fitting_numpy_scipy
/read_apache_access_log_pandas

To get the number of visits of an entry within a period of time, like Optimization methods in Scipy in November, just:

(entries['11/2015']=='/optimization_scipy').sum()
21

Show the daily visits over November of three different entries and the total of entries visits:

urls = [
    '/python-image-processing-libraries-performance-opencv-scipy-scikit-image',
    '/interpolation_scipy',
    '/optimization_scipy'
]
entries = entries['11/2015']
all_entries = entries.resample('D', how='count', kind='period')
for i, url in enumerate(urls):
    entry = entries[entries==url].resample('D', how='count', kind='period')
    entry.index.name = 'date'
    plt.subplot(int('31%d' % (i+1)))
    plt.title(url)
    all_entries.plot(kind='area', color='k', alpha=.1)
    entry.plot(kind='area', color='g', alpha=.7)
    plt.legend(['all entries', 'entry'], prop={'size': 12})
    plt.xlabel('')
    plt.ylabel('visits')
plt.subplots_adjust(hspace=.5)
plt.show()

Pandas timeseries area plot

This month, the entry Python image processing libraries performance: OpenCV vs Scipy vs Scikit-Image is the most popular in this blog.