Read Apache HTTP server access log with Pandas

pandas python

In this post we'll see how to read our Apache HTTP server access log into a Pandas dataframe. First of all, we should take a look to the logging documentation to see how the log lines are formatted. In this case, we are going to use the combined log format, which corresponds to:

%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"

Where

%h is the IP address of the client (remote host);

%l is RFC 1413 identity (not determined);

%u is the user id, determined by the HTTP authentication;

%t is the time, by default, formatted as [day/month/year:hour:minute:second zone];

\"%r\" is the request string, formatted as "method resource protocol"

%>s is the status code;

%O is the request size;

\"%{Referer}i\" is the Referer HTTP request header and

\"%{User-Agent}i\" is the User-Agent HTTP request header.

Knowing this, let's define our field delimiter and some field parsers.

Although Pandas provide the quotechar and quoting arguments in the readers to skip a quoted character (by default, "), allowing the delimiter character within a quoted text (we have three fields with quoted text), these doesn't work properly in a more complex regular expression to delimit the columns. So, in our case, we'll need to define a regular expression to split each line to get our initial nine fields. This needs to match the spaces not surrounded by double quotes (") nor squared brackets ([]). Then, it will be:

\s                           # Match space.
(?=(?:[^"]*"[^"]*")*[^"]*$)  # Not surrounded by ".
(?![^\[]*\])                 # Not surrounded by [].

Time, request, Referer and User-Agent are surrounded by quotes or squared brackets, we will need a function to remove them. Also, the time needs to be parsed and formated as a datetime object:

from datetime import datetime
import pytz

def parse_str(x):
    """
    Returns the string delimited by two characters.

    Example:
        `>>> parse_str('[my string]')`
        `'my string'`
    """
    return x[1:-1]

def parse_datetime(x):
    '''
    Parses datetime with timezone formatted as:
        `[day/month/year:hour:minute:second zone]`

    Example:
        `>>> parse_datetime('13/Nov/2015:11:45:42 +0000')`
        `datetime.datetime(2015, 11, 3, 11, 45, 4, tzinfo=<UTC>)`

    Due to problems parsing the timezone (`%z`) with `datetime.strptime`, the
    timezone will be obtained using the `pytz` library.
    '''
    dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
    dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
    return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))

Next, we can read our access log file. In my case I'm not going to save the second and third column:

import re
import pandas as pd

data = pd.read_csv(
    'data/access.log',
    sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
    engine='python',
    na_values='-',
    header=None,
    usecols=[0, 3, 4, 5, 6, 7, 8],
    names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
    converters={'time': parse_datetime,
                'request': parse_str,
                'status': int,
                'size': int,
                'referer': parse_str,
                'user_agent': parse_str})
data.head()
ip time request status size referer user_agent
0 66.249.79.73 [08/Nov/2015:08:09:02 +0000] GET /blog/python-image-processing-libraries-pe... 200 2754 NaN Mozilla/5.0 (compatible; Googlebot/2.1; +http:...
1 123.125.71.41 [08/Nov/2015:08:09:51 +0000] GET / HTTP/1.1 200 1421 NaN Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2...
2 123.125.71.25 [08/Nov/2015:08:09:52 +0000] GET / HTTP/1.1 200 4747 NaN Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2...
3 45.55.223.156 [08/Nov/2015:08:10:02 +0000] GET / HTTP/1.0 200 4706 NaN Mozilla/5.0 (compatible; NetcraftSurveyAgent/1...
4 180.76.15.5 [08/Nov/2015:08:13:28 +0000] GET / HTTP/1.1 200 4747 NaN Mozilla/5.0 (compatible; Baiduspider/2.0; +htt...
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50666 entries, 0 to 50665
Data columns (total 7 columns):
ip            50666 non-null object
time          50666 non-null object
request       50666 non-null object
status        50666 non-null int64
size          50666 non-null int64
referer       20784 non-null object
user_agent    50465 non-null object
dtypes: int64(2), object(5)
memory usage: 3.1+ MB

We can continue creating new columns from the existing ones and filtering out some rows.

Get resource URI:

request = data.pop('request').str.split()
data['resource'] = request.str[1]

Filter out non GET and non 200 requests:

data = data[(request.str[0] == 'GET') & (data.pop('status') == 200)]

Filter out undesired resources:

data = data[~data['resource'].str.match(
    r'^/media|^/static|^/admin|^/robots.txt$|^/favicon.ico$')]

Last, we can drop out the web crawlers, such as spiders from Google, Yahoo!... Just as a simple example we can filter crawlers by User-Agent:

data = data[~data['user_agent'].str.match(
    r'.*?bot|.*?spider|.*?crawler|.*?slurp', flags=re.I).fillna(False)]

and by IP:

data = data[~data['ip'].str.startswith('123.125.71.')]  # Baidu IPs.
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8840 entries, 3 to 50665
Data columns (total 6 columns):
ip            8840 non-null object
time          8840 non-null object
size          8840 non-null int64
referer       2266 non-null object
user_agent    8808 non-null object
resource      8840 non-null object
dtypes: int64(1), object(5)
memory usage: 483.4+ KB

In the next post, we will do some analysis with this data.