Read Apache HTTP server access log with Pandas
pandas pythonIn this post we'll see how to read our Apache HTTP server access log into a Pandas dataframe. First of all, we should take a look to the logging documentation to see how the log lines are formatted. In this case, we are going to use the combined log format, which corresponds to:
%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"
Where
%h is the IP address of the client (remote host);
%l is RFC 1413 identity (not determined);
%u is the user id, determined by the HTTP authentication;
%t is the time, by default, formatted as [day/month/year:hour:minute:second zone];
\"%r\" is the request string, formatted as "method resource protocol"
%>s is the status code;
%O is the request size;
\"%{Referer}i\" is the Referer HTTP request header and
\"%{User-Agent}i\" is the User-Agent HTTP request header.
Knowing this, let's define our field delimiter and some field parsers.
Although Pandas provide the quotechar and quoting arguments in the readers to skip a quoted character (by default, "), allowing the delimiter character within a quoted text (we have three fields with quoted text), these doesn't work properly in a more complex regular expression to delimit the columns. So, in our case, we'll need to define a regular expression to split each line to get our initial nine fields. This needs to match the spaces not surrounded by double quotes (") nor squared brackets ([]). Then, it will be:
\s                           # Match space.
(?=(?:[^"]*"[^"]*")*[^"]*$)  # Not surrounded by ".
(?![^\[]*\])                 # Not surrounded by [].
Time, request, Referer and User-Agent are surrounded by quotes or squared brackets, we will need a function to remove them. Also, the time needs to be parsed and formated as a datetime object:
from datetime import datetime
import pytz
def parse_str(x):
    """
    Returns the string delimited by two characters.
    Example:
        `>>> parse_str('[my string]')`
        `'my string'`
    """
    return x[1:-1]
def parse_datetime(x):
    '''
    Parses datetime with timezone formatted as:
        `[day/month/year:hour:minute:second zone]`
    Example:
        `>>> parse_datetime('13/Nov/2015:11:45:42 +0000')`
        `datetime.datetime(2015, 11, 3, 11, 45, 4, tzinfo=<UTC>)`
    Due to problems parsing the timezone (`%z`) with `datetime.strptime`, the
    timezone will be obtained using the `pytz` library.
    '''
    dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
    dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
    return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))
Next, we can read our access log file. In my case I'm not going to save the second and third column:
import re
import pandas as pd
data = pd.read_csv(
    'data/access.log',
    sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
    engine='python',
    na_values='-',
    header=None,
    usecols=[0, 3, 4, 5, 6, 7, 8],
    names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
    converters={'time': parse_datetime,
                'request': parse_str,
                'status': int,
                'size': int,
                'referer': parse_str,
                'user_agent': parse_str})
data.head()
| ip | time | request | status | size | referer | user_agent | |
|---|---|---|---|---|---|---|---|
| 0 | 66.249.79.73 | [08/Nov/2015:08:09:02 +0000] | GET /blog/python-image-processing-libraries-pe... | 200 | 2754 | NaN | Mozilla/5.0 (compatible; Googlebot/2.1; +http:... | 
| 1 | 123.125.71.41 | [08/Nov/2015:08:09:51 +0000] | GET / HTTP/1.1 | 200 | 1421 | NaN | Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2... | 
| 2 | 123.125.71.25 | [08/Nov/2015:08:09:52 +0000] | GET / HTTP/1.1 | 200 | 4747 | NaN | Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2... | 
| 3 | 45.55.223.156 | [08/Nov/2015:08:10:02 +0000] | GET / HTTP/1.0 | 200 | 4706 | NaN | Mozilla/5.0 (compatible; NetcraftSurveyAgent/1... | 
| 4 | 180.76.15.5 | [08/Nov/2015:08:13:28 +0000] | GET / HTTP/1.1 | 200 | 4747 | NaN | Mozilla/5.0 (compatible; Baiduspider/2.0; +htt... | 
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50666 entries, 0 to 50665
Data columns (total 7 columns):
ip            50666 non-null object
time          50666 non-null object
request       50666 non-null object
status        50666 non-null int64
size          50666 non-null int64
referer       20784 non-null object
user_agent    50465 non-null object
dtypes: int64(2), object(5)
memory usage: 3.1+ MB
We can continue creating new columns from the existing ones and filtering out some rows.
Get resource URI:
request = data.pop('request').str.split()
data['resource'] = request.str[1]
Filter out non GET and non 200 requests:
data = data[(request.str[0] == 'GET') & (data.pop('status') == 200)]
Filter out undesired resources:
data = data[~data['resource'].str.match(
    r'^/media|^/static|^/admin|^/robots.txt$|^/favicon.ico$')]
Last, we can drop out the web crawlers, such as spiders from Google, Yahoo!... Just as a simple example we can filter crawlers by User-Agent:
data = data[~data['user_agent'].str.match(
    r'.*?bot|.*?spider|.*?crawler|.*?slurp', flags=re.I).fillna(False)]
and by IP:
data = data[~data['ip'].str.startswith('123.125.71.')]  # Baidu IPs.
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8840 entries, 3 to 50665
Data columns (total 6 columns):
ip            8840 non-null object
time          8840 non-null object
size          8840 non-null int64
referer       2266 non-null object
user_agent    8808 non-null object
resource      8840 non-null object
dtypes: int64(1), object(5)
memory usage: 483.4+ KB
In the next post, we will do some analysis with this data.
