Read Apache HTTP server access log with Pandas
pandas pythonIn this post we'll see how to read our Apache HTTP server access log into a Pandas dataframe. First of all, we should take a look to the logging documentation to see how the log lines are formatted. In this case, we are going to use the combined log format, which corresponds to:
%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"
Where
%h
is the IP address of the client (remote host);
%l
is RFC 1413 identity (not determined);
%u
is the user id, determined by the HTTP authentication;
%t
is the time, by default, formatted as [day/month/year:hour:minute:second zone]
;
\"%r\"
is the request string, formatted as "method resource protocol"
%>s
is the status code;
%O
is the request size;
\"%{Referer}i\"
is the Referer HTTP request header and
\"%{User-Agent}i\"
is the User-Agent HTTP request header.
Knowing this, let's define our field delimiter and some field parsers.
Although Pandas provide the quotechar
and quoting
arguments in the readers to skip a quoted character (by default, "
), allowing the delimiter character within a quoted text (we have three fields with quoted text), these doesn't work properly in a more complex regular expression to delimit the columns. So, in our case, we'll need to define a regular expression to split each line to get our initial nine fields. This needs to match the spaces not surrounded by double quotes ("
) nor squared brackets ([]
). Then, it will be:
\s # Match space.
(?=(?:[^"]*"[^"]*")*[^"]*$) # Not surrounded by ".
(?![^\[]*\]) # Not surrounded by [].
Time, request, Referer and User-Agent are surrounded by quotes or squared brackets, we will need a function to remove them. Also, the time needs to be parsed and formated as a datetime
object:
from datetime import datetime
import pytz
def parse_str(x):
"""
Returns the string delimited by two characters.
Example:
`>>> parse_str('[my string]')`
`'my string'`
"""
return x[1:-1]
def parse_datetime(x):
'''
Parses datetime with timezone formatted as:
`[day/month/year:hour:minute:second zone]`
Example:
`>>> parse_datetime('13/Nov/2015:11:45:42 +0000')`
`datetime.datetime(2015, 11, 3, 11, 45, 4, tzinfo=<UTC>)`
Due to problems parsing the timezone (`%z`) with `datetime.strptime`, the
timezone will be obtained using the `pytz` library.
'''
dt = datetime.strptime(x[1:-7], '%d/%b/%Y:%H:%M:%S')
dt_tz = int(x[-6:-3])*60+int(x[-3:-1])
return dt.replace(tzinfo=pytz.FixedOffset(dt_tz))
Next, we can read our access log file. In my case I'm not going to save the second and third column:
import re
import pandas as pd
data = pd.read_csv(
'data/access.log',
sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
engine='python',
na_values='-',
header=None,
usecols=[0, 3, 4, 5, 6, 7, 8],
names=['ip', 'time', 'request', 'status', 'size', 'referer', 'user_agent'],
converters={'time': parse_datetime,
'request': parse_str,
'status': int,
'size': int,
'referer': parse_str,
'user_agent': parse_str})
data.head()
ip | time | request | status | size | referer | user_agent | |
---|---|---|---|---|---|---|---|
0 | 66.249.79.73 | [08/Nov/2015:08:09:02 +0000] | GET /blog/python-image-processing-libraries-pe... | 200 | 2754 | NaN | Mozilla/5.0 (compatible; Googlebot/2.1; +http:... |
1 | 123.125.71.41 | [08/Nov/2015:08:09:51 +0000] | GET / HTTP/1.1 | 200 | 1421 | NaN | Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2... |
2 | 123.125.71.25 | [08/Nov/2015:08:09:52 +0000] | GET / HTTP/1.1 | 200 | 4747 | NaN | Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/2... |
3 | 45.55.223.156 | [08/Nov/2015:08:10:02 +0000] | GET / HTTP/1.0 | 200 | 4706 | NaN | Mozilla/5.0 (compatible; NetcraftSurveyAgent/1... |
4 | 180.76.15.5 | [08/Nov/2015:08:13:28 +0000] | GET / HTTP/1.1 | 200 | 4747 | NaN | Mozilla/5.0 (compatible; Baiduspider/2.0; +htt... |
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50666 entries, 0 to 50665
Data columns (total 7 columns):
ip 50666 non-null object
time 50666 non-null object
request 50666 non-null object
status 50666 non-null int64
size 50666 non-null int64
referer 20784 non-null object
user_agent 50465 non-null object
dtypes: int64(2), object(5)
memory usage: 3.1+ MB
We can continue creating new columns from the existing ones and filtering out some rows.
Get resource URI:
request = data.pop('request').str.split()
data['resource'] = request.str[1]
Filter out non GET and non 200 requests:
data = data[(request.str[0] == 'GET') & (data.pop('status') == 200)]
Filter out undesired resources:
data = data[~data['resource'].str.match(
r'^/media|^/static|^/admin|^/robots.txt$|^/favicon.ico$')]
Last, we can drop out the web crawlers, such as spiders from Google, Yahoo!... Just as a simple example we can filter crawlers by User-Agent:
data = data[~data['user_agent'].str.match(
r'.*?bot|.*?spider|.*?crawler|.*?slurp', flags=re.I).fillna(False)]
and by IP:
data = data[~data['ip'].str.startswith('123.125.71.')] # Baidu IPs.
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 8840 entries, 3 to 50665
Data columns (total 6 columns):
ip 8840 non-null object
time 8840 non-null object
size 8840 non-null int64
referer 2266 non-null object
user_agent 8808 non-null object
resource 8840 non-null object
dtypes: int64(1), object(5)
memory usage: 483.4+ KB
In the next post, we will do some analysis with this data.