Simple web analytics with Python and Pandas
data-analysis pandas python web-analyticsWe are going to do some analytics with our web visits data. As a simple report we are going to obtain the unique and total visits respect the date and many other paramenters like browser, page wisited, language, operative system...
Requirements
% pip install python-dateutil pandas
Getting and filtering the data
Let's assume the structure of our data like:
| uuid | ip | city | country_code | country_name | language | browser | os | page | date | ... |
|---|---|---|---|---|---|---|---|---|---|---|
| ea2d3169-2b71-4beb-9665-108d302c3a67 | 78.146.232.107 | London | UK | United Kingdom | EN | Firefox | Linux | /foo | 2015-02-12 09:25:17.770175 | ... |
| bdb18e99-fc80-4d4b-b4a1-286e67ba374f | 95.142.167.120 | Paris | FR | France | FR | Safari | Mac | /bar | 2015-02-09 21:11:02.134322 | ... |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
The python code:
from datetime import datetime, timedelta
import pandas as pd
source = pd.read_csv('data.csv', index_col='uuid', parse_dates=['date'])
date_1 = datetime.utcnow()
date_0 = date_1 - timedelta(days=30)
data = source[(source['date'] > date_0) & (source['date'] < date_1)]
Aggregating the data
We'll obtain data structured as:
| unique_visits | total_visits | |
|---|---|---|
| 2015-01-13 | 90 | 140 |
| 2015-01-14 | 104 | 170 |
| 2015-01-15 | 80 | 193 |
| ... | ... | ... |
| unique_visits | total_visits | |
|---|---|---|
| Linux | 76 | 111 |
| Mac | 101 | 180 |
| Windows | 40 | 73 |
The Python code:
def get_visits(groupby):
ip_visits = data.groupby(groupby)['ip']
return pd.DataFrame(
{'unique_visits': ip_visits.apply(lambda x: len(set(x))),
'total_visits': ip_visits.apply(len)})
data['date_day'] = data['date'].apply(datetime.date)
visits_by_date = get_visits('date_day')
# Redefine with date index to avoid lack of dates.
visits_by_date = pd.DataFrame(visits_by_date,
index=pd.date_range(date_0, date_1)).fillna(0)
# Rest of filtered visits.
visits_by_os = get_visits('os')
visits_by_city = get_visits('city')