Data analysis with Apache Hive. A practical introduction

data-analysis data-warehousing hadoop hive

Apache Hive is a framework for data warehousing for manage large datasets. Hive can be used for data analysis in a SQL-like language called HiveQL.

The following examples are done using the hourly precipitations dataset from NCDC for May of 1998 and the weather stations description file, both can be downloaded from http://www.ncdc.noaa.gov/orders/qclcd/.

Loading data

Load CSV data (hourly precipitations records):

CREATE TABLE records (wban STRING, day STRING, time STRING, hp FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES('skip.header.line.count'='1');

LOAD DATA INPATH 'input/199805hpd.txt'
OVERWRITE INTO TABLE records;

To load more complex data, in this case, the weather stations metadata with many columns splitted by a pipe and not all of them required, first we load the data as a simple table with a single string column:

CREATE TABLE stations_raw (value STRING)
TBLPROPERTIES('skip.header.line.count'='1');
LOAD DATA INPATH 'input/station.txt'
INTO TABLE stations_raw;

Create a custom script to obtain the desired columns:

#!/usr/bin/env python

import sys

for line in sys.stdin:
    cols = line.split('|')
    wban = cols[0]
    city, state = cols[6].split(',')
    print '\t'.join([wban, city.capitalize(), state.lstrip()])

Create the final table with the desired columns using the previous script to map them:

CREATE TABLE stations (wban STRING, city STRING, state STRING);

FROM stations_raw
INSERT OVERWRITE TABLE stations
MAP value
USING '/opt/data/ncdc/qclcd/parse_stations.py'
AS wban, city, state;

Querying data

Precipitations per station and date:

SELECT day, SUM(hp)
FROM records
WHERE wban='03024'
GROUP BY day;
19980501    0.029999999329447746
19980502    0.0
19980503    0.0800000000745058
19980504    0.0
19980505    0.0
[...]

Month precipitations by station with station data:

SELECT stations.*, records.sum_hp
FROM stations
JOIN (SELECT wban, SUM(hp) AS sum_hp FROM records GROUP BY wban) records
ON (records.wban = stations.wban);
03013   LAMAR   CO  1.2399999964982271
03016   RIFLE   CO  0.47999999299645424
03017   DENVER  CO  1.7299999985843897
03024   BORGER  TX  1.3900000154972076
03026   BURLINGTON  CO  2.8499999884516
[...]