Data analysis with Apache Hive. A practical introduction
data-analysis data-warehousing hadoop hiveApache Hive is a framework for data warehousing for manage large datasets. Hive can be used for data analysis in a SQL-like language called HiveQL.
The following examples are done using the hourly precipitations dataset from NCDC for May of 1998 and the weather stations description file, both can be downloaded from http://www.ncdc.noaa.gov/orders/qclcd/.
Loading data
Load CSV data (hourly precipitations records):
CREATE TABLE records (wban STRING, day STRING, time STRING, hp FLOAT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES('skip.header.line.count'='1');
LOAD DATA INPATH 'input/199805hpd.txt'
OVERWRITE INTO TABLE records;
To load more complex data, in this case, the weather stations metadata with many columns splitted by a pipe and not all of them required, first we load the data as a simple table with a single string column:
CREATE TABLE stations_raw (value STRING)
TBLPROPERTIES('skip.header.line.count'='1');
LOAD DATA INPATH 'input/station.txt'
INTO TABLE stations_raw;
Create a custom script to obtain the desired columns:
#!/usr/bin/env python
import sys
for line in sys.stdin:
cols = line.split('|')
wban = cols[0]
city, state = cols[6].split(',')
print '\t'.join([wban, city.capitalize(), state.lstrip()])
Create the final table with the desired columns using the previous script to map them:
CREATE TABLE stations (wban STRING, city STRING, state STRING);
FROM stations_raw
INSERT OVERWRITE TABLE stations
MAP value
USING '/opt/data/ncdc/qclcd/parse_stations.py'
AS wban, city, state;
Querying data
Precipitations per station and date:
SELECT day, SUM(hp)
FROM records
WHERE wban='03024'
GROUP BY day;
19980501 0.029999999329447746
19980502 0.0
19980503 0.0800000000745058
19980504 0.0
19980505 0.0
[...]
Month precipitations by station with station data:
SELECT stations.*, records.sum_hp
FROM stations
JOIN (SELECT wban, SUM(hp) AS sum_hp FROM records GROUP BY wban) records
ON (records.wban = stations.wban);
03013 LAMAR CO 1.2399999964982271
03016 RIFLE CO 0.47999999299645424
03017 DENVER CO 1.7299999985843897
03024 BORGER TX 1.3900000154972076
03026 BURLINGTON CO 2.8499999884516
[...]