Loading data into Hive

data-warehousing hadoop hive

Let's practise with different ways to load data into Apache Hive and optimization concepts.

Hive tables

Create table with different data types:

CREATE TABLE users (
    id STRING,
    name STRING,
    email ARRAY<STRING>,
    roles STRUCT<editor:BOOLEAN, sales:BOOLEAN, admin:BOOLEAN>,
    settings MAP<STRING, STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
DESC users;
id                      string
name                    string
email                   array<string>
roles                   struct<editor:boolean,sales:boolean,admin:boolean>
settings                map<string,string>

The table will be stored in the Hive warehouse (defined in ${HIVE_HOME}/conf/hive-site.xml, by default /user/hive/warehouse/) in HDFS. It is possible to create an external table and put the data in HDFS. An external table will be created later.

Load data

We are going to put some data in our database.

Take a look to the sample data:

% cat /opt/data/test/users.txt
eo90133cf9ql|john_doe|john@doe.com,j.doe@email.com|true,true,false|theme:dark,font-size:16
sh1243ihn93n|johnsmith|jsmith@mmail.com|false,true,false|theme:light
aa9871kjn3l1|bob|bob111@smail.com,me@bob111.com|false,false,true|font-size:12
hh2342o2nkj4|alice|al@thecompany.com|true,false,true|theme:solarized

Load data from local system:

LOAD DATA LOCAL INPATH '/opt/data/data/users.txt'
OVERWRITE INTO TABLE users;

Load data from HDFS:

% hadoop fs -put /opt/data/test/user.txt input/
LOAD DATA INPATH 'input/users.txt'
OVERWRITE INTO TABLE users;

Hive partitions

In order to improve the performance, we can implement partitions of the data in Hive. In this case, we'll create a table with partitions columns according to a day field. Only the required partitions will be queried

First, create external table with the raw data to load the data using INSERT instead of LOAD:

CREATE EXTERNAL TABLE history_raw (
    user_id STRING,
    datetime TIMESTAMP,
    ip STRING,
    browser STRING,
    os STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/root/input/history_raw.txt';

Put the text file with the raw data into HDFS:

% cat /opt/data/test/history.txt
eo90133cf9ql,2015-10-01 00:03:20,123.456.77.88,firefox,linux,20151001
eo90133cf9ql,2015-10-01 01:08:56,123.456.77.88,firefox,linux,20151001
eo90133cf9ql,2015-10-02 02:30:45,123.456.77.88,firefox,linux,20151002
sh1243ihn93n,2015-10-02 11:21:50,121.956.23.88,safari,mac,20151002
eo90133cf9ql,2015-10-10 15:02:11,133.555.23.88,firefox,android,20151010
aa9871kjn3l1,2015-10-10 18:20:43,155.215.23.88,chrome,windows,20151010
eo90133cf9ql,2015-10-11 12:18:09,123.456.23.88,firefox,android,20151011
eo90133cf9ql,2015-10-12 12:34:34,123.456.23.88,firefox,android,20151012
hh2342o2nkj4,2015-10-15 15:02:11,133.555.23.88,safari,ios,20151015
sh1243ihn93n,2015-10-15 21:21:21,121.956.23.88,safari,mac,20151015
% hadoop fs -put /opt/data/test/history.txt input/history_raw.txt

Create table with partition:

CREATE TABLE history (
    user_id STRING,
    datetime TIMESTAMP,
    ip STRING,
    browser STRING,
    os STRING)
PARTITIONED BY (day STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
DESC history;
user_id                 string
datetime                timestamp
ip                      string
browser                 string
os                      string
day                     string

# Partition Information
# col_name              data_type               comment

day                     string

Create a partition:

ALTER TABLE history
ADD PARTITION (day='20151015');
SHOW PARTITIONS history;
day=20151015

To load local data into partition table we can use LOAD or INSERT, but we can filter easily the data with INSERT from the raw table to put the fields in the proper partition.

INSERT OVERWRITE TABLE history PARTITION (day='20151015')
SELECT * FROM history_raw
WHERE substr(datetime, 0, 10) = '2015-10-15';

A folder per partition will be created in the hive warehouse:

% hadoop fs -ls /user/hive/warehouse/history
... /user/hive/warehouse/history/day=20151015

Hive buckets

Following with the optimization in Hive, bucketing is a technique for segment the files into different clusters in HDFS.

Create clustered table:

CREATE TABLE history_buckets (
    user_id STRING,
    datetime TIMESTAMP,
    ip STRING,
    browser STRING,
    os STRING)
CLUSTERED BY (user_id) INTO 10 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Set the parameters to limit the reducers to the number of clusters:

set hive.enforce.bucketing = true;
set hive.exec.reducers.max = 10;

Since LOAD doesn't verify the data we need to use INSERT for bucketing:

INSERT OVERWRITE TABLE history_buckets
SELECT * FROM history_raw;

In HDFS, a file will be created for each cluster:

% hadoop fs -ls /user/hive/warehouse/history_buckets
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000000_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000001_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000002_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000003_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000004_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000005_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000006_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000007_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000008_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000009_0