Loading data into Hive
data-warehousing hadoop hiveLet's practise with different ways to load data into Apache Hive and optimization concepts.
Hive tables
Create table with different data types:
CREATE TABLE users (
id STRING,
name STRING,
email ARRAY<STRING>,
roles STRUCT<editor:BOOLEAN, sales:BOOLEAN, admin:BOOLEAN>,
settings MAP<STRING, STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
COLLECTION ITEMS TERMINATED BY ','
MAP KEYS TERMINATED BY ':';
DESC users;
id string
name string
email array<string>
roles struct<editor:boolean,sales:boolean,admin:boolean>
settings map<string,string>
The table will be stored in the Hive warehouse (defined in ${HIVE_HOME}/conf/hive-site.xml
, by default /user/hive/warehouse/
) in HDFS. It is possible to create an external table and put the data in HDFS. An external table will be created later.
Load data
We are going to put some data in our database.
Take a look to the sample data:
% cat /opt/data/test/users.txt
eo90133cf9ql|john_doe|john@doe.com,j.doe@email.com|true,true,false|theme:dark,font-size:16
sh1243ihn93n|johnsmith|jsmith@mmail.com|false,true,false|theme:light
aa9871kjn3l1|bob|bob111@smail.com,me@bob111.com|false,false,true|font-size:12
hh2342o2nkj4|alice|al@thecompany.com|true,false,true|theme:solarized
Load data from local system:
LOAD DATA LOCAL INPATH '/opt/data/data/users.txt'
OVERWRITE INTO TABLE users;
Load data from HDFS:
% hadoop fs -put /opt/data/test/user.txt input/
LOAD DATA INPATH 'input/users.txt'
OVERWRITE INTO TABLE users;
Hive partitions
In order to improve the performance, we can implement partitions of the data in Hive. In this case, we'll create a table with partitions columns according to a day
field. Only the required partitions will be queried
First, create external table with the raw data to load the data using INSERT
instead of LOAD
:
CREATE EXTERNAL TABLE history_raw (
user_id STRING,
datetime TIMESTAMP,
ip STRING,
browser STRING,
os STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/user/root/input/history_raw.txt';
Put the text file with the raw data into HDFS:
% cat /opt/data/test/history.txt
eo90133cf9ql,2015-10-01 00:03:20,123.456.77.88,firefox,linux,20151001
eo90133cf9ql,2015-10-01 01:08:56,123.456.77.88,firefox,linux,20151001
eo90133cf9ql,2015-10-02 02:30:45,123.456.77.88,firefox,linux,20151002
sh1243ihn93n,2015-10-02 11:21:50,121.956.23.88,safari,mac,20151002
eo90133cf9ql,2015-10-10 15:02:11,133.555.23.88,firefox,android,20151010
aa9871kjn3l1,2015-10-10 18:20:43,155.215.23.88,chrome,windows,20151010
eo90133cf9ql,2015-10-11 12:18:09,123.456.23.88,firefox,android,20151011
eo90133cf9ql,2015-10-12 12:34:34,123.456.23.88,firefox,android,20151012
hh2342o2nkj4,2015-10-15 15:02:11,133.555.23.88,safari,ios,20151015
sh1243ihn93n,2015-10-15 21:21:21,121.956.23.88,safari,mac,20151015
% hadoop fs -put /opt/data/test/history.txt input/history_raw.txt
Create table with partition:
CREATE TABLE history (
user_id STRING,
datetime TIMESTAMP,
ip STRING,
browser STRING,
os STRING)
PARTITIONED BY (day STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
DESC history;
user_id string
datetime timestamp
ip string
browser string
os string
day string
# Partition Information
# col_name data_type comment
day string
Create a partition:
ALTER TABLE history
ADD PARTITION (day='20151015');
SHOW PARTITIONS history;
day=20151015
To load local data into partition table we can use LOAD
or INSERT
, but we can filter easily the data with INSERT
from the raw table to put the fields in the proper partition.
INSERT OVERWRITE TABLE history PARTITION (day='20151015')
SELECT * FROM history_raw
WHERE substr(datetime, 0, 10) = '2015-10-15';
A folder per partition will be created in the hive warehouse:
% hadoop fs -ls /user/hive/warehouse/history
... /user/hive/warehouse/history/day=20151015
Hive buckets
Following with the optimization in Hive, bucketing is a technique for segment the files into different clusters in HDFS.
Create clustered table:
CREATE TABLE history_buckets (
user_id STRING,
datetime TIMESTAMP,
ip STRING,
browser STRING,
os STRING)
CLUSTERED BY (user_id) INTO 10 BUCKETS
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';
Set the parameters to limit the reducers to the number of clusters:
set hive.enforce.bucketing = true;
set hive.exec.reducers.max = 10;
Since LOAD
doesn't verify the data we need to use INSERT
for bucketing:
INSERT OVERWRITE TABLE history_buckets
SELECT * FROM history_raw;
In HDFS, a file will be created for each cluster:
% hadoop fs -ls /user/hive/warehouse/history_buckets
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000000_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000001_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000002_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000003_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000004_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000005_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000006_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000007_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000008_0
... 2015-10-15 12:14 /user/hive/warehouse/history_buckets/000009_0