Querying in Hive

data-analysis data-warehousing hadoop hive

In the previous entry we created some Hive tables and put data into them, here we are going to see how to retrieve, aggregate and filter data from our tables by different examples.

The data used comes from World Health Organization and represents many fields about demographics, socioeconomisc, risks... by country. This edited version can be found in http://www.exploredata.net/Downloads/WHO-Data-Set. I've simplified the version and created a Hive table using the following mapper function and HiveQL commands:

#!/usr/bin/env python

import sys

for line in sys.stdin:
    if line.startswith('"'):
        index = line.find('"', 1)
        fields = [line[1:index]]+line[index+2:].split(',')
    else:
        fields = line.split(',')
    print '\t'.join([
        fields[1],  # Id.
        fields[0],  # Name.
        fields[2],  # Continent id.
        str(int(fields[8])*1000) if fields[8] else '',  # Population.
        str(float(fields[10])/100) if fields[10] else '',  # Urban population.
        str(float(fields[14])/100) if fields[14] else '',  # Population > 60.
        str(float(fields[15])/100) if fields[15] else '',  # Population < 15.
        fields[5],  # GNI per capita.
        fields[129]])  # Life expectancy.
CREATE TABLE who_raw (value STRING)
TBLPROPERTIES('skip.header.line.count'='1');

CREATE TABLE who (
    country_id INT,
    country_name STRING,
    continent_id INT,
    population INT,
    population_urban FLOAT,
    population_60 FLOAT,
    population_15 FLOAT,
    gni FLOAT,
    life_expectancy INT);

LOAD DATA LOCAL INPATH '/opt/data/who/WHO.csv'
INTO TABLE who_raw;

FROM who_raw
INSERT OVERWRITE TABLE who
MAP value
USING '/opt/data/who/mapper.py'
AS country_id, country_name, continent_id, population, population_urban,
population_60, population_15, gni, life_expectancy;

DROP TABLE who_raw;

Select

The basic way to query data in Hive is using SELECT:

SELECT * FROM who LIMIT 12;
1   Afghanistan 1   26088000    0.23    0.47    0.06    NULL    42
2   Albania 2   3172000 0.46    0.26    0.9 6000.0  71
3   Algeria 3   33351000    0.64    0.29    0.9 5940.0  71
4   Andorra 2   74000   0.93    0.14    0.9 NULL    82
5   Angola  3   16557000    0.54    0.46    0.29    3890.0  41
6   Antigua and Barbuda 4   84000   0.37    0.28    NULL    15130.0 73
7   Argentina   5   39134000    0.9 0.26    0.9 11670.0 75
8   Armenia 2   3010000 0.64    0.2 0.9 4950.0  69
9   Australia   6   20530000    0.88    0.19    0.9 33940.0 82
10  Austria 2   8327000 0.66    0.16    0.9 36040.0 80
11  Azerbaijan  2   8406000 0.52    0.24    0.9 5430.0  64
12  Bahamas 4   327000  0.91    0.27    NULL    NULL    74

Retrieve only country name and population:

SELECT country_name, population FROM who LIMIT 12;
Afghanistan 26088000
Albania 3172000
Algeria 33351000
Andorra 74000
Angola  16557000
Antigua and Barbuda 84000
Argentina   39134000
Armenia 3010000
Australia   20530000
Austria 8327000
Azerbaijan  8406000
Bahamas 327000

Retrieve unique values:

SELECT DISTINCT continent_id FROM who;
1
2
3
4
5
6
7

Filtering with WHERE:

SELECT country_name FROM who WHERE continent_id = 4;
Antigua and Barbuda
Bahamas
Barbados
Bermuda
Canada
Mexico
United States of America
SELECT country_name FROM who WHERE population > 100000000;
Bangladesh
Brazil
China
India
Indonesia
Japan
Mexico
Nigeria
Pakistan
Russia
United States of America

Select subqueries

Nest queries to create more complex queries:

WITH q AS (SELECT * FROM who WHERE continent_id = 1)
SELECT country_name, population
FROM q;

or

SELECT country_name, population
FROM (SELECT * FROM who WHERE continent_id = 1) q;
Afghanistan 26088000
Bahrain 739000
Cape Verde  519000
Djibouti    819000
Egypt   74166000
Iran (Islamic Republic of)  70270000
Iraq    28506000
Israel  6810000
Jordan  5729000
Kuwait  2779000
Libyan Arab Jamahiriya  6039000
Malta   405000
Morocco 30853000
Oman    2546000
Qatar   821000
Saudi Arabia    24175000
Syria   19408000
Tunisia 10215000
United Arab Emirates    4248000
West Bank and Gaza  NULL
Yemen   21732000

Joins

To do some examples to combine rows, first, create continents table to combine together with the "who" table:

CREATE TABLE continents (id INT, name STRING);

INSERT INTO TABLE continents
VALUES (1, 'Middle East'),
       (2, 'Europe'),
       (3, 'Africa'),
       (4, 'North America'),
       (5, 'South America'),
       (6, 'Oceania'),
       (7, 'Asia');

Display the names of the countries:

SELECT t1.country_name AS country, t2.name AS continent
FROM who t1
JOIN continents t2
ON t1.continent_id = t2.id
LIMIT 12;

or an equivalent with an implicit join:

SELECT t1.country_name AS country, t2.name AS continent
FROM who t1, continents t2
WHERE t1.continent_id = t2.id
LIMIT 12;
Afghanistan Middle East
Albania Europe
Algeria Africa
Andorra Europe
Angola  Africa
Antigua and Barbuda North America
Argentina   South America
Armenia Europe
Australia   Oceania
Austria Europe
Azerbaijan  Europe
Bahamas North America

In our case, an outer join (left or right) will make the same effect joining the tables "who" and "continents", since all the countries have defined a continent id found in the "continents" table and all the ids from the "continents" table can be found in the "who" table. Just take into account that JOIN will return the matches in both tables and LEFT JOIN will return all the rows in the left table, RIGHT JOIN all the rows of the right table and FULL JOIN all the rows in both tables. If there's no match in a table it will return NULL. For the CROSS JOIN, it will return all the row combinations of the left an right tables, which is the same as an inner join with no ON condition.

Unions

Unions merge data vertically. Let's create a table with the countries with a GNI greater than $40,000 to do some examples with unions:

CREATE TABLE countries_gni40000 (name STRING, gni INT);

INSERT OVERWRITE TABLE countries_gni40000
SELECT country_name, gni FROM who WHERE gni >= 40000;

SELECT * FROM countries_gni40000;
Brunei Darussalam   49900
Kuwait  48310
Luxembourg  60870
Norway  50070
Singapore   43300
Switzerland 40840
United States of America    44070

Display the north american countries and the countries with a GNI greater than $40,000:

SELECT t1.country_name AS name
FROM who t1
WHERE t1.continent_id = 4
UNION ALL
SELECT t2.name AS name
FROM countries_gni40000 t2;
Antigua and Barbuda
Bahamas
Barbados
Bermuda
Canada
Mexico
United States of America
Brunei Darussalam
Kuwait
Luxembourg
Norway
Singapore
Switzerland
United States of America

Same query but removing the duplicates:

SELECT DISTINCT name FROM (
    SELECT country_name AS name
    FROM who
    WHERE continent_id = 4
    UNION ALL
    SELECT name AS name
    FROM countries_gni40000
) q;
Antigua and Barbuda
Bahamas
Barbados
Bermuda
Brunei Darussalam
Canada
Kuwait
Luxembourg
Mexico
Norway
Singapore
Switzerland
United States of America

Order and sort

The difference between ORDER BY and SORT BY is that the first one the sorted order is applied after the reducer and the in the second one it is applied before.

Sort/order countries by name:

SELECT name FROM countries_gni40000 ORDER BY name DESC;

or:

SELECT name FROM countries_gni40000 SORT BY name DESC;
United States of America
Switzerland
Singapore
Norway
Luxembourg
Kuwait
Brunei Darussalam

Order by continent, country name and create a rank with reversed order by population, aggregated using PARTITION BY (do not mix up with partitions when creating tables; more about aggregations in the next section):

SELECT
    continent_id,
    country_name,
    rank() OVER (PARTITION BY continent_id ORDER BY population DESC) AS rank,
    population
FROM who
ORDER BY continent_id, country_name
LIMIT 30;
1   Afghanistan 5   26088000
1   Bahrain 18  739000
1   Cape Verde  19  519000
1   Djibouti    17  819000
1   Egypt   1   74166000
1   Iran (Islamic Republic of)  2   70270000
1   Iraq    4   28506000
1   Israel  10  6810000
1   Jordan  12  5729000
1   Kuwait  14  2779000
1   Libyan Arab Jamahiriya  11  6039000
1   Malta   20  405000
1   Morocco 3   30853000
1   Oman    15  2546000
1   Qatar   16  821000
1   Saudi Arabia    6   24175000
1   Syria   8   19408000
1   Tunisia 9   10215000
1   United Arab Emirates    13  4248000
1   West Bank and Gaza  21  NULL
1   Yemen   7   21732000
2   Albania 39  3172000
2   Andorra 49  74000
2   Armenia 40  3010000
2   Austria 23  8327000
2   Azerbaijan  22  8406000
2   Belarus 20  9742000
2   Belgium 16  10430000
2   Bosnia and Herzegovina  36  3926000
2   Bulgaria    24  7693000

Aggregation

Number of countries with a life expectancy bigger than 80 years:

SELECT count(*) AS total
FROM who
WHERE life_expectancy > 80;
13

Number of countries by continent:

SELECT continent_id, count(*) AS countries
FROM who
GROUP BY continent_id;
1   21
2   51
3   48
4   7
5   31
6   35
7   9

Multiple aggregate functions. Minimum and maximum population in a country by continent:

SELECT continent_id, min(population), max(population)
FROM who
GROUP BY continent_id;
1   405000  74166000
2   31000   143221000
3   86000   144720000
4   84000   302841000
5   50000   189323000
6   2000    228864000
7   300000  1328474000

Conditionals. Number of countries with largely urban population:

SELECT sum(CASE WHEN population_urban > 0.5 THEN 1 ELSE 0 END)
FROM who;

or:

SELECT sum(if(population_urban > 0.5, 1, 0))
FROM who;
112

Unique values. Count continents:

SELECT count(DISTINCT continent_id)
FROM who;

or, with better performance for big tables using multiple reducers:

SELECT count(*)
FROM (SELECT DISTINCT continent_id FROM who) q;
7

When using groups and unions, the next operation will be done in different stages:

SELECT continent_id, NULL AS life_expectancy, count(country_name)
FROM who
GROUP BY continent_id
UNION ALL
SELECT NULL AS continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY life_expectancy
LIMIT 12;

for a better performance use GROUPING SETS, then all processes will be completed in one single stage job.

SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy
GROUPING SETS(continent_id, life_expectancy)
LIMIT 12;
1   NULL    1
1   42  1
1   56  2
1   61  1
1   68  1
1   70  2
1   71  2
1   72  4
1   74  1
1   75  1
1   77  1
1   78  2

For a more combinations of groups may be useful to use ROLLUP to combine n+1 aggregated columns:

SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy WITH ROLLUP
LIMIT 12;

is equivalent to:

SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy
GROUPING SETS((continent_id, life_expectancy), (continent_id), ())
LIMIT 12;
NULL    NULL    202
1   NULL    21
1   NULL    1
1   42  1
1   56  2
1   61  1
1   68  1
1   70  2
1   71  2
1   72  4
1   74  1
1   75  1

Following with the combinations of groups, CUBE will combine all the possible combinations (2^n):

SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy WITH CUBE
LIMIT 12;

is equivalent to:

SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy
GROUPING SETS((continent_id, life_expectancy), (continent_id), (life_expectancy), ())
LIMIT 12;
NULL    NULL    202
NULL    NULL    9
NULL    40  1
NULL    41  1
NULL    42  4
NULL    43  2
NULL    44  1
NULL    46  3
NULL    47  2
NULL    48  3
NULL    49  1
NULL    50  4

Filtered aggregation. Get the average life expectancy by continent when this is greated than 70. The next two operations are equivalent:

SELECT continent_id, mean_le
FROM (
    SELECT continent_id, avg(life_expectancy) AS mean_le
    FROM who
    GROUP BY continent_id
) q
WHERE mean_le > 70;
SELECT continent_id, avg(life_expectancy) AS mean_le
FROM who
GROUP BY continent_id
HAVING mean_le > 70;
2   74.88235294117646
4   75.83333333333333
5   71.65517241379311