Querying in Hive
data-analysis data-warehousing hadoop hiveIn the previous entry we created some Hive tables and put data into them, here we are going to see how to retrieve, aggregate and filter data from our tables by different examples.
The data used comes from World Health Organization and represents many fields about demographics, socioeconomisc, risks... by country. This edited version can be found in http://www.exploredata.net/Downloads/WHO-Data-Set. I've simplified the version and created a Hive table using the following mapper function and HiveQL commands:
#!/usr/bin/env python
import sys
for line in sys.stdin:
if line.startswith('"'):
index = line.find('"', 1)
fields = [line[1:index]]+line[index+2:].split(',')
else:
fields = line.split(',')
print '\t'.join([
fields[1], # Id.
fields[0], # Name.
fields[2], # Continent id.
str(int(fields[8])*1000) if fields[8] else '', # Population.
str(float(fields[10])/100) if fields[10] else '', # Urban population.
str(float(fields[14])/100) if fields[14] else '', # Population > 60.
str(float(fields[15])/100) if fields[15] else '', # Population < 15.
fields[5], # GNI per capita.
fields[129]]) # Life expectancy.
CREATE TABLE who_raw (value STRING)
TBLPROPERTIES('skip.header.line.count'='1');
CREATE TABLE who (
country_id INT,
country_name STRING,
continent_id INT,
population INT,
population_urban FLOAT,
population_60 FLOAT,
population_15 FLOAT,
gni FLOAT,
life_expectancy INT);
LOAD DATA LOCAL INPATH '/opt/data/who/WHO.csv'
INTO TABLE who_raw;
FROM who_raw
INSERT OVERWRITE TABLE who
MAP value
USING '/opt/data/who/mapper.py'
AS country_id, country_name, continent_id, population, population_urban,
population_60, population_15, gni, life_expectancy;
DROP TABLE who_raw;
Select
The basic way to query data in Hive is using SELECT
:
SELECT * FROM who LIMIT 12;
1 Afghanistan 1 26088000 0.23 0.47 0.06 NULL 42
2 Albania 2 3172000 0.46 0.26 0.9 6000.0 71
3 Algeria 3 33351000 0.64 0.29 0.9 5940.0 71
4 Andorra 2 74000 0.93 0.14 0.9 NULL 82
5 Angola 3 16557000 0.54 0.46 0.29 3890.0 41
6 Antigua and Barbuda 4 84000 0.37 0.28 NULL 15130.0 73
7 Argentina 5 39134000 0.9 0.26 0.9 11670.0 75
8 Armenia 2 3010000 0.64 0.2 0.9 4950.0 69
9 Australia 6 20530000 0.88 0.19 0.9 33940.0 82
10 Austria 2 8327000 0.66 0.16 0.9 36040.0 80
11 Azerbaijan 2 8406000 0.52 0.24 0.9 5430.0 64
12 Bahamas 4 327000 0.91 0.27 NULL NULL 74
Retrieve only country name and population:
SELECT country_name, population FROM who LIMIT 12;
Afghanistan 26088000
Albania 3172000
Algeria 33351000
Andorra 74000
Angola 16557000
Antigua and Barbuda 84000
Argentina 39134000
Armenia 3010000
Australia 20530000
Austria 8327000
Azerbaijan 8406000
Bahamas 327000
Retrieve unique values:
SELECT DISTINCT continent_id FROM who;
1
2
3
4
5
6
7
Filtering with WHERE
:
SELECT country_name FROM who WHERE continent_id = 4;
Antigua and Barbuda
Bahamas
Barbados
Bermuda
Canada
Mexico
United States of America
SELECT country_name FROM who WHERE population > 100000000;
Bangladesh
Brazil
China
India
Indonesia
Japan
Mexico
Nigeria
Pakistan
Russia
United States of America
Select subqueries
Nest queries to create more complex queries:
WITH q AS (SELECT * FROM who WHERE continent_id = 1)
SELECT country_name, population
FROM q;
or
SELECT country_name, population
FROM (SELECT * FROM who WHERE continent_id = 1) q;
Afghanistan 26088000
Bahrain 739000
Cape Verde 519000
Djibouti 819000
Egypt 74166000
Iran (Islamic Republic of) 70270000
Iraq 28506000
Israel 6810000
Jordan 5729000
Kuwait 2779000
Libyan Arab Jamahiriya 6039000
Malta 405000
Morocco 30853000
Oman 2546000
Qatar 821000
Saudi Arabia 24175000
Syria 19408000
Tunisia 10215000
United Arab Emirates 4248000
West Bank and Gaza NULL
Yemen 21732000
Joins
To do some examples to combine rows, first, create continents table to combine together with the "who" table:
CREATE TABLE continents (id INT, name STRING);
INSERT INTO TABLE continents
VALUES (1, 'Middle East'),
(2, 'Europe'),
(3, 'Africa'),
(4, 'North America'),
(5, 'South America'),
(6, 'Oceania'),
(7, 'Asia');
Display the names of the countries:
SELECT t1.country_name AS country, t2.name AS continent
FROM who t1
JOIN continents t2
ON t1.continent_id = t2.id
LIMIT 12;
or an equivalent with an implicit join:
SELECT t1.country_name AS country, t2.name AS continent
FROM who t1, continents t2
WHERE t1.continent_id = t2.id
LIMIT 12;
Afghanistan Middle East
Albania Europe
Algeria Africa
Andorra Europe
Angola Africa
Antigua and Barbuda North America
Argentina South America
Armenia Europe
Australia Oceania
Austria Europe
Azerbaijan Europe
Bahamas North America
In our case, an outer join (left or right) will make the same effect joining the tables "who" and "continents", since all the countries have defined a continent id found in the "continents" table and all the ids from the "continents" table can be found in the "who" table. Just take into account that JOIN
will return the matches in both tables and LEFT JOIN
will return all the rows in the left table, RIGHT JOIN
all the rows of the right table and FULL JOIN
all the rows in both tables. If there's no match in a table it will return NULL
. For the CROSS JOIN
, it will return all the row combinations of the left an right tables, which is the same as an inner join with no ON
condition.
Unions
Unions merge data vertically. Let's create a table with the countries with a GNI greater than $40,000 to do some examples with unions:
CREATE TABLE countries_gni40000 (name STRING, gni INT);
INSERT OVERWRITE TABLE countries_gni40000
SELECT country_name, gni FROM who WHERE gni >= 40000;
SELECT * FROM countries_gni40000;
Brunei Darussalam 49900
Kuwait 48310
Luxembourg 60870
Norway 50070
Singapore 43300
Switzerland 40840
United States of America 44070
Display the north american countries and the countries with a GNI greater than $40,000:
SELECT t1.country_name AS name
FROM who t1
WHERE t1.continent_id = 4
UNION ALL
SELECT t2.name AS name
FROM countries_gni40000 t2;
Antigua and Barbuda
Bahamas
Barbados
Bermuda
Canada
Mexico
United States of America
Brunei Darussalam
Kuwait
Luxembourg
Norway
Singapore
Switzerland
United States of America
Same query but removing the duplicates:
SELECT DISTINCT name FROM (
SELECT country_name AS name
FROM who
WHERE continent_id = 4
UNION ALL
SELECT name AS name
FROM countries_gni40000
) q;
Antigua and Barbuda
Bahamas
Barbados
Bermuda
Brunei Darussalam
Canada
Kuwait
Luxembourg
Mexico
Norway
Singapore
Switzerland
United States of America
Order and sort
The difference between ORDER BY
and SORT BY
is that the first one the sorted order is applied after the reducer and the in the second one it is applied before.
Sort/order countries by name:
SELECT name FROM countries_gni40000 ORDER BY name DESC;
or:
SELECT name FROM countries_gni40000 SORT BY name DESC;
United States of America
Switzerland
Singapore
Norway
Luxembourg
Kuwait
Brunei Darussalam
Order by continent, country name and create a rank with reversed order by population, aggregated using PARTITION BY
(do not mix up with partitions when creating tables; more about aggregations in the next section):
SELECT
continent_id,
country_name,
rank() OVER (PARTITION BY continent_id ORDER BY population DESC) AS rank,
population
FROM who
ORDER BY continent_id, country_name
LIMIT 30;
1 Afghanistan 5 26088000
1 Bahrain 18 739000
1 Cape Verde 19 519000
1 Djibouti 17 819000
1 Egypt 1 74166000
1 Iran (Islamic Republic of) 2 70270000
1 Iraq 4 28506000
1 Israel 10 6810000
1 Jordan 12 5729000
1 Kuwait 14 2779000
1 Libyan Arab Jamahiriya 11 6039000
1 Malta 20 405000
1 Morocco 3 30853000
1 Oman 15 2546000
1 Qatar 16 821000
1 Saudi Arabia 6 24175000
1 Syria 8 19408000
1 Tunisia 9 10215000
1 United Arab Emirates 13 4248000
1 West Bank and Gaza 21 NULL
1 Yemen 7 21732000
2 Albania 39 3172000
2 Andorra 49 74000
2 Armenia 40 3010000
2 Austria 23 8327000
2 Azerbaijan 22 8406000
2 Belarus 20 9742000
2 Belgium 16 10430000
2 Bosnia and Herzegovina 36 3926000
2 Bulgaria 24 7693000
Aggregation
Number of countries with a life expectancy bigger than 80 years:
SELECT count(*) AS total
FROM who
WHERE life_expectancy > 80;
13
Number of countries by continent:
SELECT continent_id, count(*) AS countries
FROM who
GROUP BY continent_id;
1 21
2 51
3 48
4 7
5 31
6 35
7 9
Multiple aggregate functions. Minimum and maximum population in a country by continent:
SELECT continent_id, min(population), max(population)
FROM who
GROUP BY continent_id;
1 405000 74166000
2 31000 143221000
3 86000 144720000
4 84000 302841000
5 50000 189323000
6 2000 228864000
7 300000 1328474000
Conditionals. Number of countries with largely urban population:
SELECT sum(CASE WHEN population_urban > 0.5 THEN 1 ELSE 0 END)
FROM who;
or:
SELECT sum(if(population_urban > 0.5, 1, 0))
FROM who;
112
Unique values. Count continents:
SELECT count(DISTINCT continent_id)
FROM who;
or, with better performance for big tables using multiple reducers:
SELECT count(*)
FROM (SELECT DISTINCT continent_id FROM who) q;
7
When using groups and unions, the next operation will be done in different stages:
SELECT continent_id, NULL AS life_expectancy, count(country_name)
FROM who
GROUP BY continent_id
UNION ALL
SELECT NULL AS continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY life_expectancy
LIMIT 12;
for a better performance use GROUPING SETS
, then all processes will be completed in one single stage job.
SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy
GROUPING SETS(continent_id, life_expectancy)
LIMIT 12;
1 NULL 1
1 42 1
1 56 2
1 61 1
1 68 1
1 70 2
1 71 2
1 72 4
1 74 1
1 75 1
1 77 1
1 78 2
For a more combinations of groups may be useful to use ROLLUP
to combine n+1 aggregated columns:
SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy WITH ROLLUP
LIMIT 12;
is equivalent to:
SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy
GROUPING SETS((continent_id, life_expectancy), (continent_id), ())
LIMIT 12;
NULL NULL 202
1 NULL 21
1 NULL 1
1 42 1
1 56 2
1 61 1
1 68 1
1 70 2
1 71 2
1 72 4
1 74 1
1 75 1
Following with the combinations of groups, CUBE
will combine all the possible combinations (2^n):
SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy WITH CUBE
LIMIT 12;
is equivalent to:
SELECT continent_id, life_expectancy, count(country_name)
FROM who
GROUP BY continent_id, life_expectancy
GROUPING SETS((continent_id, life_expectancy), (continent_id), (life_expectancy), ())
LIMIT 12;
NULL NULL 202
NULL NULL 9
NULL 40 1
NULL 41 1
NULL 42 4
NULL 43 2
NULL 44 1
NULL 46 3
NULL 47 2
NULL 48 3
NULL 49 1
NULL 50 4
Filtered aggregation. Get the average life expectancy by continent when this is greated than 70. The next two operations are equivalent:
SELECT continent_id, mean_le
FROM (
SELECT continent_id, avg(life_expectancy) AS mean_le
FROM who
GROUP BY continent_id
) q
WHERE mean_le > 70;
SELECT continent_id, avg(life_expectancy) AS mean_le
FROM who
GROUP BY continent_id
HAVING mean_le > 70;
2 74.88235294117646
4 75.83333333333333
5 71.65517241379311