Apache Pig is a platform for analyzing large datasets. With Pig you have a higher level of abstraction than in MapReduce, so you can deal with richer structures of data but, in general, with a lower performance.
Pig can be run in local mode, a single JVM with the local filesystem, or MapReduce mode, creating MapReduce queries and running them on a Hadoop cluster. For the following examples it is irrelevant in which mode you run Pig.
Simple example with CSV records
Using NCDC weather datasets with hourly precipitations in american weather stations (http://www.ncdc.noaa.gov/orders/qclcd/) get the total precipitations per station and the total precipitations per date for a station.
Read the data (use your own path to the piggybank file and dataset):
-- use CSVExcelStorage from piggybank to read the csv skipping the header REGISTER '/usr/local/pig/contrib/piggybank/java/piggybank.jar'; records = LOAD '/input/199804hpd.txt' USING org.apache.pig.piggybank.storage.CSVExcelStorage(',','NO_MULTILINE','UNIX','SKIP_INPUT_HEADER') AS (wban:chararray, date:chararray, time:chararray, hp:float);
Analyze the data:
-- precipitations registered in each weather station grouped_records = GROUP records BY wban; sum_records = FOREACH grouped_records GENERATE group, SUM(records.hp); DESCRIBE sum_records; DUMP sum_records; -- daily precipitations registered in weather station 93808 filtered_records = FILTER records BY wban == '93808'; grouped_records = GROUP filtered_records BY date; sum_records = FOREACH grouped_records GENERATE group, SUM(filtered_records.hp); DESCRIBE sum_records; DUMP sum_records;
More complex dataset structure
Read a fixed-width-columns dataset with many columns of weather information and obtain some desired columns:
-- read file without schema records = LOAD '/input/199803dailyavg.txt'; -- skip first row (header) and define the entire row as a schema records = STREAM records THROUGH `tail -n +2` as (row:chararray); -- get desired data spliting the row by index records = FOREACH records GENERATE SUBSTRING(row, 0, 5) AS wban, SUBSTRING(row, 6, 12) AS year_month, (float)TRIM(SUBSTRING(row, 15, 20)) AS max_temp, (float)TRIM(SUBSTRING(row, 27, 31)) AS min_temp; DESCRIBE records; DUMP records;