Apache - Pig

> Data Integration Tool (ETL/ELT) > Apache - Pig

1 - About

Pig is:

Pig provides:

  • Relational Algebra over Hadoop through its language, Pig Latin to create its own query execution plan

Not a pure relational data model. “Schema-on-Read” rather than “Schema-on-write”

Improvements on the Pig language have made it often just as efficient as writing the code in Map Reduce.

You can work with (native|in situ) data.

Pipeline are performed on collections of Tuples


3 - Example

In Pig Latin

Users = load ‘users’ AS (name, age);
Fltrd = filter Users BY age >= 18 AND age <= 25;
Pages = load ‘Activity Data’ AS (USER, url);
Jnd = join Fltrd BY name, Pages BY USER;
Grpd = GROUP Jnd BY url;
Smmd = foreach Grpd generate GROUP, COUNT(Jnd) AS clicks;
Srtd = ORDER Smmd BY clicks DESC;
Top5 = limit Srtd 5;
store Top5 INTO 'top5sites’;

4 - Data Model

  • Atom: Integer, string, etc.
  • Tuple:
    • Sequence of fields
    • Each field of any type
  • Bag:
    • A collection of tuples
    • Not necessarily the same type
    • Duplicates allowed
  • Map:
    • String dictionary (key:value) (mapped to any type)

5 - Command

No work is done until STORE is called because of lazy evaluation.

Reduce the plan to a minimum of Map Reduce jobs because they are expensive.