Apache - Pig

About

Pig is:

An Apache open source project (From Yahoo, available open source)
An engine for executing programs on top of Hadoop

Pig provides:

Relational Algebra over Hadoop through its language, Pig Latin to create its own query execution plan

Not a pure relational data model. “Schema-on-Read” rather than “Schema-on-write”

Improvements on the Pig language have made it often just as efficient as writing the code in Map Reduce.

You can work with (native|in situ) data.

Pipeline are performed on collections of Tuples

Articles Related

Example

In Pig Latin

Users = load ‘users’ as (name, age);
Fltrd = filter Users by age >= 18 and age <= 25;
Pages = load ‘Activity Data’ as (user, url);
Jnd = join Fltrd by name, Pages by user;
Grpd = group Jnd by url;
Smmd = foreach Grpd generate group, COUNT(Jnd) as clicks;
Srtd = order Smmd by clicks desc;
Top5 = limit Srtd 5;
store Top5 into 'top5sites’;

Data Model

Atom: Integer, string, etc.
Tuple:
- Sequence of fields
- Each field of any type
Bag:
- A collection of tuples
- Not necessarily the same type
- Duplicates allowed
Map:
- String dictionary (key:value) (mapped to any type)

Command

No work is done until STORE is called because of lazy evaluation.

Reduce the plan to a minimum of Map Reduce jobs because they are expensive.