MapReduce - Map (Mapper)

Mapreduce Pipeline

About

The Map implementation in Hadoop in a application

Mapper maps input key/value pairs to a set of intermediate key/value pairs.

Maps are the individual tasks that transform input records into intermediate records.

Implementation

Applications implements this map function and:

  • collect Output pairs with calls to context.write(WritableComparable, Writable).
  • can override the cleanup(Context) method to perform any required cleanup.
  • can use the Counter to report its statistics.
  • can control the grouping by specifying a Comparator via Job.setGroupingComparatorClass(Class))
  • can specify a combiner, via Job.setCombinerClass(Class), to perform local aggregation of the intermediate outputs, which helps to cut down the amount of data transferred from the Mapper to the Reducer.
  • can control the compression via the Configuration.
  • can control which keys (and hence records) go to which Reducer by implementing a custom Partitioner.

The Mapper outputs are:

  • sorted
  • stored in a simple (key-len, key, value-len, value) format.
  • passed to the Reducer.

Management

Number

The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

Example:

<MATH> \text{number of map} = \frac{10 * 1024 * 1024}{128} = 81 920 </MATH>

The total number of map can be set to be higher with Configuration.set(MRJobConfig.NUM_MAPS, int) (This provides just a hint).

  • The right level of parallelism for maps seems to be around 10-100 maps per-node
  • It is best if the maps take at least a minute to execute.

Documentation / Reference





Discover More
Card Puncher Data Processing
Hive - Column

in Hive Context Column Statistics in Hive (HIVE-1362) See ...
Mapreduce Pipeline
Map Reduce - Data (Stream) - pairs

MapReduce framework types the Writable interface (to be serializable) the WritableComparable interface (to facilitate sorting) pipeline
Mapreduce Pipeline
MapReduce - Application

Applications: specify the input/output locations supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes. These, and other job parameters, comprise...
Mapreduce Pipeline
MapReduce - Counter

counter in Hadoop application. Counter is a facility for MapReduce applications to report its statistics. Mapper and...
Mapreduce Pipeline
MapReduce - Input Split

An input split is a logical representation of the data stored in file blocks. They represents the data to be processed...
Mapreduce Pipeline
MapReduce - Operations (Transformations)

Every mapreduce app has two kind of operations/transformations:
Mapreduce Pipeline
MapReduce - Partition

Partitioner partitions the key space from the map-outputs to be sent to the reducer. The total number of partitions...
Mapreduce Pipeline
MapReduce - Pipeline

A MapReduce app implements a pipeline where: the input is transformed in key value pair stream/data the stream/data is process in paralleled via a map operations the result is then combined/shuffled...
Mapreduce Pipeline
MapReduce - Record

The record reader breaks the data from an input split into key/value pairs for input to the Mapper. The key is...



Share this page:
Follow us:
Task Runner