Spark - (Map|flatMap)

1 - About

The map implementation in Spark of map reduce.

  • map(func) returns a new distributed data set that's formed by passing each element of the source through a function.
  • flatMap(func) similar to map but flatten a collection object to a sequence.

3 - Example

3.1 - Multiplication

rdd = sc.parallelize([1,2,3,4])
rdd.map(lambda x: x * 2)
[1,2,3,4] → [2,4,6,8]

3.2 - flatMap

3.2.1 - On 1 list

rdd.flatMap(lambda x:[x,x+5])
[1,2,3] → [1,6,2,7,3,8]

3.2.2 - On a list of list

sc.parallelize([[1,2],[3,4]]).flatMap(lambda x:x).collect()
[1, 2, 3, 4]

3.3 - Key Value

key value (tuple) transformation are supported

from operator import add
wordCounts = sc.parallelize([('rat', 2), ('elephant', 1), ('cat', 2)])
totalCount = (wordCounts
              .map(lambda (x,y): y)
              .reduce(add))
average = totalCount / float(len(wordCounts.collect()))
print totalCount
print round(average, 2)
5
1.67
db/spark/rdd/map.txt · Last modified: 2018/06/05 11:13 by gerardnico