Spark - (RDD) Transformation

1 - About

Spark transformations create new data sets (Rdd) from an existing one.

Spark remembers the set of transformations that are applied to a base data set. It can then optimize the required calculations and automatically recover from failures and slow workers.

3 - List

Transformations Description
filter returns a new data set that's formed by selecting those elements of the source on which a function returns true.
distinct([numTasks])) returns a new data set that contains the distinct elements of the source data set.
map and flatMap returns a new distributed data set that's formed by passing each element of the source through a function.
zip (optionally with index or id) returning key-value pairs of the n element of each RDD: <math>\forall i\in \{0, \dots, N\} (rdd1_i,rdd2_i)</math>
split split data set

3.1 - Pipe

pipe return an RDD created by piping elements to a forked external process.

pipe(command, env={})

Example

>>> sc.parallelize(['1', '2', '', '3']).pipe('cat').collect()
[u'1', u'2', u'', u'3']
db/spark/transformation.txt ยท Last modified: 2017/09/06 20:15 by gerardnico