Spark - (Reduce|Aggregate) function
> Database > Spark > Spark - Resilient Distributed Datasets (RDDs)
Table of Contents
1 - About
Spark permits to reduce a data set through:
2 - Articles Related
3 - Reduce
The reduce function of the map reduce framework
reduce(func)
Reduce is a spark action that aggregates a data set (RDD) element using a function.
That function takes two arguments and returns one.
The function must be parallel enabled.
reduce can return a single value such as an int.
3..1 - Reduce a List
rdd = sc.parallelize([1, 2, 3]) rdd.reduce(lambda a, b: a * b)
Value: 6
3..2 - Reduce a List of Tuple
3..2.1 - Numeric value
reduceByKey(function|func) return a new distributed dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) → V
rdd = sc.parallelize([(1,2), (3,4), (3,6)]) rdd.reduceByKey(lambda a, b: a + b)
RDD: [(1,2), (3,4), (3,6)] → [(1,2), (3,10)]
If the value is a string, you can use the groupByKey() to reduce it. See below
3..2.2 - String value
groupByKey() return a new dataset of (K, Iterable<V>) pairs Key-Value Transformations
rdd2 = sc.parallelize([(1,'a'), (2,'c'), (1,'b')]) rdd2.groupByKey()
RDD: [(1,'a'), (1,'b'), (2,'c')] -> [(1,['a','b']), (2,['c'])]
Be careful using groupByKey() as it can cause a lot of data movement across the network and create large Iterables at workers
Imagine you have an RDD where you have 1 million pairs that have the key 1. All of the values will have to fit in a single worker if you use group by key. So instead of a group by key, consider using reduced by key or a different key value transformation.
3.1 - Aggregate Function
- count()
- countApprox
- countApproxDistinct
- sum()
- max
- mean
- meanApprox
- min