Spark - Distinct

> Database > Spark > Spark - Resilient Distributed Datasets (RDDs)

1 - About

distinct([numTasks])) returns a new data set (RDD) that contains the distinct elements of the source data set.

3 - Example

rdd2 = sc.parallelize([1,4,2,2,3])
rdd2.distinct()
[1,4,2,2,3] → [1,4,2,3]
Advertising
db/spark/rdd/distinct.txt · Last modified: 2018/06/05 11:13 by gerardnico