Spark - (Random) Split

Spark Pipeline

About

Statistics - Resampling through Random Percentage Split

Function

randomSplit

randomSplit randomly splits a RDD with the provided weights.

randomSplit(weights, seed=None)

where:

  • weights – weights for splits, will be normalized if they don’t sum to 1
  • seed – random seed

Example of percentage split

weights = [.8, .1, .1]
seed = 42 # seed=0L
# Use randomSplit with weights and seed
rawTrainData, rawValidationData, rawTestData = rawData.randomSplit(weights, seed)

The exact number of entries in each dataset varies slightly due to the random nature of the randomSplit() transformation.





Discover More
Spark Pipeline
Spark - (RDD) Transformation

transformation function in RDD Transformations Description filter returns a new data set that's formed by selecting those elements of the source on which a function returns true. distinct([numTasks]))...



Share this page:
Follow us:
Task Runner