Spark - Resilient Distributed Datasets (RDDs)

About

Resilient distributed datasets are one of the data structure in Spark.

  • Write programs in terms of operations on distributed datasets
  • Partitioned collections of objects spread across a cluster, stored in memory or on disk
  • RDDs built and manipulated through a diverse set of parallel transformations (map, filter, join) and actions (count, collect, save)
  • RDDs automatically rebuilt on machine failure

RDDs cannot be changed once they are created - they are immutable. You can create RDDs by applying transformations to existing RDDs and Spark automatically tracks how you create and manipulate RDDs (their lineage) so that it can reconstruct any data that is lost due to slow or failed machine. Operations on RDDs are performed in parallel.

  • You cannot change IT. (They are immutable once you create an RDD.)
  • You can transform it.
  • You can perform actions on it.

Spark tracks lineage information to enable the efficient recomputation of any lost data if a machine should fail or crash.

RDDs can contain any type of data. See Spark RDD - (Creation|Construction|Initialization)

Spark enables operations on collections of elements in parallel.

You paralyze existent Python collections such as lists by using RDDs.

  • distributed because the data may be distributed across multiple executors in the cluster.
  • resilient because the lineage of the data is preserved and, therefore, the data can be re-created on a new node at any time. Lineage is the sequence of operations that

was applied to the base data set.

Script

An RDD script

Spark Pipeline

The action (collect) causes the transformation (parallelize, filter, and map) to be executed. An RDD follows a builder pattern where the last method sometimes build an object or simply perform an action.

When you perform transformations and actions that use functions, Spark will automatically push a closure containing that function to the workers so that it can run at the workers.

Creation

In you driver program, you create an RDD from:

  • a file
  • or from a collection (parallelize a collection)

See Spark RDD - (Creation|Construction|Initialization)

Transformation

You then specify transformations to that RDD. They will lazily create new RDDs (without applying immediately the transformation)

Spark remembers the set of transformations that are applied to a base data set. It can then optimize the required calculations and automatically recover from failures and slow workers.

Spark transformations create new data sets from an existing one.

Cache

We can cache some RDDs for reuse

Action

We perform actions that are executed in parallel on the RDD.

Examples are collect and count.

Loop / Show

Example on a row

rowJavaRDD.take(10).forEach(x -> {
            for (int i = 0; i < x.size(); i++) {
                System.out.print(x.get(i) + ",");
            }
            System.out.println();
});

Property

Internally, each RDD is characterized by five main properties:

  • A list of partitions
  • A function for computing each split
  • A list of dependencies on other RDDs
  • Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
  • Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Partition

The number of partitions is an RDD property.

Operations

There are two types of operations you can perform on an RDD:

You can also persist, or cache, RDDs in memory or on disk.

An RDD follows a builder pattern where the last method performs an action whereas the previous methods just set parameters called transformation.

Documentation / Reference

Task Runner