Spark Engine - Data Structure (DataSet, DataFrame and RDD)

> Database > Spark > Spark - Engine

1 - About

Spark has many logical representation for a relation (table).

This data structure are all:

  • distributed
  • and present a abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas) using functional transformations (map, flatMap, filter, etc.)

A dataframe is a wrapper around a RDD that holds a sql connection.

Advertising

3 - Type

3.1 - The Dataset

The Dataset can be considered a combination of DataFrames and RDDs. It provides the typed interface that is available in RDDs while providing a lot of conveniences of DataFrames. It will be the core abstraction going forward.

3.2 - The DataFrame

The DataFrame is collection of distributed Row types. Similar in concept to Python pandas and R DataFrames

3.3 - The RDD (Resilient Distributed Dataset)

RDD or Resilient Distributed Dataset is the original data structure of Spark.

It's a sequence of data objects that consist of one or more types that are located across a variety of machines in a cluster.

New users should focus on Datasets as those will be supersets of the current RDD functionality.

4 - Data Structure

The below script gives the same functionality and computes an average.

4.1 - RDD

data = sc.textFile(...).split("\t") 
data.map(lambda x: (x[0], [int(x[1]), 1])) \ 
   .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ 
   .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ 
   .collect() 
Advertising

4.2 - Data Frame

  • Using DataFrames: Write less code with a dataframe
sqlCtx.table("people") \ 
   .groupBy("name") \ 
   .agg("name", avg("age")) \ 
   .collect()