Spark - DataSet

About

Dataset is a interface to the Spark Engine added in Spark 1.6 that provides:

When running a SQL against Spark Thrift Server, the dataset interface is used in the background

A Dataset is a strongly typed collection of domain-specific objects.

Each Dataset also has an untyped view called a DataFrame, which is a Dataset of Row. A dataframe is then just a dataset.

Benefit

  • access the field of a row by name

Management

Creation

A Dataset can be constructed from JVM objects and then manipulated using functional transformations (map, flatMap, filter, etc.).

  • Scala
val people = spark.read.parquet("...").as[Person]
Dataset<Person> people = spark.read().parquet("...").as(Encoders.bean(Person.class)); 
  • Python does not have the support for the Dataset API. But due to Python’s dynamic nature, many of the benefits of the Dataset API are already available (i.e. you can access the field of a row by name naturally row.columnName).
  • R is similar to Python

Documentation / Reference

Task Runner