Spark DataSet - Data Frame

About

The data frame is a dataset of rows (ie organized into named columns).

Technically, a data frame is an untyped view of a dataset.

A SparkDataFrame is a distributed collection of data organized into named columns.

It is conceptually equivalent to:

a table in a relational database
or a data frame in R

Articles Related

Management / Operations

Operations available on Datasets follow the spark pattern.

Creating

A dataframe a unified interface to reading/writing data in a variety of formats with Writer to JDBC, JSON, CSV, …

sources such as:

structured data files,
tables in Hive,
external databases,
or existing RDDs.

Spark - Scala API

// a DataFrame is represented by a Dataset of Rows. 
// a type alias of Dataset[Row]
Dataset[Row]

Java. DataSet of Row

// a DataFrame is represented by a Dataset of Rows. 
// represent a DataFrame in java
Dataset<Row>

// From a sqlContext: \
sqlContext.createDataFrame(RDD[Rows], Schema)

Python DataFrame. All Datasets in Python are Dataset[Row], and we call it DataFrame to be consistent with the data frame concept in Pandas and R

people = spark.read.parquet("...")
textFile = spark.read.text("README.md")

Reading

df = sqlContext.read   
  .format("json")   
  .option("samplingRatio", "0.1")   
  .load("/home/michael/data.json")

Writing

df.write   
  .format("parquet")   
  .mode("append")   
  .partitionBy("year")   
  .saveAsTable("fasterData")

where:

Spark DataSet - Parquet

Etl (Read and Write)

ETL Using Custom Data Sources

sqlContext.read 
  .format("com.databricks.spark.jira") 
  .option("url", "https://issues.apache.org/jira/rest/api/latest/search") 
  .option("user", "marmbrus") 
  .option("password", "*******") 
  .option("query", """ 
    |project = SPARK AND  
    |component = SQL AND  
    |(status = Open OR status = "In Progress" OR status = Reopened)""".stripMargin) 
  .load() 
  .repartition(1) 
  .write 
  .format("parquet") 
  .saveAsTable("sparkSqlJira")

where:

the load function creates a data frame
that is then saved

Operations

It has various domain-specific-language (DSL) functions defined in: DataFrame (this class), Column, and functions such as:

group by,
order,
plus,….

Example:

people.col("age").plus(10);  // in Java