Spark - Data Frame (SchemaRDD)

1 - About

  • A distributed collection of rows organized into named columns.
  • An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas).
  • Archaic: Previously SchemaRDD (cf. Spark < 1.3).

Unified interface to reading/writing data in a variety of formats:

3 - Management

Like an RDD, dataframe follow a builder pattern where the last method performs an action whereas the previous one set parameters.

3.1 - Reading

df = sqlContext.read   
  .format("json")   
  .option("samplingRatio", "0.1")   
  .load("/home/michael/data.json") 

3.2 - Writing

df.write   
  .format("parquet")   
  .mode("append")   
  .partitionBy("year")   
  .saveAsTable("fasterData")
 

where:

3.3 - Etl (Read and Write)

ETL Using Custom Data Sources

sqlContext.read 
  .format("com.databricks.spark.jira") 
  .option("url", "https://issues.apache.org/jira/rest/api/latest/search") 
  .option("user", "marmbrus") 
  .option("password", "*******") 
  .option("query", """ 
    |project = SPARK AND  
    |component = SQL AND  
    |(status = Open OR status = "In Progress" OR status = Reopened)""".stripMargin) 
  .load() 
  .repartition(1) 
  .write 
  .format("parquet") 
  .saveAsTable("sparkSqlJira") 

where:

  • the load function creates a data frame
  • that is then saved

4 - Documentation / Reference

db/spark/dataframe.txt ยท Last modified: 2017/09/20 22:34 by gerardnico