Spark - SparkR API

1 - About

SparkR provides a distributed data frame implementation that supports operations like selection, filtering, aggregation etc. (similar to R data frames, dplyr) but on large datasets. SparkR also supports distributed machine learning using MLlib.

3 - Management

3.1 - Shell

Spark - Shell

./bin/sparkR shell

The shell create automatically a SparkSession (connection).

3.2 - SparkDataFrame Creation

from:

  • structured data files,
  • tables in Hive,
  • external databases,
  • or existing local R data frames.

3.3 - Session

SparkSession in R.

The SparkSession connects your R program to a Spark cluster.

You call sparkR.session and pass in options (such as the application name, any spark packages depended on, etc.).

From the shell, the SparkSession is already created.

sparkR.session()

3.4 - Configuration

3.4.1 - Driver

SPARKR_DRIVER_R is the configuration property that define the R binary executable to use for SparkR shell (default is R). Property spark.r.shell.command take precedence if it is set.

3.5 - Installation

  • Installation for the version 2.2.0
if (!require('devtools')) install.packages('devtools')
devtools::install_github('apache/[email protected]', subdir='R/pkg')
Downloading GitHub repo apache/[email protected]
from URL https://api.github.com/repos/apache/spark/zipball/v2.2.0
Installing SparkR
"C:/R/R-3.5.0/bin/x64/R" --no-site-file --no-environ --no-save --no-restore  \
  --quiet CMD INSTALL  \
  "C:/Users/gerard/AppData/Local/Temp/RtmpYTWtef/devtools64e063bc40c5/apache-spark-a2c7b21/R/pkg"  \
  --library="C:/R/R-3.5.0/library" --install-tests

In R CMD INSTALL
* installing *source* package 'SparkR' ...
** R
** inst
** tests
** byte-compile and prepare package for lazy loading
Creating a new generic function for 'as.data.frame' in package 'SparkR'
Creating a new generic function for 'colnames' in package 'SparkR'
Creating a new generic function for 'colnames<-' in package 'SparkR'
Creating a new generic function for 'cov' in package 'SparkR'
Creating a new generic function for 'drop' in package 'SparkR'
Creating a new generic function for 'na.omit' in package 'SparkR'
Creating a new generic function for 'filter' in package 'SparkR'
Creating a new generic function for 'intersect' in package 'SparkR'
Creating a new generic function for 'sample' in package 'SparkR'
Creating a new generic function for 'transform' in package 'SparkR'
Creating a new generic function for 'subset' in package 'SparkR'
Creating a new generic function for 'summary' in package 'SparkR'
Creating a new generic function for 'union' in package 'SparkR'
Creating a new generic function for 'endsWith' in package 'SparkR'
Creating a new generic function for 'startsWith' in package 'SparkR'
Creating a new generic function for 'lag' in package 'SparkR'
Creating a new generic function for 'rank' in package 'SparkR'
Creating a new generic function for 'sd' in package 'SparkR'
Creating a new generic function for 'var' in package 'SparkR'
Creating a new generic function for 'window' in package 'SparkR'
Creating a new generic function for 'predict' in package 'SparkR'
Creating a new generic function for 'rbind' in package 'SparkR'
Creating a generic function for 'substr' from package 'base' in package 'SparkR'
Creating a generic function for '%in%' from package 'base' in package 'SparkR'
Creating a generic function for 'lapply' from package 'base' in package 'SparkR'
Creating a generic function for 'Filter' from package 'base' in package 'SparkR'
Creating a generic function for 'nrow' from package 'base' in package 'SparkR'
Creating a generic function for 'ncol' from package 'base' in package 'SparkR'
Creating a generic function for 'factorial' from package 'base' in package 'SparkR'
Creating a generic function for 'atan2' from package 'base' in package 'SparkR'
Creating a generic function for 'ifelse' from package 'base' in package 'SparkR'
** help
No man pages found in package  'SparkR'
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
*** arch - i386
*** arch - x64
* DONE (SparkR)

4 - Documentation / Reference

db/spark/sparkr.txt · Last modified: 2018/06/11 20:49 by gerardnico