Spark - Local Installation

> Database > Spark > Spark - Admin

1 - About

A local installation is a spark installation on a single machine (generally a dev machine).

The local master connection will start for you a local standalone spark installation on your machine.

This steps were written for a Windows laptop.

Advertising

3 - Connection URL

The master connection URL local will start for you locally the standalone spark cluster manager:

  • with one thread local
  • with N threads local[N]

Example with sparklyr:

sc <- sparklyr::spark_connect(master = "local")

where: master = Spark - Master (Connection URL)

4 - Installation Steps

This a manually installation, you may want also to check the semi-automatic sparklyr installation.

This steps were written for a Windows laptop.

4.1 - Pre-built unarchive

Download the “Pre-built for Hadoop X.X and later” package of the latest release of Spark and simply unpack it.

They are located at https://d3kbcqa49mib13.cloudfront.net/ to download the version spark-2.2.0-bin-hadoop2.7.tgz you would type: https://d3kbcqa49mib13.cloudfront.net/spark-2.2.0-bin-hadoop2.7.tgz

Once it is unpacked, you should be able to run the spark-shell script from the package’s bin directory

Advertising

4.2 - Env

4.2.1 - SPARK_HOME

The SPARK_HOME environment variable gives the installation directory.

Set the SPARK_HOME environment variable. This environment variable is used to locate

  • Winutils (on Windows) in the bin
  • the conf file first at SPARK_HOME/conf then at SPARK_HOME/hadoop/conf
SET SPARK_HOME=/pathToSpark

4.2.2 - HADOOP_HOME

Set the HADOOP_HOME environment variable. The environment variable is used to locate

  • Winutils (on Windows) in the HADOOP_HOME/bin
  • the conf file HADOOP_HOME/conf
SET HADOOP_HOME=%SPARK_HOME%\hadoop

4.3 - Classpath

The conf files are searched within the classpath in this order:

  • SPARK_HOME/conf
  • HADOOP_HOME/conf

Example of command line when starting the spark sql cli where you can see that the classpath (cp) has two conf location.

java 
-cp "C:\spark-2.2.0-bin-hadoop2.7\bin\..\conf\;C:\spark-2.2.0-bin-hadoop2.7\bin\..\jars\*;C:\spark-2.2.0-bin-hadoop2.7\hadoop\conf" 
-Xmx1g org.apache.spark.deploy.SparkSubmit --class org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver spark-internal

In your IDE, be sure to add this two directory in your classpath.

Example with IDEA:

Advertising

4.4 - Winutil

For windows only:

  • Download and put winutils under the %HADOOP_HOME%\bin

4.5 - Hive

Spark - Hive

In %HADOOP_HOME%\conf\hive-site.xml

Example of configuration file for a test environment where the base dir for hive is C:\spark-2.2.0-hive\

hive-site.xml
<configuration>
    <property>
        <name>hive.exec.scratchdir</name>
        <value>C:\spark-2.2.0-hive\scratchdir</value>
        <description>Scratch space for Hive jobs</description>
    </property>
    <property>
        <name>hive.metastore.warehouse.dir</name>
        <value>C:\spark-2.2.0-hive\spark-warehouse</value>
        <description>Spark Warehouse</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:derby:c:/spark-2.2.0-metastore/metastore_db;create=true</value>
        <description>JDBC connect string for a JDBC metastore</description>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>org.apache.derby.jdbc.EmbeddedDriver</value>
        <description>Driver class name for a JDBC metastore</description>
    </property>
</configuration>

4.5.1 - Directory

The hive configuration has two importants directory that must be writable:

  • the scratch dir. A temporary runtime directory (default value is c:\tmp\hive, common location %TEMP%\hive)
  • the warehouse. A directory where the internal Hive data are saved.

Steps:

  • Make the directories
set SPARK-SCRATCHDIR=C:\spark-2.2.0-hive\scratchdir
set SPARK-WAREHOUSE=C:\spark-2.2.0-hive\warehouse
mkdir %SPARK-SCRATCHDIR%
mkdir %SPARK-WAREHOUSE%
winutils.exe chmod -R 777 %SPARK-SCRATCHDIR%
winutils.exe chmod -R 777 %SPARK-WAREHOUSE%

4.5.2 - Metastore

The metastore is a Derby local metastore because the jar is already located in SPARK_HOME/jars

If when starting, you can see an error saying that it can found a driver, this is caused by a faulty Jdbc Url. Verify your URL

<property>
	<name>javax.jdo.option.ConnectionURL</name>
	<value>jdbc:derby:c:/spark-2.2.0-metastore/metastore_db;create=true</value>
	<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
	<name>javax.jdo.option.ConnectionDriverName</name>
	<value>org.apache.derby.jdbc.EmbeddedDriver</value>
	<description>Driver class name for a JDBC metastore</description>
</property>

You may install and configure locally a SQL Server if you want to access the metastore while Spark is running. Because the default Derby installation allows only one connection to the database

5 - Sparklyr

sparklyr has a function to install a local spark instance.

  • Install a local Spark.
# check the available version
spark_available_versions()
# Install the one that you want locally
spark_install(version = "1.6.2")
Installing Spark 1.6.2 for Hadoop 2.6 or later.
Downloading from:
- 'https://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz'
Installing to:
- 'C:\Users\gerardn\AppData\Local\rstudio\spark\Cache/spark-1.6.2-bin-hadoop2.6'
trying URL 'https://d3kbcqa49mib13.cloudfront.net/spark-1.6.2-bin-hadoop2.6.tgz'
Content type 'application/x-tar' length 278057117 bytes (265.2 MB)
downloaded 265.2 MB

Installation complete.
  • Restart RStudio and verify that you have the HADOOP_HOME
Sys.getenv("HADOOP_HOME")
[1] "C:\\Users\\gerardn\\AppData\\Local\\rstudio\\spark\\Cache\\spark-1.6.2-bin-hadoop2.6\\tmp\\hadoop"

6 - Documentation / Reference