Spark - Yarn

> Database > Spark > Spark - Admin

1 - About

Yarn is a cluster manager supported by Spark.

Advertising

3 - Mode

The deployment mode sets where the driver will run. The driver will run:

Mode Client Cluster
Interactive coding Yes No
Driver Machine The client machine The cluster
Process Synchronous Asynchronous (Background)

Example:

./bin/spark-shell --master yarn --deploy-mode client
./bin/spark-submit  --deploy-mode cluster 

4 - Steps

4.1 - Configuration

The HADOOP_CONF_DIR or YARN_CONF_DIR environment variable points to the directory which contains the (client side) configuration files for the Hadoop cluster.

These configs permits to read the configurations:

  • of Hdfs in order to connect to HDFS (HADOOP_CONF_DIR)
  • of Yarn in order to connect to the YARN ResourceManager (YARN_CONF_DIR). The ResourceManager’s address is picked up from the Hadoop configuration.

Example: Set the HADOOP_CONF_DIR or YARN_CONF_DIR

set YARN_CONF_DIR=C:\Users\gerardn\Downloads\YARN_CLIENT
  • Copy the yarn-site.xml file into the conf directory. If this is the default file, change at minimal:
    • yarn.resourcemanager.hostname with the hostname
    • yarn.client.nodemanager-connect.max-wait-ms to 10000 (10 sec)
    • yarn.resourcemanager.connect.max-wait.ms to 10000 (10 sec)
    • yarn.resourcemanager.connect.retry-interval.ms to 10000 - 10 sec (total time to retry before failing)
Advertising

4.2 - Deployment mode

4.2.1 - Cluster

The master value is yarn and not the cluster URL. The ResourceManager’s address is picked up from the Hadoop configuration

With spark-submit

./bin/spark-submit --class path.to.your.Class --master yarn --deploy-mode cluster [options] <app jar> [app options]
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
    --master yarn \
    --deploy-mode cluster \
    --driver-memory 4g \
    --executor-memory 2g \
    --executor-cores 1 \
    --queue thequeue \
    examples/jars/spark-examples*.jar \
    10

where:

4.2.2 - Client

Shell feedback

  • A YARN client program is started along with an Application Master (in the above example with the default one)
  • The client periodically poll the Application Master for status updates and display them in the console.

Spark installation on Yarn.

  • Start a shell (you need to be on the same network and reachable from all node)
:: To locate winutils
set HADOOP_HOME=C:\spark-2.2.0-bin-hadoop2.7
REM suppress the HADOOP_HOME\conf files if you don't want them to be used

REM Then
set HADOOP_CONF_DIR=%HADOOP_HOME%\confAap
set YARN_CONF_DIR=%HADOOP_HOME%\confAap
set HADOOP_BIN=%HADOOP_HOME%\bin

REM the user
set HADOOP_USER_NAME=gnicolas
 
cd %HADOOP_BIN%
 
spark-shell.cmd --master yarn --deploy-mode client
REM or
pyspark.cmd --master yarn --deploy-mode client
Advertising

5 - Note

5.1 - Azure Conf

  • Suppress the decryption proprerties in core-site.xml
<property>
      <name>fs.azure.account.keyprovider.basisinfrasharedrgp122.blob.core.windows.net</name>
      <value>org.apache.hadoop.fs.azure.ShellDecryptionKeyProvider</value>
</property>
<property>
      <name>fs.azure.shellkeyprovider.script</name>
      <value>/usr/lib/hdinsight-common/scripts/decrypt.sh</value>
</property>
  • Add the azure jar files for the storage

5.2 - FYI: Conf file send to the cluster

FYI: Example of conf file send by Spark in client deploy mode where 10.0.75.1 is the IP of the host machine (The client)

  • send by the Spark shell
__spark_conf__.properties
spark.yarn.cache.visibilities=PRIVATE
spark.yarn.cache.timestamps=1553518131341
spark.executor.id=driver
spark.driver.host=10.0.75.1
spark.yarn.cache.confArchive=file\:/C\:/Users/gerard/.sparkStaging/application_1553465137181_5816/__spark_conf__.zip
spark.yarn.cache.sizes=208833138
spark.jars=
spark.sql.catalogImplementation=hive
spark.home=C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..
spark.submit.deployMode=client
spark.yarn.queue=root.development
spark.master=yarn
spark.yarn.cache.filenames=file\:/C\:/Users/gerard/AppData/Local/Temp/spark-3a55ab80-2afe-4de2-be7b-0f5cc792c168/__spark_libs__9157723267130265104.zip\#__spark_libs__
spark.yarn.cache.types=ARCHIVE
spark.driver.appUIAddress=http\://10.0.75.1\:4040
spark.repl.class.outputDir=C\:\\Users\\gerard\\AppData\\Local\\Temp\\spark-3a55ab80-2afe-4de2-be7b-0f5cc792c168\\repl-66e09de6-41c3-47ab-9589-f8f95578432c
spark.app.name=Spark shell
spark.repl.class.uri=spark\://10.0.75.1\:10361/classes
spark.driver.port=10361
  • Send by pySpark
__spark_conf__.properties
spark.executorEnv.PYTHONPATH=C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..\\python\\lib\\py4j-0.10.4-src.zip;C\:\\spark-2.2.0-bin-hadoop2.7\\bin\\..\\python;<CPS>{{PWD}}/pyspark.zip<CPS>{{PWD}}/py4j-0.10.4-src.zip
spark.yarn.cache.visibilities=PRIVATE,PRIVATE,PRIVATE
spark.yarn.cache.timestamps=1553513892305,1498864159000,1498864159000
spark.executor.id=driver
spark.driver.host=10.0.75.1
spark.yarn.cache.confArchive=file\:/C\:/Users/gerard/.sparkStaging/application_1553465137181_5377/__spark_conf__.zip
spark.yarn.isPython=true
spark.yarn.cache.sizes=208833138,480115,74096
spark.sql.catalogImplementation=hive
spark.submit.deployMode=client
spark.master=yarn
spark.yarn.cache.filenames=file\:/C\:/Users/gerard/AppData/Local/Temp/spark-c5350af3-fabd-469e-bfc3-565eb0f6ed4b/__spark_libs__2786045563156883095.zip\#__spark_libs__,file\:/C\:/spark-2.2.0-bin-hadoop2.7/python/lib/pyspark.zip\#pyspark.zip,file\:/C\:/spark-2.2.0-bin-hadoop2.7/python/lib/py4j-0.10.4-src.zip\#py4j-0.10.4-src.zip
spark.serializer.objectStreamReset=100
spark.yarn.cache.types=ARCHIVE,FILE,FILE
spark.driver.appUIAddress=http\://10.0.75.1\:4040
spark.rdd.compress=True
spark.app.name=PySparkShell
spark.driver.port=6067

6 - Documentation / Reference