Spark - Hive

> Database > Spark > Spark - Sql

1 - About


3 - Enable

  • The SparkSession must be instantiated with Hive support
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
  .appName("Java Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport() // Hive support

4 - Default

Users who do not have an existing Hive deployment can still enable Hive support.

When not configured by the hive-site.xml, the context automatically:

  • creates metastore_db in the current directory
  • creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started.

5 - Management

5.1 - Configuration

5.1.1 - Dependency

Since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution.

On all of the worker nodes, the following must be installed on the classpath:


5.1.2 - File

Configuration of Hive is done by placing:


5.1.3 - Options

5.2 - Server

5.3 - Metastore

Example of configuration file for a local installation in a test environment.

    <description>Scratch space for Hive jobs</description>
    <description>Spark Warehouse</description>
  <description>JDBC connect string for a JDBC metastore</description>
  <description>Driver class name for a JDBC metastore</description>

5.3.1 - Database

The default metastore is created with Derby if not set.

5.3.2 - Warehouse

Hive - Warehouse

Search Algorithm:

  • If the setting value of spark.sql.warehouse.dir is not null, spark.sql.warehouse.dir
  • If the setting value of hive.metastore.warehouse.dir is not null, hive.metastore.warehouse.dir
  • otherwise workingDir/spark-warehouse/

The doc said that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0.

Example of log output:

18/07/01 00:10:50 INFO SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('C:\spark-2.2.0-metastore\spark-warehouse').
18/07/01 00:10:50 INFO SharedState: Warehouse path is 'C:\spark-2.2.0-metastore\spark-warehouse'.

Normally, it will be found in the hive-site.xml file but you can change it via code:

Path warehouseLocation = Paths.get("target","spark-warehouse");
SparkSession spark = SparkSession
      .appName("Java Spark Hive Example")
      .config("spark.sql.warehouse.dir", warehouseLocation.toAbsolutePath().toString())

5.4 - Table

From saving-to-persistent-tables

DataFrames can also be saved as persistent tables into Hive metastore using the saveAsTable command.

A DataFrame for a persistent table can be created by calling the table method on a SparkSession with the name of the table.

5.4.1 - Internal

Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore.

Partition Hive DDLs are supported


5.4.2 - External

For an external table, you can specify the table path via the path option

df.write.option("path", "/some/path").saveAsTable("t")

The partition information is not gathered by default when creating external datasource tables (those with a path option). To sync the partition information in the metastore, you can invoke MSCK REPAIR TABLE.

6 - Documentation / Reference