Spark - Hive

> Database > Spark > Spark - Sql

1 - About

Advertising

3 - Enable

  • The SparkSession must be instantiated with Hive support
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
  .builder()
  .appName("Java Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport() // Hive support
  .getOrCreate();

4 - Default

Users who do not have an existing Hive deployment can still enable Hive support.

When not configured by the hive-site.xml, the context automatically:

  • creates metastore_db in the current directory
  • creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started.

5 - Management

5.1 - Configuration

5.1.1 - Dependency

Since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution.

On all of the worker nodes, the following must be installed on the classpath:

Advertising

5.1.2 - File

Configuration of Hive is done by placing:

in:

5.1.3 - Options

5.2 - Server

5.3 - Metastore

Example of configuration file for a local installation in a test environment.

hive-site.xml
<configuration>
  <property>
    <name>hive.exec.scratchdir</name>
    <value>C:\spark-2.2.0-metastore\scratchdir</value>
    <description>Scratch space for Hive jobs</description>
  </property>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>C:\spark-2.2.0-metastore\spark-warehouse</value>
    <description>Spark Warehouse</description>
  </property>
  <property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:derby:c:/spark-2.2.0-metastore/metastore_db;create=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.apache.derby.jdbc.EmbeddedDriver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>
 
</configuration>
Advertising

5.3.1 - Database

The default metastore is created with Derby if not set.

5.3.2 - Warehouse

Hive - Warehouse

Search Algorithm:

  • If the setting value of spark.sql.warehouse.dir is not null, spark.sql.warehouse.dir
  • If the setting value of hive.metastore.warehouse.dir is not null, hive.metastore.warehouse.dir
  • otherwise workingDir/spark-warehouse/

The doc said that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0.

Example of log output:

18/07/01 00:10:50 INFO SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('C:\spark-2.2.0-metastore\spark-warehouse').
18/07/01 00:10:50 INFO SharedState: Warehouse path is 'C:\spark-2.2.0-metastore\spark-warehouse'.

Normally, it will be found in the hive-site.xml file but you can change it via code:

Path warehouseLocation = Paths.get("target","spark-warehouse");
SparkSession spark = SparkSession
      .builder()
      .appName("Java Spark Hive Example")
      .config("spark.sql.warehouse.dir", warehouseLocation.toAbsolutePath().toString())
      .enableHiveSupport()
      .getOrCreate();

5.4 - Table

From saving-to-persistent-tables

DataFrames can also be saved as persistent tables into Hive metastore using the saveAsTable command.

A DataFrame for a persistent table can be created by calling the table method on a SparkSession with the name of the table.

5.4.1 - Internal

Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore.

Partition Hive DDLs are supported

ALTER TABLE PARTITION ... SET LOCATION 

5.4.2 - External

For an external table, you can specify the table path via the path option

df.write.option("path", "/some/path").saveAsTable("t")

The partition information is not gathered by default when creating external datasource tables (those with a path option). To sync the partition information in the metastore, you can invoke MSCK REPAIR TABLE.

6 - Documentation / Reference

db/spark/sql/hive.txt · Last modified: 2019/05/29 15:45 by gerardnico