Spark - Hive

About

Hive is the default Spark catalog.

Since Spark 2.0, Spark SQL supports builtin Hive features such as:

See:

Articles Related

Enable

The SparkSession must be instantiated with Hive support

String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
  .builder()
  .appName("Java Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport() // Hive support
  .getOrCreate();

Default

Users who do not have an existing Hive deployment can still enable Hive support.

When not configured by the hive-site.xml, the context automatically:

creates metastore_db in the current directory
creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started.

Management

Configuration

Dependency

Since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution.

On all of the worker nodes, the following must be installed on the classpath:

Hive dependencies
Hive serialization and deserialization libraries (SerDes)

File

Configuration of Hive is done by placing:

hive-site.xml,
core-site.xml
and hdfs-site.xml files

in:

the conf directory of spark. SPARK_HOME/conf/
of Hadoop

Options

Spark SQL - Conf (Set)

Server

see Spark SQL - Server (Thrift) (STS)

Metastore

Example of configuration file for a local installation in a test environment.

<configuration>
  <property>
    <name>hive.exec.scratchdir</name>
    <value>C:\spark-2.2.0-metastore\scratchdir</value>
    <description>Scratch space for Hive jobs</description>
  </property>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>C:\spark-2.2.0-metastore\spark-warehouse</value>
    <description>Spark Warehouse</description>
  </property>
  <property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:derby:c:/spark-2.2.0-metastore/metastore_db;create=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.apache.derby.jdbc.EmbeddedDriver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>
  
</configuration>

Database

The default metastore is created with Derby if not set.

Spark - Metastore

Warehouse

Hive - Warehouse

Search Algorithm:

If the setting value of spark.sql.warehouse.dir is not null, spark.sql.warehouse.dir
If the setting value of hive.metastore.warehouse.dir is not null, hive.metastore.warehouse.dir
otherwise workingDir/spark-warehouse/

The doc said that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0.

Example of log output:

18/07/01 00:10:50 INFO SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('C:\spark-2.2.0-metastore\spark-warehouse').
18/07/01 00:10:50 INFO SharedState: Warehouse path is 'C:\spark-2.2.0-metastore\spark-warehouse'.

Normally, it will be found in the hive-site.xml file but you can change it via code:

Path warehouseLocation = Paths.get("target","spark-warehouse");
SparkSession spark = SparkSession
      .builder()
      .appName("Java Spark Hive Example")
      .config("spark.sql.warehouse.dir", warehouseLocation.toAbsolutePath().toString())
      .enableHiveSupport()
      .getOrCreate();

Table

From saving-to-persistent-tables

DataFrames can also be saved as persistent tables into Hive metastore using the saveAsTable command.

A DataFrame for a persistent table can be created by calling the table method on a SparkSession with the name of the table.

Internal

Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore.

Partition Hive DDLs are supported

ALTER TABLE PARTITION ... SET LOCATION

External

For an external table, you can specify the table path via the path option

df.write.option("path", "/some/path").saveAsTable("t")

The partition information is not gathered by default when creating external datasource tables (those with a path option). To sync the partition information in the metastore, you can invoke MSCK REPAIR TABLE.

About

Articles Related

Enable

Default

Dependency

File

Options

Server

Database

Warehouse

Internal

External

Documentation / Reference