Spark - Hive

1 - About

Hive is the default Spark catalog.

Since Spark 2.0, Spark SQL supports builtin Hive features such as:

See:

3 - Enable

  • The SparkSession must be instantiated with Hive support
String warehouseLocation = new File("spark-warehouse").getAbsolutePath();
SparkSession spark = SparkSession
  .builder()
  .appName("Java Spark Hive Example")
  .config("spark.sql.warehouse.dir", warehouseLocation)
  .enableHiveSupport() // Hive support
  .getOrCreate();

4 - Default

Users who do not have an existing Hive deployment can still enable Hive support.

When not configured by the hive-site.xml, the context automatically:

  • creates metastore_db in the current directory
  • creates a directory configured by spark.sql.warehouse.dir, which defaults to the directory spark-warehouse in the current directory that the Spark application is started.

5 - Management

5.1 - Configuration

5.1.1 - Dependency

Since Hive has a large number of dependencies, these dependencies are not included in the default Spark distribution.

On all of the worker nodes, the following must be installed on the classpath:

5.1.2 - File

Configuration of Hive is done by placing:

in:

5.1.3 - Options

See doc ref

SET -v
  • Scala
sparkSession.sql("SET -v").show(numRows = 200, truncate = false)
  • Java
sparkSession.sql("SET -v").show(200, false);
  • Python
sparkSession.sql("SET -v").show(n=200, truncate=False)
  • R
sparkR.session()
properties <- sql("SET -v")
showDF(properties, numRows = 200, truncate = FALSE)
Conf key Value Desc
spark.sql.hive.caseSensitiveInferenceMode INFER_AND_SAVE Sets the action to take when a case-sensitive schema cannot be read from a Hive table's properties. Although Spark SQL itself is not case-sensitive, Hive compatible file formats such as Parquet are. Spark SQL must use a case-preserving schema when querying any table backed by files containing case-sensitive field names or queries may not return accurate results. Valid options include INFER_AND_SAVE (the default mode– infer the case-sensitive schema from the underlying data files and write it back to the table properties), INFER_ONLY (infer the schema but don't attempt to write it to the table properties) and NEVER_INFER (fallback to using the case-insensitive metastore schema instead of inferring).
spark.sql.hive.convertMetastoreParquet true When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in support.
spark.sql.hive.convertMetastoreParquet .mergeSchema false When true, also tries to merge possibly different but compatible Parquet schemas in different Parquet data files. This configuration is only effective when “spark.sql.hive.convertMetastoreParquet” is true.
spark.sql.hive.filesourcePartitionFileCacheSize 262144000 When nonzero, enable caching of partition file metadata in memory. All tables share a cache that can use up to specified num bytes for file metadata. This conf only has an effect when hive filesource partition management is enabled.
spark.sql.hive.manageFilesourcePartitions true When true, enable metastore partition management for file source tables as well. This includes both datasource and converted Hive tables. When partition management is enabled, datasource tables store partition in the Hive metastore, and use the metastore to prune partitions during query planning.
spark.sql.hive.metastore.barrierPrefixes A comma separated list of class prefixes that should explicitly be reloaded for each version of Hive that Spark SQL is communicating with. For example, Hive UDFs that are declared in a prefix that typically would be shared (i.e. org.apache.spark.*).
spark.sql.hive.metastore.sharedPrefixes com.mysql.jdbc,
org.postgresql,
com.microsoft.sqlserver,
oracle.jdbc
A comma separated list of class prefixes that should be loaded using the classloader that is shared between Spark SQL and a specific version of Hive. An example of classes that should be shared is JDBC drivers that are needed to talk to the metastore. Other classes that need to be shared are those that interact with classes that are already shared. For example, custom appenders that are used by log4j.
spark.sql.hive.metastore.version 1.2.1 Version of the Hive metastore. Available options are 0.12.0 through 1.2.1.
spark.sql.hive.metastorePartitionPruning true When true, some predicates will be pushed down into the Hive metastore so that unmatching partitions can be eliminated earlier. This only affects Hive tables not converted to filesource relations (see HiveUtils.CONVERT_METASTORE_PARQUET and HiveUtils.CONVERT_METASTORE_ORC for more information).
spark.sql.hive.thriftServer.async true When set to true, Hive Thrift server executes SQL queries in an asynchronous way.
spark.sql.hive.thriftServer.singleSession false When set to true, Hive Thrift server is running in a single session mode. All the JDBC/ODBC connections share the temporary views, function registries, SQL configuration and the current database.
spark.sql.hive.verifyPartitionPath false When true, check all the partition paths under the table's root directory when reading data stored in HDFS.
spark.sql.hive.version 1.2.1 Version of Hive used internally by Spark SQL.

5.2 - Server

5.3 - Metastore

Example of configuration file for a local installation in a test environment.

hive-site.xml
<configuration>
  <property>
    <name>hive.exec.scratchdir</name>
    <value>C:\spark-2.2.0-metastore\scratchdir</value>
    <description>Scratch space for Hive jobs</description>
  </property>
  <property>
    <name>hive.metastore.warehouse.dir</name>
    <value>C:\spark-2.2.0-metastore\spark-warehouse</value>
    <description>Spark Warehouse</description>
  </property>
  <property>
  <name>javax.jdo.option.ConnectionURL</name>
  <value>jdbc:derby:c:/spark-2.2.0-metastore/metastore_db;create=true</value>
  <description>JDBC connect string for a JDBC metastore</description>
</property>
 
<property>
  <name>javax.jdo.option.ConnectionDriverName</name>
  <value>org.apache.derby.jdbc.EmbeddedDriver</value>
  <description>Driver class name for a JDBC metastore</description>
</property>
 
</configuration>

5.3.1 - Database

The default metastore is created with Derby if not set.

5.3.2 - Warehouse

Hive - Warehouse

Search Algorithm:

  • If the setting value of spark.sql.warehouse.dir is not null, spark.sql.warehouse.dir
  • If the setting value of hive.metastore.warehouse.dir is not null, hive.metastore.warehouse.dir
  • otherwise workingDir/spark-warehouse/

The doc said that the hive.metastore.warehouse.dir property in hive-site.xml is deprecated since Spark 2.0.0.

Example of log output:

18/07/01 00:10:50 INFO SharedState: spark.sql.warehouse.dir is not set, but hive.metastore.warehouse.dir is set. Setting spark.sql.warehouse.dir to the value of hive.metastore.warehouse.dir ('C:\spark-2.2.0-metastore\spark-warehouse').
18/07/01 00:10:50 INFO SharedState: Warehouse path is 'C:\spark-2.2.0-metastore\spark-warehouse'.

Normally, it will be found in the hive-site.xml file but you can change it via code:

Path warehouseLocation = Paths.get("target","spark-warehouse");
SparkSession spark = SparkSession
      .builder()
      .appName("Java Spark Hive Example")
      .config("spark.sql.warehouse.dir", warehouseLocation.toAbsolutePath().toString())
      .enableHiveSupport()
      .getOrCreate();

5.4 - Table

From saving-to-persistent-tables

DataFrames can also be saved as persistent tables into Hive metastore using the saveAsTable command.

A DataFrame for a persistent table can be created by calling the table method on a SparkSession with the name of the table.

5.4.1 - Internal

Starting from Spark 2.1, persistent datasource tables have per-partition metadata stored in the Hive metastore.

Partition Hive DDLs are supported

ALTER TABLE PARTITION ... SET LOCATION 

5.4.2 - External

For an external table, you can specify the table path via the path option

df.write.option("path", "/some/path").saveAsTable("t")

The partition information is not gathered by default when creating external datasource tables (those with a path option). To sync the partition information in the metastore, you can invoke MSCK REPAIR TABLE.

6 - Documentation / Reference

db/spark/sql/hive.txt · Last modified: 2018/07/18 17:43 by gerardnico