Hive - Partition

Card Puncher Data Processing

About

Data Partitions (Clustering of data) in Hive

Each Table can have one or more partition.

Data in each partition may be furthermore divided into Buckets.

Column

Directory Hierarchy

The partition columns determine how the data is stored. A separate data directory is created for each distinct value combination in the partition columns. The partition column order determines the directory hierarchy.

Example:

  • For a table T with a date partition column, the data will be stored for a particular date in the following directory
<table directory location>/ds=<date>

  • partitioned by (ds string, dept int) implies a directory structure of ds=2009-02-26/dept=2.

Data Type and Value

The partition column is a virtual column. Its value is not stored with the data; it is derived from the directory where the data is saved.

Their data type and value has to be able to be converted to:

  • a string
  • and thereafter to a valid directory name.

Then no special characters that is reserved in FS path names (such as '/' or '..').

Type

Dynamic

Dynamic Partition (DP) columns: columns whose values are only known at EXECUTION TIME.

Dynamic partitioning means that you want Hive to create partitions automatically for you. Since you have already created the partitioning table from the staging table, all you need to do is insert data to the partitioned table

SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT INTO TABLE lineitem_part
PARTITION (L_SHIPDATE)
SELECT L_ORDERKEY as L_ORDERKEY, L_PARTKEY as L_PARTKEY , 
     L_SUPPKEY as L_SUPPKEY, L_LINENUMBER as L_LINENUMBER,
     L_QUANTITY as L_QUANTITY, L_EXTENDEDPRICE as L_EXTENDEDPRICE,
     L_DISCOUNT as L_DISCOUNT, L_TAX as L_TAX, L_RETURNFLAG as L_RETURNFLAG,
     L_LINESTATUS as L_LINESTATUS, L_SHIPDATE as L_SHIPDATE_PS,
     L_COMMITDATE as L_COMMITDATE, L_RECEIPTDATE as L_RECEIPTDATE,
     L_SHIPINSTRUCT as L_SHIPINSTRUCT, L_SHIPMODE as L_SHIPMODE, 
     L_COMMENT as L_COMMENT, L_SHIPDATE as L_SHIPDATE FROM lineitem;

Static

Static partitioning means that you have already sharded data in the appropriate directories. With static partitions, you add Hive partitions manually based on the directory location.

  • Add the partition
ALTER TABLE lineitem_part ADD PARTITION (L_SHIPDATE = ‘5/23/1996 12:00:00 AM’))
LOCATION ‘/data/partitions/5_23_1996/'
  • Insert data
INSERT OVERWRITE TABLE lineitem_part
PARTITION (L_SHIPDATE = ‘5/23/1996 12:00:00 AM’)
SELECT * FROM lineitem 
WHERE lineitem.L_SHIPDATE = ‘5/23/1996 12:00:00 AM’

Partitioned Views

Partitioned Views (HIVE-1941)

Syntax

Partition

Partitioned tables can be created using the PARTITIONED BY clause.

CREATE TABLE lineitem_part
    (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT,L_LINENUMBER INT,
     L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE,
     L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING,
     L_SHIPDATE_PS STRING, L_COMMITDATE STRING, L_RECEIPTDATE STRING, 
     L_SHIPINSTRUCT STRING, L_SHIPMODE STRING, L_COMMENT STRING)
PARTITIONED BY(L_SHIPDATE STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE;

Bucket

Tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns.

Management

Describe

SHOW PARTITIONS page_view;
-- To list columns and all other properties of table. This prints lot of information and that too not in a pretty format. Usually used for debugging.
DESCRIBE EXTENDED page_view PARTITION (ds='2008-08-08');

Rebuild

  • rebuild the partitions all at once. MSCK REPAIR TABLE

Drop

ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08')

Note that any data for this table or partitions will be dropped and may not be recoverable. *

Load

External table

Exchange

https://cwiki.apache.org/confluence/display/Hive/Exchange+Partition

Insert overwrite

Since the partitioned column is derived from the directory, you can't load directly partitioned column from a file. You have to do it in two steps:

  • Create a staging table
CREATE EXTERNAL TABLE my_table_staging (id INT, fname STRING, dt STRING, lname STRING)
LOCATION '/usr/hive/warehouse/my_table_staging';
INSERT OVERWRITE TABLE my_table PARTITION (dt)
SELECT id, fname, lname, dt FROM my_table_staging;

Pruning

Data Partition - Partition Pruning (Elimination)

hive.partition.pruning A strict value for this variable indicates that an error is thrown by the compiler in case no partition predicate is provided on a partitioned table. This is used to protect against a user inadvertently issuing a query against all the partitions of the table. Default nonstrict

Statistic

List:

  • Number of rows
  • Number of files
  • Size in Bytes

See Stats

Documentation / Reference





Discover More
Card Puncher Data Processing
Hive - Bucket (Cluster)

Data in each partition may be divided into Buckets. The bucket key is based on the hash of a column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows the system...
Card Puncher Data Processing
Hive - Data Model (Data Units)

Data in Hive is organized into: database: Namespaces function Tables - A relation table Partitions - Data in each table may be divided into Partitions. Buckets - Data in each partition may be...
Card Puncher Data Processing
Hive - Dynamic Partition (DP)

DynamicPartitions (DP) is a partition type where the partitioning columns are only known at EXECUTION TIME (ie within a SQL statement) Hive will create the partitions automatically. The only difference...
Card Puncher Data Processing
Hive - Table

Table implementation in Hive. serializer/deserializers (SerDe) The fully qualified name in Hive for a table is: where: db_name is the database name By default, tables are assumed to be of:...
Card Puncher Data Processing
Spark DataSet - Partition

org/apache/spark/sql/DataFrameWriterpartitionBy(scala.collection.Seq org/apache/spark/sql/DataFrameWriterpartitionBy(String... colNames) org/apache/spark/sql/DatasetforeachPartition(func) - Runs func...



Share this page:
Follow us:
Task Runner