Hive - Bucket (Cluster)

About

Data in each partition may be divided into Buckets.

The bucket key is based on the hash of a column in the table.

Each bucket is stored as a file in the partition directory. Bucketing allows the system to efficiently evaluate queries that depend on a sample of data (these are queries that use the SAMPLE clause on the table).

Articles Related

Management

Ddl

Tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries.

Sampling are efficient on the clustered column.

Example:

the clustered column is userid
the data is clustered by a hash function of userid into 32 buckets.
within each bucket the data is sorted in increasing order of viewTime.

CREATE TABLE page_view(
    viewTime INT, 
    userid BIGINT,
    page_url STRING,
    referrer_url STRING,
    ip STRING COMMENT 'IP Address of the User'
)
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
        COLLECTION ITEMS TERMINATED BY '2'
        MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;

Documentation / Reference

https://cwiki.apache.org/confluence/display/Hive/Design