Hive - Bucket (Cluster)

Card Puncher Data Processing

About

Data in each partition may be divided into Buckets.

The bucket key is based on the hash of a column in the table.

Each bucket is stored as a file in the partition directory. Bucketing allows the system to efficiently evaluate queries that depend on a sample of data (these are queries that use the SAMPLE clause on the table).

Management

Ddl

Tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. The sorting property allows internal operators to take advantage of the better-known data structure while evaluating queries.

Sampling are efficient on the clustered column.

Example:

  • the clustered column is userid
  • the data is clustered by a hash function of userid into 32 buckets.
  • within each bucket the data is sorted in increasing order of viewTime.
CREATE TABLE page_view(
    viewTime INT, 
    userid BIGINT,
    page_url STRING,
    referrer_url STRING,
    ip STRING COMMENT 'IP Address of the User'
)
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING, country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS
ROW FORMAT DELIMITED
        FIELDS TERMINATED BY '1'
        COLLECTION ITEMS TERMINATED BY '2'
        MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE;

Documentation / Reference





Discover More
Card Puncher Data Processing
Hive - Data Model (Data Units)

Data in Hive is organized into: database: Namespaces function Tables - A relation table Partitions - Data in each table may be divided into Partitions. Buckets - Data in each partition may be...
Card Puncher Data Processing
Hive - HiveQL

HiveQL Tutorial is also a good reference Language Manual Reference Select...
Card Puncher Data Processing
Hive - Partition

in Hive Each Table can have one or more partition. Data in each partition may be furthermore divided into Buckets. The partition columns determine how the data is stored. A separate data directory...
Card Puncher Data Processing
Hive - Sample Clause

The sampling clause allows the users to write queries for samples of the data instead of the whole table. Currently the sampling is done on the clustered column. (ie columns specified in the CLUSTERED...
Card Puncher Data Processing
Hive - Table

Table implementation in Hive. serializer/deserializers (SerDe) The fully qualified name in Hive for a table is: where: db_name is the database name By default, tables are assumed to be of:...
Card Puncher Data Processing
Spark DataSet - Bucket

A partition may be divided in bucket. Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive's bucketing scheme. This is applicable...



Share this page:
Follow us:
Task Runner