Hive - Sample Clause

Card Puncher Data Processing

About

The sampling clause allows the users to write queries for samples of the data instead of the whole table. Currently the sampling is done on the clustered column. (ie columns specified in the CLUSTERED BY)

Syntax

The buckets are numbered starting from 0.

In general the TABLESAMPLE syntax looks like:

TABLESAMPLE(BUCKET x OUT OF y)

where:

  • y has to be a multiple or divisor of the number of buckets in that table as specified at the table creation time.
  • The buckets chosen are determined when the following formula is true:

<MATH> \text{BucketNumber } \text{module } y = x </MATH>

Example

The table pv_gender_sum has 32 bucket.

  • the following statement will select 3 bucket.
SELECT pv_gender_sum.* FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32);
  • the following statement will pick out the 3rd and 19th buckets
SELECT pv_gender_sum.* FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 16)
  • the following statement will pick out half of the 3rd bucket.
SELECT pv_gender_sum.* FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid)

Documentation / Reference





Discover More
Card Puncher Data Processing
Hive - Bucket (Cluster)

Data in each partition may be divided into Buckets. The bucket key is based on the hash of a column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows the system...



Share this page:
Follow us:
Task Runner