Hive - Sample Clause

About

The sampling clause allows the users to write queries for samples of the data instead of the whole table. Currently the sampling is done on the clustered column. (ie columns specified in the CLUSTERED BY)

Articles Related

Syntax

The buckets are numbered starting from 0.

In general the TABLESAMPLE syntax looks like:

TABLESAMPLE(BUCKET x OUT OF y)

where:

y has to be a multiple or divisor of the number of buckets in that table as specified at the table creation time.
The buckets chosen are determined when the following formula is true:

<MATH> \text{BucketNumber } \text{module } y = x </MATH>

Example

The table pv_gender_sum has 32 bucket.

the following statement will select 3 bucket.

SELECT pv_gender_sum.* FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32);

the following statement will pick out the 3rd and 19th buckets

SELECT pv_gender_sum.* FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 16)

the following statement will pick out half of the 3rd bucket.

SELECT pv_gender_sum.* FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 64 ON userid)

Documentation / Reference

https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Sampling