Hive - Table

1 - About

Table implementation in Hive.

Hive also supports custom serializer/deserializers (SerDe) for complex or irregularly structured data.

3 - Name

3.1 - Qualified

The fully qualified name in Hive for a table is:

db_name.table_name

where:

4 - Type

The type is based on the location of data.

4.1 - External

Hive - External Table

Use external tables when:

  • The data is also used outside of Hive. For example, the data files are updated by another process (that does not lock the files.)
  • Data needs to remain in the underlying location, even after dropping the table.
  • You need a custom location, such as a non-default storage account.
  • A program other than hive manages the data format, location, etc.

4.2 - Internal

Hive - Internal table

Use internal tables when:

  • Data is temporary.
  • You want Hive to manage the lifecycle of the table and data.

5 - Management

5.1 - Create

5.1.1 - With transform

MovieLensUserRatings

  • Regex
CREATE TABLE apachelog (
  host STRING,
  IDENTITY STRING,
  USER STRING,
  TIME STRING,
  request STRING,
  STATUS STRING,
  SIZE STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;

5.1.2 - ROW FORMAT

The delimited row format specifies how the rows are stored in the hive table. In the case of the delimited format, this specifies:

  • how the fields are terminated,
  • how the items within collections (arrays or maps) are terminated,
  • and how the map keys are terminated.

5.1.3 - STORED AS

STORED AS:

  • SEQUENCEFILE (default) indicates that the data is stored in a binary format (using hadoop SequenceFiles) on hdfs.

5.2 - Show

5.2.1 - List

SHOW TABLES;
-- The pattern follows Java regular expression syntax (so the period is a wildcard).
SHOW TABLES 'Pattern';
-- Example
SHOW TABLES 'page.*';

5.2.2 - Ddl

SHOW CREATE TABLE tableName;

5.2.3 - Describe

DESCRIBE FORMATTED tableName
-- To list columns and column types of table.
DESCRIBE EXTENDED page_view;
# col_name              data_type               comment

clientid                string
querytime               string
market                  string
deviceplatform          string
devicemake              string
devicemodel             string
state                   string
country                 string
querydwelltime          double
sessionid               bigint
sessionpagevieworder    bigint

# Detailed Table Information
Database:               default
Owner:                  root
CreateTime:             Thu Jan 25 13:59:12 UTC 2018
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               wasb://[email protected]b.core.windows.net/hive/warehouse/hivesampletable
Table Type:             MANAGED_TABLE
Table Parameters:
        numFiles                1
        numRows                 0
        rawDataSize             0
        totalSize               4955715
        transient_lastDdlTime   1516888756

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        field.delim             \t
        serialization.format    \t
Time taken: 0.449 seconds, Fetched: 41 row(s)

where:

5.3 - Drop

DROP TABLE IF EXISTS tableName;

5.4 - Partition

Each Table can have one or more partition.

5.5 - Load

with beeline

LOAD DATA INPATH '/user/gerardn/myTable.csv' into TABLE myTable;
Table default.myTable stats: [numFiles=1, numRows=0, totalSize=266712, rawDataSize=0]
OK
Time taken: 3.502 seconds

where:

myTable.sql
CREATE EXTERNAL TABLE IF NOT EXISTS myTable
(
col1 string, 
col2   INT, 
col3       DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
TBLPROPERTIES("skip.header.line.count"="1");

5.6 - Storage

All the data of a table is stored in a directory in HDFS and is specified within the create statement.

format: conf hive.default.fileformat Default file format for CREATE TABLE statement. Options are TextFile, SequenceFile, RCFile, and Orc. Default: TextFile

See Hive - SerDe (Storage Format)

5.7 - Statistics

Table and partition statistics are stored in the Hive Metastore

  • Number of rows
  • Number of files
  • Size in Bytes
  • Number of partition

See Hive - Table-Level Statistics (Table/Partition/Column)

5.8 - Constraint

  • PRIMARY KEY
  • FOREIGN KEY
  • UNIQUE
  • NOT NULL
  • DEFAULT (to come)

5.9 - Dual / Dummy

6 - Example

db/hive/table.txt · Last modified: 2018/07/18 19:04 by gerardnico