About

Table implementation in Hive.

Hive also supports custom serializer/deserializers (SerDe) for complex or irregularly structured data.

Property

Qualified Name

The fully qualified name in Hive for a table is:

db_name.table_name

where:

Default Storage

By default, tables are assumed to be of:

Type

The type is based on the location of data.

External

Hive - External Table

Use external tables when:

  • The data is also used outside of Hive. For example, the data files are updated by another process (that does not lock the files.)
  • Data needs to remain in the underlying location, even after dropping the table.
  • You need a custom location, such as a non-default storage account.
  • A program other than hive manages the data format, location, etc.

Internal

Hive - Internal table

Use internal tables when:

  • Data is temporary.
  • You want Hive to manage the lifecycle of the table and data.

Management

Create

DDL Create

With transform

MovieLensUserRatings

  • Regex
CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;

ROW FORMAT

row format specifies how the rows are stored in the hive table.

You can use:

STORED AS

Stored as design the file format

Show

List

show tables;
-- The pattern follows Java regular expression syntax (so the period is a wildcard).
SHOW TABLES 'Pattern';
-- Example
SHOW TABLES 'page.*';

Ddl

SHOW CREATE TABLE tableName;

Describe

DESCRIBE FORMATTED tableName
-- To list columns and column types of table.
DESCRIBE EXTENDED page_view;
# col_name              data_type               comment

clientid                string
querytime               string
market                  string
deviceplatform          string
devicemake              string
devicemodel             string
state                   string
country                 string
querydwelltime          double
sessionid               bigint
sessionpagevieworder    bigint

# Detailed Table Information
Database:               default
Owner:                  root
CreateTime:             Thu Jan 25 13:59:12 UTC 2018
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               wasb://hi-clus-hadoop-01-2018-01-25t11-58-44-894z@hiinformaticasawe.blob.core.windows.net/hive/warehouse/hivesampletable
Table Type:             MANAGED_TABLE
Table Parameters:
        numFiles                1
        numRows                 0
        rawDataSize             0
        totalSize               4955715
        transient_lastDdlTime   1516888756

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        field.delim             \t
        serialization.format    \t
Time taken: 0.449 seconds, Fetched: 41 row(s)

where:

Drop

DROP TABLE IF EXISTS tableName;

Partition

Each Table can have one or more partition.

Load

with beeline

LOAD DATA INPATH '/user/gerardn/myTable.csv' into TABLE myTable;
Table default.myTable stats: [numFiles=1, numRows=0, totalSize=266712, rawDataSize=0]
OK
Time taken: 3.502 seconds

where:

CREATE EXTERNAL TABLE IF NOT EXISTS myTable
(
col1 string, 
col2   int, 
col3       double
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
TBLPROPERTIES("skip.header.line.count"="1");

Storage

All the data of a table is stored in a directory in HDFS

A partitioned table will create sub-directory by partition columns. See Partitioned column

Statistics

Table and partition statistics are stored in the Hive Metastore

  • Number of rows
  • Number of files
  • Size in Bytes
  • Number of partition

See Hive - Table-Level Statistics (Table/Partition/Column)

Constraint

  • PRIMARY KEY
  • FOREIGN KEY
  • UNIQUE
  • NOT NULL
  • DEFAULT (to come)

Dual / Dummy

Not yet https://issues.apache.org/jira/browse/HIVE-1558

Example