Hive - Table

Card Puncher Data Processing

About

Table implementation in Hive.

Hive also supports custom serializer/deserializers (SerDe) for complex or irregularly structured data.

Property

Qualified Name

The fully qualified name in Hive for a table is:

db_name.table_name

where:

Default Storage

By default, tables are assumed to be of:

Type

The type is based on the location of data.

External

Hive - External Table

Use external tables when:

  • The data is also used outside of Hive. For example, the data files are updated by another process (that does not lock the files.)
  • Data needs to remain in the underlying location, even after dropping the table.
  • You need a custom location, such as a non-default storage account.
  • A program other than hive manages the data format, location, etc.

Internal

Hive - Internal table

Use internal tables when:

  • Data is temporary.
  • You want Hive to manage the lifecycle of the table and data.

Management

Create

DDL Create

With transform

MovieLensUserRatings

  • Regex
CREATE TABLE apachelog (
  host STRING,
  identity STRING,
  user STRING,
  time STRING,
  request STRING,
  status STRING,
  size STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;

ROW FORMAT

row format specifies how the rows are stored in the hive table.

You can use:

STORED AS

Stored as design the file format

Show

List

show tables;
-- The pattern follows Java regular expression syntax (so the period is a wildcard).
SHOW TABLES 'Pattern';
-- Example
SHOW TABLES 'page.*';

Ddl

SHOW CREATE TABLE tableName;

Describe

DESCRIBE FORMATTED tableName
-- To list columns and column types of table.
DESCRIBE EXTENDED page_view;
# col_name              data_type               comment

clientid                string
querytime               string
market                  string
deviceplatform          string
devicemake              string
devicemodel             string
state                   string
country                 string
querydwelltime          double
sessionid               bigint
sessionpagevieworder    bigint

# Detailed Table Information
Database:               default
Owner:                  root
CreateTime:             Thu Jan 25 13:59:12 UTC 2018
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               wasb://hi-clus-hadoop-01-2018-01-25t11-58-44-894z@hiinformaticasawe.blob.core.windows.net/hive/warehouse/hivesampletable
Table Type:             MANAGED_TABLE
Table Parameters:
        numFiles                1
        numRows                 0
        rawDataSize             0
        totalSize               4955715
        transient_lastDdlTime   1516888756

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        field.delim             \t
        serialization.format    \t
Time taken: 0.449 seconds, Fetched: 41 row(s)

where:

Drop

DROP TABLE IF EXISTS tableName;

Partition

Each Table can have one or more partition.

Load

with beeline

LOAD DATA INPATH '/user/gerardn/myTable.csv' into TABLE myTable;
Table default.myTable stats: [numFiles=1, numRows=0, totalSize=266712, rawDataSize=0]
OK
Time taken: 3.502 seconds

where:

CREATE EXTERNAL TABLE IF NOT EXISTS myTable
(
col1 string, 
col2   int, 
col3       double
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
TBLPROPERTIES("skip.header.line.count"="1");

Storage

All the data of a table is stored in a directory in HDFS

A partitioned table will create sub-directory by partition columns. See Partitioned column

Statistics

Table and partition statistics are stored in the Hive Metastore

  • Number of rows
  • Number of files
  • Size in Bytes
  • Number of partition

See Hive - Table-Level Statistics (Table/Partition/Column)

Constraint

  • PRIMARY KEY
  • FOREIGN KEY
  • UNIQUE
  • NOT NULL
  • DEFAULT (to come)

Dual / Dummy

Not yet https://issues.apache.org/jira/browse/HIVE-1558

Example





Discover More
Card Puncher Data Processing
Hive - Bucket (Cluster)

Data in each partition may be divided into Buckets. The bucket key is based on the hash of a column in the table. Each bucket is stored as a file in the partition directory. Bucketing allows the system...
Card Puncher Data Processing
Hive - Data Model (Data Units)

Data in Hive is organized into: database: Namespaces function Tables - A relation table Partitions - Data in each table may be divided into Partitions. Buckets - Data in each partition may be...
Card Puncher Data Processing
Hive - Database

All table in hive are in a database. The database: has a namespaces function are used to enforce security for a user or group of users. Qualify the table names db_name.table_name Set it...
Card Puncher Data Processing
Hive - Delimited Row Format

delimited is a row format that specifies: how the fields are terminated, how the items within collections (arrays or maps) are terminated, and how the map keys are terminated.
Card Puncher Data Processing
Hive - External Table

external table are external because the data is stored outside the data warehouse. Use external tables when: The data is also used outside of Hive. For example, the data files are updated by another...
Card Puncher Data Processing
Hive - File Format (Storage format)

The file format is given at the creation of a table with the statement STORED AS. The default is given by the hive.default.fileformat...
Card Puncher Data Processing
Hive - HiveQL

HiveQL Tutorial is also a good reference Language Manual Reference Select...
Card Puncher Data Processing
Hive - Partition

in Hive Each Table can have one or more partition. Data in each partition may be furthermore divided into Buckets. The partition columns determine how the data is stored. A separate data directory...
Card Puncher Data Processing
Hive - Text File (TEXTFILE)

TEXTFILE is the default storage format of a table STORED AS TEXTFILE is normally the storage format and is then optional. The delimiters are assumed to be ^A(ctrl-a). Example with the customer...
Card Puncher Data Processing
Hive - Warehouse

Hive metastore warehouse is the root location of the internal storage of data for Hive. The data of tables: created without file system location (ie internal table) loaded are saved in this directory....



Share this page:
Follow us:
Task Runner