Hive - Table

1 - About

Table implementation in Hive.

Hive also supports custom serializer/deserializers (SerDe) for complex or irregularly structured data.

3 - Name

3.1 - Qualified

The fully qualified name in Hive for a table is:

db_name.table_name

where:

4 - Management

4.1 - Create

If the location is a directory, the query will be on all files in the directory

4.1.1 - External

Data is stored outside the data warehouse.

Structure:

  • a carriage return/line feed at the end of all rows including the last one.

Use external tables when:

  • The data is also used outside of Hive. For example, the data files are updated by another process (that does not lock the files.)
  • Data needs to remain in the underlying location, even after dropping the table.
  • You need a custom location, such as a non-default storage account.
  • A program other than hive manages the data format, location, etc.
CREATE EXTERNAL TABLE log4jLogs (t1 string, t2 string, t3 string, t4 string, t5 string, t6 string, t7 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ' '
STORED AS TEXTFILE 
LOCATION '/example/data/';

where:

Example location:

  • Azure with a Blob storage: wasb://{container}@{storageaccount}.blob.core.windows.net/path/pth

4.1.2 - Internal

Data is stored in the Hive data warehouse.

Use internal tables when:

  • Data is temporary.
  • You want Hive to manage the lifecycle of the table and data.

The data warehouse is located at:

  • /hive/warehouse/ on the default storage for a azure cluster.
CREATE TABLE (word string) LOCATION '/data/nico';

4.1.3 - With transform

MovieLensUserRatings

  • Regex
CREATE TABLE apachelog (
  host STRING,
  IDENTITY STRING,
  USER STRING,
  TIME STRING,
  request STRING,
  STATUS STRING,
  SIZE STRING,
  referer STRING,
  agent STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"
)
STORED AS TEXTFILE;

4.2 - Show

show tables;

4.3 - Describe

DESCRIBE FORMATTED tableName
# col_name              data_type               comment

clientid                string
querytime               string
market                  string
deviceplatform          string
devicemake              string
devicemodel             string
state                   string
country                 string
querydwelltime          double
sessionid               bigint
sessionpagevieworder    bigint

# Detailed Table Information
Database:               default
Owner:                  root
CreateTime:             Thu Jan 25 13:59:12 UTC 2018
LastAccessTime:         UNKNOWN
Protect Mode:           None
Retention:              0
Location:               wasb://[email protected]b.core.windows.net/hive/warehouse/hivesampletable
Table Type:             MANAGED_TABLE
Table Parameters:
        numFiles                1
        numRows                 0
        rawDataSize             0
        totalSize               4955715
        transient_lastDdlTime   1516888756

# Storage Information
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat:            org.apache.hadoop.mapred.TextInputFormat
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Compressed:             No
Num Buckets:            -1
Bucket Columns:         []
Sort Columns:           []
Storage Desc Params:
        field.delim             \t
        serialization.format    \t
Time taken: 0.449 seconds, Fetched: 41 row(s)

4.4 - Drop

DROP TABLE IF EXISTS tableName;

4.5 - Storage

All the data of a table is stored in a directory in HDFS and is specified within the create statement.

4.6 - Partition

Each Table can have one or more partition.

4.7 - Load

with beeline

LOAD DATA INPATH '/user/gerardn/myTable.csv' into TABLE myTable;
Table default.myTable stats: [numFiles=1, numRows=0, totalSize=266712, rawDataSize=0]
OK
Time taken: 3.502 seconds

where:

myTable.sql
CREATE EXTERNAL TABLE IF NOT EXISTS myTable
(
col1 string, 
col2   INT, 
col3       DOUBLE
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
TBLPROPERTIES("skip.header.line.count"="1");

5 - Example

db/hive/table.txt · Last modified: 2018/05/15 18:19 by gerardnico