Hive - Open Csv Serde

> Database > Apache - Hive (HS|Hive Server)

1 - About

The Csv Serde is a serde that is applied above a text file. It's one way of reading a CSV / TSV format.

Advertising

3 - Architecture

  • The CSVSerde is available in Hive 0.14 and greater.
  • It was added to the Hive distribution in HIVE-7777.
  • The CSVSerde has been built and tested against Hive 0.14 and later, and uses Open-CSV 2.3 which is bundled with the Hive distribution.

4 - Limitations

This SerDe:

  • does not handle embedded newlines.
  • treats all columns to be of type String. Even if you create a table with non-string column types using this SerDe, the DESCRIBE TABLE output would show string column type. To convert columns to the desired type in a table, you can create a view over the table that does the CAST to the desired type.

5 - Syntax

The following example creates a TSV (Tab-separated) file.

CREATE TABLE my_table(a string, b string, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = "\t",
   "quoteChar"     = "'",
   "escapeChar"    = "\\"
)  
STORED AS TEXTFILE;

where:

  • the fully qualified class name org.apache.hadoop.hive.serde2.OpenCSVSerde must be specified.
  • the default properties value are
 
DEFAULT_ESCAPE_CHARACTER \
DEFAULT_QUOTE_CHARACTER  "
DEFAULT_SEPARATOR        ,
Advertising

6 - Example

6.1 - Step 1 - Create the staging external table

CREATE EXTERNAL  TABLE `customer_dat`(
  `c_customer_sk` INT, 
  `c_customer_id` CHAR(16), 
  `c_current_cdemo_sk` INT, 
  `c_current_hdemo_sk` INT, 
  `c_current_addr_sk` INT, 
  `c_first_shipto_date_sk` INT, 
  `c_first_sales_date_sk` INT, 
  `c_salutation` CHAR(10), 
  `c_first_name` CHAR(20), 
  `c_last_name` CHAR(30), 
  `c_preferred_cust_flag` CHAR(1), 
  `c_birth_day` INT, 
  `c_birth_month` INT, 
  `c_birth_year` INT, 
  `c_birth_country` VARCHAR(20), 
  `c_login` CHAR(13), 
  `c_email_address` CHAR(50), 
  `c_last_review_date_sk` INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   "separatorChar" = "|"
)
LOCATION 'hdfs://locationToADirectory';

where:

  • location is a directory where the tpcds customer data can be found

Et voila:

6.2 - Step 2 - Load the data into the target table with data type

FROM `customer_stg`
INSERT OVERWRITE TABLE `customer` 
SELECT *;

7 - Documentation / Reference

db/hive/csv_serde.txt · Last modified: 2019/05/15 17:35 by gerardnico