Apache Drill

> Database

1 - About

Drill is a SQL Engine aimed to be able to query data file stored in a file system structure.

At the core of Apache Drill is the 'Drillbit' service, which is responsible for accepting requests from the client, processing the queries, and returning results to the client.

Drill uses ZooKeeper to maintain cluster membership and health-check information.

Advertising

3 - Workpsace

https://drill.apache.org/docs/workspaces/

You can define one or more workspaces in a storage plugin configuration. The workspace defines the location of files in subdirectories of a local or distributed file system. Drill searches the workspace to locate data when you run a query. A hidden default workspace, dfs.default, points to the root of the file system.

4 - Storage Plugin and configuration

A storage plugin is a software module for connecting Drill to data sources.

A storage plugin :

  • optimizes execution of Drill queries,
  • provides the location of the data,
  • configures a workspace
  • configure file formats for reading data.

Doc:

4.1 - File System and Data Type

Local or Distributed (HDFS) with the following file types:

The extensions for these file types must match the configuration settings for your registered storage plugins. For example, PSV files may be defined with a .tbl extension, while CSV files are defined with a .csv extension.

Storage Supported: Data Sources and File Formats

Others Data Source:

ALTER SESSION SET `store.format`='parquet';
Advertising

4.2 - Storage Plugin Conf

Drill saves storage plugin configurations:

  • in a temporary directory (embedded mode)
  • or in ZooKeeper (distributed mode).

For example, on Mac OS X, Drill uses /tmp/drill/sys.storage_plugins to store storage plugin configurations. The temporary directory clears when you reboot. When you run drill in embedded mode, add the sys.store.provider.local.path option to the drill-override.conf file and the path for storing the plugin configurations.

drill.exec: {

  cluster-id: "drillbits1",
  zk.connect: "localhost:2181",
  sys.store.provider.local.path="/mypath"

}

5 - Schema

Schema in Drill are storage plugin.

The Drill installation registers the cp, dfs, hbase, hive, and mongo default storage plugin configurations.

  • cp: Points to JAR files in the Drill classpath, such as employee.json that you can query.
  • dfs: Points to the local file system, but you can configure this storage plugin to point to any distributed file system, such as a Hadoop or S3 file system.
  • hbase: Provides a connection to HBase.
  • hive: Integrates Drill with the Hive metadata abstraction of files, HBase, and libraries to read data and operate on SerDes and UDFs.
  • mongo: Provides a connection to MongoDB data.
Advertising

6 - Client Connection

Drill supports two kinds of client connections:

  • Direct Drillbit Connection
  • ZooKeeper Quorum Connection

Clients can communicate to a specific Drillbit directly or go through a ZooKeeper quorum to discover the available Drillbits before submitting queries.

The format of the JDBC URL differs slightly, depending on the way you want to connect to the Drillbit:

  • random
  • local,
  • or direct.

6.1 - JDBC URL Drill bit

A drillbit must be first started.

An URL to directly connect to a Drillbit.

jdbc:drill:drillbit=Host[:Port];Property1=Value;Property2=Value;...
  • default port: 31010
  • Properties are case-sensitive. Do not duplicate properties in the connection URL.

http://maprdocs.mapr.com/51/Drill/Using-JDBC-Driver-App.html

6.2 - JDBC URL Zk

This section covers using the URL for a random or local connection. If you want ZooKeeper to randomly choose a Drillbit in the cluster, or if you want to connect to the local Drillbit, the format of the driver URL is:

jdbc:drill:zk=<zk name>[:<port>][,<zk name2>[:<port>]...<directory>/<cluster ID>;[schema=<storage plugin>]

where

  • jdbc is the connection type. Required.
  • schema is the name of a storage plugin configuration to use as the default for queries. For example,schema=dfs or schema=hive. Optional.
  • zk name specifies one or more ZooKeeper host names, or IP addresses. Use local instead of a host name or IP address to connect to the local Drillbit. Required.
  • port is the ZooKeeper port number. Port 2181 is the default. On a MapR cluster, the default is 5181. Optional.
  • directory is the Drill directory in ZooKeeper, which by default is /Drill. Optional.
  • cluster ID is drillbits1 by default. If the default has changed, determine the cluster ID and use it. Optional.

7 - Starting Drill

7.1 - Embedded mode

Embedded mode is a quick way to try Drill without having to perform any configuration tasks. A ZooKeeper installation is not required. Installing Drill in embedded mode configures the local Drillbit service to start automatically when you launch the Drill shell.

Java URL that will start the embedded mode. A Jdbc call will the following URL will start a server and stop the main thread.

set JAVA_HOME=C:\Java\jdk1.7.0_79
sqlline.bat -u "jdbc:drill:zk=local"

You can then see the service in the web console:

Dependency:

<!-- Running Drill in embedded mode using Drill's jdbc-all JDBC driver Jar file alone is not supported -->
<dependency>
    <groupId>org.apache.drill.exec</groupId>
    <artifactId>drill-jdbc</artifactId>
    <version>1.10.0</version>
</dependency>

7.2 - Distributed Mode

  • Service script
drillbit.sh

Debug:

. ./drill-config.sh
./runbit debug

https://drill.apache.org/docs/starting-drill-in-distributed-mode/

8 - CSV

SELECT * FROM dfs.`/tmp/csv_with_header.csv2`;

Config:

"csv": {
  "type": "text",
  "extensions": [
    "csv2"
  ],
  "skipFirstLine": false,
  "extractHeader": true,
  "delimiter": ","
},
  • inline Configuration
SELECT a, b FROM TABLE(dfs.`path/to/data.csv`(TYPE => 'text', fieldDelimiter => ',', extractHeader => TRUE))

9 - Log

  • log.path
  • log.query.path

Example with Java:

 -Dlog.path="C:\apache-drill-1.1.0\log\sqlline.log" -Dlog.query.path="C:\apache-drill-1.1.0\log\sqlline_queries.log"

10 - Example

SELECT * FROM dfs.`C:/apache-drill-1.1.0/sample-data/region.parquet`;
  • PSV files (pipe-separated values): tbl extension
 SELECT COLUMNS[1] AS myAlias FROM dfs.`C:/myPath/ToPsvFiles.tbl`;

11 - Metadata

Information Schema

You can query the following INFORMATION_SCHEMA tables:

  • SCHEMATA
  • CATALOGS
  • TABLES
  • COLUMNS
  • VIEWS
SELECT CATALOG_NAME, SCHEMA_NAME AS all_my_data_sources FROM INFORMATION_SCHEMA.SCHEMATA ORDER BY SCHEMA_NAME;
-- Shortcut
-- SHOW DATABASES; -- to show the database
-- show tables; -- to show the tables
+---------------+----------------------+
| CATALOG_NAME  | all_my_data_sources  |
+---------------+----------------------+
| DRILL         | INFORMATION_SCHEMA   |
| DRILL         | cp.default           |
| DRILL         | dfs.default          |
| DRILL         | dfs.root             |
| DRILL         | dfs.tmp              |
| DRILL         | sys                  |
+---------------+----------------------+

12 - System Table

The System Tables are in the sys schema

  • List of tables
USE sys;
SHOW TABLES;
  • System, session, and boot options:
SELECT * FROM options WHERE TYPE='SYSTEM' LIMIT 10;
  • Boot, start parameters
SELECT * FROM boot LIMIT 10;
  • Threads
SELECT * FROM threads;
  • Memory
SELECT * FROM memory;

13 - Version

SELECT * FROM sys.version;

14 - Querying in Directory

Assuming that bob.logdata is a workspace that points to the logs directory, which contains multiple subdirectories: 2012, 2013, and 2014.

The following query constrains files inside the subdirectory named 2013. The variable:

  • dir0 refers to the first level down from logs,
  • dir1 to the next level,
  • and so on.
0: jdbc:drill:> USE bob.logdata;
+------------+-----------------------------------------+
|     ok     |              summary                    |
+------------+-----------------------------------------+
| true       | Default schema changed to 'bob.logdata' |
+------------+-----------------------------------------+
SELECT * FROM logs WHERE dir0='2013' LIMIT 10;

You can use query directory functions:

  • to restrict a query to one of a number of subdirectories
  • to prevent Drill from scanning all data in directories.

15 - Reserved Word

reserved word must be enclosed by back ticks

`YEAR`

16 - Interface/Client

  • Drill shell
  • Drill Web Console
  • ODBC*
  • JDBC
  • C++ API

17 - Support

17.1 - File Not Found

file not found means that your file extensions in not recognize by an extension. For instance, PSV files (pipe-separated values) have a tbl extension.

17.2 - Unexpected RuntimeException: java.lang.IndexOutOfBoundsException: Index: 0

Error during udf area creation [/C:/Users/gerard/drill/udf/registry] on file system https://issues.apache.org/jira/browse/DRILL-5101

mkdir %userprofile%\drill
mkdir %userprofile%\drill\udf
mkdir %userprofile%\drill\udf\registry
mkdir %userprofile%\drill\udf\tmp
mkdir %userprofile%\drill\udf\staging
takeown /R /F %userprofile%\drill