Apache Drill

Card Puncher Data Processing

About

Drill is a SQL Engine aimed to be able to query data file stored in a file system structure.

At the core of Apache Drill is the 'Drillbit' service, which is responsible for accepting requests from the client, processing the queries, and returning results to the client.

Drill uses ZooKeeper to maintain cluster membership and health-check information.

Workpsace

https://drill.apache.org/docs/workspaces/

You can define one or more workspaces in a storage plugin configuration. The workspace defines the location of files in subdirectories of a local or distributed file system. Drill searches the workspace to locate data when you run a query. A hidden default workspace, dfs.default, points to the root of the file system.

Storage Plugin and configuration

A storage plugin is a software module for connecting Drill to data sources.

A storage plugin :

  • optimizes execution of Drill queries,
  • provides the location of the data,
  • configures a workspace
  • configure file formats for reading data.

Doc:

File System and Data Type

Local or Distributed (HDFS) with the following file types:

The extensions for these file types must match the configuration settings for your registered storage plugins. For example, PSV files may be defined with a .tbl extension, while CSV files are defined with a .csv extension.

Storage Supported: Data Sources and File Formats

Others Data Source:

alter session set `store.format`='parquet';

Storage Plugin Conf

Drill saves storage plugin configurations:

  • in a temporary directory (embedded mode)
  • or in ZooKeeper (distributed mode).

For example, on Mac OS X, Drill uses /tmp/drill/sys.storage_plugins to store storage plugin configurations. The temporary directory clears when you reboot. When you run drill in embedded mode, add the sys.store.provider.local.path option to the drill-override.conf file and the path for storing the plugin configurations.

 drill.exec: {
    cluster-id: "drillbits1",
    zk.connect: "localhost:2181",
    sys.store.provider.local.path="/mypath"
 }

Schema / Connection

Schema in Drill are storage plugin.

The Drill installation registers the cp, dfs, hbase, hive, and mongo default storage plugin configurations.

Classpath (cp)

The cp schema points to JAR files in the Drill classpath, such as employee.json that you can query.

It's implemented by the cp (classpath) storage plugin.

Example:

SELECT * FROM cp.`employee.json` LIMIT 3;

File System (dfs)

dfs points to the root local file system (Relative seems to not work). You can configure this storage plugin to point to any distributed file system, such as a Hadoop or S3 file system.

Example:

SELECT * FROM dfs.`/opt/apache-drill-1.20.0/sample-data/region.parquet`

Others

  • hbase: Provides a connection to HBase.
  • hive: Integrates Drill with the Hive metadata abstraction of files, HBase, and libraries to read data and operate on SerDes and UDFs.
  • mongo: Provides a connection to MongoDB data.

Client Connection

Drill supports two kinds of client connections:

  • Direct Drillbit Connection
  • ZooKeeper Quorum Connection

Clients can communicate to a specific Drillbit directly or go through a ZooKeeper quorum to discover the available Drillbits before submitting queries.

The format of the JDBC URL differs slightly, depending on the way you want to connect to the Drillbit:

  • random
  • local,
  • or direct.

JDBC URL Drill bit

A drillbit must be first started.

An URL to directly connect to a Drillbit.

jdbc:drill:drillbit=Host[:Port];Property1=Value;Property2=Value;...
  • default port: 31010
  • Properties are case-sensitive. Do not duplicate properties in the connection URL.

http://maprdocs.mapr.com/51/Drill/Using-JDBC-Driver-App.html

With Dbeaver and a ssh tunnel

  • URL: jdbc:drill:drillbit=localhost:31010
  • SSH: User and password
  • Advanced Setting: Local Port 31010, Remote Port 31010

JDBC URL Zk

This section covers using the URL for a random or local connection. If you want ZooKeeper to randomly choose a Drillbit in the cluster, or if you want to connect to the local Drillbit, the format of the driver URL is:

jdbc:drill:zk=<zk name>[:<port>][,<zk name2>[:<port>]...<directory>/<cluster ID>;[schema=<storage plugin>]

where

  • jdbc is the connection type. Required.
  • schema is the name of a storage plugin configuration to use as the default for queries. For example,schema=dfs or schema=hive. Optional.
  • zk name specifies one or more ZooKeeper host names, or IP addresses. Use local instead of a host name or IP address to connect to the local Drillbit. Required.
  • port is the ZooKeeper port number. Port 2181 is the default. On a MapR cluster, the default is 5181. Optional.
  • directory is the Drill directory in ZooKeeper, which by default is /Drill. Optional.
  • cluster ID is drillbits1 by default. If the default has changed, determine the cluster ID and use it. Optional.

Starting Drill

Embedded mode

Embedded mode is a quick way to try Drill without having to perform any configuration tasks. A ZooKeeper installation is not required. Installing Drill in embedded mode configures the local Drillbit service to start automatically when you launch the Drill shell.

Java URL that will start the embedded mode. A Jdbc call will the following URL will start a server and stop the main thread.

set JAVA_HOME=C:\Java\jdk1.7.0_79
sqlline.bat -u "jdbc:drill:zk=local"

You can then see the service in the web console:

Drill Web Console

Dependency:

<!-- Running Drill in embedded mode using Drill's jdbc-all JDBC driver Jar file alone is not supported -->
<dependency>
    <groupId>org.apache.drill.exec</groupId>
    <artifactId>drill-jdbc</artifactId>
    <version>1.10.0</version>
</dependency>

Distributed Mode

  • Service script
drillbit.sh

Debug:

. ./drill-config.sh
./runbit debug

https://drill.apache.org/docs/starting-drill-in-distributed-mode/

CSV

SELECT * FROM dfs.`/tmp/csv_with_header.csv2`;

Config:

"csv": {
  "type": "text",
  "extensions": [
    "csv2"
  ],
  "skipFirstLine": false,
  "extractHeader": true,
  "delimiter": ","
},
  • inline Configuration
select a, b from table(dfs.`path/to/data.csv`(type => 'text', fieldDelimiter => ',', extractHeader => true))

Log

  • log.path
  • log.query.path

Example with Java:

 -Dlog.path="C:\apache-drill-1.1.0\log\sqlline.log" -Dlog.query.path="C:\apache-drill-1.1.0\log\sqlline_queries.log"

Example

SELECT * FROM dfs.`C:/apache-drill-1.1.0/sample-data/region.parquet`;
  • PSV files (pipe-separated values): tbl extension
 SELECT columns[1] as myAlias FROM dfs.`C:/myPath/ToPsvFiles.tbl`;

Metadata

Information Schema

You can query the following INFORMATION_SCHEMA tables:

  • SCHEMATA
  • CATALOGS
  • TABLES
  • COLUMNS
  • VIEWS
SELECT CATALOG_NAME, SCHEMA_NAME as all_my_data_sources FROM INFORMATION_SCHEMA.SCHEMATA ORDER BY SCHEMA_NAME;
-- Shortcut
-- SHOW DATABASES; -- to show the database
-- show tables; -- to show the tables
+---------------+----------------------+
| CATALOG_NAME  | all_my_data_sources  |
+---------------+----------------------+
| DRILL         | INFORMATION_SCHEMA   |
| DRILL         | cp.default           |
| DRILL         | dfs.default          |
| DRILL         | dfs.root             |
| DRILL         | dfs.tmp              |
| DRILL         | sys                  |
+---------------+----------------------+

System Table

The System Tables are in the sys schema

  • List of tables
use sys;
show tables;
  • System, session, and boot options:
select * from options where type='SYSTEM' limit 10;
  • Boot, start parameters
select * from boot limit 10;
  • Threads
select * from threads;
  • Memory
select * from memory;

Version

select * from sys.version;

Querying in Directory

Assuming that bob.logdata is a workspace that points to the logs directory, which contains multiple subdirectories: 2012, 2013, and 2014.

The following query constrains files inside the subdirectory named 2013. The variable:

  • dir0 refers to the first level down from logs,
  • dir1 to the next level,
  • and so on.
0: jdbc:drill:> USE bob.logdata;
+------------+-----------------------------------------+
|     ok     |              summary                    |
+------------+-----------------------------------------+
| true       | Default schema changed to 'bob.logdata' |
+------------+-----------------------------------------+

SELECT * FROM logs WHERE dir0='2013' LIMIT 10;

You can use query directory functions:

  • to restrict a query to one of a number of subdirectories
  • to prevent Drill from scanning all data in directories.

Reserved Word

reserved word must be enclosed by back ticks

`YEAR`

Interface/Client

  • Drill shell
  • Drill Web Console
  • ODBC*
  • JDBC
  • C++ API

Configuration Options

1)

Example for the admin security:

SELECT name, val FROM sys.`options` where name like 'security.admin%' ORDER BY name;
|name                      |val                        |
|--------------------------|---------------------------|
|security.admin.user_groups|%drill_process_user_groups%|
|security.admin.users      |%drill_process_user%       |

Support

File Not Found

file not found means that your file extensions in not recognize by an extension. For instance, PSV files (pipe-separated values) have a tbl extension.

Unexpected RuntimeException: java.lang.IndexOutOfBoundsException: Index: 0

Error during udf area creation [/C:/Users/gerard/drill/udf/registry] on file system https://issues.apache.org/jira/browse/DRILL-5101

mkdir %userprofile%\drill
mkdir %userprofile%\drill\udf
mkdir %userprofile%\drill\udf\registry
mkdir %userprofile%\drill\udf\tmp
mkdir %userprofile%\drill\udf\staging
takeown /R /F %userprofile%\drill 





Discover More
Data Modeling Chebotko Logical
Cassandra - Client

This article lists the client that can connect to Cassandra. Cassandra Database Tool ...
Data Modeling Chebotko Logical
Cassandra NoSql Database

Cassandra is a NoSql database for transactional workloads that require high scale and maximum availability. Cassandra is suited for transactional workloads at high volume and shouldn’t be considered...
Database Design Space
Database - (Software|Design Space|Category)

, Greenplum Massively parallel open source data warehouse Originally based on PostgreSQL (See also: ) Drill A single query can...
Card Puncher Data Processing
Dremel

is a SQL Engine for analytics query (aggregate) - low latency used by Big Query. is the inspiration for Apache Drill. _(software)
Oracle Database Sql Processing
RDBMS - Sql Processing (Engine)

A SQL engine is responsible to produce a result set on the client side (generally backed by a cursor on the back-end side) from a SQL statement. See Every relational database has an SQL engine:...



Share this page:
Follow us:
Task Runner