Database - Column (Storage|Store)

> (Data|State) Management and Processing > (Data Type|Data Structure) > (Relation|Table) - Tabular data > Table - Physical Data Structure

1 - About

Storing data in columns is functionally similar to having a built-in index for each column.

Columnar storage is a popular data structure in analytical workloads because of the vertical pruning, parallel processing and compression benefits.

The goal is to keep I/O to a minimum by reading from a disk only the data required for the query.

Advertising

3 - Concept

3.1 - Compression

compression is really more efficient as you are more likely to have more duplicate in a column than in a row.

Since all the values in a given column have the same type, generic compression tends to work better and type-specific compression can be applied.

3.2 - Parallel

In a column storage, data is already vertically partitioned. This means that operations on different columns can easily be processed in parallel. If multiple columns need to be searched or aggregated, each of these operations can be assigned to a different processor core.

3.3 - Index

Columnar storage, in many cases, eliminates the need for additional index structures. Storing data in columns is functionally similar to having a built-in index for each column (bit map or b-tree ?)

3.4 - High Locality

With columnar data, operations on single columns, such as searching or aggregations, can be implemented as loops over an array stored in contiguous memory locations. Such an operation has high spatial locality and can efficiently be executed in the CPU cache. With row-oriented storage, the same operation would be much slower because data of the same column is distributed across memory and the CPU is slowed down by cache misses.

3.5 - Filter Early From Disk

Since column values are stored consecutively, a query engine can skip loading columns whose values it doesn’t need to answer a query, and use vectorized operators on the values it does load.

3.6 - Vertical Partition Pruning

scan only a subset of the columns.

3.7 - Branching more predictable

Instruction branching can be made more predictable by choosing a better encodings that suits the modern processors’ pipeline.

4 - RDBMS

See Column-oriented_DBMS

Advertising

5 - Documentation / Reference