Stream - How to capture change of state - Change Data Capture (CDC)

About

This page is about how to capture the changes of state in order to create a stream. The stream processing will then perform an incremental computation

Capturing data changes is far from a trivial task, and this page tries to help you understand the pro and cons of every strategy.

Articles Related

Usage

DataWarehouse

Most data warehouse tables are so large that they cannot be refreshed during every load cycle. You must have a capability to transfer only the relevant changes to the source data since the last update. Isolating the latest source data is called change data capture (CDC). The idea behind change data capture is simple enough: just transfer the data that has been changed since the last load.

During the data warehouse's initial historic load, capturing source data content changes is not important since you are loading all data.

See Data Warehousing - Data Loading (Fast Refresh)

Replication

Data Processing - Replication

Incremental Processing

Incremental Computation

Characteristic

Model

data-based (state) - incremental column
event-based (changes) - log miner, trigger, …

Real-time

Real time
Schedule

Source

Log
App

If updates to source system tables occur outside the application itself, each external update should take into account the incremental functionalities (and update the incremental column accordingly. Because the application strategy has a weak consistency level, a data reconciliation strategy should be put in place in order to identify any difference culprit.

CDC mechanisms

Log Miner

Log miner app

Database Log Miner (Log-based Change Data Capture or Database Log Readers

Materialized View

SQL - Materialized View

Incremental column

Also known as the tracability column, in order to identify the change as modifications, insertion and deletion by date

Timestamp

A timestamp (Date and time stamps) that track the last time data was changed (ie the equivalent of the file modification time but for the row)

Might require modification in the program
The programs may fail to update this field (ie eventually consistency level)
Deletions are not included since the entire record including the time stamps are gone.

Id

A single column containing a ascendant unique ID where every new row has an new id bigger than the previous one

Use case: Table that are insert-only such as a fact table

Data Set Diff

The conventional approach to data integration involves extracting all data from the source system and then integrating the entire set. This phase uses possibly also an incremental strategy by detecting the changes between the target and the source via a data set comparison based on a hash function: OWB - How to implement a type 2 slowly changing dimension with a hash function ?

Not real time (impossible in the given timeframe)

Database Triggers

Database trigger

A trigger-based approaches support incremental iteration with only weak consistency guarantees

In concept, the code that the trigger launches can be written to write a record of the transaction to a database table and the ETL tool can then poll those tables on a periodic basis.

Pro:

Easy to setup: The trigger-based data capture system is easy to understand and widely supported by database systems.
No admin right needed: Table synchronization can be setup by users and application developers without requiring a database administrator to modify the server.

Cons:

The capture may lost change.. For example triggers can regularly be deleted/disabled then re-added/re-enabled in the normal course of a business operation.
The main database server process is used

CDC vs Full reload

Long running time
More storage (Flip Mechanism needed)
Problem in the workflow let data inconsistent - The mechanism is not resilient.