Stream - How to capture change of state - Change Data Capture (CDC)

Event Centric Thinking

About

This page is about how to capture the changes of state in order to create a stream. The stream processing will then perform an incremental computation

Capturing data changes is far from a trivial task, and this page tries to help you understand the pro and cons of every strategy.

Usage

DataWarehouse

Most data warehouse tables are so large that they cannot be refreshed during every load cycle. You must have a capability to transfer only the relevant changes to the source data since the last update. Isolating the latest source data is called change data capture (CDC). The idea behind change data capture is simple enough: just transfer the data that has been changed since the last load.

During the data warehouse's initial historic load, capturing source data content changes is not important since you are loading all data.

See Data Warehousing - Data Loading (Fast Refresh)

Replication

Data Processing - Replication

Incremental Processing

Incremental Computation

Characteristic

Model

  • data-based (state) - incremental column
  • event-based (changes) - log miner, trigger, …

Real-time

  • Real time
  • Schedule

Source

  • Log
  • App

If updates to source system tables occur outside the application itself, each external update should take into account the incremental functionalities (and update the incremental column accordingly. Because the application strategy has a weak consistency level, a data reconciliation strategy should be put in place in order to identify any difference culprit.

CDC mechanisms

Log Miner

Log miner app

Materialized View

Incremental column

Also known as the tracability column, in order to identify the change as modifications, insertion and deletion by date

Timestamp

A timestamp (Date and time stamps) that track the last time data was changed (ie the equivalent of the file modification time but for the row)

  • Might require modification in the program
  • The programs may fail to update this field (ie eventually consistency level)
  • Deletions are not included since the entire record including the time stamps are gone.

Id

A single column containing a ascendant unique ID where every new row has an new id bigger than the previous one

Use case: Table that are insert-only such as a fact table

Data Set Diff

The conventional approach to data integration involves extracting all data from the source system and then integrating the entire set. This phase uses possibly also an incremental strategy by detecting the changes between the target and the source via a data set comparison based on a hash function: OWB - How to implement a type 2 slowly changing dimension with a hash function ?

  • Not real time (impossible in the given timeframe)

Database Triggers

Database trigger

A trigger-based approaches support incremental iteration with only weak consistency guarantees

In concept, the code that the trigger launches can be written to write a record of the transaction to a database table and the ETL tool can then poll those tables on a periodic basis.

Pro:

  • Easy to setup: The trigger-based data capture system is easy to understand and widely supported by database systems.
  • No admin right needed: Table synchronization can be setup by users and application developers without requiring a database administrator to modify the server.

Cons:

  • The capture may lost change.. For example triggers can regularly be deleted/disabled then re-added/re-enabled in the normal course of a business operation.
  • The main database server process is used

CDC vs Full reload

  • Long running time
  • More storage (Flip Mechanism needed)
  • Problem in the workflow let data inconsistent - The mechanism is not resilient.

Documentation / Reference





Discover More
Etl Basic Functionnality
Data Integration - ETL evolves into Data Integration

Data integration suffers from an image problem. It has become synonymous with extract, transform and load. Likewise, ETL has been regarded as a data warehousing technology. Both of these viewpoints...
Event Centric Thinking
Data Stream - Source (Provider)

The stream source is the source of the stream (known also as the provider) Finite collection (list, sequence,...) Iteration See
Data System Architecture
Data Warehouse - Persistent layer (Fundamental Layer | Operational Data Store - ODS)

A persistent layer is a data warehouse layer where data is persisted (ie never deleted). A persistent layer is also known as: a fundamental layer an operational data store (ODS) an enterprise...
Data System Architecture
Data Warehousing - 34 Kimball Subsytems

This page takes back the Kimball Datawarehouse 34 Subsystem as a table of content and links them to a page on this website....
Data System Architecture
Data Warehousing - Data Loading (Fast Refresh)

Fast Refresh refer to the better way to quickly load data modifications of a data source. Several techniques exist: for the source: for the target: In fact, the load process is often the primary...
Undraw File Manager Re Ms29
File - Modification Time

The file modification time is a file metadata that is set to the last time the content of the file was modified. digest Most processing tool such as code build tools will compare the modified...
Card Puncher Data Processing
OBIA - Creating Custom Indexes in Oracle EBS Source Databases for Incremental Load Performance

Oracle EBS source database tables contain mandatory LAST_UPDATE_DATE columns, which are used by Oracle BI Applications for capturing incremental data changes. Some Oracle EBS source tables used by Oracle...
Odi Cdc
ODI - Change Data Capture

The goal of Change Data Capture is to track change in the source data. When running integration interface, ODI-EE can reduce the volume of source data processed in the flow by extracting only the changed...
Card Puncher Data Processing
ORACLE - Streams / Change Data Capture

For the basic of change data capture, see this article: In Oracle database since version 9i. The first version of Change Data Capture placed a set of triggers on the tables in the source application,...
Card Puncher Data Processing
OWB - Change Data Capture

The implemention in OWB of Change Data Capture is made through the Change Data Capture capabilities of the Oracle Database : . With at least the first release of OWB10gR2, the setting up of Asynchronous...



Share this page:
Follow us:
Task Runner