Data Integration - Change Data Capture (Incremental Load, ...)

> (Data|State) Management and Processing > (Data Processing|Data Integration) > (Stream|Pipe|Message Queue|Event Processing)

1 - About

During the data warehouse's initial historic load, capturing source data content changes is not important since you are loading all data.

However, most data warehouse tables are so large that they cannot be refreshed during every load cycle. You must have a capability to transfer only the relevant changes to the source data since the last update. Isolating the latest source data is called change data capture (CDC). The idea behind change data capture is simple enough: just transfer the data that has been changed since the last load.

It's why generally the use of a publish-subscribe model is made.


3 - CDC mechanisms

  • Database Log Miner (Log-based Change Data Capture or Database Log Readers)
  • Database Triggers. This DBMS feature invokes a prewritten routine each time a specific set of conditions are met, such as the addition or updating of a record in the database. In concept, the code that the trigger launches can be written to write a record of the transaction to a database table and the ETL tool can then poll those tables on a periodic basis. In practice, this approach is far from foolproof. For example triggers can regularly be deleted/disabled then re-added/re-enabled in the normal course of a business operation. Triggers also place a relatively high overhead burden on the source database server
  • Message Queues. An obvious disadvantage is that they are not a completely reliable source of change information since they only know about data changes that are sent to them by the applications and not updates in the database.
  • Date and time stamps – many ERP applications and other data sources maintain data fields within each record that indicate when it was last changed. The fatal flaw in this approach is that it relies on the programs that change data to unfailingly update this field. In addition, this approach can lose track of deletions since the entire record including the time stamps are gone.

See All mode of Kafka JDBC connector

3.1 - Custom Query

A custom query allows you to join data from multiple tables. The advantage is that you don't need to load each tables in the target source.

4 - Type of Integration

  • data-based,
  • event-based
  • or services-based integration.

5 - Which CDC strategy to choose?

The CDC mechanism selected must be absolutely fool proof – all changed data must be identified. Finding the most comprehensive strategy can be elusive; many times updates to source system tables can occur outside the application itself. A mistake here will result in inconsistent results that can't be easily explained; often it takes significant data reconciliation to identify the culprit. Problems can be a very costly in terms of rework – not to mention embarrassing. In short, capturing data changes is far from a trivial task, and you must clearly understand the source data systems. This knowledge will help the ETL team evaluate data sources, identify change data capture problems and determine the most appropriate strategy.

6 - CDC vs Full reload

  • Long running time
  • More storage (Flip Mechanism needed)
  • Problem in the workflow let data inconsistent - The mechanism is not resilient.

7 - Documentation / Reference

data/processing/stream/cdc.txt · Last modified: 2017/10/18 16:33 by gerardnico