Data property - Data Latency

> (Data|State) Management and Processing > Data Properties and Transactions

1 - About

Data latency is a property of eventual consistency and describe how quickly two set of data will be consistent.

In a DW/BI environment, Data latency describes how quickly source system data must be delivered to the business users via

Be careful: asking business users if they want “real-time” delivery of data is an invitation for trouble. Most business users will request lower-latency data regardless of whether they understand the impact of the request. We recommend dividing the real-time challenge into three categories:

  • daily,

Daily means that the data visible on the screen is valid as of a batch file download or reconciliation from the source system at the end of the previous working day.

  • frequently

Frequently means that the data visible to the end user is updated many times per day but is not guaranteed to be the absolute current data as of this instant.

  • and instantaneous.

Instantaneous means that the data visible on the end user's screen represents the true state of the source transaction system at every instant. When the source system status changes, the online screen must also respond instantly.


3 - Real Time

When you want to get the data in near-real-time,

  • you can't use a bulk copy process (like Oracle SQLLoader, SQLServer BCP, …)
  • you need
    • a queueing process, a message broker,
    • or a log replication, CDC, log miner (like Golden Gate)

3.1 - Impact

Data latency obviously has a huge effect on the costs and complexity of your ETL environment.

Clever processing algorithms, parallelization and potent hardware can speed up traditional batch-oriented data flows. But at some point, if the data latency requirement is urgent, the ETL system architecture must step up from batch mode to streaming orientation. This isn't a gradual or evolutionary change; it's a major paradigm shift in which almost every step of the data delivery pipeline must be reimplemented.

ETL streams for most organizations typically require a data latency that matches the natural rhythm of the business. We find that most organizations require daily updates for most ETL streams and weekly or monthly updates for other ETL streams. However, in some circumstances, more frequent updates or even real-time updates suit the rhythm of the business. The key is to recognize that only a handful of business processes within any organization are appropriate for real-time updating. There's no compelling reason to convert all ETL processing to real time. The rhythm of most business processes simply doesn't demand that treatment.

3.2 - Real-time

It was impossible to answer analytic questions with transactional data structures.

Real-time data warehousing is usually an attempt to do the opposite: answer transactional questions with analytic data structures.

The issue is how do you properly cleanse the data, integrate it and align it across the enterprise, and keep track of changing attributes all on a real time basis. In most cases, you have to create separate, parallel structures that support real time loading and querying, along with cleansed, aligned, detailed history to provide an analytic context.

When you deal with real-time information, you are dealing with individual transactions rather than batches of transactions. There is no time to stop and smell the roses (as it were), to run multiple aggregations, data quality, cleansing routines, or even potentially some lookups. The information movement needs shift. Traditional batch techniques often do not apply to processing real-time information feeds.


4 - Documentation / Reference

data/property/latency.txt · Last modified: 2017/09/17 18:31 by gerardnico