Stream vs Batch

> (Data|State) Management and Processing > (Data Processing|Data Integration)

1 - About

Stream Processing vs Batch Processing

  • Do it once at night vs. do it every time for a query

Many organizations are building a hybrid model by combining the two approaches, and maintain a real-time layer and a batch layer.

  1. Data is first processed by a streaming data platform to extract real-time insights, and then persisted into a store
  2. Data can be transformed and loaded for a variety of batch processing use cases.

See Data Processing - Lambda Architecture (batch and stream processing)

3 - List

3.1 - Features

Feature Stream Batch Description
Scheduler Mode Publish / Subscriber Flow In a batch mode, a workflow scheduler runs the processes whereas in a stream mode it's a publisher / scheduler one
CPU Few Intensive The CPU is busy moving around old data most of the time.
IO Few A lot In a batch architecture, the CPU is busy moving a lot of data whereas on a stream platform only the delta is taken into account
Performance Quick Slow Stream means less data motion
System Properties
Coupling / No dependency Few A lot change is difficult, due to the shared data that introduce strong dependency. Dependency are external in the workflow vs dependency are internal in the process
Inconsistency Only latency Yes If an ETL step breaks, the whole data set ends being inconsistent. That's why caches are always fact based and not table based. Ex: full dimension loading must be succesvol before a fact loading
Fault Tolerance Yes No In an batch architecture, the workflow is dependent and fail when one of this step is failing. A stream orchestration can be resolved and restart over the day
Failure Management
Restart Yes No In case of failure, the restart point is not well known whereas in a stream model, each process can be restarted (It must still be Idempotence)
Resources Management
CPU Load Incremental Not predicable In an batch architecture, it's really difficult to forecast the future load as one big batch is enough to got the CPU high. Stream makes capacity planning (hardware prediction) easier. The weekly or monthly pick due to batch process (as financial invoices) are easily spotted.
Process must be prioritized Yes No When the load becomes to high, choice must be made on a batch platform whereas on a streaming platform, you will just adjust cap the number of running process
Scope Rolling time window
one record
dataset Queries or processing over data
Size Few Large Individual records or micro batches consisting of a few records vs Large batches of data.
Analytics Simple Complex Stream uses simple response functions, aggregates, and rolling metrics

3.2 - Access Pattern

A database and a message queue, both store data but have different data structure and therefore different access pattern which means:

  • different performance characteristics.
  • different implementations

3.3 - Typical Batch Orchestration CPU metrics

3.4 - ETL Tool and Real time

ETL tools were built to focus on connecting databases and the data warehouse in a batch fashion.

Early take on real-time ETL = Enterprise Application Integration (EAI)

EAI is a different class of data integration technology for connecting applications in real-time.

EAI employed Enterprise Service Buses and MQs weren’t scalable (Push method and index management done on server side)


4 - Documentation / Reference

data/processing/stream_vs_batch.txt · Last modified: 2019/04/12 16:42 by gerardnico