Data Processing - Data Flow (ETL | Workflow | Pipeline)

> (Data|State|Operand) Management and Processing > (Data Processing|Data Integration)

1 - About

A data flow is a workflow specialized for data processing

Any system where the data moves between code units and triggers execution of the code could be called dataflow

This page is not about Dataflow_architecture which is a computer architecture

A data flow engine has the following features:

  • retries,
  • timeouts,
  • parallel execution of operations,
  • fault tolerant execution,

There is no program-counter to keep track of what should be executed next, data arrival triggers the code to execute. There is no need to worry about locks because the data is local and can only be accessed by the code it was sent to.

Advertising

3 - Characteristics

A data flow program is a directed graph where:

  • nodes represent operators (also known as block, process, action, actor)
  • arcs shows the data dependencies among operators
  • operands (data) are carried along these arcs
  • the arrival of data causes a node to activate (event based)

The flow of data is explicit, often visually illustrated as a line or pipe.

3.1 - Actor

An Actor model applied to a data flow engine can be seen as:

  • an actor is the node (processing node)
  • and the messages past are equivalent to the connections between nodes (communication channels)

3.2 - Data-driven

Data-driven:

  • the avaibility of data drives the computation
  • Data Dependency build the flow
  • An operator is enabled when all its input operands have arrived on incoming arcs. It executes by consuming this values and produces results sent along its outputs arcs to others operators.
  • There is no need of central unit do decide when an operation should be executed
  • The results of computations must known the address of subsequent computations that use them
  • The execution is sequenced automatically by the availability of intermediate results.
Advertising

3.3 - Parallel

At the lowest level, dataflow is both a programming style and a way to manage parallelism.

As an operation runs as soon as all of its inputs become valid, dataflow engines are inherently parallel and can work well in large, decentralized systems.

Since the operations are only concerned with the availability of data inputs, they have no hidden state to track, and are all “ready” at the same time.

3.4 - Loop

Loop: to guarantee that a program executes correctly, it's essential that tokens from an other iterations do not take over one another.

Two implementations to guarantee the executions of loop correctly:

  • Static interpretation. Ensured by feedback signal which inhibit the execution of an operator until all its inputs arcs have no more tokens.
  • Loop unraveling: a separate copy of the graph is created for each iteration of the loop - the tokens are passed for each loop to a separate instance

3.5 - Engine

Data Flow basic tasks sequence (Feedback interpreter):

  • matching of operand tokens destined for the same instructions
  • fetching of enabled instructions - Enable: Determines what nodes can fire
  • instruction executions - Execute: Executes nodes
  • routing of tokens (communication and moves)

A dataflow engine might be implemented as a hash table where:

  • the keys are inputs / operand / data
  • the value are pointers to the instructions / operator

When any operation completes, the program scans down the list of operations until it finds the first operation where all inputs are currently valid, and runs it. When that operation finishes, it will typically output data, thereby making another operation become valid.

For parallel operation, only the list needs to be shared; it is the state of the entire program. Thus the task of maintaining state is removed from the programmer and given to the language's runtime.

Advertising

4 - Library / Tool

5 - Visualization

To represent conditions or iterations as a set of nodes can easily result in a complex graph, nontrivial to understand. The complexity of interpreting a visual representation can end up being higher than reading textual source code.

6 - Documentation / Reference