Data Processing - Data Flow (ETL | Workflow | Pipeline)

Card Puncher Data Processing

About

A data flow is a workflow specialized for data processing

Any system where the data moves between code units and triggers execution of the code could be called dataflow

This page is not about wiki/Dataflow_architecture which is a computer architecture

A data flow engine has the following features:

  • retries,
  • timeouts,
  • parallel execution of operations,
  • fault tolerant execution,

There is no program-counter to keep track of what should be executed next, data arrival triggers the code to execute. There is no need to worry about locks because the data is local and can only be accessed by the code it was sent to.

Characteristics

A data flow program is a directed graph where:

  • nodes represent operators (also known as block, process, action, actor)
  • arcs shows the data dependencies among operators
  • operands (data) are carried along these arcs
  • the arrival of data causes a node to activate (event based)

The flow of data is explicit, often visually illustrated as a line or pipe.

Actor

An Actor model applied to a data flow engine can be seen as:

  • an actor is the node (processing node)
  • and the messages past are equivalent to the connections between nodes (communication channels)

Data-driven

Data-driven:

  • the avaibility of data drives the computation
  • Data Dependency build the flow
  • An operator is enabled when all its input operands have arrived on incoming arcs. It executes by consuming this values and produces results sent along its outputs arcs to others operators.
  • There is no need of central unit do decide when an operation should be executed
  • The results of computations must known the address of subsequent computations that use them
  • The execution is sequenced automatically by the availability of intermediate results.

Parallel

At the lowest level, dataflow is both a programming style and a way to manage parallelism.

As an operation runs as soon as all of its inputs become valid, dataflow engines are inherently parallel and can work well in large, decentralized systems.

Since the operations are only concerned with the availability of data inputs, they have no hidden state to track, and are all “ready” at the same time.

Loop

Loop: to guarantee that a program executes correctly, it's essential that tokens from an other iterations do not take over one another.

Two implementations to guarantee the executions of loop correctly:

  • Static interpretation. Ensured by feedback signal which inhibit the execution of an operator until all its inputs arcs have no more tokens.
  • Loop unraveling: a separate copy of the graph is created for each iteration of the loop - the tokens are passed for each loop to a separate instance

Engine

Data Flow basic tasks sequence (Feedback interpreter):

  • matching of operand tokens destined for the same instructions
  • fetching of enabled instructions - Enable: Determines what nodes can fire
  • instruction executions - Execute: Executes nodes
  • routing of tokens (communication and moves)

A dataflow engine might be implemented as a hash table where:

  • the keys are inputs / operand / data
  • the value are pointers to the instructions / operator

When any operation completes, the program scans down the list of operations until it finds the first operation where all inputs are currently valid, and runs it. When that operation finishes, it will typically output data, thereby making another operation become valid.

For parallel operation, only the list needs to be shared; it is the state of the entire program. Thus the task of maintaining state is removed from the programmer and given to the language's runtime.

Library / Tool

Visualization

To represent conditions or iterations as a set of nodes can easily result in a complex graph, nontrivial to understand. The complexity of interpreting a visual representation can end up being higher than reading textual source code.

Documentation / Reference





Discover More
Graph
(Network|Graph) - Directed Graph (or digraph)

A directed graph (or digraph) is a graph, where the edges have a direction associated with them. Directed edges are suitable for modeling asymmetric relations undirected Data flow modeling ...
Card Puncher Data Processing
Airflow is a data platform to manage workflows

A big overview of what airflow is and how to start it with Docker
Card Puncher Data Processing
Apache Beam (Batch and Stream processing)

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline....
Imperative Vs Functional
Code - Functional programming (FP) - Collection Operations

Functional programming (FP) defines standard operations on collections. It is a declarative paradigm that treats computation as the evaluation of mathematical functions. Most of the operations, you perform...
Event Centric Thinking
Data Flow - Backpressure

When the dataflow runs through asynchronous steps, each step may perform different things with different speed. In this setting, Back-pressure means a fast producer and slow consumer. The issue of...
Card Puncher Data Processing
Data Processing - (Pipeline | Compose | Chain)

A pipeline is a finite automata where: the data transition from one state to another via a series of transformations (work) A pipeline creates a composition relationship. A pipeline is also...
Card Puncher Data Processing
Data Processing - Arc

The arcs shows the data dependencies among operators Arcs are edge in a graph data flow.
Java Conceptuel Diagram
Java - Stream Processing

Map objects to another value as specified by a Function object Perform an action as specified by a Consumer object void forEach(Consumer action) filter, map, forEach lambda expressionsiterators...
Card Puncher Data Processing
Process (Modeling)

A process may be: a natural process (ie organizational) performed by human: or a machine process: A process may be seen as a succession: of state: of event: Several processes...
Relational Algebra Between Sql And Query Plan
Relational Algebra - Expression and Operators

Relational algebra is based upon the fact that you can pass tabular data through a set of data operators (select, filter, join, sort, union, etc.) in a algebraic structure. It means that: the output...



Share this page:
Follow us:
Task Runner