Stream - (Stream|Event) Processing
Table of Contents
1 - About
Something happened (Event), subscribe to it (Streams)
Stream processing = transformations on stream data
Streaming Processing is also known as Incremental Processing.
It has a low memory and processing overhead.
This performance comes at a cost
- All content to read/write has to be processed in exact same order as input comes in (or output is to go out).
- No random access. For this, you need to build an auxiliary structure as a map or a tree.
As a result, Streaming is most commonly used to process a lot of data.
2 - Articles Related
3 - Problems
A messaging system is a fairly low-level piece of infrastructure—it stores messages and waits for consumers to consume them. When you start writing code that produces or consumes messages, you quickly find that there are a lot of tricky problems that have to be solved in the processing layer.
Consider the counting example, above (count page views and update a dashboard).
- What happens when the machine that your consumer is running on fails, and your current counter values are lost? How do you recover?
- Where should the processor be run when it restarts? What if the underlying messaging system sends you the same message twice, or loses a message? (Unless you are careful, your counts will be incorrect.)
- What if you want to count page views grouped by the page URL? How do you distribute the computation across multiple machines if it’s too much for a single machine to handle?
Stream processing is a higher level of abstraction on top of messaging systems, and it’s meant to address precisely this category of problems.
See Samza is a stream processing framework that aims to help with these problems.
4 - Implementation
- Multiple input processor
- Category variable
Processor selection through:
- Category variable
- Data Type
- Flow Content + Json Object for extra attributes (Ex: ResultSet + TestId)
Loop through an explicit processor (ie if and then go to step N) ?
- Lazy instantiation: The operations are performed with an execute() call. Whether the program is executed locally or on a cluster depends on the type of execution environment. The lazy evaluation lets construct execution plan that may differ from the lexical (code) definition.
It is our firm belief that in the future, the majority of data applications, including real-time analytics, continuous analytics, and historical data processing, will treat data as what it really is: unbounded streams of events. Kostas Tzoumas (data Artisans raises a Series A Posted on March 30, 2016)
5 - Software
- Apache Samza: LinkedIn stream processing framework that provides powerful, reliable tools for working with data in Kafka. (LinkedIn created Apache Kafka to be the data exchange backbone of its organisation.) See StreamTask
- Apache Storm team (Yahoo!)
- Apache Spark (Micro batch)
- Apache Flink
- Java Stream (Map reduce implementation, only one source then)
int sum = widgets.stream() .filter(b -> b.getColor() == RED) .mapToInt(b -> b.getWeight()) .sum();