Designing stream processing applications

In the recent years, stream processing has gained a lot of traction.

A lot of places have adopted lambda architecture. Even though there has been some criticism that lambda architecture is not necessary, and that it is sufficient to have only streaming applications - https://www.oreilly.com/radar/questioning-the-lambda-architecture/, in my opinion lambda architecture has made stream processing much more popular. Mainly, because people could implement something alongside of their existing batch systems.

We have now tools that enable us to write streaming applications without too much effort, but those usually come with huge assumptions that not are not always true, especially when it comes to processing big data.

In this article I wanted to cover the most important aspects of designing stream processing applications. Applications that process streaming data are usually quite complex if you want to make them reliable.

I am going to divide this topic into seven sections:

Ordering
Partitioning
Delivery guarantees
Throughput and latency
Handling failures and idempotency
Window operations
Joins

Ordering

It is enormously important to establish strong guarantees for the order of the data that your application will ingest. Ideally, all of the messages that are consumed from a stream have some kind of number that indicates a strong order. For this purpose a timestamp of an event is usually used. Without any guarantees about the order of the data it is usually impossible to make the stream processing application idempotent.

Partitioning

It is literally the key to scale any streaming application. If your application relies on the exact order in which your data comes in, it is usually impossible scale it efficiently through partitions, as you would need to rejoin all of the partitions later on in your applications, and that is only if you can guarantee that the stream will have some kind of unique monotonic identificator. Once you have identified the sub-streams in your stream that will allow you to process the data with the same level of guarantees, you will need to select a key from the message that will allow you to partition the stream. However if you do not care about the order of the data you can partition your data any way you like to increase the parallelism of your application.

Delivery guarantees

You will need to establish what are the delivery guarantees in your system. Is your application meant to do exactly once processing, or is it good enough to do at least once or at most once processing? Make sure that you choose the guarantee that will enable you to handle the cases you want. If you aim for idempotency you will need to go for exactly once delivery guarantees.

Throughtput and latency

When designing stream processing applications you need to decide whether you build it for low latency or high throughput. Unfortunately, it is rarely the case that you can achieve both, however you can design for certain latency guarantees. On one side you can aim for the lowest level of latency with systems that are based on single UDP multicast packets, and on the other you can have huge batches of data sent over TCP connections. You will also need to think about how you would like to handle late data. Usually, if you want to meet certain latency guarantees, you will need to implement some kind of windowed operations with timeouts. Making sure you define a strict SLA at the beginning is crucial, as all later operations within your application will need to adhere to it, especially operations that join multiple streams. If you aim for high throughput you will need to consider how you want to approach idempotency, as exactly once processing can usually reduce throughput significantly if the machines are physically far apart.

Handling failures and idempotency

If you want to make your application reliable you will need to prepare procedures for how you want handle failures. Restarts are inevitable in streaming applications, so you will have to work out how to recover the state of your application. If you want to rely only on streams you will need to be able to work out your state only from your input and output streams. Making sure you can restore the previous state is essential for the application to be idempotent. This is usually done by looking up the last message in the output stream and replaying everything from the input stream that would fit in the window until this last message would be produced, by then you can switch back your processing to start emitting new messages. If your application has very relaxed delivery guarantees or does not care about the order you could either replay all of the messages from the input stream, or just start from wherever you have left reading without worrying about having to recreate the state. Sometimes your application needs to restore its state quickly, but it is difficult or costly to recompute it back again from the stream. The way to do that is to implement some kind of checkpointing within your application. On restarts it would revert to the last checkpoint which would have the state as of that time, and recompute everything from there.

Window operations

There are primarily four types of window operations:

tumbling
sliding
session
global (ever increasing window)

The most common of which is the sliding window. A good example of how window operations work is described in Apache's Flink documentation.

Joins

The problem with joins within stream processing applications is that you cannot give good guarantees about the completeness of the data if you need to meet certain SLAs to which you have agreed when discussing overall latency of the system. If you want your data to be complete when joining two streams, you have no other choice but to wait indefinitely on the data until it arrives. That is just the nature of streaming algorithms. However, when designing stream processing applications you usually care more about the latency. Joins are usually done in memory, so you will need to make sure that you can cache all of the messages that can fit in the window. Some frameworks leverage RocksDB, as it is optimized for fast in memory access, although it still has the ability to write to disk if needed.

Summary

As you can see, there is much more towards stream processing that just defining a simple function on how to process a single message. Even though newer frameworks try to abstract as much as possible, I think it is crucial to understand how reliable stream processing applications are made and on what assumptions they are based. Anyone designing a stream procesing application should be aware of all the caveats that come with it. In particular, before starting to design an application, they should think about all of the sections above and what guarantees they want to design it for. There are some things around reliability that I have not covered in this overview of stream processing applications, particularly monitoring, alerting, documentation, replicas and redundancy, as I consider them to be a part of any reliable application. I hope you have found this little summary helpful and it has brought you a bit closer to understanding how one can design a reliable stream processing application.