Apache Flink is open-source Framework for streaming processing, which was developed by Apache Software Foundation. The core of Apache Flink is engine for distributed streaming of data flow, which is written in Java and Scala. Flink executes programs of data flow in data parallel pipeline model. Pipeline runtime system used in Flink allows bulk or batch services execution and streaming programs processing. Furthermore, Flink has native support from iterative algorithms.

Flink is streaming engine with high throughput and low latency. It can offer support for time processing of event or state management. Flink apps are tolerant to the failures in case of machine failure and they support semantics in style of “exactly once”. Programs can be written in Java, Scala, Python and SQL and they are compiled automatically and optimized to data flow programs, which are executed on cluster or cloud environment.

Flink doesn’t have own system of data storing and enables the connection to the data sources as are Amazon, Kinesis, Apache Kafka, HDFS, Apache Cassandra or ElasticSearch.

Overview of features

Apache Flink is based on programming model of data flow and offers processing of the finite and infinite of datasets. On the elementary level is Flink composed of streams and transformations. Conceptually it is flow of data records and transformation is operation, which takes one or more of these flows as input and produces one or more streams as output.

Apache Flink contains two main APIs: DataStream API for limited and unlimited data streams and DataSet API for limited datasets. Flink also offers Table API, which is based on a language that is similar to SQL for relational streams and batch processes, which can be simply nested to the DataStream or DataSet.

Programming model and distributed runtime

During execution are Flink programs mapped on streaming data flows. Every Flink’s data flow begins with one or more sources and ends with one or more data outputs. On stream can be made random number of transformations. These streams can be sorted directly to the acyclic graph, which allows branching and connect data flows.

Flink has built-in connection to the data sources and saving outputs on Apache Kafka, Amazon Kinesis, HDFS, Apache Cassandra and more.

States: Checkpoints, store points and failure tolerability

Apache Flink contains tolerance mechanism to failures based on distributed checkpoints. Checkpoint is automatically asynchronous snaphot of application state and position within source stream. In case of failure the program using Flink will take over the state of the last stored checkpoint during recovery, which leads to the Flink preserving semantics of just one state within application. Checkpoint mechanism can include even external systems.

There is also an option to set saving points, which are manually executed equivalent of checkpoints.

Advantages of using Apache Flink

  • Reduces complexity. Compared to other Frameworks intended for distributed data processing, Flink reduces complexity. It achieves this by using integrated queries optimization, concepts inspired by data systems and effective algorithms in memory and off the core.
  • Memory management. Explicit memory management enables better usage of memory than Apache Spark.
  • Speed. Iterating processes enable maximum speed of processing on the same node. Further performance improvement can be achieved with the help of processing only relevant part of data.
  • Minimum of configuration.

Disadvantages of using Apache Flink

  • Sharing data in memory. Apache Spark is better solution for sharing data in memory, which makes it an ideal choice for iterative, interactive and events based on stream processing.
  • Community. Apache Flink doesn’t have as wide community as its competitor, Apache Spark.