EDNA is an end-to-end streaming toolkit for ingesting, processing, and emitting streaming data. It is based on my prior work with LITMUS, ASSED, and EventMapper. I started it as a framework to help build streaming applications during my research into disaster management, fake news detection, landslides detection, and multiple-source integration. There are several existing mature streaming systems, like Flink and Kafka. In this case, I neede something I could tune at any point in the streaming workflow, such as the message passing interface, or the code for the stream joins, etc. So, EDNA's initial use was a test-bed for studying these topics, and over time, it extended to drift detection in streaming data and web applicatins for providing disaster information. It has grown to a toolkit for stream analytics.
More importantly, I have used it for the past 3 years to teach students about stream analytics in Georgia Tech's Introduction to Enterprise Computing, and Real-Time Embedded Systems classes. To that end, I have continued to develop it, since it provides an excellent primer for working in streaming analytics, kubernetes, docker, containerization, and a host of technical skills expecially important for CS students.
An alpha version is hosted at EDNA's Github repository.
I'll provide some high-level details about EDNA's overall architecture, some challenges faced in its creation, as well as student projects we have deployed and presented with EDNA.
The central abstraction in EDNA is the ingest-process-emit loop, implemented in an EDNA job, shown below. Each component of the loop is an abstract primitive in EDNA that is extended to create powerful operators. Ingest primitives serve to consume streaming records. Process primitives implement common streaming transformations such as map and filter. Multiple process primitives can be chained in the same job. Emit primitives generate an output stream that can be sent to a storage sink, such as a SQL table, or to another EDNA job.
An EDNA application consists of several jobs in a DAG, shown below. We apply the EDNA job abstraction to the application as well, where an EDNA application consists of: - At least 1 Ingest Job to ingest a stream from external sources, such as the Twitter Streaming API. The Source Job should not do any processing to reduce backpressure and ensure the highest throughput for consuming an external stream - Process Jobs that process the stream. Each Internal Job runs its own ingest-process-emit¬ loop to transform the stream - At least 1 Emit Job to emit a stream to external sinks, such as a SQL table, S3 bucket, or distributed file system such as Hadoop
I'll cover a few applications below.
First, though, let's cover the EDNA stack.
The EDNA stack consists of four layers: deployment, runtime, APIs, and plugins.
EDNA can be deployed on a local machine for single jobs or on clusters managed by orchestrators like Kubernetes for multiple jobs in a streaming application. On a cluster deployment, EDNA uses Apache Kafka, a durable message broker with built-in stream playback to connect jobs, and Redis to share information between jobs. The EDNA runtime manages executes jobs on the applied deployment. EDNA jobs use the ingest, process, and emit APIs to implement the ingest-process-emit loop, with the appropriate plugin for complete the job.
Here is an example of an EDNA application. This is for processing Covid Twitter data.
It has the following jobs:
EDNA has been used in student projects for several years. Examples of student projects include: