A brief intro to Dataflow
A fully-managed cloud service and programming model for batch and streaming big data processing.
from Google. It is a unified programming model (recently open sourced as Apache Beam) and a managed service for creating ETL, streaming and batching jobs. It’s also seamlessly integrated with other Google Cloud services like Cloud Storage, Pub/Sub, Datastore, BigTable, BigQuery. The combination of automatic resource management, auto scaling and the integration with the other Google Cloud
So how do we use it?
Here at Unacast we receive large amounts of data, through both files and our APIs, that we need to filter, convert and pass on to storage in e.g. BigQuery. So being able to create both batch (files) and stream (APIs) based data pipelines using one DSL, running on our existing Google Cloud infrastructure was a big win. As Ken wrote in his post on GCP we try to use it every time we need to process a non-trivial amount of data or we just need to run continuously running worker. Ken also mentioned in that post that we found the Dataflow Java SDK less than ideal for defining data pipelines in code. The piping of transformations (pure functions) felt like something better represented in a proper functional language. We had a brief look at the Scala based SCIO by (which is also donated to Apache Beam btw). It looks promising, but we felt that their DSL diverged too much from the “native” Java/Beam one.
Next on our list was Datasplash, a thin wrapper around the Java SDK with a Clojuresque approach to the pipeline definitions, using concepts such as -» (threading), map and filter mixed with regular clojure functions, what’s not to like? So we went with Datasplash and have really enjoyed using it in several of our data pipeline projects. Since the Datsplash source is quite extensible and relatively easy to get a grasp of we even have contributed a few enhancements and bugfixes to the project.
And in the blue corner, Datasplash!
It’s time to see of how Datasplash performs in the ring, and to showcase that I’ve chosen to reimplement the StreamingWordExtract example from the Dataflow documentation. A Dataflow-off, so to speak.
The example pipeline reads lines of text from a PubSub topic, splits each line into individual words, capitalizes those words, and writes the output to a BigQuery table
Here’s how it the code looks in it’s entirety, and I’ll talk about some of the highlights specifically about the the pipeline composition bellow.
First we have to create a pipeline instance, and it can in theory be use several times to create parallel pipelines.
Then we apply the different transformation functions with the pipeline as the first argument.
Notice that the pipeline has to be run in a separate step, passing the pipeline instance as an argument.
This isn’t very functional, but it’s because of the underlying Java SDK.
Inside apply-transforms-to-pipeline we utilize the Threading Macro
to start passing the pipeline as the last argument to the read-from-pubsub transformation.
The Threading Macro will then pass the result of that transformation as the last argument of the next one, and so on
and so forth.
Here we see the actual processing of the data. For each message from PubSub we extract words (and flatten those lists
with mapcat), uppercase each word and add them to a simple row json object. Notice the different ways we pass functions
Last, but not least we write the results as separate rows to the given BigQuery table.
And that’s it really! No
to apply a simple, pure function.
Here’s a quick look at the graphical representation of the pipeline in the Dataflow UI.
To summarize I’ll say that the experience of building Dataflow pipelines in using Datasplash has been a pleasant and exciting experience. I would like to emphasize a couple of things I think have turned out to be extra valuable.
- The code is mostly known Clojure constructs, and the Datasplash specific code try to use the same semantics. Like ds/map and ds/filter.
- Having a REPL at hand in the editor to test small snippets and function is very underestimated, I’ve found my self using it all the time.
- Setting up aliases to run different pipelines (locally and in the ☁️ ) with different arguments via Leiningen has also been really handy when testing a pipeline during development.
- The conciseness and overall feeling of “correctness” when working in an immutable, functional LISP has also been something that I’ve come to love even more now that I’ve tried it in a full fledged project.