How do you make sense of all those terabytes of data stuck in your BigQuery database?

Here’s what we tested

We here at Unacast sit on loads of data in several BigQuery databases and have tried several ways of visualizing that data to better understand them. These efforts have been mostly custom Javascript code as a part of our admin UI, but when we read about Re:dash we were eager to test how advanced visualizations we could do with an “off the shelf” solution like that. We wanted both charts showing all kinds of numerical statistics retrieved from that data and maps showing us geographical patterns. Re:dash supports this right out of the box, so what were we waiting for?

Getting up and running

Since we run all our systems on Google Cloud we were really happy to discover that Re:dash offers a pre-built image for Google Compute Engine, and they even have one with BigQuery capabilities preconfigured. This means that when we fire up Re:dash in one of our Google Cloud projects, the BigQuery databases in the same project are automatically available as a data sources ready to be queried. Awesomeness!!

Apart from booting the GCE image itself we had to open some firewall ports (80/443) using the gcloud compute firewall-rules create command, add a certificate to the nginx instance running inside the Re:dash image to enable https and lastly add a dns record for easy access.

The final touch was to add authentication using Google Apps so we could log in using our Unacast Google accounts. This also makes access and user control a breeze.

The power of queries

As the name implies, the power of BigQuery lies in queries on big datasets. To write these queries we can (luckily) just use our old friend SQL so we don’t have to learn some new weird query language. The documentation is nothing less than excellent. There’s a detailed section on Query Syntax and then there’s a really extensive list of Functions that spans from simple COUNT() and SUM() via REGEXP_EXTRACT() on Strings and all kinds of Date manipulations like DATE_DIFF(). There’s also beta support for standard SQL syntax

which is compliant with the SQL 2011 standard and has extensions that support querying nested and repeated data

but that’s sadly not supported in Re:dash yet (at least not in the version included in the GCE image we use).

In Re:dash you can utilize all of BigQuery’s querying power and you can (and should) save those queries with descriptive names to use later for visualizations in dashboards. Here’s a screenshot of the query editor and the observant reader will notice that I’ve used Google’s public nyc-tlc:yellow dataset in this example. It’s a dataset containing lots and lots of data about NYC Yellow Cab trips and I’ll use them in my examples because they’re kind of similar to our beacon interaction data as they contain lat/long coordinates and timestamps for when the interaction occurred.

1000 cab trips

It’s, however, worth noting that you don’t get any autocomplete functionality in Re:dash, so if you want to explore the different functions of BigQuery using the tab key you should use the “native” query editor instead. Just ⌘-C/⌘-V the finished query into Re:dash and start visualizing.

Visualize it

Every query view in Re:dash has a section at the bottom where you can create visualizations of the data returned by that specific query. We can choose between these visualization types: [Boxplot, Chart, Cohort, Counter, Map] and here’s how 100 cab trips look in a map

100 cab trips

When you get a handful of these charts and maps you might want to gather them in a dashboard to e.g. showcase them on a monitor in the office. Re:dash has a dashboard generator where you can choose to add widgets based on the visualizations you have made from your different queries. You can even rename and rearrange these widgets to create the most informative view. Here’s an example dashboard with the map we saw earlier and a graph showing the number of trips for each day in a month. The graph makes it easy to see that the traffic fluctuates throughout the week, with a peak on Fridays.

dashboard

So what’s the conclusion?

Re:dash has been a pleasant experience so far, and it has helped us get more insight into the vast amount of data we have. We discover new ways to query the data because it’s easier to picture a specific graph or map that we want to produce rather than just numbers in columns. We intend to use this as an internal tool to quickly generate visualizations and dashboards of specific datasets to better understand how they relate too and differs from other datasets we have.

There are some rough edges, however, that have been bothering us a bit. The prebuilt GCE images aren’t entirely up to date with the latest releases, unfortunately. The documentation mentions a way to upgrade to the newest release, but we haven’t gotten around to that yet. The lack of support for standard SQL syntax in BigQuery is also a little disappointing since that syntax has even better documentation and the feature set seems larger, but it’s not that big of a deal. The biggest problem we have been facing is that the UI sometimes freezes and crashes that tab in the browser. We haven’t pinpointed exactly what causes it yet, whether it’s the size of the result set or the size of the dataset we’re querying. It’s really annoying regardless of the cause because it’s hard to predict which queries will cause Re:dash to crash. Hopefully, this will be solved when we figure out how to upgrade to a newer version or the Re:dash team releases an updated image.

Introduction

Today, I’m writing about concurrency and concurrency patterns in Go. In this blog post I will outline why I think concurrency is important and how they can be implemented in Go using channels and goroutines.

Disclaimer: This post is heavily is inspired by “Go Concurrency Patterns” a talk by Rob Pike.

Why is concurrency important?

Web services today is largely dependent upon I/O. Either from disk, database or an external service. Running these operations sequentially and waiting for them to finish will result in a slow and underperforming system. Most modern web frameworks solves the basic issues for you. That is, without setup it handles each http request concurrently. But if you need to do something out of the ordinary, like calling a few external services and combine the results you are mostly on your own.

The two most common models for concurrency that I’ve used is shared-memory model using Threads like in Java. Or callbacks used in asynchronously languages like in Node.js. I believe that both approaches can be insanely powerful when done right. However, that they’re also insanely hard to get right. Shared-memory model sharing state/messages through memory using locks and is error-prone to say the least. And asynchronous programming is, at least in my experience, a hard programming paradigm to reason about and especially to master.

Concurrency in Go

Go solves concurrency in a different manner. It’s similar to Threads but instead of sharing messages through memory, it shares memory through messages. Go uses goroutines to achieve concurrency and channels for passing data between them. We will dig into these two concepts a bit further.

Goroutines

Goroutines is a simple abstraction for running things (functions) concurrently. This is achieved by prepending go before a function call. E.g.

A good example of the concept can be found here

Channels

Channels is the construct for passing data between routines in Go. A channel blocks both on the sending and receiving side until both are ready. Meaning a channel can be used for both synchronising goroutines and passing data between them. Below we see a simple example of how to use channels in Go. The basic idea is that data flows the same directions as the arrow.

In the example below we see how channels and goroutines can be used to create a function utilising concurrency that is easy to understand and reason about.

Example: using goroutines and channels

First let’s assume we want to create a service that asks three external services and return them. Let’s call these three services Facebook, Twitter and Github. For simplicity, we a fake communication with each of these services, such that the result of each service can be found at https://localhost/{facebook, twitter, github}, respectively.

The behaviour for GetAllTheThings is to fetch data from all services defined and combined them into a list. Let’s start with a naive approach.

Above we see an example implementation of the naive approach. In other words we query each service sequentially. That means that the call to the Github service has to wait for the Facebook service. And the Twitter service needs to wait on both Github and Facebook. Since each of these services are not dependent on each other. We can improve this by performing the requests concurrently. Enter channels and goroutines.

(PS: I’ve ignored handling errors in the concurrent examples. Don’t do this at home. It’s just for pure readability).

We’ve now modified the naive approach using channels and goroutines. We see that each call is being issued inside a goroutine. And that the results are being collected in the for-loop at the end. The code can be read sequentially and therefore easy to reason about. It’s also explicitly concurrent since we explicitly issue several goroutines. The only caveat is that the results may not be returned in the same order as the routines were issued.

Notice that we can still are able to use the naive approach for fetching a resource: naive.Get(path string). And that the signature of the function is exactly the same as before. That is powerful! But does it actually run faster?

In main.go we put everything together and measure execution time to see if its actually faster.

The conclusion is yes, it runs faster. Actually, it runs an order of magnitude faster. If you want to run these experiments your self or just curious about the implementation. The full example project can be found here.

Closing notes

We have shown that it’s easy to utilise concurrency in Go using channels and goroutines. However, this post has simplified a lot and the caveats you may encounter using channels and goroutines are not fully addressed here. So use channels and goroutines with caution. They can cause a lot of headache if over used. The general advice is to always start by building something naive before optimising.

I hope you have enjoyed reading this post. If I’ve done something unidiomatic please tell me so in the comment below or on twitter (@gronnbeck). I’m still learning and having fun with Go. And I’m always eager to learn from you as well.

The 12 factor app is a methodology that unify the composition and interface of web applications. Additionally, this methodology addresses other factors of web applications such as scalability and continuous deployment.

The 12 factors

  1. Codebase
  2. Dependencies
  3. Config
  4. Backing services
  5. Build, release, run
  6. Processes
  7. Port binding
  8. Concurrency
  9. Disposability
  10. Dev/prod parity
  11. Logs
  12. Admin processes

Source: 12factor.net

A building block for microservices and container orchestration

There is a wide range of reasons for why adoption of the microservices pattern is common today — continuous integration, independent service scalability and organizational scalability are some. Moreover, the 12 factor app is crucial to creating microservices as it ensures ability to perform continuous integration and independent service scalability.

Microservices require an infrastructure that offers simple service orchestration and deployment management. Therefore, container and orchestration products like Docker or Kubernetes and friends are absolutely pivotal to making microservices a sensible approach. Container technology creates portable, containers, of an application or data. Whereas orchestration tools manages running clusters of such containers or other portable executables, providing clear interfaces for deployment, resilience and scalability.

To create valuable application containers, 12 factor apps again come in handy by exhibiting traits like the use of backing services, relying on port binding and disposability.

Further, container orchestration products are deeply sunk into the composition and interface of 12 factor apps. Commonly, these products expect, in addition to the container traits, apps to be dealing with config through the environment, scale by stateless processes and handle logs through stdout.

Looking at the highest current maturity level of web application platforms AppEngine, Heroku and the similars, they are bound into 12 factor apps in much the same way as most container orchestration products. In fact, the 12 factor app was introduced by Heroku themselves.

Developer experience in a polyglot world

The state of web application development is evolving at a heartening pace. As a result many aspects are heavily fragmented, most notably the number of programming languages and frameworks. Fortunately, an increasing adoption of applications cohering to the 12 factor methodology helps keeping the eco-system as a whole sane. Without the common ground of the 12 factor app, creating general tools would likely be an excruciating task. Not to mention, the 12 factor common ground ease the mental load for developers moving from one framework to another — something that is especially huge in a microservices setting.

Up until recently Kubernetes clusters running on GCP have only supported one machine-type per cluster. It is easy to imagine situations where it would be beneficial to have different types of machines available to applications with different demands. This post will detail how to add a new node-pool and ensure that specific pods are deployed to the preferred nodes.

k8s logo

Why not a cluster with different machines?

There is at least one good reason to run a cluster with a homogeneous machine pool, it is the simplest thing. And up to a certain level, that is the smartest thing to do. If all your applications running on k8s has roughly the same demands to e.g. CPU and memory, it is also something you can do for a long time.

What pushed us to explore heterogeneous clusters was mainly two things:

  1. We had some apps demanding a much higher amount of memory than others
  2. We had some apps that need to run on privileged GCP machines.

We could solve #2 by giving all machines the privileges needed, and we also did for a while. But to solve #1 it would be very expensive to upgrade all machines in the cluster to high memory instances.

Enter node-pools.

Node pools

Node pools is a fairly new, and very poorly documented, alpha feature on Google Container Engine that lets you run nodes with different machine types. Earlier you were stuck with the initial machine type, but with node-pools, you are also able to migrate your cluster from one machine type to another. This is a great feature, as migrating all your apps from one cluster to another is nothing I would recommend doing more than once.

All clusters come with a default pool, and all new pools need to have a minimum size of 3 nodes.

Creating a new node pool

Creating a node pool is pretty straight forward, use the following command

  $> gcloud alpha container node-pools create <name-of-pool> \
  --machine-type=<machine-type> --cluster=<name-of-cluster>

Scheduling pods to specific types of nodes

To schedule a pod to a specific node or a set of nodes, one can use a nodeSelector in the pod spec. The nodeSelector needs to refer to a label on the node, and that’s pretty much it. An alpha feature in Kubernetes 1.2 is node affinity, but more on that in a later post.

There are a couple of ways to approach the selection of nodes. We could add custom labels to the nodes with the kubectl label node <node> <label>=<value> command, and use this label as the nodeSelector in the pod spec. The disadvantage of this approach is that you will have to add the new labels as you resize the node pool. The other and simpler solution are just to refer to the node-pool itself when scheduling a the pods.

Let us imagine that we added a node-pool with high memory machines to our cluster, and we called the new node-pool highmem-pool. When creating node-pools on GKE, a label is automatically added. If we do a kubectl describe node <node-name> we can see that the node has the following label: cloud.google.com/gke-nodepool=highmem-pool.

To ensure that a pod is scheduled to the node pools, we need to add that label in the nodeSelector like this:

1
2
3
4
5
6
7
8
9
10
11
  apiVersion: v1
  kind: Pod
  metadata:
    name: nginx
   spec:
    containers:
    - name: nginx
      image: nginx
      imagePullPolicy: Always
    nodeSelector:
      cloud.google.com/gke-nodepool: highmem-pool

Summary

Node-pools are a great new feature on GKE and something that makes Kubernetes much more flexible and also let you run different kinds of workload with different requirements.

I need to add some synonyms for these tags.

How about a simple command-line tool?

What do we need synonym lookup for?

We here at Unacast have explored different solutions to how we can find synonyms for the tags registered with the different resources in our system, so that we can suggest tag improvements for our partners. These tags are mostly nouns and our initial exploration into how this could be done included downloading, massaging and interfacing a large dump of raw Wiktionary data. We found this solution a bit clunky and inelegant, and it didn’t always give us as good results as we had hoped. So when I came across Big Huge Thesaurus and discovered that they have an API, I decided it was time for a new experiment to see if we could improve the quality of our tag suggestions.

For this test, I wanted to get suggestions for the actual tags in our database and I wanted to store the suggestions so we could display them in our admin UI. So I had to write something capable of accessing our API and the Big Huge Thesaurus API in a manageable way, but at the same time, I didn’t want to build a full-blown app with a web UI and all the complexity that introduces.

How to quickly create a client to explore data from an API you consider using?

The solution I came up with was inspired by a talk from Josh Suereth about a simple twitter command-line client called snark, that he had made using a rather original technique. It’s based on the internal features powering the excellent interactive mode of the interactive Scala Build Tool, sbt. It enables me to build a rather advanced command-line tool with autocompletion, ANSI colors and dynamic content only using my favorite language, Scala. I can also package it into any kind of executable using a plugin for sbt called sbt-native-packager. That beats wrestling around with a bash script trying to make API calls, parse JSON and at the same time present the results as suggestions using bash-completion.

This is how it looks in action:

As you can see I get tab completion suggestions for the different parts of the command I’m trying to use. The sbt console also has support for ANSI colors out of the box.

So how does this marvel work?

As I mentioned earlier it’s powered by the command-line client engine in sbt, which in turn uses JLine to read input from the console and it’s own implementation of the Parser Combinator pattern for making sense of that input.

In functional programming, a parser combinator is a higher-order function that accepts several parsers as input and returns a new parser as its output. In this context, a parser is a function accepting strings as input and returning some structure as output, typically a parse tree or a set of indices representing locations in the string where parsing stopped successfully. Parser combinators enable a recursive descent parsing strategy that facilitates modular piecewise construction and testing. This parsing technique is called combinatory parsing.

This allows me to implement the commands I want the CLI to handle as separate parsers that can be combined in the order I want the autocompletion to present them. The resulting parsers from the different combinations encapsulate the actual command that gets executed to perform the task we want to be done.

Show me the code!

I’ve created a small sample implementation that shows the main components needed to make a command-line tool like this. Take a look at this Github project for a runnable example or see the gist below for code examples.

Wrapping up

Using sbt and it’s parser combinator abilities it’s a breeze to create simple to use CLI tools with full tab completion backed by the full force of Scala to implement the program logic. Combined with the sbt-native-packager plugin we can also make them run natively on the major platforms, as runnable jars or even as Docker images. Clone my example project and try for yourself!