First, let me be frank and announce that we’ve been on the Google Cloud Platform (GCP) for little more than a year. And that we’ve not been working full time with GCP for a year, yet. Nonetheless, We feel ready to share some insights and thoughts on building microservices on GCP.

In this blog post will focus on why we at Unacast chose GCP as our cloud provider, why it’s still a good fit for us, and a few lessons learned of using the platform for about a year.

Why Google Cloud Platform

Today the natural choice of cloud is Amazon Web Services (AWS). And with good reason. AWS pioneered many of the great cloud services out there, S3, EC2, Lambda etc. It has as far as we know the longest list of cloud components you can use to build your platform 1. And it’s battle tested at scale by Amazon, Netflix, AirBnB and many others. So why did we choose GCP instead?

Actually, we didn’t. Unacast started out building its platform on a combination of Heroku and AWS. And after some fumbling, sessions of banging our heads against the wall, and some help from a consultant2 we decided to try GCP. And with some effort and a lot of luck it turned out to be the right platform for us. The reason for testing GCP was two fold 1) GCPs big data capabilities and 2) it helps us minimise time used on operations.

There is no secret that Google knows how to handle large amounts of data. And many of the tools provided at GCP is designed for handling storing and processing big data. Tools like Dataflow, Pubsub, BigQuery, Datastore, and BigTable are really powerful tools for data management. GCP also as great environments for running services like App-, Cloud Engine and Dataflow helps us maximise the time used to build business critical features fast, rather than using developer time on keeping the lights on.

The Good Parts

Pubsub

Pubsub is a distributed publish/subscriber queue. It can be used to propagate messages between services, but at Unacast we’ve mostly used it to regulating back pressure. And it works great in scenarios where you want to buffer traffic/load between front-line API endpoints and sub stream services. We use this approach designing write-centric APIs that can handle large unpredictable spikes of requests. NB! Pubsub doesn’t provide any ordering guarantees, and it doesn’t provide any retention unless a subscription is created for a topic.

BigQuery

BigQuery is a great database for building analytics tools. Storing data is cheap and you only pay for querying and storing data. BigQuery is great because of its out of the box capabilities for querying large amounts of data really fast. To put things into perspective, with our thorough usage we’ve seen that BigQuery can query 1GB of data just as fast as 100GB (and probably even more). One thing to remember when using BigQuery is that it’s an append-only database, meaning that you cannot delete single rows, only tables3. In other words, where Cassandra as row-level TTLs BigQuery ships with table-level TTLs. So implementing data retention has to be done differently and may not be straight forward if you’re coming from a standard SQL perspective.

App Engine

App Engine is a scalable runtime environment for Java designed to scale without having to worry about operations. App Engine is great if you need highly scalable APIs. But you can only use Java 7 and libraries whitelisted by Google. Because of this restrictions we’ve got mixed feelings for App Engine. Getting scale without worrying about operations is great but on the other hand the development process is a lot more complex. We would use App Engine where your API doesn’t need much logic or external dependencies, like an API gateway, but for more complex services we would use Container Engine instead.

Container Engine

Container Engine is GCPs answer for hosting linux containers. It’s powered by Kubernetes which is, as of writing, the de facto standard for scheduling and running linux containers in production. On GCP we view Container Engine as the middle ground between Compute and App Engine. Where we believe you get the best tradeoff between operational overhead and flexibility. With Kubernetes you can do interesting things as bundle databases or other services together to increase performance which is impossible in App Engine. However, you have to worry about updating your Kubernetes cluster and keeping the nodes healthy and happy. Adding some operational complexity, work and time spent on not adding features.

Dataflow

Dataflow all the things! Dataflow is GCPs next generation MapReduce. It has streaming and batch capabilities. Dataflow is so good that we try to use it every time we need to process a non-trivial amount of data or we just need to run continuously running workers. As of writing Dataflow only has a SDK for Java. And Java isn’t necessarily the natural language for defining and working with data pipelines. Needless to say we started to look for non-official SDKs that could suit our needs. We found Datasplash, a Dataflow wrapper written in Clojure. Clojure syntax and functional approach works very well when defining data processing pipelines. We’re currently pretty happy with Datasplash/Clojure, at the time of writing we’re running Dataflow pipelines written in Java and Clojure. Time will show if this is the right tool. A caveat with Dataflow on GCP is that it uses Google Compute Engines under the hood. And that means the quota limits for virtual machines can be a show stopper. Make sure to always have large enough limits while you’re evolving your platform.

The not so good

Stackdriver

Stackdrivers monitors sucks. Monitoring is hard from the get go. Its hard because it’s hard to know what to monitor and setting up real good monitors. If the monitor is too verbose and sensitive nobody cares when an alert is triggered, and if it’s too sensitive errors in production will go unnoticed. Our opinion setting up custom metrics in Stackdriver is a horrible experience. And that is why we use Datadog for monitoring services and setting up dashboards. To be fair, Stackdriver has some good components too, especially if you’re using AppEngine. Stackdrivers Trace functionality is awesome for tracking what is slow in your application. And the logs module are easy to use and query for interesting info. Our experience is that these two modules works really great out-of-the-box.

CloudSQL

Cloud SQL is a great service for running a SQL database with automatic backups, migrate to better hardware, and easy setup and scale read-only replicas. But the SQL engine behind is MySQL. We’ve much respect for what MySQL achieved back in the day but those days are over. But because of the ease of use, infrastructure wise, we’ll probably still be using Cloud SQL in the nearest future. However we think we should always consider using Postgres through compose.io or even AWS AuroraDB before settling for Cloud SQL.

Closing notes

We haven’t been able to test all the features of GCP and some of them looks really promising. We’re really excited about the machine learning module. And we hope they’ll support Endpoints for other services than AppEngine soon.

Choosing the right cloud platform isn’t straight forward. It’s hard to know if the services provided by a platform at hand is the right services. We at Unacast have learned from first hand experience that more isn’t necessarily better. And that your first choice and instinct might not always be correct. It was and is still the right choice for us. And after Spotify announced that they we’re moving their infrastructure to GCP, we’re more sure than ever that we chose the right cloud platform.

Footnotes

1
2
3
1. Everything is a platform these days.
2. All consultants aren’t evil.
3. Not entirely true. Deletes are expensive not impossible.

Introduction

This post is best read with some prior knowledge to Kubernetes. You should be familiar with concepts like pods, services, secrets and deployments. Also, I'm assuming you've been working with kubectl before. Enjoy!

At Unacast we spend a lot of time creating web services usually in the form of JSON APIs. And we’ve spent a lot of time designing and experimenting and researching to design them. We’ve shared what we’ve learned along the way. A lot of these posts has been theoretical but in todays post we’re getting our hands dirty, and we’re going to implement an API that scales when being subject to massive amount of read requests. And we’re doing this using Redis. All the examples will be run using Kubernetes.

We assume that the logic to keep the data in Redis updated has been implemented somewhere else. And we also assume that the rate of adding or updating data (writes) is low. In other words, we expect to have a multiple orders of magnitude more reads than writes. Thus, in our tests we’ll just contain read requests and no writes.

All code snippets included in this post can be found in its full form here.

Redis

Before we get started we need to talk about Redis. The Redis team has described Redis quite good on their homepage

Redis is an open source (BSD licensed), in-memory data structure store, used as database, cache and message broker.

In my own words, Redis is really fast and scalable in-memory database with a small footprint on both memory and CPU. What it lacks in features it makes it makes up for in speed and ease of use. Redis isn’t like a relational store where you use SQL to query. But it is shipped with multiple commands for manipulating different types of data structures.

Redis is a really powerful tool and should be a part of every developers toolkit. If Redis isn’t the best fit for you, I’ll still recommend investing time into learning how and when to use a in-memory database.

Architctures

Redis can be used in multiple ways. Each different approach has different trade-offs and characteristics. We’ll be looking at two different models and test them on scalability. The two models we’re testing are:

  1. Central Redis: one Redis used by multiple instances of the API.

  2. Redis as a sidecar container: Run a read-only slave instance of Redis for every instance of the API.

The performance tests will be run against the same service using the same endpoint for both models. I’ve extracted the endpoint from main.go and included it below.

The snippet does one simple thing. It asks Redis for a string that is stored using the key known-key. And from this simple endpoint we’ll look how Redis behaves under pressure and if it scales. We expect different behavior from the two different architectural approaches. This example might seem constructed, a real world examples that is similar to this approach is verification of api tokens. I agree that this might not be the best way to do token verification but it’s a very simple and elegant design. For a more elegant solution you should consider JSON Web Tokens.

Central Redis

As mentioned above a central Redis architecture is when we use one Redis instance for all API instances. In our case these API instances are replicas of the same API. This is not a restriction but it is a recommended architectural principal to not share databases between different services.

In Unacast we believe in not hosting your own databases. We’ll rather focus on building stuff for our core business and not worry about operations. Normally, we use Google Cloud Platform (GCP) for hosting databases. But hosted Redis isn’t publicly available at GCP so decided to use compose.io’s Redis hosting.

Setting up the service using a single Redis is pretty straight forward using Compose.io. Compose.io has some great guides on how you to get started with their Redis hosting as well. The kube manifest for running a kubernetes deployment and service is added below:

Redis as a sidecar container

Before we describe how to setup a Redis as a sidecar container. We’ve to give short description of what a sidecar is. The sole responsibility of a sidecar container is to support another container. And in this case the job of the Redis sidecar container is to support the API. In Kubernetes we solve this by bundling the API and a Redis Container inside one pod. And for those of us who don’t remember what a Pod is, here is an excerpt for the Kubernetes documentation:

pods are the smallest deployable units of computing that can be created and managed in Kubernetes.

Meaning that if a Redis container is bundled with an API container. They’ll always be deployed together on the same machine. Sharing the same IP and port ranges. So don’t try to bundle two services using the same ports, it’ll simply not work.

The following shows how to bundle the two containers together inside a Kubernetes pod.

By deploying this we’ll have a Redis instances for each pod replica. In this specific case we’ll have three Redis instances. That means we need some mechanism for keeping these instances in sync. Implementing sync functionality is horrible to do on your own [citation needed]. Luckily, Redis can be run in master-slave mode and we’ve a stable Redis instances hosted by compose.io. By configuring every Redis sidecar instance as a slave of the master run by compose.io. We can just update the master and not worry about propagating the data the slaves. Our unscientific tests showed us that the Redis master propagates data to the slaves really fast.

NB! A caveat is that you’ve to setup a SSL tunnel to compose.io to be able to successfully pair the sidecar instances to compose.io’s master instance.

We expect this architecture to scale better than the central Redis approach.

Results

All the tests were run using a Kubernetes cluster:

  • 12 instances of g1-small virtual machines
  • 12 pod replicas

We used vegeta distributed on five n1-standard-4 virtual machines to run the performance tests.

The graphs below are the results from the performance tests. The results focuses on success rate and response times.

Central Redis

Redis as a sidecar container

Conclusion

As expected we see that the sidecar container scales better than the central approach. We observe that the central approach is able to scale to about 15 000 reads/second, while the other can handle over 60 000 reads/second without any problems. Remember that these tests are run on the same hardware and that only a minor change in the APIs architecture resulted in a major performance gain.

Closing Notes

One last thing, remember that utilizing multiple read-only slaves will behave in the same matter as multiple read-only Redis slaves. We prefer using Redis because of its speed, small footprint and ease of use.

We haven’t been running this in production for a long time. So we don’t have any operational experience to share yet. And we intend to share this in the future.

Further work

This post didn’t cover if the Redis as a sidecar container approach scaled linearly as more CPU was added. This is outside of the scope of this post. But our internal testing has shown this to be true. You’re welcome to test this yourself.

At Unacast we’re obsessed with monitoring. One of our mantras is “monitoring over testing”. And notice we haven’t added any monitoring for the Redis instances inside a pod. However if you’re using Datadog, as we do, it’s fairly straight forward to add monitoring by bundling a dd-agent as another sidecar container inside the same pod.

Want more?

If you’re interested in reading more about API design I can recommend the following posts from our archive:

Could we change the paradigm of how we build HTTP REST APIs such that great API documentation is a consequence instead of a chore?

Throughout this text API(s) refers to HTTP APIs

Great API documentation makes integrating with an API a breeze. Not only because you can read it and implement a good client yourself, but more in the line of great API documentation would let us generate awesome clients such that we do not have to spend any time at all. The downside is that API documentation rarely is great, not existing at all, missing crucial information or just blatantly wrong. So how could we fix this and at the same time make it an enjoyable experience?

As developers, documentation is often something we write after the fact. On some occasional sunny days we might actually write documentation simultaneously as code. Unfortunately, the next day we might again change that piece of code and of course forget to change the documentation along with it. To my experience this is as much the fact when writing a piece of application code as when implementing APIs. The issue with not writing API documentation until when development is completed is that it stands as an extra barrier before releasing. Unsurprisingly, writing the documentation is likely to be hurried as it blocks release and the fact that the developer finds it tedious to document. Using annotation based API documentation in the code itself as opposed to defining the API documentation in a separate file helps, but is not in our experience sufficient to mitigate the issue of API documentation being an artifact updated only after technical implementations are completed.

The opposite approach, design-first, is an approach to building APIs where the documentation is written first, then the implementation is shaped after that piece of documentation. This approach has been covered by many thought leaders over the last couple of years: Programmable Web, API Evangelist and InfoQ. The upside of this approach is that you get a chance to vet your API design without writing a line of code. Additionally, it acts as a natural task specification for the developers implementing the design. The main downside of this approach is that technical challenges with the API design might only surface very late in the process, making it more costly to actually amend the challenges gracefully. Furthermore, developers might feel bound by the specification in the API and feel the process to rigid, or in worse cases decide to change the design neglecting the already existing API document.

Many of the popular frameworks used to build web applications and APIs today rely heavily on convention to increase developer speed. The framework themselves by default defines what error codes are returned and what headers used. Even though the conventions are convenient, they tend to make writing accurate API documentation challenging. The challenge is due to the documentee having to remember to document all the less than visible default conventional behaviors. In our experience, being privy to the ‘complete’ set of these behaviors for any framework is a daunting challenge.

As for the issues mentioned we see two classes of issues: those related to making changes twice (code and documentation) and those that are a result of including the conventions used by our frameworks to our API documentation. As per our understanding to accommodate any or both of these issues API documentation must be a first-class citizen in our frameworks or languages we write APIs in. By first-class citizen we mean that the API design cannot change without the documentation changes and vice versa. Additionally, the framework should reflect all of its API design defaults in the API documentation. By completing this the API documentation basically works as a contract.

An example of a framework that does treat API documentation as a first-class citizen is the Go framework goa. goa makes a clean separation between the API design interface and your business logic. The API design interface is defined using a DSL. From the DSL API documentation, data models and the classic controller code is generated. As long as you do not alter the generated code, which easily can be enforced on a build server, your API design and API documentation are now always in sync. This characteristic makes uncovering breaks to a current API-contract as diff’ing the new and and old API documentation document.

Some ideas and thoughts on recruiting for tech positions, no technical background required.

Our take on the “final interview” for developers

How do you at one point in the tech interview process determine if a candidate is the right fit for your team? He or she has passed initial screenings and interviews, and seems to possess the right skills to fill the needed role. If we consider this a stage-gate-process, what should be the final gate?

At Unacast, we are experimenting with an interesting take. The question that we asked ourselves is basically - how do know that the candidate will fit into our everyday life at Unacastle and how we work? This lead to a rough analysis of what we actually do during a typical workday.

Being a startup in a highly exploratory space, working on the tech, we spend a lot of time on researching, testing and discussing different approaches to solve problems at hand before implementing production code. This means that we often need to learn new languages, frameworks and paradigms.

Pair programming

A common approach in development is to do pair-programming, which is basically what it sounds like, developers working on the same problem on one computer. One could also call it pair-problem-solving - and research have shown that being two often yields better solutions faster. This is especially true when learning new things or attacking unknown problems.

We therefore decided to invite candidates that had passed all stages to a “night at the Unacastle”, where the candidate would be paired up with one of our developers to work for a couple of hours on a given problem.

An important twist is that the problem to work in is unknown both to our developer and the candidate. That means that both start pretty much on scratch and have to approach the problem together and stitch together a solution by googling, reading documentation and discussing. And usually a pretty solid dose of Stack Overflow.

In our experience, after a little while, everybody seems to forgot that this is actually an interview, and people let their guard down - let’s get some shit done! It has been interesting to see how people react when they are being questioned or how they argue that one approach is better than the other. In our opinion, the real goal of an interview is to let the candidates show their true selves, and their skills. We have found that this interview structure make that happen because it creates a relaxed environment. Considering the fact that some developers can have an analytical and introvert personality (and by the way, being introvert or extrovert are neither positive or negative traits - they are just different) , in addition to the natural nervousness connected to applying for a new job, an approach that creates a playing field that feels natural is a huge win.

Some guidelines and ethical perspectives

We value the candidate’s time and effort, so it would be unethical to work on internal products or features, which basically would mean that candidates did free work for us. Our solution to this issue has been to just do something for fun, that could be open sourced at some point. Obviously it can be stuff that is useful for us, but it should also have potential to be useful to other people and not tightly integrated or related to our systems or codebase. Usually, we build something in a new, exciting language or framework that we may have looked at in our spare time at but haven’t found the time to dig into. It’s a win-win!

One could fear that the candidate may find that we do not have the level they expect, but that is ok. An interview process goes both ways, it is just as much about that the candidate should get a real impression of our company, our people and how we do our things.

Takeaways and learnings

On the practical side, you need at least one developer internally that is available. Ideally, a few more are good because then the team can just hang around and listen casually into the conversation and drop in if they feel like. Looking at recruiting in general for non-tech positions, case assignments are a widely used method to assess a candidate’s real-life abilities, however in a somewhat artificial setting. Coding like we do in “A night at the Unacastle” is in many ways easier to assess, since it is much closer to the actual thing. We are a young company that is in continuous development in all aspects, such as recruiting. But we feel that we have found something precious here, and we will definitely continue with this practice, and continue to refine it based on feedback from candidates and our own learnings.

Lastly, although not entirely related to the “Night at Unacastle”, one of the most important learnings is that sourcing candidates is really, really hard work! Considering that the type of candidates (i.e. the A players) we are looking for usually are happy where they are, it’s all about using networks, seminars, social media and whatever means possible to get hold of the right people. We have also found that working with recruiters is a great help in the actual interview process, but there is no excuse for not making sourcing a team responsibility that everybody should feel committed to.

A last little secret that we want to share is that most developers get at least a couple of recruiter calls or emails per month that are easily dismissed. It has a lot more punch to start the conversation with “Hey, I work as a Platform Engineer at Unacast. I think you have done some really interesting stuff, wanna grab a coffee?”

If you feel that you are in the target group and that the last question speaks to you, don’t hesitate to reach out! We would love to spend a night at Unacastle with you.

How do you make sense of all those terabytes of data stuck in your BigQuery database?

Here’s what we tested

We here at Unacast sit on loads of data in several BigQuery databases and have tried several ways of visualizing that data to better understand them. These efforts have been mostly custom Javascript code as a part of our admin UI, but when we read about Re:dash we were eager to test how advanced visualizations we could do with an “off the shelf” solution like that. We wanted both charts showing all kinds of numerical statistics retrieved from that data and maps showing us geographical patterns. Re:dash supports this right out of the box, so what were we waiting for?

Getting up and running

Since we run all our systems on Google Cloud we were really happy to discover that Re:dash offers a pre-built image for Google Compute Engine, and they even have one with BigQuery capabilities preconfigured. This means that when we fire up Re:dash in one of our Google Cloud projects, the BigQuery databases in the same project are automatically available as a data sources ready to be queried. Awesomeness!!

Apart from booting the GCE image itself we had to open some firewall ports (80/443) using the

1
gcloud compute firewall-rules create
command, add a certificate to the nginx instance running inside the Re:dash image to enable https and lastly add a dns record for easy access.

The final touch was to add authentication using Google Apps so we could log in using our Unacast Google accounts. This also makes access and user control a breeze.

The power of queries

As the name implies, the power of BigQuery lies in queries on big datasets. To write these queries we can (luckily) just use our old friend SQL so we don’t have to learn some new weird query language. The documentation is nothing less than excellent. There’s a detailed section on Query Syntax and then there’s a really extensive list of Functions that spans from simple

1
COUNT()
and
1
SUM()
via
1
REGEXP_EXTRACT()
on Strings and all kinds of Date manipulations like
1
DATE_DIFF()
. There’s also beta support for standard SQL syntax

which is compliant with the SQL 2011 standard and has extensions that support querying nested and repeated data

but that’s sadly not supported in Re:dash yet (at least not in the version included in the GCE image we use).

In Re:dash you can utilize all of BigQuery’s querying power and you can (and should) save those queries with descriptive names to use later for visualizations in dashboards. Here’s a screenshot of the query editor and the observant reader will notice that I’ve used Google’s public

1
nyc-tlc:yellow
dataset in this example. It’s a dataset containing lots and lots of data about NYC Yellow Cab trips and I’ll use them in my examples because they’re kind of similar to our beacon interaction data as they contain lat/long coordinates and timestamps for when the interaction occurred.

1000 cab trips

It’s, however, worth noting that you don’t get any autocomplete functionality in Re:dash, so if you want to explore the different functions of BigQuery using the tab key you should use the “native” query editor instead. Just ⌘-C/⌘-V the finished query into Re:dash and start visualizing.

Visualize it

Every query view in Re:dash has a section at the bottom where you can create visualizations of the data returned by that specific query. We can choose between these visualization types:

1
[Boxplot, Chart, Cohort, Counter, Map]
and here’s how 100 cab trips look in a map

100 cab trips

When you get a handful of these charts and maps you might want to gather them in a dashboard to e.g. showcase them on a monitor in the office. Re:dash has a dashboard generator where you can choose to add widgets based on the visualizations you have made from your different queries. You can even rename and rearrange these widgets to create the most informative view. Here’s an example dashboard with the map we saw earlier and a graph showing the number of trips for each day in a month. The graph makes it easy to see that the traffic fluctuates throughout the week, with a peak on Fridays.

dashboard

So what’s the conclusion?

Re:dash has been a pleasant experience so far, and it has helped us get more insight into the vast amount of data we have. We discover new ways to query the data because it’s easier to picture a specific graph or map that we want to produce rather than just numbers in columns. We intend to use this as an internal tool to quickly generate visualizations and dashboards of specific datasets to better understand how they relate too and differs from other datasets we have.

There are some rough edges, however, that have been bothering us a bit. The prebuilt GCE images aren’t entirely up to date with the latest releases, unfortunately. The documentation mentions a way to upgrade to the newest release, but we haven’t gotten around to that yet. The lack of support for standard SQL syntax in BigQuery is also a little disappointing since that syntax has even better documentation and the feature set seems larger, but it’s not that big of a deal. The biggest problem we have been facing is that the UI sometimes freezes and crashes that tab in the browser. We haven’t pinpointed exactly what causes it yet, whether it’s the size of the result set or the size of the dataset we’re querying. It’s really annoying regardless of the cause because it’s hard to predict which queries will cause Re:dash to crash. Hopefully, this will be solved when we figure out how to upgrade to a newer version or the Re:dash team releases an updated image.