We have been running a handful of services on Kubernetes for the last 6 months. Here I will summarize some takeaways and patterns that have arisen.
Some words about our setup
We are running Kubernetes (K8S) on Google Container Engine (GKE). GKE is hosting the Kubernetes master so we don’t need to worry about it going down. We also run different clusters for different environments, to ensure that if we screw up in development, it does not affect production. It should be noted that this post is not to be considered an introduction to Kubernetes, and will not necessarily explain concepts in detail. Please refer to the excellent Kubernetes documentation for an introduction.
1. Environment detection
As we run one cluster per environment we cannot use Kubernetes namespaces to detect what environment we are running in. We have tried a few different approaches to this. The most intuitive approach was to use environment variables in the ReplicationControllers, but this meant that we would have to interpolate the correct environment variable at deploy time. To do this, we wrote a script that generated the correct ReplicationController at deploy-time.
Since almost all of our apps need some way to identify what environment it is running in this would have to be replicated for each application. Because of this, we came up with a second solution where we create a Kubernetes secret in each environment. The secret contains a file with one line of text in it, and that is the environment the cluster is running in. We then mount that secret on all pods and the pods themselves are free to read the environment at startup.
This is usually done in the Dockerfile when the containers start, like this:
This is a bit more flexible and ensures that the environment is perceived as the same for all applications in the cluster.
As we do all our deployments through Slack we needed some way of automating deployment to K8S. This is still much a work in progress, but we have ended up with a pretty stable default script to do this. The goal is to be able to update the ReplicationController at each deploy so that we can mount volumes, open ports, update labels etc.
Updating the image in the ReplicationController does not automatically update already running pods, so to actually deploy a new version we also need to do a rolling update. We also want the script to conditionally update the ReplicationController or create it if it does not exist in the given environment.
The following gist shows a simplified version of our deploy script. It assumes that you already are authenticated to the correct cluster and that the image is passed as a parameter. It also assumes that you have a script that creates a ReplicationController YAML file with the correct image.
This is by no means a fail-proof script, and this only proves the point that the Kubernetes authors really need to finish their
Deployment API really soon.
GKE comes with a monitoring solution from Google (Google Cloud Monitoring), that gives insight into amongst others cpu and memory usage for your pods. We have found the bundled monitoring solution to lack some important aspects, so we opted to use DataDog for our monitoring needs.
DataDog provides a really good integration with Kubernetes that we have found very useful, so it is something we would definitely recommend looking into. A small caveat does exist if you run K8S on GKE since GKE does not support the Kubernetes Beta API in general, and more specifically does not have support for DaemonSets.
The DataDog-agent depends on DaemonSets to run exactly one agent on each Kubernetes node, but a small hack will fix this. The trick is to create a ReplicationController with replicas = number of nodes and specify a hostPort in the template spec. This prevents that two Datadog agents run on the same node.
To get the most out of your monitoring solution, it is important to use consistent labels on your pods. This enables you to group metrics across tiers and applications and get better insights into your metrics.
We have landed on a set of fairly simple, but flexible label conventions that gives us the insight we need.
Let’s imagine we have an app called “Awesome” which consists of two pods. One pod that is running a backend API and one frontend-pod serving HTML. Those would then have the following labels. For
Further, let’s imagine we have 20 different apps that adhere to the same labeling conventions. Now we can collect metrics across apps and across tiers. If we also apply a “language”-label to our pods we can, for instance, graph the memory usage of all JVM-based apps, or all our NodeJS-apps.
Kubernetes is in our experience a solid platform to run micro services on, and it is under heavy development. There is also many exciting features in progress, amongst others the Deployment API. If you are already running Kubernetes or consider doing so, I would recommend that you join Kubernetes on Slack and also the Google Cloud Platform Community.
Disclaimer: The title is ironic. Search Engine Optimization is in fact a huge task, and this post is just a small summary of my experiences of working hands on with it.
Most self-proclaimed "SEO experts" will just tell you that SEO is important and that you should use keywords and some tags. That is not wrong, but it will only take you
part of the way of increasing traffic to your site, which is what SEO is really about. Especially, the term SEO misses out on
an equally important channel, social media.
This Summer, we decided to build Proxbook - a crowdsourced directory of proximity solution providers, use cases,
white papers and in depth resources about the industry itself.
It has worked pretty well, in fact so well that we recently decided to take things to the next level.
So far so good. Proxbook is implemented as a Single Page Application
using an underlying API for populating the frontend with
data. SPA architecture gives developers flexibility, and end users gets more responsive web applications with a lot
less overhead of unnecessary page loads.
“The next level” in these terms meant that we certainly could see that the site got more traffic and usage - especially
in terms of how long time users spend on the page. Consider it a funnel - users enters the page, some leave (bounce)
and some stay, or move to other sub-pages, creating a flow chart of a user journey. In order to boost these numbers, we realized that we had to look at SEO.
Proxbook had pretty low scores on all parameters (using Google Pagespeed / Google webmaster tools and similar). So, what is SEO, and what are the important parameters
for getting your site high up the search engine indices?
SEO is a beast with many heads
Most technologists want to focus on building kickass stuff on bleeding edge technology,
but this often mixes bad with SEO (not by definition, but optimization is usually not the first that comes to mind when learning new stuff).
I myself thought SEO was mainly about adding some keywords in the page header,
but my initial research revealed that SEO has actually become quite comprehensive. The list below describes what
I’ve found, it is probably not exhaustive - but it definitely shows that SEO is something that must be grounded in
all aspects of a website, from copywriters and authors to designers and developers:
use section, article, nav, h1, h2, don’t use background images that are relevant for the content, use image alt text etc
The elements themselves convey their importance and relationships.
One page should only have one h1 tag.
Use section tags to divide the different sections of the page instead of div
Use a nav tag to contain all the links for navigation your site
Using special markup to display products, locations, companies etc. nicely in the Google search.
Page loading time
Caching of assets - images, fonts etc (yes, Pagespeed will tell you this - essentially means that you should set long expiration times)
using editing tools that allow to set / override specific meta tags for subpages, sections, blogposts in an easy manner
Links in out / PageRank. Number of inbound links vs outbound links. Inbound links from other sites (preferably high ranking sites)
Generating a sitemap.
Social sharing of content
This is not SEO per se, but if we consider SEO (to simplify) to also include all factors that drive traffic to a site, this is extremely important.
Many sites hardly have traffic through their main portals. Not knowing the exact ratio, I would assume sites like TNW, techcrunch and their likes
(this also includes all viral sites with stupid links like “you would not believe what she did when…”)
generate 80-90% of their traffic through facebook, twitter and linkedin.
That means having good shareability of all subpages of the site will greatly increase the probability of inbound traffic.
Which is the end goal no matter where users are coming from. The image below shows the facebook link when sharing the last Proxbook report:
So, where to begin? This is simple, right? Well, both yes and no. Proxbook was conceived and created extremely fast,
so some shortcuts were made. Adding semantic HTML / Microdata for instance, was to say the least a lot of boilerplate work.
At the same time it provided good motivation for restructuring the page and getting code redundancy down to an acceptable level.
The next section will not necessarily deal with all these points, but use the ones we worked the most on.
Loading time - Assets / compression
According to Google, user experience is very important and also something they factor in when ranking pages. This is also why
SSL has recently been added as a bonus (or rather a penalty for not having it) in their ranking algorithms. The focus on loading time makes
sense as more and more users are on mobile devices with limited battery capacity as well as limited data plans. Thus, having optimized pages
using caching, compression and other strategies for minimizing bandwith as well as not running heavy scripts is increasingly important.
For managing assets, I decided to use GZIP compression (gulp-gzip) , long cache expiration as well as hosting on Amazon S3.
The app itself is built on Angular.js and a gulp build pipeline, so I just added a step that uploads all assets to an
S3 bucket after a successful run. This step also runs through all HTML / CSS and swaps references to assets with the S3 url.
The S3 part was done using gulp-cdnizer
To bust the cache, all generated assets are versioned using gulp-rev .
We also had another issue. As all user media such as company logos were uploaded to S3 without compression or resizing,
we had a lot of images on the site that was unnecessarily large and caused Pagespeed
to complain. To deal with that, I created a script in the backend that looped through all the
companies and used requests
to get the image, then Pillow to downsize all images to
an acceptable size and save them back to S3. The last part is already handled by Django / boto using S3 as primary file storage.
This script did most of the heavy lifting for updating the logos:
Lastly, a manual inspection of background images, css and similar etc was done to optimize / resize and remove duplicate css.
These steps combined got The pagespeed index from about 2% to 79%, a very decent increase considering the amount of time
spent. From my experience, and also from trying sites like facebook on pagespeed, 79% is a very good score. Getting the last 20% seems
to involve a lot of obscure hacks and a lot of sweat. having it around 80% is far better than most sites and conforms well
with the 80-20 principle.
So, you got a single page app? And you want SEO? And you want social media to understand the content of your links?
Seamless indexing of single page apps has for a long time been a holy grail for frontend developers. Angular 2.0 will also support it server side rendering,
and this is also supported in react. However, it does not seem to completely straightforward, and will either way involve a case-specific server setup.
Proxbook is built on Angular 1.x, so there was no help here, meaning we had to “roll our own”.
These describes the mechanics of getting crawlers to 1) understand that your page is a SPA and 2) serve an HTML snapshot of a requested page.
It was a bit difficult to understand whether it applied to sites using hashbang-navigation or pushstate-based navigation as proxbook does to
get more visually appealing URLs. To some degree, they also seem to be a bit outdated.
To my surprise, the Google bot seemed to pick up all links and crawl the page, as well as displaying these links in the search
out of the box, even before we started diving into the SEO. The articles mentioned indicate that you must do special stuff for achieving this, but Google is notoriously secretive about these things actually work and from
my experience they will often put things in production and test it long before they add official documentation.
So, the site would now be crawled and indexed, but we still experienced problems with sharing content on social media (manual testing indicated that it didn’t work).
had solutions that basically used phantom to render html and send the generated html snapshot to
the requesting entity based on the escaped_fragment semantics described in the articles. This was not a good fit for our
case, so based on those I created a simple express middleware that instead looked at the user-agent that requested the page,
and pre-rendered the page for the user-agents we identify as bots from twitter, facebook and crawlers:
A bit hackish, but it seems to work - enabling us to share customized links on social networks and increasing the value
of the content on proxbook. We tested this on facebook, twitter and linkedin, which are the most important platforms for us.
How does it work? If we are dealing with at bot request, Phantom fires up, and loads the original url. When Angular starts up, it fires all API requests as usual. Then, there is a callback
on each http request that checks if there are any pending requests. If there are none, it fires yet another callback to phantom
(the phantom browser exposes a callPhantom function to the window object) that
tells it that the rendering is completed, with a safety interval of 500ms:
It is not necessary to use a callback to achieve this,
one can also set a timeout for a couple of seconds and just assume that everything is rendered by that time. The best approach would in that case be
to load your site a certain number of times to make it statistically significant, and take the average of the total loading time and add
a safety interval. I tried that as well, and it worked. But since Angular has good support for looking at the number of pending http requests,
the callback approach seemed the more “correct” way of doing it.
In the time ahead, we may turn of this rendering for Googlebot (and / or other), as it already seems to understand
our page without upfront-rendering. Considering the significant performance penalty of rendering pages in this manner, it
is obvious that you only want to use it where and when strictly necessary. You also see that it seems to be some duplication
in the bot list, it was just to safeguard based on reading around the web and stack overflow. The server logs will in time
show which ones are actually used by the crawlers.
We also generated the sitemap server-side by calling the API and populating an XML file with all the content that we wanted in the
sitemap using xmlbuilder-js
to speed up the rendering by running all API calls in parallel.
Wrapping it up
So, the search engines understand our content as well as social networks. Goal achieved.
It was at times a nitty gritty project with some tasks that were pretty boring, and some tasks were challenging and fun to work with. It was
also a humbling and learning experience. At the end of the day, I’m glad i’m not in the SEO business myself, it’s really really hard work.
There are probably many of you that have a lot more of experience with this, this just sums up my experiences with working on this for
a short period of time. There are many things that can be done better, for instance using redis, in-memory or disk to cache the html
snapshots for a certain amount of time to increase snappiness and maybe avoid the (if applicable) penalty on the SEO ranking. The image conversion
could also have been done directly on S3 without going through Django and changing their names. But you live and learn, and it works OK so why bother?
There is a lot of material avaiable on this topic online, but I didn’t manage to find any material that was really up to date
and that included both SEO and social media optimization for SPAs. I hope this post can be helpful for any of you that are
struggling on this. As we figured out in this post, my solution in itself isn’t too many lines of code, but the journey to get there
was long. And it probably never ends as web technology is changing from day to day.
If you want to pay for these kinds of services, you can check out these:
Everybody should have their own personal theme song
In my last post “Welcome to Unacastle” I described different ways we could greet someone who enters our
humble Unacastle. None of them are particularly elegant, ubiquitous or fun, so I wanted to find some other way to showcase
proximity based greetings.
The solution revealed it self after we had a meeting with a couple of the guys at Writeup. They are a Norwegian proximity focused startup with the tagline:
Create your own proximity message service with iBeacons
They gave us one of their beacons to play around with (the white one to the right)
and when I discovered that Writeup allows you to specify a webhook that will fire upon interaction, I got an :bulb:
What if I could make our dashboard TV in the hallway play a personalized theme song when we get into the
office in the morning!
Make it work
The first thing I had to do was to create some kind of service that would receive the webhook triggered by the Writeup app. As I belive in
polyglotting when developing side projects like this, I deployed a small Clojure/Compojure app on Heroku. When it gets triggered by the
webhook it looks like this. (INFO: Account id: 96 is me trigging the beacon)
The dashboard TV is powered by a Raspberry Pi, and I figured out that it can play mp3s using the TV’s speakers. So all I had to do now
was to write a small node.js app that connects to the Clojure app via a websocket so it can get notified when the latter receives a beacon-interaction-webhook call.
It goes a little something like this (the Raspberry is a bit slow, hence the long asciinema)
The last thing we needed to sort out was what songs to play and map them to our ids. We ended up with a quite interesting collection of tunes, e.g:
:notes: Imperial March :notes:
:notes: Every day I’m Hustlin’ :notes:
:notes: My Heart Will Go On :notes:
Here’s Kjartan entering the Unacastle from the hallway.
And this is what Andreas’ personal theme song sounds (and looks) like.
The story of when I made a typo and fixed it while drinking beer.
Last monday I was experimenting with some new log metrics on our platform. I wanted to make sure that all our alerting rules for errors in the log were correctly reported and published to our ops-channel in Slack.
Specifically I wanted to make sure that if the queue that connects our outward API with our processing engine was inaccessible, the bells would start to ring.
Our API is designed to use a local backup-store if our queue experiences problems, and there is a batch job that regularly checks the backup-store and tries to re-send the data to the queue for further processing. This introduces a significant delay in our processing time, and is of course something we prefer to avoid, hence the aggressive alerting when this happens.
To test that the alerts really were triggered I did the simplest thing that I could think of, and introduced a deliberate error in the API so that it would try to push data to a non-existing queue.
I then deployed this to our staging environment, and made sure that the propert alerts showed up when I tried to post something to the API.
Feeling confident, I deployed the branch to production and left the offices, intentionally leaving my mac behind.
Kubernetes, beers and errors in production
This evening we expected to receive a lot of new data from a partner that had recently integrated with us, so while enjoying both the Kubernetes talk and a beer, I decided to log into our admin console from my phone and check how much data we had received.
To my surprise we had not received any new data the last few hours, I felt a bit uneased and tried to think of possible explanations. I checked the production logs, and was blown away with a mountain of errors.
I started cold-sweating and pinged the other engineers, but none responded. Seemed like I was on my own. I desperately searched the room to see if there was anyone I knew who had a computer with them. No luck. I downed the remaining glass of beer in pure panic. It did not help.
I could not understand why these errors had not been posted to our dedicated Slack channel, our API had been pumping out errors for several hours, and nothing had reached our data-store. This was starting to become pretty serious.
Upon inspecting the logs (from my phone) in more detail I managed to find the culprit. Seemed like our queue was not responding. I thought that was strange, I had just worked on that earlier, remember?
Then it hit me, like a ton of bricks, the API was trying to push to a queue that did not exist.
I had screwed up when I tested our alerts. I had screwed up badly.
Everything to the rescue
I logged in to GitHub and looked at the last commits, after a whole lot of scrolling and pinching I found the violating line of code. It certainly was my fault, we do not have any queues ending with “foobar”.
One of the great things about GitHub is that you can actually edit your source code from a browser. It’s pretty practical for editing README’s and such, but for editing java-code it is not exactly a direct competitor to IntelliJ or Eclipse.
Anyway, after a bit of struggeling I managed to correct the error code and commit it to a new pull request through the web-ui.
I opened the Slack app, and waited for CircleCI to report a green build. It did so after a few minutes, and I typed unabot deploy api/fix-stupid-error to production.
After a few moments of waiting for Heaven to do its work I checked the logs again, and was no longer met with errors. The bug had been squashed! Victory!
For me this incident and how we were able to resolve it proves the importance of automation in general, and ChatOps in particular. Being able to correct an error from my phone while on the move justifies the up-front cost of making it possible. It also makes it easier for us to experiment and try new stuff, when we know that the consequence of fail is mitigated by our tooling.
With a solid deployment and monitoring solution we are able to roll forward or backward at the whim of a few keystrokes in Slack.
This incident also taught us something very important: Nothing is validated, unless it is validated in production. All stages up to production is only increasing our confidence by a small percentage. The full confidence only comes when the code is running with real data in the only environment that matters; production.
The real mistake that I did was not to commit some faulty code with a non-existing queue-name. My sin was that I did not test the alerting policy in production as well. As you might have understood by now, the alerting policy did not do what it was supposed to do. It was actually a very small mistake. The alerting policy was set up correctly, but a small (and very important) thing was missing. The actual integration with Slack, which resulted in a proper alert, but it was propagated to no-one.
To conclude my learnings with this incidents, a couple of things stands out.
Investing in automation will pay off. The cost can be high, but in time it will pay off.
Everything must be validated in production.
If you are interested in learning more about ChatOps, have a look at our previous blog post ChatOps @ Unacast.
So, it is yet again time for a new blogpost. This time, I will focus on a tool that I have built to showcase the power of beacons and how they can work with advertising to give users relevant communication as they go along.
I give you, the Unacast demo tool. The goal was really to build a flexible tool to showcase the essence of what Unacast is doing, in a user-friendly way that can be handled by our commercial team and also partners of Unacast that bundle our services.
This was a project that was mainly handled by the engineering team, after experiencing in sales meetings that having a demo or visual presentation would be valuable. There were very few strict requirements, as it is always difficult to convey an idea that you need something, but you are not sure exactly what.
Carving out initial requirements
However, I chiseled out some of the requirements to fulfill, based on my own experiences in sales meetings as well as feedback from “the suits”:
Flexible setup and configuration to handle different showcases and pitches
Adding / removing beacons
Adding / removing categories (i.e. Food&Beverage, etc) and attaching these to beacons
Adding / removing ads and attaching these to categories
An application that is installed on the user device that “picks up” beacons
Creating and populating user profiles based on device id
Displaying an ad on the user device and changing it in real time, based on beacon interactions
Realtime view of user profile as the device interacts with beacons
A pleasant UI to visualize the data and provide easy setup
Creating a data model
Based on this, I started iterating on an MVP. The data model and flow was roughly like this:
An app is the domain, or the bucket to which data belongs
A user is identified by the IDFA of his / her device
A user interacts with beacons and the app sends info about this to the backend, which saves a new interaction with info about the app, the user and the beacon encountered
A beacon has one or more tags (Sports, Food, Travel, etc)
When the user interacts with a beacon, this interaction is saved (as stated above). Also, an advertisement is shown in the app. This advertisement has a tag of the same type as the beacon. So you guessed right, a “sports” beacon will make a “sports” advertisement appear in the user’s app.
Lastly, when a user opens the app out of range of a beacon, the app shows an ad that matches the tag of the beacon type the user has the most interactions with. If no previous data is available on the user, a random advertisement is served
Getting the hands dirty / Taking a first stab
Much can said about this stage, but the fact is that pictures are much more telling than words,
so I will just say that it ended up something like this:
First, a couple of beacons are needed. These are pretty cool. Stick a couple of USB-beacons in a couple of emergency chargers, and you have a nice setup for testing beacons in a controlled environment.
This screen lets the user edit basic information about the demo, such as name, description, beacon UUIDs to use, as well as the page that the app should embed (the app is very simple, just displaying the contents of a user-specified URL with an ad on top)
The user profile gives us a real-time updating view of an end user as he or she journeys trough a world of beacons.
The box labeled “computed user profile” shows us the the user’s preferences in a visual manner that updates automatically.
Here, the user can create their own tags to use.
Here, the user can add entities (we just used that name insted of beacons to have a generic model for all kinds of proximity tech, such as NFC).
and assign tags to them.
Lastly, this section enables the user to create advertisements that are connected to the tags, completing the circle and also fulfilling the
requirements that we outlined above.
This is a screenshot from the iOS application that was developed, and shows the wikipedia frontpage (as stated, you can embed any type of web content, even create your own custom page) and a Coke app on top.
Wrapping it up
Obviously, this is a very crude example, and some of our partners have advanced capabilities that go beyond this example. We do however, encourage Proximity Solution Providers and partners to think about this. It is a small price to pay to implement and it will increase the value of the data in the future. It is also important to keep in mind that it is possible to just store this data and apply analytics and scientific methods at a later stage. As Unacast is a data first company we cannot stress enough the importance of this.
PS: This also highlights an important part of our commercial and engineering culture and how they play together. Rather than placing an “order” from the tech team, the tech team was given great room to navigate and explore. This obviously only works if there is a high level of trust and transparancy - essential components as Unacast scales up the organization in all professional disciplines.
PS#2: If you are wondering about the technology we used for this project, please leave a comment. If you think this was extremely interesting, we are also hiring, so feel free to reach out.