balenaCloud outages: their causes and what we’re doing to resolve them

This post outlines the causes of recent late 2021 balenacloud outages and what we’re doing to address this situation.

While we always work with scale in mind, our platform has recently been less reliable than our customers have come to expect during these last few months. balenaCloud has experienced an explosion of growth which has put the platform under a lot of pressure, exposing areas that require improving.

We want to apologize to our customers for the impact these outages have had and explain our plan going forward.

Resolving this is our highest priority as a company. At the moment, we are focusing on rolling out incremental changes to improve reliability of the platform without introducing more instability. While we continue to make those improvements, we want to share with you the causes for these outages in more detail, as well as the actions we have taken so far, and our near-term plans for further scaling and improvement.

What’s causing these outages?

NOTE: The following information has been summarized here for ease of understanding. Once we fully resolve these issues, we plan on releasing a full post-mortem for public view and feedback. You can also view our incident history for more information.

The instability we’ve been experiencing is due to rapid customer growth on our platform. While we were preparing for this level of growth, the growth occurred differently to how we’d anticipated, and the huge influx of simultaneous requests to our API caused the set of failures that our customers have been experiencing.

How we stabilized the situation

So far, we’ve made the following improvements to stabilize the platform:
* Improved query efficiency – By analyzing our most costly queries, we were able to identify several improvements to the models and how they are queried. Additionally, removing excessive table locks improved throughput by reducing CPU waits on other queries.
* Mitigated replication lag – A core contributor was heavy writes and reads on the same tables in the master node, causing large replication lag in the Postgres cluster. This was mitigated by increasing our master node hardware and moving reads to this node until we can implement the models in a way that mitigates heavy writes and reads on the same table. Replication lag is gone now, with the master node having safe headroom even with the read load (due to new caching layers added).
* Increased caching – We added caching for OpenVPN auth queries, permission queries, image resolution, and device location resolution. We also added an index to optimize device API key permission lookups.
* Reduced and spaced demand – We reduced the frequency of device health metrics communication, added pre-checks for device-state to avoid unnecessary updates, enabled spacing of device requests, and enabled additional spacing of delta requests in cases of heavy load.
* Optimized the way device log connections are initialized – We removed redundant work and overhead during initialization.

How we’ll improve the platform for the long-term

Although the above improvements have helped to increase the stability of the platform, there are still instances occurring where it faces performance issues. To continue to tackle this, the following improvements are underway.

Our team is planning a migration to AWS Aurora PostgreSQL to substantially scale API capacity. Currently, balenaCloud uses the PostgreSQL database engine within Amazon Relational Database Service (RDS). We will be migrating the database engine to Aurora, specifically to scale API capacity. Among other improvements, Aurora does not use the same replication mechanism as PostgreSQL, so it is able to guarantee replication lag will not occur.

So far, we have already upgraded to Aurora in our staging environment and will use what we learn from this experience to inform the migration of our production environment.

NOTE: We will schedule downtime for this migration

This migration will require scheduled downtime for our platform. We plan on ensuring that all customers are fully notified of when this migration will happen. As of the time of writing this post, we have not officially identified a migration date yet, but are working on it.

As always, keep an eye on our status page for the announced migration date. Our goal is to prepare our team for this migration and to strategically choose a time when the downtime will have the least amount of impact to our customers.

Additional upcoming plans

Improve geocoding, particularly for device updates, reducing the impact of detecting and storing new IP addresses.
Investigate additional API rate limiting where needed without reducing functionality or affecting other users.
Evaluate device metric architecture to identify efficiency improvements.
Implement webhooks as both a customer-desired feature, and as a mechanism for reducing API load by reducing the need for state polling.
Improve speed and quality of communication to customers about general Balena platform services downtime – we acknowledge that the faster we can notify users of issues, the sooner the consequences can be mitigated.
Provide a written post-mortem after successful migration and resolution of related issues for customer review.

How to stay informed

We thank you all for your patience so far. While we’re seeking full resolution of these incidents, we want to keep all of our users fully informed.

We encourage you to keep an eye on our status page for up-to-date information on the health of the platform.

While we will continue to monitor our backend systems and alert our users to any incidents that can cause disruption, we encourage our users to raise new support tickets if they are experiencing any issues with the balena platform. The more data we have on how these issues are affecting our customers, the better we can mitigate these and improve our services for the future.

Thank you for your continued patience as we work to resolve these issues. Please reach out with any questions, concerns, or feedback with your Customer Success Manager, or in the comments below.

You can find out what other features we have in progress, let us know about what you’d like to see us developing and upvote features on our public roadmap.

Posted

November 26, 2021

thoughts

Team balena

Tags:

Notable Replies

heysoundude says:

November 28, 2021

I’ve never fully grokked why there’s a VPN connection…but if it’s necessary, are the issues related to which protocol you use? WireGuard is a thing in more current linux kernels, if that helps…but I’m sure the devs are aware.
Is there a way to get off your servers and just SSH into my balenasound machine from wherever, or update with SFTP? or have a direct login from a machine on the same network? (takes away the cloud aspect, but…)
ab77 says:

November 30, 2021

Hello, the performance issues we’ve described in the blog post are not directly related to our VPN subsystem. We use a “VPN” to enable features such as web terminal and SSH. While we currently use OpenVPN to provide this functionality, technically any secure transport can be used, as long as it is supported by the kernel and/or user-space, even SSH tunnel. Currently, we don’t see an immediate technical/business requirement to move off OpenVPN, but if such a need will arise in the future, we will certainly consider it.

Please take a look at openBalena – Home for an Open Source version of balenaCloud, which can be run privately/internally. You could also continue to use balenaCloud, but inject your SSH key(s) into config.json to enable openssh on devices without having to go via the CLI/WebUI methods.
the-real-kenna says:

January 28, 2022
Hi @martink,

Apologies for the delay on this response! There are quite a few things we’re working on, but I’ll do my best to summarize:
1. we are making some longer-term fixes relating to our Supervisor and its methods for using our API to remove unnecessary usage, which should reduce load across our entire customer base.
2. we did make the move over to Aurora earlier this month, but there are still some improvements we intend to make with regard to replica lag that will be of help over the long term.
3. I got a very long list of specific improvements we are making across our API, Registry, Deltas, Proxy, and Supervisor that are likely more specific than most people want to know, but I counted 28 items and would be happy to share them with you in email if you are genuinely interested!
Overall, we’ve had a much more stable experience since the Aurora migration, and the team is confident that the experience our customers have should be smoother than ever now.

However, if that’s not been your experience the past few weeks, please do let us know. We’re keeping a close eye on things the next few months, and that includes customer feedback to make sure we hear about experiences that aren’t matching the expectations we have based on the improvements we’ve put in place. So thank you for reaching out, and please do keep in touch with us about your experience going forward!