While we always work with scale in mind, our platform has recently been less reliable than our customers have come to expect during these last few months. balenaCloud has experienced an explosion of growth which has put the platform under a lot of pressure, exposing areas that require improving.
We want to apologize to our customers for the impact these outages have had and explain our plan going forward.
Resolving this is our highest priority as a company. At the moment, we are focusing on rolling out incremental changes to improve reliability of the platform without introducing more instability. While we continue to make those improvements, we want to share with you the causes for these outages in more detail, as well as the actions we have taken so far, and our near-term plans for further scaling and improvement.
What’s causing these outages?
*NOTE: The following information has been summarized here for ease of understanding. Once we fully resolve these issues, we plan on releasing a full post-mortem for public view and feedback. You can also
view our incident history for more information.*
The instability we’ve been experiencing is due to rapid customer growth on our platform. While we were preparing for this level of growth, the growth occurred differently to how we’d anticipated, and the huge influx of simultaneous requests to our API caused the set of failures that our customers have been experiencing.
How we stabilized the situation
So far, we’ve made the following improvements to stabilize the platform:
- Improved query efficiency - By analyzing our most costly queries, we were able to identify several improvements to the models and how they are queried. Additionally, removing excessive table locks improved throughput by reducing CPU waits on other queries.
- Mitigated replication lag - A core contributor was heavy writes and reads on the same tables in the master node, causing large replication lag in the Postgres cluster. This was mitigated by increasing our master node hardware and moving reads to this node until we can implement the models in a way that mitigates heavy writes and reads on the same table. Replication lag is gone now, with the master node having safe headroom even with the read load (due to new caching layers added).
- Increased caching - We added caching for OpenVPN auth queries, permission queries, image resolution, and device location resolution. We also added an index to optimize device API key permission lookups.
- Reduced and spaced demand - We reduced the frequency of device health metrics communication, added pre-checks for device-state to avoid unnecessary updates, enabled spacing of device requests, and enabled additional spacing of delta requests in cases of heavy load.
- Optimized the way device log connections are initialized - We removed redundant work and overhead during initialization.
Although the above improvements have helped to increase the stability of the platform, there are still instances occurring where it faces performance issues. To continue to tackle this, the following improvements are underway.
Our team is planning a migration to
AWS Aurora PostgreSQL to substantially scale API capacity. Currently, balenaCloud uses the PostgreSQL database engine within
Amazon Relational Database Service (RDS). We will be migrating the database engine to Aurora, specifically to scale API capacity. Among other improvements, Aurora does not use the same replication mechanism as PostgreSQL, so it is able to guarantee replication lag will not occur.
So far, we have already upgraded to Aurora in our staging environment and will use what we learn from this experience to inform the migration of our production environment.
NOTE: We will schedule downtime for this migration
This migration will require scheduled downtime for our platform. We plan on ensuring that all customers are fully notified of when this migration will happen. As of the time of writing this post, we have not officially identified a migration date yet, but are working on it.
As always, keep an eye on our
status page for the announced migration date. Our goal is to prepare our team for this migration and to strategically choose a time when the downtime will have the least amount of impact to our customers.
Additional upcoming plans
- Improve geocoding, particularly for device updates, reducing the impact of detecting and storing new IP addresses.
- Investigate additional API rate limiting where needed without reducing functionality or affecting other users.
- Evaluate device metric architecture to identify efficiency improvements.
- Implement webhooks as both a customer-desired feature, and as a mechanism for reducing API load by reducing the need for state polling.
- Improve speed and quality of communication to customers about general Balena platform services downtime - we acknowledge that the faster we can notify users of issues, the sooner the consequences can be mitigated.
- Provide a written post-mortem after successful migration and resolution of related issues for customer review.
We thank you all for your patience so far. While we’re seeking full resolution of these incidents, we want to keep all of our users fully informed.
While we will continue to monitor our backend systems and alert our users to any incidents that can cause disruption, we encourage our users to raise new support tickets if they are experiencing any issues with the balena platform. The more data we have on how these issues are affecting our customers, the better we can mitigate these and improve our services for the future.
Thank you for your continued patience as we work to resolve these issues. Please reach out with any questions, concerns, or feedback with your Customer Success Manager, or in the comments below.
You can find out what other features we have in progress, let us know about what you'd like to see us developing and upvote features on our
public roadmap.