Over the past few months we’ve been hard at work on version 2.0 of Ondat (previously StorageOS), our distributed, persistent storage layer for Kubernetes. We started thinking about this release a while ago, based on conversations with our customers, and observations on the Kubernetes landscape in general. In particular, we see two important emerging trends.
Firstly, we see customers deploying much bigger clusters as they gain experience with Kubernetes and push more workloads to their clusters. Bigger clusters present several engineering challenges. As the number of nodes within the cluster scales, distributed consensus becomes a progressively harder problem to solve. At the same time, bigger clusters typically suffer from higher amounts of transient failures – overloaded machines, network interruptions and the like. These all require robust error handling to maintain a stable production environment.
Secondly, we see the desire to deploy multiple clusters in varying topologies and consume storage across cluster boundaries. One pattern we see commonly is the desire to run a centralised production storage cluster and mount storage from satellite development clusters. Another is to replicate data between datacentres, either synchronously or asynchronously.
We’ve built StorageOS 2.0 with these observations in mind. So what’s in the box?
Upgraded Control Plane
The control plane is the ‘brain’ of the product. It is responsible for orchestrating cluster operations such as volume placement, node failure detection, and so on. We’ve made several significant enhancements to the control plane.
Firstly, and most importantly, the way we model and schedule volume operations has changed. In StorageOS version 1 we coordinated all operations in the cluster through a node elected to act as a central scheduler. In version 2 we wanted to reduce reliance on that central role – for reasons of scale, but also because a centralised role made the handling of network partitions significantly more error-prone (is the master partitioned from the scheduler, or from another replica, or is the scheduler partitioned from etcd?). In version 2, each volume group (defined as a master and zero or more replicas), can act autonomously from the rest of the cluster, reasoning about node failure and maintaining strong consistency via locks in etcd. This architecture is far more scalable and resilient to short network interruptions.
In order to augment our internal testing, we’ve also modelled our internal state machine using the TLA+ formal verification language. This gives us far more confidence over our behaviour under extreme and difficult to test circumstances such as very large clusters or odd network partitions.
We now protect StorageOS endpoints using TLS using mutual authentication. This brings two benefits. Firstly, all traffic on the wire is encrypted, preventing eavesdropping. Secondly, all endpoints are authenticated using X509 certificates. This prevents all manner of malign activity, both intentional and unintentional. In today’s complex environments, we can no longer rely on firewalls to keep out bad-actors, so having a storage layer that is secure by default brings defense-in-depth to our clusters.
While we recognise that TLS is a good thing, managing a CA can be challenging. StorageOS 2.0 ships with an internal CA to manage all that complexity for you. It’s fully automatic by default. In later editions we’ll add the ability to plug in your own CA, for integration into larger environments.
Finally, we’ve made some smaller operational enhancements that are worth calling out.
We’ve significantly upgraded our logging in this release. All log messages are now decorated with rich context (such as volume name, node name, and so on), to make it easy to reason about cluster events, even as they occur across different nodes. We take observability seriously at Ondat, and spend a lot of time thinking about how to provide better visibility into our product’s operation. Many of us have run production environments ourselves in stressful environments such as the finance sector, and know the pain of the 3am wake-up call.
Our usage of etcd is vastly improved. We make less queries and store less data, resulting in more manageable etcd instances, and are more tolerant towards transient outages of etcd.
These sorts of improvements are important in making the product easier to administer and less impactful on the surrounding environment.
Upgraded Data Plane
The data plane is the engine of the product, responsible for moving data between disk and your application. It is a point of pride for us that our storage engine is written from the ground up – giving us a lot of flexibility that we ultimately pass to you, our customer.
In StorageOS version 2 we’ve completely re-written our sync engine, used when seeding or catching up replicas that have been offline or partitioned. The new algorithm, which we call Delta Sync, uses a Hash List to determine which blocks within a volume differ between the master and a replica, and only syncs the changed blocks. This is somewhat similar to what rsync does, but we maintain the Hash Lists during normal runtime operation, so we don’t need to perform lots of expensive IO during sync operations.
The benefits here are twofold. In the first instance, catching up failed replicas is much faster, so node reboots for maintenance are less impactful, and the effect of transient network partitions is minimised. In the second instance, because the amount of IO performed during sync operations is reduced, our impact on the cluster as a whole is much smaller than in previous versions. This gives you more available network and IO bandwidth during cluster failure conditions – just when you might need it the most.
Finally, StorageOS 2.0 is faster, a lot faster. By changing the size of thread pools dynamically based on load, we can react with lower latency under conditions of low contention. We benchmark Ondat internally using FIO, iterating on many parameters such as blocksize, queue depth, compressibility, and so on. While all of these scenarios show improvements, we’ve seen some of them improve by up to 135%.
As part of our development of StorageOS 2.0, we wanted to improve our testing capability. We built a new testing infrastructure based around the excellent Ignite VM manager from our friends at Weave. Our framework allows us to run a series of tests on each candidate release, each on a cleanly provisioned environment to prevent one test from polluting another. We layer our tests in increasing complexity, from ‘happy path’ tests which confirm functionality under normal operation to complex scenarios that use the linux iptables and traffic shaping subsystems to test behaviour under different failure conditions such as network partitions or severe packet loss.
After these tests have passed, we put the product through an exhaustive set of stress tests, in which we run clusters for extended periods and periodically introduce a range of failure conditions. During stress test operation we report metrics such as failover times to Prometheus, in order to accurately compare different versions.
We’ll blog about this further in the future, but suffice-to-say that StorageOS 2.0 is the best tested release we’ve ever shipped.
StorageOS 2.0 is the most reliable, most scalable, most performant, and best tested release we’ve ever shipped. Perhaps more importantly, it gives us a platform on which to implement our aggressive roadmap for the remainder of the year. We have plenty of good things to come – watch this space!