At the beginning of 2019 Ondat (previously StorageOS) embarked on a significant development effort to build v2.0 of our product. The cloud native landscape had significantly changed since the inception and subsequent release of StorageOS v1.0, with Kubernetes emerging from the container orchestrator battle royale as the clear winner. We observed two key trends amongst our customers and the industry in general.
Firstly, customers wanted to run bigger clusters – in the 10’s or even 100’s of nodes. Secondly, customers wanted to run multiple clusters and for the clusters to interact in various patterns. We designed StorageOS 2.0 to address these trends.
Given the significant development effort to create StorageOS v2.0, we wanted to ensure the testing toolkit and workflow we used was up to the task of verifying our reliability and scalability. At Ondat, product testing is undertaken by the Product Reliability Engineering (PRE) team, and we tasked ourselves with engineering a new testing framework and suite of tests to do justice to the product the company was building.
Our previous testing framework (known internally as ‘vol-test’) was based around a set of libraries and scripting tools that included the ability to provision a variety of environments on a variety of platforms for the tests to run in. Vol-test also included functional, end to end, stress and benchmarking tests for the StorageOS API as well as Ondat running in Kubernetes.
While feature rich, we often found ourselves wanting a more expressive language and toolset. Additionally, the provisioning and testing functions were closely coupled – which made the system inconvenient to work with at times. We staged our own version of the perennial engineering ‘re-write or improve’ debate and settled on separating the testing and provisioning functions and focusing on a better way to write and execute tests. What we ended up with was three systems:
- T2Infra – a provisioning framework for creating reproducible environments quickly
- T2 – a testing framework and suite of tests to verify Ondat behaviour and API outside of Kubernetes
- Kubecover – a testing framework and suite of tests to run within Kubernetes
A major consideration was what language to create the new testing frameworks in. The two languages that we considered were Python and Golang; the former due to ease of use and our team’s familiarity with it and the latter because Kubernetes and surrounding projects are often written in Golang. Perhaps more importantly, much of Ondat is written in Golang too.
The use of a common language between our Reliability Engineers and our Software Engineering colleagues is a great advantage when testing, as it allows for closer collaboration and tightened development cycles. The static typing and easily accomplished concurrency that Golang offered has been extremely welcomed and has helped to drive development of significantly more complex tests at a significantly reduced cost.
T2Infra – Fast environment provisioning using Ignite/Firecracker
T2Infra is a provisioning framework that lets us stand up testing environments quickly. We generally run our tests on Firecracker VMs (for more information about Firecracker check out this blog by Shuveb Hussain) by leveraging the Ignite project from our good friends at Weave. In essence, Ignite provisions Firecracker virtual machines that run Ondat containers. That seems like a lot of layers of abstraction but this combination of tooling gives us; highly reproducible, quick to provision, isolated environments for running tests.
These qualities are important to us as a reproducible environment translates directly into more deterministic tests, while fast provisioning times allow us to spin up a fresh environment for every test. Using a new environment per test means that tests do not “cross-contaminate” one another further increasing the reliability of our tests.
T2 – Testing outside of Kubernetes
T2 is a testing framework that was created to test and verify the behaviours of Ondat while driving Ondat via its API. The framework consists of libraries to access the StorageOS API, to cause various kinds of ‘chaos’, provide test orchestration functions, and of course, the tests themselves.
Because we are using lightweight VMs, we have familiar tools available to us for introducing controlled disruption into the test environment. For example, we leverage IPTables, IPSet and TC to cause network partitions, inject latency and restrict bandwidth between nodes. These tools are controlled via the T2 framework during test runtime so changes are entirely dynamic.
For additional insight into the test environment, it is important to collect both logs and traces from inside our VMs. Running additional containers inside the VMs such as the Jaeger and Fluentd agents for collecting traces and logs is simple to do with systemd manifests. As each test spins up its own environment it is easy to collect and separate traces and logs from different tests. For more insight into how important tracing is for understanding a distributed system Manish Rai Jain from Dgraph wrote an interesting blog about how he used tracing and Jaeger to squash a bug exposed by Jepsen.
Kubecover – Adding Kubernetes to the mix
While T2 is useful for testing Ondat in a dedicated, predictable environment, the final arbiter of Ondat as a product is of course how we behave within Kubernetes – since this is how our customers run our product.
Kubecover is a testing framework written in Golang that uses the Kubernetes Golang client directly as well as the StorageOS API Client to drive tests of Ondat within Kubernetes clusters. Our Kubecover framework leverages great work by darkowlzz to make Kubernetes in Docker (Kind) work with Ignite as a backend instead of Docker. The changes allow us to spin up a Kind cluster with each Kind “node” running on a separate Ignite provisioned Firecracker VM. The advantage of using Kind is that we can spin up local and test environments in a matter of seconds.
Given the many popular Kubernetes distributions that exist, we can also use Kubecover binaries in testing pipelines that run against other Kubernetes distributions. For example, we routinely run tests against managed Kubernetes services such as AKS and EKS, as well as various versions of upstream Kubernetes running on cloud infrastructure.
Why have two separate testing frameworks?
The reason for having two separate test frameworks is separation of concerns. T2 is for end to end testing of the Ondat container, driven via the StorageOS API, while Kubecover is for testing Ondat integration with, and behaviours in Kubernetes, driven via the CSI interface. Kubernetes adds layers of abstraction that are not always helpful in the end to end testing of Ondat, so testing Ondat behaviour is something we generally handle in T2. Testing of Kubernetes specific behaviours then takes place in Kubecover. In general, our approach is to first verify the StorageOS API behaviour before adding testing of behaviours running Ondat in Kubernetes.
For a more specific example of how we make these distinctions, consider testing of volume resize. We first tested volume resize in T2 to verify that the StorageOS API is behaving correctly and the feature works as we expect it to. Once the correct behaviour was verified, we wrote tests in Kubecover to ensure that CSI resize requests were also being handled correctly. The resize requests in Kubernetes are generated by editing the size of a Persistent Volume Claim so requests come via the CSI API.
How do we spec our tests?
In order to have a consistent and clear description each test gets its own Gherkin spec, so the test is described in a `GIVEN, WHEN, THEN` syntax. The test spec should give the reader a clear indication of what the test is going to do and provides context for the test code itself. Having a consistent approach to documenting our tests has made it easy for anyone in our team to triage a test failure and understand what the test is doing. This approach is particularly helpful when it comes to more complex test cases.
Typically we begin spec’ing out tests by following the “happy path”. For example, the “happy path” spec for resizing a volume looks like the below:
GIVEN an unattached volume of size x WHEN the volume is resized to size y where y is greater than x THEN the volume will display its new size AND the block device size will reflect this size AND the filesystem will reflect a new larger size
The happy path test is important as it proves that the feature works as expected. The next priority is an “unhappy path” test that proves the API is not just accepting and actioning any and all requests, whether valid or not.
GIVEN an unattached volume of size x WHEN the volume is resized to size y where y is less than or equal to x THEN the resize will fail with an easy to understand error
Once a happy and unhappy path has been set we investigate cases where failure is assumed. For example in cases of concurrent requests, it is important that we only action the first, and error the rest, as detailed in the following spec:
GIVEN an unattached volume of size x WHEN two concurrent resize requests are made to size y and size z THEN only one resize will be actioned AND the other request will receive an error response
As the Ondat Control Plane guarantees linearisable volume updates, one of the concurrent resize requests will fail because each resize request will be using the same volume object version. The first request to reach the API will cause a version bump so the subsequent request will be failed. However, there are also linearisable operations that we verify work as intended, such as having replicas rejoin while a resize is occurring. Once the replicas have rejoined they should also reflect the new size of the volume.
How we run our tests
We run all of the tests in T2 against development containers throughout the day. This helps us to catch any breaking changes that have not been caught further down the testing stack. A full T2 run takes a few hours so it’s not feasible to gate development merges on T2, so any new test failures in a T2 run are traceable to that day’s merges. As it stands we have T2 tests covering all Ondat features and we work closely with the control plane team to ensure that we develop tests in cadence with development of new features.
Another important part of the puzzle is the behaviour of Ondat during longer running multi-day stress tests. As a part of these tests we write data, fail nodes at random intervals and verify the integrity of the previously written data, while also monitoring resource usage. This is effectively a form of Chaos Engineering. A key component of running stress-tests is to instrument the system so that we can measure response times and error rates under conditions of duress. We log many different metrics (for example the completion times of various API calls) to Prometheus, and can therefore build a picture of what ‘good’ looks like, and alert where the system diverges from these norms. More on this in a future post!
As mentioned in our StorageOS v2.0 Release Blog, our 2.0 release is the best tested release we’ve ever shipped. Our testing regimes give us high confidence that our product is scalable, performant, and most importantly reliable, under difficult conditions. This gives our customers confidence that Ondat will always keep their data safe, even under the most demanding of scenarios.