JEPSEN

Analysis Work

Jepsen focuses on high-quality, detailed analyses of distributed systems safety. Working closely with the client, Jepsen reads documentation, designs a test suite, and measures system behavior, producing a written report detailing how the system behaves under various conditions. You can request an analysis by emailing aphyr@jepsen.io.

The Report

Jepsen reports explain the behavior of a system to the world. They characterize the system under test: its conceptual design, API, and claimed consistency and availability properties. They describe the design of the test suite, and key findings from the analysis. Reports conclude with a summary of findings, recommendations for users and the system’s maintainers, and future work.

For example, see our 2022 analysis of Redpanda 21.10.1

The Test Suite

As a part of every analysis, Jepsen builds an automated test suite for the system under test. This test uses the Jepsen testing library to install, interact with, induce faults in, and tear down the system, usually on a cluster of five to ten Debian nodes. Tests include a command-line runner which produces a pass-fail result, and usually encompass a variety of workloads, faults, and system tuning options.

Jepsen often trains clients in running and extending the test suite themselves, and helps teams integrate the test suite into their CI system.

The test suite is licensed under the Eclipse Public License, an open-source license commonly used by JVM projects. By default, only the client and Jepsen have access to the test suite until the report is released.

Systems

Jepsen analyzes databases, queues, caches, coordination services, blockchains, task schedulers, and more. We’ve tested SQL, key-value, document, graph, and crypto databases. Our work includes analyses of single-node, replicated, and sharded systems; in-memory and disk-persistent ones. The Jepsen testing library provides a general framework for testing all kinds of concurrent systems against a wide variety of invariants.

We mainly test systems that can be installed on clusters of Debian Linux nodes. This allows us to perform sophisticated fault injection. We’ve also tested some hosted services—for instance, AWS’s RDS database service. However, these tests generally don’t include fault injection, and are generally slower to run.

Faults

Jepsen tests systems under a broad variety of conditions. We look at healthy clusters, of course, but also what happens when systems fail.

The Jepsen library includes a suite of common faults, including simulated network partitions, network latency, process pauses and crashes, clock errors, power loss, and disk errors. In addition to these common faults, we often develop specific scenarios for the system under test: inducing garbage collection, adding and removing nodes, and so on.

Properties

Jepsen tests can measure a wide variety of safety and liveness properties. Based on the system’s documentation, we build checkers that ensure (e.g.) that confirmed writes are not lost, that messages are delivered in order, that nodes eventually converge, that state advances monotonically, that writes are visible to reads within some time bound, and so on. Our transactional isolation checkers can verify Serializabilility, Snapshot Isolation, Repeatable Read, Read Committed, and so on, as well as realtime and session variants of those models.

Our tests also provide quantitative and qualitative insight into availability. Graphs of throughput and latency show how system behavior changes when faults occur, and how long recovery takes. We can place estimates on replication lag in databases, or message delivery time in queues.

Over the last decade, Jepsen has built a suite of powerful, sophisticated checkers for various kinds of systems. Where existing checkers are insufficient, Jepsen often develops custom ones for the system under test.

Rates & Scheduling

Jepsen works with a single client, full time, on a week-to-week basis. Clients pay a flat rate for each week, and can keep going as long as they like. Once the engagement concludes, Jepsen moves on to the next client. Clients are queued on a first-signed, first-served basis.

Jepsen generally works year-round, Monday through Friday, 08:00 to 18:00 US Central time. Whenever Jepsen cannot provide a full week of work (e.g. due to teaching, conferences, holiday, or illness), that week’s fee is prorated accordingly.

If a client is at the top of the queue, but not ready to begin, they can opt to defer their engagement. Jepsen moves on to the next client in queue, and returns to the deferred client afterwards. There is no penalty for deferring.

We often discover unexpected bugs during an analysis, and clients often realize they’d like to test additional builds, features, or faults during the engagement. A week-to-week model gives clients the flexibility to explore system behaviors as we discover new information. It also ensures that engagements finish promptly, so queued clients don’t have to wait too long.

Analyses have a four-week minimum term: this ensures we have enough time to install the system, design a basic test suite, explore some faults, and produce a publishable report. Most analyses reflect four to twelve weeks of work.

The Analysis Process

During the first few days, Jepsen chats with the client and performs a thorough review of the system’s documentation. We figure out what the system is supposed to do (e.g. “Strong Snapshot Isolation over transactions”), and develop a plan for testing it. We start with high-impact tests that can be built quickly, then, in coordination with the client, develop more complex tests which take longer to write and tune. Throughout the process, Jepsen shares findings with the client, builds up a written report, and asks for feedback.

In the first week, Jepsen lays down the scaffolding for a test suite, and builds automation for setting up and tearing down the system. Depending on system complexity, this usually takes one to three days.

We then build a minimal test workload which generates operations, uses a network client to submit them to the system under test, and checks that they satisfy critical properties. Depending on the system’s API and client support, this may take one to five days.

We then introduce faults. We start with Jepsen’s built-in faults: network partitions, process crashes, and so on. Basic fault injection usually takes a week to write and evaluate.

After the basic test suite is up and running, we begin expanding it: adding new workloads, expanding existing ones to stress additional database features, introducing new kinds of faults, running tests with higher request rates or larger data volumes, and so on. As we run the test suite, Jepsen shares results with the client, and asks for guidance on what they’d like to explore next. We often work with the client to troubleshoot the causes of bugs: tuning configuration parameters, enhanced logging, collecting packet traces, and so on. We typically test candidate builds to confirm bugfixes. All of our findings are integrated into the report.

In the last week of the engagement, we polish the written report and ask for feedback from the client. We often go through a few rounds of editing before the report is finalized. Jepsen also cuts releases of any supporting libraries and the test suite itself, so the client has a stable artifact to run going forward.

Once the engagement concludes, an embargo period begins, and Jepsen moves on to work with another client. The client can use the embargo period to fix bugs, write documentation, cut releases, coordinate with their customers, and so on. The client picks a release date up to three months after the end of the engagement. On that date, the report and test suite become public.

A few days before the release of the report, Jepsen checks in with the client to make small, last-minute updates to the report: new features, documentation, bugfixes, and so on. This includes only a few hours of work—it does not include additional testing, review, etc.

Request an Analysis

Want to get started? Email aphyr@jepsen.io.