JEPSEN

Research Ethics

People rely on Jepsen to help them test their own systems, to make decisions about algorithms, and to understand how to use systems safely. To put this work in context, we’d like to talk about Jepsen’s limitations and biases–technical, social, and fiscal.

Technical Limitations

Jepsen analyses generally consist of running operations against a distributed system in a dedicated cluster, introducing faults into that cluster, and observing whether the results of those operations are consistent with some model. This introduces various sources for error: bugs, bounds on the search space, and the problem of induction. Jepsen’s design also limits its use as a performance benchmark.

Like all software, Jepsen’s test harnesses and libraries have bugs. We do our best to avoid them, but there have been bugs in Jepsen before and there will be again. These could lead to false negatives and false positives. Jepsen writes automated tests and double-checks anomalies by hand, but there can always be mistakes.

Jepsen operates on real software, not abstract models. This prevents us from seeing pathological message or thread schedules, or bugs that arise only on particular platforms. Because Jepsen performs real network requests, we can’t explore the system’s search space as quickly as a model checker–and our temporal resolution is limited. Memory and time complexity further limit the scope of a test. However, testing executables often reveals implementation errors or subsystem interactions which a model checker would miss! We can also test programs without having to modify or read their source.

Because Jepsen tests are experiments, they can only prove the existence of errors, not their absence. We cannot prove correctness, only suggest that a system is less likely to fail because we have not (so far) observed a problem.

Finally, Jepsen is a safety checker, not a benchmarking tool. It can emit performance analyses, but benchmark design is a complex, separate domain. Hardware, OS tuning, workload, concurrency, network behavior… all play significant roles in performance. Jepsen’s latency measurements are generally only comparable to the same software tested on the same platform. Its throughput measurements are not a performance benchmark; Jepsen controls its own request rate to reveal concurrency errors.

Conflicts of Interest

In addition to Jepsen’s technical limitations, there are social and financial motivations which influence this work. Nobody can be totally objective–all research is colored by the researchers’ worldview–but Jepsen still strives to publish high-quality results that people can trust.

First, motivations. Jepsen does this work because we believe it’s important, and because we want to help others build better systems. Jepsen believes that distributed systems should clearly communicate their requirements and guarantees, making tradeoffs explicit. We believe documentation should be complete and accurate, using formally analyzed concurrency models to describe a system’s invariants. We want every operator and engineer to have the tools and know-how to analyze the systems they rely on, and the systems they build.

Our researchers’ personal connections in the distributed systems world influence the work. However, we strive to be critical of the systems we’re affiliated with, such as Riak, Kafka, and Datomic. Conversely, harsh critiques can lead to productive relationships: some of our negative reports have been followed by friendly collaborations, as with Aerospike, Redis, and MongoDB.

There are also financial conflicts of interest when a vendor pays Jepsen to analyze their database. Few companies like seeing their product discussed in a negative light, which creates pressure to deliver positive results. At the same time, if we consistently fail to find problems, there’s no reason to hire Jepsen: we can’t tell you anything new, and we can’t help improve your product. Then there are second-order effects: companies often believe that a positive Jepsen post may suggest tampering, and encourage Jepsen to find problems. Database users have shown surprising goodwill towards vendors which are open to public criticism, and who commit to improvement.

Finally, when someone pays Jepsen to analyze any system–not just their own–it changes the depth and character of the investigation. During a paid engagement we can afford to devote more time to the analysis: a closer reading of the documentation, more aggressive test suites, and more time for back-and-forth collaboration with the vendor to identify the causes of a test failure and to come up with workarounds.

Promises

In an ideal world, Jepsen would crank out a comprehensive analysis of every release of every database on the planet and make it available to the public. We want users to understand the software they use. We want to keep vendors honest, and help them find and fix bugs. We want to teach everyone how to analyze their own systems, and for the industry as a whole to produce software which is resilient to common failure modes.

That said, analyses often take months of work, and our researchers have to eat. We have to balance the public good, what clients ask for, and constraints of time and money. Here’s what Jepsen promises:

Jepsen does its best to provide educational, clear, accurate, and timely analyses of distributed systems, in the interests of users, academics, and vendors alike.
Every analysis we release will include a clear statement of who funded the work.
Clients and Jepsen may collaborate on test design and the written analysis. We ask for client feedback as we work, and ask for review of drafts before publication.
That said, clients can never tamper with results. Clients may suggest possible causes, related bugs, workarounds, etc, and we’ll happily include those in the report, but if a database allows stale reads, or a queue drops data, we discuss those problems. Jepsen retains final editorial control over all public analyses.
Once we’ve prepared a written analysis, the client may defer publication by up to three months. This gives the client a chance to understand the faults we’ve identified, attempt to fix bugs, update documentation, and deploy fixes to users.
To prevent readers from assuming the worst in the event that a report is delayed, Jepsen will not disclose the existence of a contract without client consent.
Every analysis will include complete code for reproducing our findings. Arguments around formal models will be backed up by appropriate citations, and interpretation of intended behavior will be supported by documentation excerpts.
When we make a mistake in an analysis, we will update the post with a clear retraction or correction.
For environmental reasons, Jepsen will not analyze systems based on proof-of-work, proof-of-space, etc.

Most clients have asked that Jepsen operate with maximal freedom–they aren’t interested in influencing results. We think that’s great! Conversely, Jepsen declines requests for sponsored content and other fluff pieces; we don’t feel it’s consistent with Jepsen’s goals.

Finally, we’d like to emphasize that Jepsen analyses aren’t just database reviews. Each one is written as a case study in test design, and you should build on their techniques to evaluate your own systems. Everyone is free to use the Jepsen software to design their own tests for anything they please. It’s our sincere hope that we can collectively raise the bar for distributed systems safety.

Changelog

2024-07-17: Earlier versions promised that Jepsen could veto publication if Jepsen and a client could not agree on the content of an analysis. However, this veto has never been used. In fact, Jepsen’s contracts have given Jepsen final approval over the content of analyses since 2016. We replace the promise of a veto with a stronger promise of editorial control. In light of Jepsen’s multiple authors, we also shift to an organizational third person voice. Finally, we’ve streamlined some language.

2021-06-04: Jepsen no longer works on proof-of-work, proof-of-space, etc.