JEPSEN

Distributed Systems Safety Research

About Jepsen

Jepsen is an effort to improve the safety of distributed databases, queues, consensus systems, etc. We maintain an open source software library for systems testing, as well as blog posts and conference talks exploring particular systems' failure modes. In each analysis we explore whether the system lives up to its documentation's claims, file new bugs, and suggest recommendations for operators.

Jepsen pushes vendors to make accurate claims and test their software rigorously, helps users choose databases and queues that fit their needs, and teaches engineers how to evaluate distributed systems correctness for themselves.

In addition to public analyses, Jepsen offers technical talks, training classes, and distributed systems consulting services.

Recent Work

  • Jepsen worked with Tendermint to evaluate their distributed, linearizable, byzantine-fault-tolerant blockchain system. We were unable to find issues with their replication algorithm, but did discover single-node crashes and issues with crash recovery that could lead to unavailability or data loss.

  • We worked with Cockroach Labs to refine the Jepsen test suite they wrote for CockroachDB, and found multiple bugs leading to serializability violations, all of which are now fixed.

  • Jepsen helped MongoDB identify design flaws in their v0 replication protocol and implementation bugs in its v1 replacement, all of which could lead to the loss of majority-acknowledged operations. We also collaborated with MongoDB to integrate Jepsen into their CI system. MongoDB added support for linearizable reads in October 2016.

  • Research for Crate.io led to cases of dirty reads, replica divergence, and lost updates in Elasticsearch.

  • Jepsen found that document versions in Crate.io do not uniquely identify a particular version of a document, allowing lost updates.

  • We worked with VoltDB to discover and fix stale and dirty reads in their SQL database, and, in uncommon configurations, two bugs leading to the loss of acknowledged updates.