Distributed Systems Safety Analysis
Jepsen is an effort to improve the safety of distributed databases, queues, consensus systems, etc. We maintain an open source software library for systems testing, as well as blog posts and conference talks exploring particular systems’ failure modes. In each analysis we explore whether the system lives up to its documentation’s claims, file new bugs, and suggest recommendations for operators.
Jepsen pushes vendors to make accurate claims and test their software rigorously, helps users choose databases and queues that fit their needs, and teaches engineers how to evaluate distributed systems correctness for themselves.
Research for Crate.io led to cases of dirty reads, replica divergence, and lost updates in Elasticsearch.
Jepsen found that document versions in Crate.io do not uniquely identify a particular version of a document, allowing lost updates.
We worked with VoltDB to discover and fix stale and dirty reads in their SQL database, and, in uncommon configurations, two bugs leading to the loss of acknowledged updates.
Jepsen helped RethinkDB identify and resolve a bug that led to stale reads, lost updates, and table failure during cluster reconfiguration.