JEPSEN

Distributed Systems Safety Research

About Jepsen

Jepsen is an effort to improve the safety of distributed databases, queues, consensus systems, etc. We maintain an open source software library for systems testing, as well as blog posts and conference talks exploring particular systems’ failure modes. In each analysis we explore whether the system lives up to its documentation’s claims, file new bugs, and suggest recommendations for operators.

Jepsen pushes vendors to make accurate claims and test their software rigorously, helps users choose databases and queues that fit their needs, and teaches engineers how to evaluate distributed systems correctness for themselves.

In addition to public analyses, Jepsen offers technical talks, training classes, and distributed systems consulting services.

Other Resources

Recent Work

  • We analyzed Dgraph, a distributed graph database, and identified numerous deadlocks, crashes, and consistency violations, including the loss and corruption of records, even in healthy clusters. Dgraph addressed many of these issues during our collaboration, and continues to work on remaining problems.

  • Together with Aerospike, we validated their next-generation consensus system, confirming two known data-loss scenarios due to process pauses and crashes, and discovering a previously unknown bug in their internal RPC proxy mechanism which allowed clients to see successfully applied updates as definite failures. Aerospike fixed this bug, added an option to require nodes write to disk before acknowledging operations to clients, and plans to extend the maximum clock skew their consensus system can tolerate.

  • Jepsen demonstrated numerous problems with data loss in Hazelcast, an in-memory data grid: map updates could be lost, atomic references were not atomic, ID generators generated duplicate IDs, locks were not exclusive, and queues could lose acknowledged messages.

  • Jepsen worked with Tendermint to evaluate their distributed, linearizable, byzantine-fault-tolerant blockchain system. We were unable to find issues with their replication algorithm, but did discover single-node crashes and issues with crash recovery that could lead to unavailability or data loss.