JEPSEN

Distributed Systems Safety Research

About Jepsen

Jepsen aims to improve the safety of distributed databases, queues, consensus systems, etc. We maintain an open source library for safety testing, and publish free, in-depth analyses of specific systems. In each analysis we explore whether the system lives up to its documentation’s claims, file new bugs, and suggest recommendations for operators. In addition to paid analysis, Jepsen offers technical talks, training classes, and consulting services.

Jepsen pushes vendors to make accurate claims and test their software rigorously, helps users choose databases and queues that fit their needs, and teaches engineers how to evaluate distributed systems correctness for themselves.

See Also

News

Recent research, analyses, and announcements.

TigerBeetle is a distributed OLTP database for financial transactions. We worked with TigerBeetle to test versions 0.16.11 through 0.16.30, and found seven crashes, elevated latencies during single-node failures, and requests which were retried forever. We found only two safety issues: missing results for queries with multiple predicates, and incorrect timestamps in a debugging API. As of version 0.16.45, TigerBeetle had addressed every issue, except for indefinite retries.

With a new experimental library for running Jepsen tests on Amazon RDS clusters, we report on a small issue in Amazon RDS for PostgreSQL. At the “Repeatable Read” isolation level, which in PostgreSQL normally means Snapshot Isolation, Amazon RDS for PostgreSQL clusters appear to exhibit Long Fork. We observed this behavior in healthy clusters, in versions ranging from 13.15 to 17.4. Amazon RDS for PostgreSQL may instead support Parallel Snapshot Isolation, a slightly weaker consistency model.

Upcoming Events

2025-03-03

By popular demand, we’re offering another open session of the distributed systems fundamentals class: four half-days discussing the basics of distributed systems theory and practice. For the first time we’re also opening up the accompanying workshop, for up to five participants. Please join!

Jepsen also has two conference talks coming up: one at BugBash 2025 (April 3-4), in Washington, DC, and a second at Systems Distributed 2025 (June 19-20), in Amsterdam.

Antithesis, Buf, and Jepsen are running a joint webinar on December 5th, 2024. We’ll discuss a Kafka protocol safety issue, talk about the challenges of distributed systems testing, and show how Jepsen and Antithesis helped identify critical safety errors in Bufstream. Come watch Antithesis pause, rewind, and explore a running Bufstream cluster in an interactive debugging shell!

Bufstream 0.1.0

2024-10-24

Jepsen worked with Buf to analyze the safety of Bufstream, a Kafka-compatible streaming system. We found three safety and two liveness issues in Bufstream 0.1.0, including the loss of acknowledged writes in healthy clusters. These problems were resolved by version 0.1.3. We also discovered serious issues in Kafka’s transaction protocol, including write loss, aborted read, and torn transactions. These problems affect Kafka, Bufstream, and (presumably) other Kafka-compatible systems; they remain outstanding.

All news from Jepsen…