JEPSEN

Distributed Systems Safety Research

About Jepsen

Jepsen aims to improve the safety of distributed databases, queues, consensus systems, etc. We maintain an open source library for safety testing, and publish free, in-depth analyses of specific systems. In each analysis we explore whether the system lives up to its documentation’s claims, file new bugs, and suggest recommendations for operators. In addition to paid analysis, Jepsen offers technical talks, training classes, and consulting services.

Jepsen pushes vendors to make accurate claims and test their software rigorously, helps users choose databases and queues that fit their needs, and teaches engineers how to evaluate distributed systems correctness for themselves.

News

Recent research, analyses, and announcements.

Kaivalya Apte interviewed Kyle Kingsbury for The GeekNarrator Podcast. We talk about common bugs in distributed systems, type I vs type II errors, ensuring correctness in Jepsen itself, LLMs, experimental techniques, formal verification, and more.

Jepsen’s 17th conference talk, “ACID Jazz”, is now available on YouTube. This talk was presented at Antithesis’ BugBash conference, in Washington, DC, and covers research on MySQL 8.0.34, Datomic Pro 1.0.7075, and Bufstream 0.1.0.

Antithesis and Jepsen are proud to present a glossary for distributed systems reliability. We hope this will be helpful for engineers, testers, and students!

TigerBeetle is a distributed OLTP database for financial transactions. We worked with TigerBeetle to test versions 0.16.11 through 0.16.30, and found seven crashes, elevated latencies during single-node failures, and requests which were retried forever. We found only two safety issues: missing results for queries with multiple predicates, and incorrect timestamps in a debugging API. As of version 0.16.45, TigerBeetle had addressed every issue, except for indefinite retries.

With a new experimental library for running Jepsen tests on Amazon RDS clusters, we report on a small issue in Amazon RDS for PostgreSQL. At the “Repeatable Read” isolation level, which in PostgreSQL normally means Snapshot Isolation, Amazon RDS for PostgreSQL clusters appear to exhibit Long Fork. We observed this behavior in healthy clusters, in versions ranging from 13.15 to 17.4. Amazon RDS for PostgreSQL may instead support Parallel Snapshot Isolation, a slightly weaker consistency model.

All news from Jepsen…