JEPSEN

Request for a Lazy Filesystem

Thanks to everyone who wrote in! I’ve got several responses to sort through–I promise I’ll get back to everybody who got in touch about this job. –Kyle

Jepsen would like to contract someone to build a (or extend an existing) FUSE filesystem with an internal page cache, for testing databases which don’t call fsync as often as they should.

Problem Statement

I want to simulate what would happen if a node suffered a power failure and did not write un-fsynced data to disk. Right now we test node failures in Jepsen by killing a database process with kill -9, but this still allows the OS to politely flush data to disk. Ideally, I’d like to kill the DB, then run some command (e.g. breakfs simulate-power-failure) which tells the filesystem “Hey, forget whatever data was written to you that wasn’t fsynced.” Once that’s done, I’ll restart the DB process, and it can look at the filesystem and fail to read its unsynced writes. For applications which don’t fsync correctly, I’m hoping this leads to detectable application-level data loss.

My guess is that this looks like a FUSE filesystem, perhaps written in C, which maintains an in-memory page cache on top of an underlying filesystem, and flushes that cache only when memory pressure or fsyncs occur. On command, it can drop the page cache.

Requirements

  • The filesystem needs to be complete enough that existing commercial & OSS databases can run on it unmodified.

  • It needs some sort of client binary that can tell the filesystem to inject a fault, and wait until the fault is complete. Maybe it puts a message saying “please do this kind of fault” onto a Unix domain socket and waits for a response, or reads and writes to a virtual file. Latency is important here: this client needs to invoke and complete quickly, so we can go on to do other work. ~100 ms would be good.

  • Right now I just need one fault: losing un-synced data. I’d like to add more fault types later, so the protocol should be designed with extension in mind.

  • This should be easy to get running on Debian x64 and arm64 machines. Jepsen’s going to be responsible for getting this filesystem installed on each DB node automatically by running shell commands, and we don’t necessarily control the DB node environment. I don’t want to spend six hours trying to get some weird version of Avro or some sprawling NPM-based build system installed. A small C tarball with a makefile would be fantastic. Simple, stable toolchains please. Things that won’t explode when the compiler version changes three years from now.

  • The filesystem needs to handle ~50 gigabytes of data. We’re not talking huge data volumes, but it’s going to exceed RAM. The filesystem will need to efficiently spill to disk, while still being able to discard all (or at least a good chunk of) un-fsynced writes on command.

  • Some of the databases Jepsen tests are fairly IO intensive, especially on startup, so this thing needs to be Reasonably Fast. For instance, a DB might want to write 10 GB of data immediately on startup before it starts answering requests, and if it can’t do that fast enough, it’ll mess up the cluster join process. I’d like to target within ~10x the underlying filesystem’s latency/throughput, if that’s at all reasonable. I think this steers us away from languages like Python, but I’m not sure.

  • Docs. One of the problems in this space is that basically none of the extant projects have good documentation; you’re often stuck reading the source or tests. I’d like you to write docs with examples for this project.

  • Tests. There should be some kind of automated test that shows this thing works like a real filesystem, a safety test which demonstrates that it can induce data loss on command, and some kind of basic performance test that shows it can handle Moderately Chonky data volumes / IOPS.

  • This thing should have some kind of reasonable, commonplace OSS license, so I can either distribute it as a part of Jepsen (which is EPL) or at least download and install it on DB nodes automatically without getting users in legal trouble.

Stretch Goals

  • More faults! Out-of-space would be a good one–databases routinely hit this in prod, and don’t necessarily handle it well. So would flushing some—but not all—of the page cache on simulated crash.

  • If you know a lot about the POSIX filesystem API and its Weird Ordering Semantics, that’s fantastic. I’d love it if you could read some of the academic literature on FS testing for faults that might be relevant, or make up your own kinds of interesting, “plausible-but-terrifying” faults. Bonus points if you can prove this kind of fault has happened in the wild.

  • I don’t have a good sense of the POSIX rules around directory and metadata operations and fsync vs fdatasync. It’d be great if you did understand these, and came up with a reasonable way to simulate how they’d behave under a node power failure.

Prior Art

This isn’t a new problem exactly–there have been several projects in this space, but I haven’t gotten any of them to work yet. Extending and documenting one of these projects might be a valid approach to this project!

  • Confluent’s Kibosh feels the closest to what I want, but as far as I can tell from its code, it doesn’t have a way to lose un-synced data. This project might turn out to be “adding a page cache and lose-unsynced-data fault to Kibosh”.

  • CuttleFS does post-fsync failure, but I can’t tell if that’s just issuing faults in response to fsync calls, or actually involves rolling back unsynced writes. It’s not clear if it can dynamically schedule faults–there’s an undocumented HTTP server lurking in the code, which might be promising. It also has a frustrating build toolchain, at least for a Python novice: I spent two days getting pip, ninja, meson, and the particular version of libfuse and fusepy it needs installed. Landed in Python dependency hell and never got it working.

  • CharybdeFS relies on a particular version of Thrift which involved (last time I tried) a bunch of dependencies whose versions weren’t readily available in Debian. I burned multiple days on this too and gave up. Maybe it’s better now?

  • PetardFS seems to rely on pre-determined schedules in a config file, and the documentation was somewhat unclear about how that file worked.

  • UnreliableFS doesn’t seem to do un-fsynced data loss, and it’s not clear if you can request a single fault via through the virtual filesystem, or if you can only change ongoing faults.

Business Stuff

We’ll use an independent contractor agreement for this project; either your standard contract, or one Jepsen’s lawyers put together.

We’ll discuss work via email, chat, and/or voice/video calls. I imagine we’d have some initial chats about goals and what you plan to do. You’d go off and build the thing, checking in every few days to let me know how it’s going. Once you’ve got something runnable, I’ll start plugging it into Jepsen and make sure that I can run some Real Databases on top of it; we’ll make adjustments or bugfixes as necessary from there. The final deliverable would be something like a Git repo with code and documentation.

I’m guessing this project will take roughly 2-4 weeks of full-time work, but let me know what you think based on your experience. If you’ve written FUSE filesystems before and have a good idea of how you’d build a page cache, that’s great. If you’re new to this and don’t quite know what you’re doing, but want to give it a shot, I’m willing to work with that too. People from under-represented or marginalized communities are especially encouraged to apply!

I work on a week-to-week basis as a contractor myself; I’m happy to arrange either fixed or weekly rates. I believe in friendly communication, mutual respect, and doing right by others. Jepsen pays well and on time.

If you want to work on this, please email aphyr@jepsen.io with a few sentences about your expertise and how you might tackle this problem. :-)