List of resources on testing distributed systems curated by Andrey Satarin (@asatarin). If you are interested in my other stuff, checkout talks page. For any questions or suggestions you can reach out to me on Twitter (@asatarin) or LinkedIn.
Contents
- Overview of testing approaches
- Specific approaches in different distributed systems
- Amazon Web Services
- Netflix
- Cassandra
- ScyllaDB
- VoltDB
- MemSQL
- CockroachLabs (CockroachDB)
- PingCap (TiDB)
- MongoDB
- Cloudera
- FoundationDB
- Wallaroo Labs
- Microsoft
- Dropbox
- Atomix Copycat
- Onyx
- Druid.io
- Salesforce
- InfluxDB
- Shopify
- Confluent (Kafka)
- Elastic (Elasticsearch)
- YugabyteDB
- FaunaDB
- Hazelcast
- Basho (Riak)
- CoreOS (etcd)
- Red Planet Labs
- Coil (TigerBeetle)
- Single node systems
- Tools
- Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems — Great overview of how even simple testing can help a lot, you just need right focus
- What Bugs Live in the Cloud? A Study of 3000+ Issues in Cloud Systems — study of actual bugs in different popular distributed systems (Hadoop MapReduce, HDFS, HBase, Cassandra, ZooKeeper and Flume)
- TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems — comprehensive taxonomy of bugs in distributed systems (Cassandra, Hadoop MapReduce, HBase, ZooKeeper)
- An Empirical Study on Crash Recovery Bugs in Large-Scale Distributed Systems — based on bug database from "What Bugs Live in the Cloud?" paper reseachers focus specifically on crash recovery bugs in Hadoop MapReduce, HBase, Cassandra, ZooKeeper. There is review of this paper by Murat Demirbas in his blog.
- Redundancy does not imply fault tolerance: analysis of distributed storage reactions to single errors and corruptions — study of several distributed systems (Redis, ZooKeeper, MongoDB, Cassandra, Kafka, RethinkDB) on how fault tolerant they are to data corruption and read/write errors
- An empirical study on the correctness of formally verified distributed systems — study of bugs in formally verified distributed systems. Analysis includes Miscrosoft's IronFleet distributed key-value store built from formal model.
- The Case for Limping-Hardware Tolerant Clouds — research on effect of limping hardware on performance of a distributed systems (aka limplock), see also great blog post by Dan Luu on a similiar topic Distributed systems: when limping hardware is worse than dead hardware
- Early detection of configuration errors to reduce failure damage — why and how to test configuration files of your system
- Why Is Random Testing Effective for Partition Tolerance Bugs? — just what it says in a title, authors try to explain why random testing (Jepsen) is effective and introduce notions of test coverage relating to network partition, see also "The Morning Paper" review or a video from POPL 2018.
- FlyMC: Highly Scalable Testing of Complex Interleavings in Distributed Systems — novel approach of systematically exploring interleavings in distributed systems augmented with static analysis and prioritization. This approach is faster than previous techniques and found old and new bugs in several systems (Cassandra, Ethereum Blockchain, Hadoop, Kudu, Raft LogCabin, Spark, ZooKeeper).
- What bugs cause cloud production incidents? — research focused on bugs (and their resolution strategies) that actually cause production incidents in large-scale distributed services at Microsoft Azure.
- Torturing Databases for Fun and Profit — checking ACID guarantees of open source and commercial databases under power loss, additional material
- Toward a Generic Fault Tolerance Technique for Partial Network Partitioning — overview of netrwork partition failures in various distributed systems (MongoDB, HBase, HDFS, Kafka, RabbitMQ, Elasticsearch, Mesos, etc), common traits among them and strategies to mitigate those failures.
Colin Scott shares his viewpoint from academia on testing distributed systems, specifically regression testing for correctness and performance bugs.
- Technologies for Testing Distributed Systems, Part I
- See also post Distributed Systems Testing: The Lost World by Crista Lopes
Great overview of techniques for testing distributed systems from practitioner, the video did age well and still extremely good overview of the landscape. Additional materials could be found in this Github repo
These materials are not directly related to testing distributed systems, but they greatly contribute to general understanding of such systems.
- Velocity NY 2013: Richard Cook, "Resilience In Complex Adaptive Systems"
- Velocity 2012: Richard Cook, "How Complex Systems Fail"
- How Complex Systems Fail
State of the art approach to testing stateful distributed systems.
- Jepsen Analyses — most recent Jepsen analyses of different distributed systems
- Jepsen Talks — talks by Kyle Kingsbury on various conferences
- Aphyr's Jepsen posts — older Jepsen analyses on Kyle Kingsbury's (Aphyr) personal site
- Jepsen Talks on Github — Jepsen talks slides before 2015 on Github
- Kyle Kingsbury on InfoQ
- Call me maybe: Jepsen and flaky networks — talk on Jepsen, not by Kyle
- Jepsen is used by Microsoft CosmosDB — founder of Azure CosmosDB confirms, that they are using Jepsen
Elle transactional consistency checker for black-box databases:
- Elle source code
- Black-box Isolation Checking with Elle — talk Kyle gave at CMU DB database seminar descibing Elle and results obtained with it
- Elle: Inferring Isolation Anomalies from Experimental Observations — paper on Elle design by Kyle Kingsbury and Peter Alvaro
Some notable Jepsen analyses:
- Jepsen: CockroachDB beta-20160829
- Jepsen: VoltDB 6.3
- Jepsen: RethinkDB 2.2.3 reconfiguration
- Jepsen: RethinkDB 2.1.5
Jepsen is used by CockroachDB, VoltDB, Cassandra, ScyllaDB and others.
- The verification of a distributed system By Caitie McCaffrey also podcast and talk on InfoQ.com and accompanying materials on GitHub and a slidedeck
- Designing Distributed Systems in TLA+ by Hillel Wayne, and talk Everything about distributed systems is terrible
- Comparisons of Alloy and Spin
- Verdi: Formally Verifying Distributed Systems
- Verdi — A framework for formally verifying distributed systems implementations in Coq
- Network Semantics for Verifying Distributed Systems
- Proving that Android’s, Java’s and Python’s sorting algorithm is broken (and showing how to fix it) — using formal verification to find a bug in TimSort sorting algorithm
- Proving JDK’s Dual Pivot Quicksort Correct — analizying quicksort implementation in Java
Companies using TLA+ to verify correctness of algorithms:
- Amazon Web Services
- PingCap for TiDB
- MongoDB
- Microsoft for services in Azure cloud
- Confluent for Apache Kafka
Netflix adopted lineage-driven fault injection techniques for testing microservices.
- Principles of Chaos Engineering
- Free Chaos Engineering book by Netflix engineers
- A curated list of awesome Chaos Engineering resources
Netflix pioneered chaos engineering discipline.
There are two flavors of fuzzing. First, randomized concurrency testing, where the ordering of messages is fuzzed:
And input fuzzing, where message contents or user inputs are fuzzed:
- DNS parser, meet Go fuzzer
- Fuzz Testing with afl-fuzz (American Fuzzy Loop)
- Randomized testing for Go and talk on this tool GopherCon 2015: Dmitry Vyukov — Go Dynamic Tools
- Simple guided fuzzing for libraries using LLVM's new libFuzzer
- LibFuzzer – a library for coverage-guided fuzz testing
- How Heartbleed could've been found — example of how fuzzing could be used for finding famous HeartBleed vulnerability
Amazing and comprehensive overview of different strategies to test systems built with microservices by Cindy Sridharan.
Series of blog posts specifically on testing in production — best practices, pitfaults, etc:
- Your Load Generator Is Probably Lying To You
- Everything You Know About Latency Is Wrong — great overview of Gil Tene`s "How NOT to Measure Latency" talk
- "How NOT to Measure Latency" by Gil Tene
- "Benchmarking: You're Doing It Wrong" by Aysylu Greenberg
- Performance Analysis Methodology — approaches developed by Brendan Gregg for analysing performance in systematic fashion
See also benchmarking tools.
- Minimizing Faulty Executions of Distributed Systems — reducing the size of buggy executions to make them easier to understand. 60 minute talk here
- Troubleshooting Blackbox SDN Control Software with Minimal Causal Sequences — similar to above, but requires less instrumentation.
- Concurrency Debugging with Differential Schedule Projections — find and minimize concurrency bugs using program analysis. Shared memory systems are equivalent to message passing systems, so you can apply the same techniques to distributed systems.
- "Simulation Testing" by Michael Nygard
- Testing Distributed Systems for Linearizability
- Metamorphic Testing — overview of what metamorphic testing is and where it can help. For more details see paper "Metamorphic Testing: A Review of Challenges and Opportunities".
- The Evolution of Testing Methodology at AWS: From Status Quo to Formal Methods with TLA+
- Use of Formal Methods at Amazon Web Services
- CACM Article "How Amazon Web Services Uses Formal Methods"
- Debugging Designs by Chris Newcombie there is also a source bundle
- Millions of tiny databases — has section on testing which describes several approaches: SimWorld simulation resembling approach used at Foundation DB, use of Jepsen and formal methods and game days.
- Using lightweight formal methods to validate a key-value storage node in Amazon S3 — paper on verifying correctness of a new key-value storage node implementation in S3. They are using property-based testing and stateless model checking extensively to balance trade-offs and follow pragmatic approach. I gave a talk "Formal Methods at Amazon S3" on this paper for a reading group.
See also formal methods section.
Automated failure injection (see also Lineage-driven Fault Injection):
- Monkeys in Lab Coats: Applying Failure Testing Research @Netflix
- “Monkeys in Labs Coats”: Applied Failure Testing Research at Netflix
- Automated Failure Testing
- Automating Failure Testing Research at Internet Scale by P. Alvaro et.el
Random/manual failure injection testing:
- Netflix Simian Army
- Failure Injection Testing
- From Chaos to Control — Testing the resiliency of Netflix’s Content Discovery Platform
- Breaking Bad at Netflix: Building Failure as a Service
- GTAC 2014: I Don't Test Often ... But When I Do, I Test in Production — Netflix different testing strategies
See also Chaos Engineering.
- Testing Apache Cassandra with Jepsen
- Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
- Testing Cassandra Guarantees under Diverse Failure Modes with Jepsen
- Jepsen Cassandra Testing on Git
- Netflix A STATE OF XEN — CHAOS MONKEY & CASSANDRA from Cassandra Summit 2015
- Testing Apache Cassandra with Jepsen: How to Understand and Produce Safe Distributed Systems by Joel Knighton presented at Devoxx UK 2016
- Testing Apache Cassandra 4.0 — quick overview of approaches used to test next major version of Cassandra
- Fallout — tool to run distributed tests as a service. It is meant to easily orchestrate cluster creation and testing tools like Jepsen, performance testing tools and others, though extention and combining them in various ways with enviromental conditions. It could run tests either locally or on large scale clusters.
- Cassandra Harry — Fuzz testing tool for Apache Cassandra. Aims to provide reproducible workloads to test correctness of Apache Cassandra.
- Fuzz Testing and Verification of Apache Cassandra with "Harry" — talk on Harry fuzz testing tool by Alex Petrov at ApacheCon 2021
- Harry, an Open Source Fuzz Testing and Verification Tool for Apache Cassandra by Alex Petrov — blog post about Harry fuzz testing tool for Apache Cassandra and how it helps to find bugs
They published series of blog posts on testing ScyllaDB:
- Scylla testing part 1: Cassandra compatibility testing
- Scylla testing part 2: Extending Jepsen for testing Scylla
- CharybdeFS: a new fault-injecting filesystem for software testing
- Testing part 4: Distributed tests
- Testing part 5: Longevity testing
- Fault-injecting filesystem cookbook Video from Scylla Summit 2017 on testing
- How We Constantly Try to Bring Scylla to its Knees and slides — overview of different testing types at ScyllaDB
- Project Gemini: An Open Source Automated Random Testing Suite for Scylla and Cassandra Clusters — random test generator comparing results from cluster with injected faults against single node running without faults. Works on tops of CQL API and suitable for testing any database implementing it. See also talk on Project Gemini and open source code
Series of post on testing at VoltDB:
- How We Test at VoltDB
- Testing at VoltDB: SQLCoverage — describes how they test SQL query functionality using 5 millions queries generated from templates and comparing results against HSQLDB
- Testing VoltDB Against PostgreSQL
- VoltDB 6.4 Passes Official Jepsen Testing — VoltDB hired Kyle Kingsbury (Jepsen) to tests their database, they share results in this post
Additional resources:
- "All In With Determinism for Performance and Testing in Distributed Systems" by John Hugg and a slide deck Hugg-DeterministicDistributedSystems.pdf
- SelfCheck workload
- TPC-C implementation
- Running MemSQL’s 107 Node Test Infrastructure on CoreOS
- Practical Techniques to Achieve Quality in Large Software Projects
- How to Make a Believable Benchmark
- Building an Infinitely Scalable Testing System — description of internal test system PsyDuck
- DIY Jepsen Testing CockroachDB — great read about using Jepsen at Cockroach Labs
- CockroachDB Beta Passes Jepsen Testing — CockroachDB tested by Kyle Kingsbury (Jepsen.io)
- Introducing Pebble: A RocksDB Inspired Key-Value Store Written in Go — introduces new storage engine and includes thorough discussion on what it takes to properly test storage engine
- Use Chaos to test the distributed system linearizability — describes Jepsen-like framework implemented in Go and used at PingCap to test TiDB
- A test framework for linearizability check with Go — Chaos is a Jepsen-like framework written in Go
- Testing Distributed Systems for Linearizability — linearizability testing library used by Chaos framework
- Chaos Tools and Techniques for Testing the TiDB Distributed NewSQL Database and the same post on company blog
- Official Jepsen report on TiDB 2.1.7 and companion blog post in company blog
- Safety First! Common Safety Pitfalls in Distributed Databases Found by Jepsen Tests — overview of Jepsen approach and tests with quick refresher on results for different databases to date
- https://github.com/pingcap/tla-plus — formal specification in TLA+ of Raft consensus protocol and implementation of distributed transactions in TiDB
- Testing Cloud-Native Databases with Chaos Mesh — talk on Chaos Mesh and how it is used for testing TiDB at PingCap. Blog post with introduction to Chaos Mesh and how it integrates with Kubernetes. See also Chaos Mesh source code and chaos engineering section.
See also formal methods section.
- MongoDB’s JavaScript Fuzzer: Creating Chaos (1/2)
- MongoDB’s JavaScript Fuzzer: Harnessing the Havoc (2/2)
- Fixing a MongoDB Replication Protocol Bug with TLA+ by William Schultz — how MongoDB uses formal verification with TLA+ to check correctness of their replication protocol. Describes how replication bugs could have been found with help of formal model.
- eXtreme Modelling in Practice - two attempts at MongoDB to check that code conforms to its formal model.
- Change Point Detection in Software Performance Testing — paper on how MongoDB team automatically detects performance degradations in the presence of noise in continuous integration runs. The paper was presented at ICPE 2020
See also formal methods section.
- Quality Assurance at Cloudera: Fault Injection and Elastic Partitioning — Cloudera describes their approach to fault injection testing
- Quality Assurance at Cloudera: Highly-Controlled Disk Injection
- "Testing Distributed Systems w/ Deterministic Simulation" by Will Wilson — talk on FoundationDB simulation testing. Their architecture was built from the ground up to suppport fully deterministic simulation testing
- Simulation and Testing — public overview of FoundationDB simulation testing framework
- FoundationDB or: How I Learned to Stop Worrying and Trust the Database by Markus Pilman from Snowflake — updated talk on testing FoundationDB with determenistic simulation. Markus goes into details of what it takes to build determenistic simulation into a database. He mentions that it took two years to build a simulation framework before FoundationDB team started working on a database.
- "Buggify — Testing Distributed Systems with Deterministic Simulation" — Alex Miller (https://twitter.com/oytyafln), one of developers at FoundationDB, describes BUGGIFY macros, which helps bias simulation tests towards doing dangerous and bug finding things. This is a good example of cooperation between testing efforts and production code.
- "FoundationDB: A Distributed Unbundled Transactional Key Value Store" — SIGMOD 2021 paper on FoundationDB has a very detailed section on simulation testing at FoundationDB with discussions on determinism, test oracles, fault injection and limitations.
- Measuring Correctness of State in a Distributed System — describes general idea and implementation how to test safety of distributed stream processing system
- Performance testing a low-latency stream processing system — high level overview of what to look at when testing performance of stream processing system
- How We Test the Stateful Autoscaling of Our Stream Processing System — advanced safety tests for autoscaling stateful stream processing
- All posts on testing from Walaroo engineering blog
There is also talk from Sean T. Allen on testing stream processing system at Wallaroo Labs (ex. Sendence)
- Materials on Sean's blog "CodeMeshIO: How Did I Get Here?"
- Video from QCon NY 2016 on InfoQ
- Video from CodeMeshIO on YouTube
- Presentation on Speakerdeck
- Efficient Exploratory Testing of Concurrent Systems — They don't mention it but looks like they describe testing of Google Omega
- Exploratory Testing Architecture (ETA)
- Paxos Made Live — An Engineering Perspective has a section on testing
- 10 Years of Crashing Google describes some war stories from Disaster Recovery Testing (DiRT) team at Google
- Testing for Reliability chapter from Google Site Reliability Engineering book
- Randomized Testing of Cloud Spanner — overview of randomized testing at Cloud Spanner, including how to scale it to large datasets and high concurrency
- Asynchronous programming, analysis and testing with state machines — Open source language for building distributed systems. Language is designed with tooling in mind, particularly, automatic exploration of message orderings in order to find bugs.
- Uncovering Bugs in Distributed Storage Systems during Testing (not in Production!)
- Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency describes "Pressure Point Testing" approach used for Azure Cloud Storage
- Inside Azure Search: Chaos Engineering
- TLA+ at Microsoft: 16 Years in Production by David Langworthy — how rejuvenation of TLA+ happened at Microsoft in 2016 and onwards
See also formal methods section.
- Mysteries of Dropbox Property-Based Testing of a Distributed Synchronization Service — example of how to use QuickCheck to test synchronisation in Dropbox and similar tools (Google Drive). John Hughes gave a talk on this. See also QuickCheck.
- Data Checking at Dropbox — If you have lots of data, you have to verify that is doesn't bit rot and protect it against rare bugs (e.g. race conditions) to guarantee long term durability. This talks explains intricacies of building data consistency checker(s) at Dropbox scale.
- Dropbox's Exabyte Storage System (aka Magic Pocket) talk by James Cowling — describes number of strategies to achieve exteremely high durability.
This includes:
- guard against faulty disks,
- guard against software defects,
- guard against black swan events,
- operational safeguards to reduce blast radius,
- safeguards against deletes with multi stage soft-delete,
- comprehensive testing strategy in-depth with increased scale,
- redundancy across varios axis in software and hardware stacks,
- continuous data integrity validation on many levels,
- etc
- Testing sync at Dropbox — comprehensive overview of two test frameworks at Dropbox for new sync engine implementation. CanopyCheck — single threaded and fully deterministic randomized testing framework with minimization for synchronization planner component of the engine. The other framework Trinity focuses on concurrency and larger surface area of componenents. Great discussion on tradeoffs between determinism, strengh of test oracles vs width of coverage and size of the system under test.
- Simoorg Failure inducer framework — Failure inducer implemented in Python
- A Deep Dive into Simoorg
- Dynamometer: Scale Testing HDFS on Minimal Hardware with Maximum Fidelity — testing scalability of large Hadoop clusters (namely NameNode) with just fraction of nodes
- Resiliency Testing with Toxiproxy
- Toxiproxy — A TCP proxy to simulate network and system conditions for chaos and resiliency testing
- Kafka Fault Injection framework
- TLA+ specification of the Kafka replication protocol and talk about using TLA+ for hardening Kafka replication protocol
See also formal methods section.
- Growing a protocol — applying lineage driven fault injection to test Elasticsearch replication protocol
- Using TLA+ for fun and profit in the development of Elasticsearch by Yannick Welsch — Elasticsearch uses TLA+ to verify correctnes of their replication protocol
- Jepsen Testing on YugabyteDB — YugabyteDB describes how they use Jepsen
- YugabyteDB 1.1.9 analysis by Kyle Kingsbury — Kyle explores safety of YugabyteDB. Accompanying post in company blog "YugabyteDB 1.2 Passes Jepsen Testing" and "Wrapping Up: Jepsen Test Results for YugabyteDB 1.2 Webinar" post with webinar recording by Kyle and Karthik Ranganathan (Yugabyte CTO).
- YugabyteDB 1.3.1 — Jepsen analysis of YugabyteDB support for serializable SQL transactions. Companion blog post on the company website.
- Verifying Transactional Consistency with Jepsen — results of internal Jepsen testing at FaunaDB
- Jepsen: FaunaDB 2.5.4 — official Jepsen test for FaunaDB, write-up in Fauna blog
- Testing the CP Subsystem with Jepsen — overview of how Jepsen is used to test Hazelcast in-memory data grid CP sybsystem
- Testing Eventual Consistency in Riak — how to model eventually consistent database in QuickCheck and find bugs in it's implementation, video available on youtube
- Modeling Eventual Consistency Databases with QuickCheck — another talk on testing Riak eventual consistency guarantees with QuickCheck
- Testing distributed systems in Go — overview of failure injection testing for etcd
- Where we’re going, we don’t need threads: Simulating Distributed Systems — following FoundationDB steps, Red Planet Labs uses deterministic simulation for testing. Their formula for success is "deterministic simulation = no parallelism + quantized execution + deterministic behavior".
- Simulation Tests in TigerBeetle — TigerBeetle is a distributed financial accounting database built in Zig programming language and uses simulation tests inspired by Dropbox and FoundationDB.
These examples are not about distributed systems, but they demostrate testing concurrency and level of sofistication required in distributed systems.
SQLite is not a distributed system by any stretch of the imagination, but provides good example of comprehensive testing of a database implementation.
- Finding bugs in SQLite, the easy way — how fuzzing used in testing SQLite database
- How SQLite Is Tested
- Sled simulation guide (jepsen-proof engineering) — guide on simulation testing (see FoundationDB) in Sled database
- Reliable Systems Series: Model-Based Testing
- Fuzzing ClickHouse — high level overview of query fuzzing at Clickhouse
- Comcast — Simulating shitty network connections so you can build better systems
- Muxy Simulating real-world distributed system failures
- Namazu — Programmable fuzzy scheduler for testing distributed systems
- Toxiproxy — A TCP proxy to simulate network and system conditions for chaos and resiliency testing
- Traffic Control
- Python API for Linux Traffic Control
- Slow tool
- Blockade is a utility for testing network failures and partitions in distributed applications
- DEMi: Distributed Execution Minimizer for Akka
- Chaos Mesh — chaos engineering platform for Kubernetes. See also PingCap, company behind Chaos Mesh.
- PolyConf 14: Testing the Hard Stuff and Staying Sane / John Hughes
- The Joy of Testing
- John Hughes on InfoQ
- Hansei: Property-based Development of Concurrent Systems
- QuickChecking Poolboy for Fun and Profit — from Basho
- Combining Fault-Injection with Property-Based Testing
- Testing Telecoms Software with Quviq QuickCheck
- Fuzz testing distributed systems with QuickCheck — using QuickCheck to test Raft protocol implementation in Haskell