Chaos Testing
The cross-crate chaos suite — fault injection, recovery, and FMEA.
The krishiv-chaos crate runs a cross-crate suite of fault-injection scenarios. It is excluded from clippy and publish and is the only place where the entire system is exercised end-to-end under deliberate failure.
Running the suite
cargo test -p krishiv-chaos
What it covers
| Scenario | Failure injected | Recovery property |
|---|---|---|
executor_crash_during_window_agg | Executor killed mid-checkpoint | Job restarts from last committed epoch. No duplicate output for sinks with exactly-once delivery. |
coordinator_restart_during_submit | Coordinator restarted while a job is being submitted | Submit retries idempotently; job either runs once or errors cleanly. |
network_partition_coordinator_executor | Partition longer than heartbeat timeout | Executor is marked lost; tasks reassigned; no double-commit thanks to fencing tokens. |
checkpoint_storage_5xx | Object store returns 5xx during checkpoint | Checkpoint is retried; on permanent failure, the job fails rather than commit a partial state. |
ivm_compute_under_partition | Coordinator splits an IVM step across executors during a network partition | Steps are serialized per-job via step_lock; no double-compute or lost output. |
shuffle_spill_disk_full | Local shuffle directory runs out of space | Job fails with a clear error; partial results discarded. |
kafka_consumer_rebalance | Consumer rebalance mid-batch | At-least-once delivery preserved; exactly-once sources commit offsets transactionally with the checkpoint. |
iceberg_commit_conflict | Two writers commit to the same Iceberg snapshot | Conflict is detected and retried with a fresh snapshot id; no data loss. |
Extending the suite
To add a new scenario, write a test in crates/krishiv-chaos/tests/ and pull in the subsystems you want to fail. The chaos feature on krishiv-common exposes shared fault-injection helpers:
chaos::kill_executor(executor_id)chaos::partition_between(a, b, duration)chaos::corrupt_responses_from(a, fraction)chaos::slow_io_to(a, latency)