ProductDocumentationExamplesBlogRoadmapGitHubGet Started
Available

Chaos Testing

The cross-crate chaos suite — fault injection, recovery, and FMEA.

The krishiv-chaos crate runs a cross-crate suite of fault-injection scenarios. It is excluded from clippy and publish and is the only place where the entire system is exercised end-to-end under deliberate failure.

Running the suite

cargo test -p krishiv-chaos

What it covers

ScenarioFailure injectedRecovery property
executor_crash_during_window_aggExecutor killed mid-checkpointJob restarts from last committed epoch. No duplicate output for sinks with exactly-once delivery.
coordinator_restart_during_submitCoordinator restarted while a job is being submittedSubmit retries idempotently; job either runs once or errors cleanly.
network_partition_coordinator_executorPartition longer than heartbeat timeoutExecutor is marked lost; tasks reassigned; no double-commit thanks to fencing tokens.
checkpoint_storage_5xxObject store returns 5xx during checkpointCheckpoint is retried; on permanent failure, the job fails rather than commit a partial state.
ivm_compute_under_partitionCoordinator splits an IVM step across executors during a network partitionSteps are serialized per-job via step_lock; no double-compute or lost output.
shuffle_spill_disk_fullLocal shuffle directory runs out of spaceJob fails with a clear error; partial results discarded.
kafka_consumer_rebalanceConsumer rebalance mid-batchAt-least-once delivery preserved; exactly-once sources commit offsets transactionally with the checkpoint.
iceberg_commit_conflictTwo writers commit to the same Iceberg snapshotConflict is detected and retried with a fresh snapshot id; no data loss.

Extending the suite

To add a new scenario, write a test in crates/krishiv-chaos/tests/ and pull in the subsystems you want to fail. The chaos feature on krishiv-common exposes shared fault-injection helpers:

  • chaos::kill_executor(executor_id)
  • chaos::partition_between(a, b, duration)
  • chaos::corrupt_responses_from(a, fraction)
  • chaos::slow_io_to(a, latency)

See also