Distributed Tracing
W3C trace context, OTLP exporters, and how traces are propagated across gRPC and Flight.
Krishiv emits W3C trace-context spans for every operation that crosses a meaningful boundary: SQL planning, task assignment, checkpoint barrier, shuffle fetch, and every gRPC call.
Span pipeline
The default pipeline (TracerExporter::Otlp) looks like:
Session::sql()
└─ DataFusion::plan
└─ krishiv_sql::plan_sql
└─ Coordinator::submit_job
└─ Scheduler::schedule
└─ Executor::run_task
└─ Dataflow::execute_window
└─ StateBackend::get ← (span propagates through state I/O)
Every span is annotated with the standard fields: job_id, task_id, epoch, operator_id, plus the current traceparent and tracestate.
gRPC propagation
Every gRPC service in krishiv-proto (coordinator-to-executor, executor-to-executor, coordinator-management) injects and extracts traceparent via inject_trace_context / extract_trace_context in krishiv-metrics::grpc. This means a single trace spans the coordinator and the executor that ran a task.
Configuration
| Exporter | When to use |
|---|---|
TracerExporter::Otlp | Default. Pushes to OTEL_EXPORTER_OTLP_ENDPOINT over gRPC. |
TracerExporter::Stdout | Local dev. Prints spans to stderr. |
TracerExporter::InMemory(exporter) | Tests. Inspect spans in-process. |
API
use tracing::{trace, info_span};
let _span = info_span!("checkpoint_commit", job_id = %job_id, epoch).entered();
// ... work that should be tracked
drop(_span); // exits the span
Cross-process context:
use krishiv_metrics::{current_traceparent, current_tracestate};
let tp = current_traceparent(); // e.g. "00-aabbcc..-112233..-01"
let ts = current_tracestate(); // vendor-specific
gRPC duration histogram
The GrpcDurationLayer middleware records krishiv_grpc_call_duration_seconds per method. The GrpcDurationService wrapper makes it trivial to add to a tower service.