Metrics
Process-wide KrishivMetrics singleton, Prometheus text, and the OTLP push path.
Metrics live in a single process-wide KrishivMetrics struct (one per process — coordinator, executor, CLI). The struct carries ~120 counter, gauge, and histogram fields covering the scheduler, executors, sources, sinks, IVM, shuffle, and state.
Counters and gauges (selection)
| Metric | Type | Labels |
|---|---|---|
krishiv_tasks_submitted_total | counter | job_id |
krishiv_tasks_succeeded_total | counter | job_id, task_id |
krishiv_tasks_failed_total | counter | job_id, reason |
krishiv_task_attempt_total | counter | job_id, attempt |
krishiv_checkpoint_epoch | gauge | job_id |
krishiv_checkpoint_epochs_total | counter | job_id, status |
krishiv_checkpoint_commit_duration_seconds | histogram | job_id |
krishiv_watermark_ms | gauge | job_id, source_id |
krishiv_source_offset_lag | gauge | job_id, source_id |
krishiv_executor_slots_used | gauge | executor_id |
krishiv_executor_lost_total | counter | reason |
krishiv_state_key_count | gauge | job_id, operator_id |
krishiv_state_bytes | gauge | job_id, operator_id |
krishiv_shuffle_bytes_written_total | counter | job_id, stage_id |
krishiv_shuffle_records_written | counter | job_id, stage_id |
krishiv_shuffle_local_blocks_fetched | counter | job_id, stage_id |
krishiv_shuffle_remote_blocks_fetched | counter | job_id, stage_id |
krishiv_grpc_call_duration_seconds | histogram | method |
krishiv_streaming_rows_emitted_total | counter | job_id, task_id |
krishiv_job_queue_depth | gauge | namespace |
Full list: krishiv_metrics::KrishivMetrics.
Histograms
Buckets are sensible for the typical value ranges. For latency: 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 30, 60, 300, 1800 seconds. For sizes: powers of two from 256 B to 1 GiB.
Prometheus text format
Coordinator, executors, and the operator UI all expose GET /metrics returning Prometheus text exposition. render_prometheus() serialises with one HELP and one TYPE per metric family, including all labelled variants. Suitable for direct scrape by Prometheus or VictoriaMetrics.
OTLP push
Set OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317 and Krishiv will push traces (and metrics if your OTel SDK is configured to) over gRPC. Same pipeline as the rest of your fleet.
Scheduler-specific metrics
Beyond the KrishivMetrics singleton, the scheduler exposes additional metrics through SchedulerMetrics::scheduler_metrics():
krishiv_running_taskskrishiv_task_retries_totalkrishiv_failed_assignments_totalkrishiv_max_executor_heartbeat_age_ticks
These are rendered into the same /metrics response when the UI is co-located with the coordinator.
Per-job cleanup
When a job ends, call global_metrics().remove_job(job_id) to drop all labelled variants. The CLI and the coordinator do this automatically on job completion and on coordinator restart.