ProductDocumentationExamplesBlogRoadmapGitHubGet Started
Available

Metrics

Process-wide KrishivMetrics singleton, Prometheus text, and the OTLP push path.

Metrics live in a single process-wide KrishivMetrics struct (one per process — coordinator, executor, CLI). The struct carries ~120 counter, gauge, and histogram fields covering the scheduler, executors, sources, sinks, IVM, shuffle, and state.

Counters and gauges (selection)

MetricTypeLabels
krishiv_tasks_submitted_totalcounterjob_id
krishiv_tasks_succeeded_totalcounterjob_id, task_id
krishiv_tasks_failed_totalcounterjob_id, reason
krishiv_task_attempt_totalcounterjob_id, attempt
krishiv_checkpoint_epochgaugejob_id
krishiv_checkpoint_epochs_totalcounterjob_id, status
krishiv_checkpoint_commit_duration_secondshistogramjob_id
krishiv_watermark_msgaugejob_id, source_id
krishiv_source_offset_laggaugejob_id, source_id
krishiv_executor_slots_usedgaugeexecutor_id
krishiv_executor_lost_totalcounterreason
krishiv_state_key_countgaugejob_id, operator_id
krishiv_state_bytesgaugejob_id, operator_id
krishiv_shuffle_bytes_written_totalcounterjob_id, stage_id
krishiv_shuffle_records_writtencounterjob_id, stage_id
krishiv_shuffle_local_blocks_fetchedcounterjob_id, stage_id
krishiv_shuffle_remote_blocks_fetchedcounterjob_id, stage_id
krishiv_grpc_call_duration_secondshistogrammethod
krishiv_streaming_rows_emitted_totalcounterjob_id, task_id
krishiv_job_queue_depthgaugenamespace

Full list: krishiv_metrics::KrishivMetrics.

Histograms

Buckets are sensible for the typical value ranges. For latency: 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10, 30, 60, 300, 1800 seconds. For sizes: powers of two from 256 B to 1 GiB.

Prometheus text format

Coordinator, executors, and the operator UI all expose GET /metrics returning Prometheus text exposition. render_prometheus() serialises with one HELP and one TYPE per metric family, including all labelled variants. Suitable for direct scrape by Prometheus or VictoriaMetrics.

OTLP push

Set OTEL_EXPORTER_OTLP_ENDPOINT=http://collector:4317 and Krishiv will push traces (and metrics if your OTel SDK is configured to) over gRPC. Same pipeline as the rest of your fleet.

Scheduler-specific metrics

Beyond the KrishivMetrics singleton, the scheduler exposes additional metrics through SchedulerMetrics::scheduler_metrics():

  • krishiv_running_tasks
  • krishiv_task_retries_total
  • krishiv_failed_assignments_total
  • krishiv_max_executor_heartbeat_age_ticks

These are rendered into the same /metrics response when the UI is co-located with the coordinator.

Per-job cleanup

When a job ends, call global_metrics().remove_job(job_id) to drop all labelled variants. The CLI and the coordinator do this automatically on job completion and on coordinator restart.

See also