ProductDocumentationExamplesBlogRoadmapGitHubGet Started
Available

Streaming Overview

How Krishiv models streams, event time, watermarks, and the unified batch+stream runtime.

Streaming in Krishiv is not a separate engine — it's the same Session, the same operator runtime, and the same DataFrame API you use for batch. What changes is the boundedness of the input and the operator shape: streaming sources are unbounded and operators that hold per-key or per-window state run forever.

Two runtimes, one API

You write a query once. At planning time Krishiv tags the plan as bounded or unbounded:

Plan typeSourceOperators allowed
BoundedParquet, CSV, JSON, Iceberg, Delta, Hudi, in-memoryAll DataFrame operators. collect() returns all rows.
UnboundedKafka, Kinesis, Pulsar, registered streaming table, in-memory unbounded streamStreamingDataFrame operators: event-time, key_by, windowed aggregation, joins, side output.

Event time, processing time, and watermarks

Each streaming record carries a timestamp column (event time) and an arrival timestamp (processing time). The operator runtime advances a watermark based on the event-time column. Operators that need a notion of "now" (windows, late events) read the watermark.

Three watermark policies, applied per stream or globally:

PolicyWhen to use
WatermarkSpec::fixed_lag_ms(lag_ms)Single source, fixed allowed lateness. This is the default.
MultiSourceWatermarkSpecMultiple streaming sources joined together; effective watermark is the min across all sources (with optional idle timeout).
Processing-time timerWall-clock driven events (heartbeats, SLA timers). Not for windowing.

Two streaming styles

Krishiv supports both micro-batch continuous queries (Spark Structured Streaming style) and true streaming (Flink style). You pick the style when you call write_stream():

StyleTriggerWhen it fits
Micro-batchProcessingTime(n) or OncePeriodic aggregates, simple sinks, lower operational complexity. Default.
ContinuousContinuous(checkpoint_interval_ms)Sub-second latency requirements. Higher coordinator load.
Available-nowAvailableNowProcess all currently-available data and stop. Used for backfills.

Where to go next