Streaming Overview
How Krishiv models streams, event time, watermarks, and the unified batch+stream runtime.
Streaming in Krishiv is not a separate engine — it's the same Session, the same operator runtime, and the same DataFrame API you use for batch. What changes is the boundedness of the input and the operator shape: streaming sources are unbounded and operators that hold per-key or per-window state run forever.
Two runtimes, one API
You write a query once. At planning time Krishiv tags the plan as bounded or unbounded:
| Plan type | Source | Operators allowed |
|---|---|---|
| Bounded | Parquet, CSV, JSON, Iceberg, Delta, Hudi, in-memory | All DataFrame operators. collect() returns all rows. |
| Unbounded | Kafka, Kinesis, Pulsar, registered streaming table, in-memory unbounded stream | StreamingDataFrame operators: event-time, key_by, windowed aggregation, joins, side output. |
Event time, processing time, and watermarks
Each streaming record carries a timestamp column (event time) and an arrival timestamp (processing time). The operator runtime advances a watermark based on the event-time column. Operators that need a notion of "now" (windows, late events) read the watermark.
Three watermark policies, applied per stream or globally:
| Policy | When to use |
|---|---|
WatermarkSpec::fixed_lag_ms(lag_ms) | Single source, fixed allowed lateness. This is the default. |
MultiSourceWatermarkSpec | Multiple streaming sources joined together; effective watermark is the min across all sources (with optional idle timeout). |
| Processing-time timer | Wall-clock driven events (heartbeats, SLA timers). Not for windowing. |
Two streaming styles
Krishiv supports both micro-batch continuous queries (Spark Structured Streaming style) and true streaming (Flink style). You pick the style when you call write_stream():
| Style | Trigger | When it fits |
|---|---|---|
| Micro-batch | ProcessingTime(n) or Once | Periodic aggregates, simple sinks, lower operational complexity. Default. |
| Continuous | Continuous(checkpoint_interval_ms) | Sub-second latency requirements. Higher coordinator load. |
| Available-now | AvailableNow | Process all currently-available data and stop. Used for backfills. |
Where to go next
- Windows and Watermarks — tumbling, sliding, session, late events
- Streaming Joins — stream-table temporal, stream-stream interval, regular
- Stateful Process Functions — ProcessFunction, timers, ConnectedStreams, BroadcastState
- Queries and Lifecycle — DataStreamWriter, StreamingQuery, listeners, modes