Parquet → SQL aggregation
Read a Parquet file, run a SQL aggregation, write to a new Parquet file.
The shortest path from "I have a Parquet file" to "I have a result." Uses the embedded runtime — no cluster required.
Python
import krishiv as ks
session = ks.Session.embedded()
orders = session.read_parquet("data/orders.parquet")
session.register_parquet("orders", "data/orders.parquet")
result = session.sql("""
SELECT customer_id, SUM(amount) AS total
FROM orders
WHERE event_time >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY customer_id
ORDER BY total DESC
LIMIT 100
""")
result.write_parquet("out/top_customers.parquet")
result.show(10)
Rust
use krishiv_api::{col, lit, sum, Session};
#[tokio::main]
async fn main() -> krishiv_api::Result<()> {
let session = Session::embedded().await?;
session.register_parquet("orders", "data/orders.parquet").await?;
let df = session.sql("
SELECT customer_id, SUM(amount) AS total
FROM orders
WHERE event_time >= CURRENT_DATE - INTERVAL '30' DAY
GROUP BY customer_id
ORDER BY total DESC
LIMIT 100
").await?;
df.write_parquet("out/top_customers.parquet", None).await?;
df.show().await?;
Ok(())
}
Notes
- The internal data model is Apache Arrow RecordBatch; reading and writing both go through it.
- SQL parsing and execution are delegated to DataFusion; you can use any standard DataFusion SQL syntax.
- For larger files, prefer reading a directory of partitioned Parquet files instead of one giant file.
See the Python Session and Parquet & Object Store pages for full options.