Parquet → SQL aggregation

Read a Parquet file, run a SQL aggregation, write to a new Parquet file.

The shortest path from "I have a Parquet file" to "I have a result." Uses the embedded runtime — no cluster required.

Python

import krishiv as ks

session = ks.Session.embedded()

orders = session.read_parquet("data/orders.parquet")
session.register_parquet("orders", "data/orders.parquet")

result = session.sql("""
    SELECT customer_id, SUM(amount) AS total
    FROM orders
    WHERE event_time >= CURRENT_DATE - INTERVAL '30' DAY
    GROUP BY customer_id
    ORDER BY total DESC
    LIMIT 100
""")
result.write_parquet("out/top_customers.parquet")
result.show(10)

Rust

use krishiv_api::{col, lit, sum, Session};

#[tokio::main]
async fn main() -> krishiv_api::Result<()> {
    let session = Session::embedded().await?;
    session.register_parquet("orders", "data/orders.parquet").await?;

    let df = session.sql("
        SELECT customer_id, SUM(amount) AS total
        FROM orders
        WHERE event_time >= CURRENT_DATE - INTERVAL '30' DAY
        GROUP BY customer_id
        ORDER BY total DESC
        LIMIT 100
    ").await?;
    df.write_parquet("out/top_customers.parquet", None).await?;
    df.show().await?;
    Ok(())
}

Notes

The internal data model is Apache Arrow RecordBatch; reading and writing both go through it.
SQL parsing and execution are delegated to DataFusion; you can use any standard DataFusion SQL syntax.
For larger files, prefer reading a directory of partitioned Parquet files instead of one giant file.

See the Python Session and Parquet & Object Store pages for full options.