S3 / Object Store

Reading and writing Parquet on S3, ADLS, GCS, and MinIO.

S3, ADLS, GCS, and MinIO share the same object_store backend. They are referenced by URI in the same places as local paths.

URI form

s3://bucket/path/to/table/
s3://bucket/path/to/file.parquet
abfss://[email protected]/path/
gs://bucket/path/

Authentication

Provider	Default chain
AWS S3	Env vars, then IMDS / IRSA / instance profile.
MinIO (S3-compatible)	`AWS_ENDPOINT_URL` + static keys.
Azure ADLS Gen2	Workload identity / managed identity / service principal.
GCS	Application default credentials / service account file.

Reading

CREATE EXTERNAL TABLE orders
STORED AS PARQUET
LOCATION 's3://my-bucket/data/orders/';

Or directly:

let df = session.read_parquet("s3://my-bucket/data/orders/2024-q1/*.parquet").await?;

Writing

df.write_parquet("s3://my-bucket/out/orders/").await?    # Rust
df.write_parquet("s3://my-bucket/out/orders/")           # Python

Writes use the same ParquetSink and respect write_parquet_with_options (compression, row group size).

Credentials in code

You can also pass credentials explicitly via CREATE EXTERNAL TABLE options or object-store config:

CREATE EXTERNAL TABLE orders
STORED AS PARQUET
LOCATION 's3://my-bucket/data/orders/'
OPTIONS (
  'aws.region'            = 'us-east-1',
  'aws.access_key_id'     = '...',
  'aws.secret_access_key' = '...'
);

For production, prefer env-var / IAM-based auth and leave these options empty.

Performance

Reads use ParquetReadOptions::with_pushdown_filters(true), with_enable_page_index(true), with_enable_bloom_filter(true) by default in production profiles.
Writes coalesce small row groups to max_row_group_size (default 1 MB / 1 048 576 rows).
For directory-based sources, use partition pruning: s3://bucket/table/year=2024/month=01/.

Preview: S3 / ADLS / GCS writes are feature-complete but end-to-end certification against a specific object store is still in progress. See Connector Certification for the matrix.