S3 / Object Store
Reading and writing Parquet on S3, ADLS, GCS, and MinIO.
S3, ADLS, GCS, and MinIO share the same object_store backend. They are referenced by URI in the same places as local paths.
URI form
s3://bucket/path/to/table/
s3://bucket/path/to/file.parquet
abfss://[email protected]/path/
gs://bucket/path/
Authentication
| Provider | Default chain |
|---|---|
| AWS S3 | Env vars, then IMDS / IRSA / instance profile. |
| MinIO (S3-compatible) | AWS_ENDPOINT_URL + static keys. |
| Azure ADLS Gen2 | Workload identity / managed identity / service principal. |
| GCS | Application default credentials / service account file. |
Reading
CREATE EXTERNAL TABLE orders
STORED AS PARQUET
LOCATION 's3://my-bucket/data/orders/';
Or directly:
let df = session.read_parquet("s3://my-bucket/data/orders/2024-q1/*.parquet").await?;
Writing
df.write_parquet("s3://my-bucket/out/orders/").await? # Rust
df.write_parquet("s3://my-bucket/out/orders/") # Python
Writes use the same ParquetSink and respect write_parquet_with_options (compression, row group size).
Credentials in code
You can also pass credentials explicitly via CREATE EXTERNAL TABLE options or object-store config:
CREATE EXTERNAL TABLE orders
STORED AS PARQUET
LOCATION 's3://my-bucket/data/orders/'
OPTIONS (
'aws.region' = 'us-east-1',
'aws.access_key_id' = '...',
'aws.secret_access_key' = '...'
);
For production, prefer env-var / IAM-based auth and leave these options empty.
Performance
- Reads use
ParquetReadOptions::with_pushdown_filters(true),with_enable_page_index(true),with_enable_bloom_filter(true)by default in production profiles. - Writes coalesce small row groups to
max_row_group_size(default 1 MB / 1 048 576 rows). - For directory-based sources, use partition pruning:
s3://bucket/table/year=2024/month=01/.
Preview: S3 / ADLS / GCS writes are feature-complete but end-to-end certification against a specific object store is still in progress. See Connector Certification for the matrix.