data_manipulation
¤
This module is designed for benchmarking various data processing methods.
It compares the performance of Pandas, Polars and DuckDB for a common data aggregation task.
- Polars
Polars is a Rust-powered DataFrame library designed for speed that brings multi-threaded execution and query optimization to Python.
Key capabilities include:
- Speeds up operations by using all available CPU cores by default
- Builds a query plan first, then executes only what’s needed
- Streaming mode for processing datasets larger than RAM
-
Expressive method chaining with a pandas-like API
-
DuckDB
DuckDB is an embedded SQL database optimized for analytics that brings database-level query optimization to local files.
Key capabilities include:
- Native SQL syntax with full analytical query support
- Queries CSV, Parquet, and JSON files directly without loading
- Uses disk storage automatically when data exceeds available memory
-
Zero-configuration embedded database requiring no server setup
-
Benchmark Summary Table
At the end of the script, a comparison benchmark table summarizes the performance of Pandas, Polars, and DuckDB across various operations.