Skip to content

data_manipulation ¤

This module is designed for benchmarking various data processing methods.

It compares the performance of Pandas, Polars and DuckDB for a common data aggregation task.

  1. Polars

Polars is a Rust-powered DataFrame library designed for speed that brings multi-threaded execution and query optimization to Python.

Key capabilities include:

  • Speeds up operations by using all available CPU cores by default
  • Builds a query plan first, then executes only what’s needed
  • Streaming mode for processing datasets larger than RAM
  • Expressive method chaining with a pandas-like API

  • DuckDB

DuckDB is an embedded SQL database optimized for analytics that brings database-level query optimization to local files.

Key capabilities include:

  • Native SQL syntax with full analytical query support
  • Queries CSV, Parquet, and JSON files directly without loading
  • Uses disk storage automatically when data exceeds available memory
  • Zero-configuration embedded database requiring no server setup

  • Benchmark Summary Table

At the end of the script, a comparison benchmark table summarizes the performance of Pandas, Polars, and DuckDB across various operations.