How to Process Massive CSV Files With DuckDB and Python Fast

A 60GB CSV file lands in your AWS S3 bucket. Your data pipeline triggers, spins up a standard EC2 instance, and attempts to run pandas.read_csv(). Five minutes later, the server silently crashes. Your memory usage spiked to 120GB, the Linux OOM (Out of Memory) killer stepped in, and your pipeline is dead in the water.

I have fought this exact battle more times than I care to admit. For years, the standard advice was to chunk your data, use Dask, or spin up a ridiculously expensive Spark cluster just to count a few rows. But the Python data ecosystem has fundamentally shifted. If you are trying to figure out exactly how to process large csv files with duckdb python, you are in the right place. You don’t need Spark. You don’t need a 256GB RAM instance. You just need a completely different execution model.

Today, I am going to show you how to rip through massive, larger-than-memory CSV files using DuckDB. We will cover environment setup, out-of-core processing, handling messy schemas, and integrating with modern tools like the Polars dataframe and the Ibis framework. Let’s dig in.

The Memory Wall: Why Pandas Fails and DuckDB Thrives

To understand why DuckDB feels like absolute magic, you have to understand why traditional tools fail. When you load a CSV into Pandas (prior to the recent PyArrow updates in Pandas 2.0), it parses the text and converts every single value into a CPython object. Because of CPython internals, a simple integer doesn’t just take up 8 bytes of memory; it carries the massive overhead of a Python object header, reference counts, and type information.

This means a 10GB CSV file on disk easily inflates to 30GB or 40GB of RAM once materialized in Pandas. While the Python community is making incredible strides with the upcoming GIL removal, Free threading, and the Python JIT compiler, none of these structural language improvements solve the raw memory bloat of materializing millions of objects in memory simultaneously.

DuckDB bypasses this entirely. It is an in-process, columnar, vectorized SQL database engine. When you point DuckDB at a CSV, it doesn’t load the file into Python objects. It executes C++ vectorized operations directly on chunks of the underlying data, keeping the memory footprint incredibly small. If the operation requires more memory than your system has, DuckDB gracefully spills to disk, using temporary storage to complete the query. It is exactly what SQLite is for transactional data, but purpose-built for heavy analytical workloads.

Setting Up the Modern Python Data Stack

I am highly opinionated about how to build data environments. Blindly running pip install globally is a recipe for dependency hell, especially when mixing C++ backed libraries like DuckDB, PyArrow, and Polars. Instead, I use the uv installer (written in Rust) or the Rye manager to keep my environments strictly isolated and incredibly fast.

Let’s bootstrap our project. Open your terminal and run the following commands to create a virtual environment and install our dependencies using uv:

# Initialize a new project
uv venv
source .venv/bin/activate

# Install DuckDB, PyArrow, and Polars
uv pip install duckdb pyarrow polars

For my development workflow, I have completely ditched Jupyter for Marimo notebooks. Marimo is reactive, entirely reproducible, and stores code as pure Python files. It integrates beautifully with DuckDB. We will also install a modern linter to keep our code clean.

uv pip install marimo ruff

How to Process Large CSV Files With DuckDB Python

Let’s get straight to the core technique. The absolute fastest way to query a massive CSV is to treat it as a database table directly from disk. You do not need to import the data into a DuckDB database file first; DuckDB can query the raw text file on the fly using its read_csv_auto function.

Here is the most basic example of how to process large csv files with duckdb python. Imagine we have a 40GB file named transactions.csv:

import duckdb

# Connect to an in-memory DuckDB instance
con = duckdb.connect()

# Query the CSV directly from disk
query = """
    SELECT 
        merchant_id,
        SUM(transaction_amount) as total_volume,
        COUNT(*) as transaction_count
    FROM read_csv_auto('transactions.csv')
    WHERE transaction_date >= '2023-01-01'
    GROUP BY merchant_id
    ORDER BY total_volume DESC
    LIMIT 10;
"""

# Execute the query and fetch the results
results = con.execute(query).fetchall()

for row in results:
    print(f"Merchant: {row[0]} | Volume: ${row[1]:,.2f} | Count: {row[2]}")

Notice what happens here. DuckDB streams the CSV file from your SSD into its execution engine in chunks. It filters out rows before 2023 immediately, meaning those rows never consume memory in the aggregation phase. It computes the sums and counts on the fly, and only returns the final 10 rows to Python. A 40GB file can be processed on a laptop with 8GB of RAM in seconds.

Handling Out-of-Core Processing (When Your RAM is Tiny)

If you are running complex aggregations (like an enormous GROUP BY with millions of unique keys) or a massive JOIN, DuckDB might still run out of memory. By default, it tries to do everything in RAM. To enable out-of-core processing, you must tell DuckDB where it is allowed to write temporary files to disk.

You can configure this using DuckDB pragmas. I always set these explicitly in my production scripts:

import duckdb

# Create a persistent connection instead of in-memory to allow disk spilling
con = duckdb.connect('analytical_pipeline.db')

# Restrict DuckDB to use only 4GB of RAM
con.execute("PRAGMA memory_limit='4GB'")

# Tell DuckDB where to write temporary swap files
con.execute("PRAGMA temp_directory='/tmp/duckdb_swap'")

# Utilize all CPU cores (Free threading at the C++ level)
con.execute("PRAGMA threads=8")

By enforcing a strict memory limit, DuckDB becomes virtually crash-proof. It will trade slightly longer execution times (due to SSD write speeds) for the guarantee that the Linux OOM killer will leave your process alone.

Advanced Ingestion: Surviving Messy Real-World CSVs

In a perfect world, CSV files are strictly typed, perfectly quoted, and beautifully formatted. In the real world, CSV files are an absolute nightmare. A single unescaped quote or a string appearing in an integer column halfway through a 100-million row file will crash your pipeline.

While read_csv_auto tries to sniff the data types by scanning a sample of rows, it often guesses wrong if the messy data is buried deep in the file. To build a resilient pipeline, you need to use the more explicit read_csv function and override the schema.

Here is how I structure bulletproof CSV ingestion:

import duckdb

con = duckdb.connect()

query = """
    SELECT *
    FROM read_csv(
        'messy_data.csv',
        header = true,
        delim = ',',
        quote = '"',
        escape = '"',
        null_padding = true,
        ignore_errors = true,
        types = {
            'user_id': 'BIGINT',
            'email': 'VARCHAR',
            'account_balance': 'DOUBLE',
            'created_at': 'TIMESTAMP'
        }
    )
    WHERE user_id IS NOT NULL;
"""

# Fetch the cleaned data into a PyArrow table
clean_table = con.execute(query).arrow()

Let’s break down the critical arguments here:

types: We explicitly enforce the schema. If DuckDB encounters the string “N/A” in the account_balance column, it knows to treat it as a parsing error rather than crashing the type sniffer.
ignore_errors = true: This is a lifesaver. Instead of throwing an exception and halting the entire 50GB read, DuckDB will silently skip the malformed rows. If you are doing malware analysis on massive log dumps or algo trading where dropping a few corrupted ticks is acceptable, this parameter is mandatory.
null_padding = true: Sometimes a CSV row is missing columns at the end (e.g., 4 commas instead of 5). This parameter pads the row with NULLs rather than throwing a column mismatch error.

Querying CSVs Directly From the Cloud (S3/GCS)

Downloading a 100GB CSV to your local disk just to filter it is a massive waste of network bandwidth and time. One of DuckDB’s greatest superpowers is the httpfs extension, which allows it to query files over HTTP or directly from AWS S3, Google Cloud Storage, or Azure Blob.

When combined with parquet files, this does byte-range requests. With CSVs, it still has to stream the file, but it avoids saving it to your local disk.

import duckdb
import os

con = duckdb.connect()

# Install and load the httpfs extension
con.execute("INSTALL httpfs;")
con.execute("LOAD httpfs;")

# Configure AWS credentials securely from environment variables
con.execute(f"""
    CREATE SECRET (
        TYPE S3,
        KEY_ID '{os.getenv("AWS_ACCESS_KEY_ID")}',
        SECRET '{os.getenv("AWS_SECRET_ACCESS_KEY")}',
        REGION 'us-east-1'
    );
""")

# Query the remote CSV directly
remote_query = """
    SELECT category, COUNT(*) as item_count
    FROM read_csv_auto('s3://my-data-lake-bucket/exports/massive_export.csv')
    GROUP BY category;
"""

results = con.execute(remote_query).df()
print(results)

This approach completely revolutionized my Python automation scripts. I no longer write Boto3 download scripts. I just point DuckDB at the S3 URI and let the C++ engine handle the network streams.

Converting CSVs to Parquet: The Ultimate Fix

I have a harsh truth for you: if you are continuously querying massive CSV files, you are doing data engineering wrong. CSV is a terrible format for analytics. It lacks a schema, it isn’t compressed, and it forces full table scans because it lacks columnar metadata.

The best way to process large csv files with duckdb python is to process them exactly once—by converting them into Parquet files. Parquet is heavily compressed and columnar. If you query a single column from a Parquet file, DuckDB only reads the bytes for that specific column. This can speed up downstream queries by 100x.

DuckDB makes this conversion trivial. You can stream a CSV directly into a Parquet file without ever holding the full dataset in memory.

import duckdb

con = duckdb.connect()

# Stream the CSV into a ZSTD-compressed Parquet file
conversion_query = """
    COPY (
        SELECT * FROM read_csv_auto('raw_dump.csv')
    ) TO 'optimized_data.parquet' (FORMAT PARQUET, COMPRESSION ZSTD);
"""

print("Converting CSV to Parquet...")
con.execute(conversion_query)
print("Conversion complete!")

By using the ZSTD compression codec, a 50GB CSV will often shrink to a 5GB Parquet file. From that point on, you point your DuckDB queries at optimized_data.parquet, and your I/O bottlenecks will disappear.

Integrating with Polars, PyArrow, and Ibis

DuckDB is phenomenal for data ingestion and SQL processing, but you often need to pass the aggregated data into other tools for machine learning (like Scikit-learn updates or PyTorch news models) or complex dataframe manipulation.

DuckDB implements the Apache Arrow C Data Interface. This means you can execute a DuckDB query and hand off the resulting data to a Polars dataframe or PyArrow table with zero-copy transport. The data never actually moves in memory; the Python libraries just pass a pointer to the C++ memory buffer.

import duckdb
import polars as pl

con = duckdb.connect()

# Query the massive CSV and filter down to a manageable size
query = """
    SELECT timestamp, sensor_id, temperature
    FROM read_csv_auto('iot_telemetry.csv')
    WHERE temperature > 95.0
"""

# Fetch the result directly as a PyArrow table, then wrap it in Polars
# This is blazingly fast and involves zero memory copying
arrow_table = con.execute(query).arrow()
df = pl.from_arrow(arrow_table)

# Now use Polars' incredible expression API for downstream work
result = df.group_by("sensor_id").agg(
    pl.col("temperature").mean().alias("avg_critical_temp")
)

print(result)

If you hate writing raw SQL strings in Python, you need to look into the Ibis framework. Ibis provides a lazy, Pandas-like Python API that compiles down to SQL and executes on backends like DuckDB. It gives you the safety of Python type hints, MyPy updates compatibility, and IDE autocomplete, while leveraging DuckDB’s engine under the hood.

import ibis
from ibis import _

# Configure Ibis to use DuckDB
ibis.set_backend('duckdb')

# Connect to the CSV file (Ibis handles the DuckDB read_csv syntax)
table = ibis.read_csv('financial_data.csv')

# Build a lazy query using Python methods
query = (
    table
    .filter(_.status == 'COMPLETED')
    .group_by('region')
    .aggregate(total_revenue=_.amount.sum())
)

# Execute the query
df = query.execute()

This combination—DuckDB for the engine, Ibis for the query builder, and Polars for downstream manipulation—is arguably the most powerful data stack available in Python today.

Data Validation and Security in Your Pipelines

When dealing with massive ingested datasets, especially in fields like Python finance or Algo trading, you cannot trust the input data. Malicious payloads or malformed strings can corrupt downstream databases. While Python security tools and malware analysis often focus on the web layer (like FastAPI news or Django async applications), securing your data ingestion pipelines is just as critical.

I highly recommend wrapping your DuckDB ingestion logic in strict functions formatted by the Black formatter and linted via the Ruff linter. Ensure that your queries use parameterized inputs if you are injecting variables, though for raw file reads, explicit schema definitions via the types parameter in read_csv serve as your first line of defense against data poisoning.

Furthermore, consider writing tests for your data pipelines using standard Pytest plugins. You can create a tiny, 100-row fixture CSV that contains known edge cases (nulls, massive integers, broken quotes) and assert that your DuckDB pipeline handles them gracefully without crashing the CI/CD runner.

Frequently Asked Questions

How much RAM do I need to process a 100GB CSV with DuckDB?

You can process a 100GB CSV with as little as 2GB to 4GB of RAM. By setting PRAGMA memory_limit and providing a temp_directory, DuckDB will stream the file in small chunks and spill any memory-intensive aggregations (like massive sorts or joins) to your disk.

Can DuckDB handle compressed CSV files directly?

Yes, DuckDB automatically detects and decompresses gzip (.csv.gz) and zstd (.csv.zst) files on the fly. You do not need to extract the files first; simply pass the compressed file path directly into the read_csv_auto() function.

What is the difference between DuckDB and Polars for large data?

While both are incredibly fast tools built on Arrow, Polars is a DataFrame library designed primarily for in-memory processing (though it has an out-of-core streaming mode). DuckDB is a relational SQL database engine built from the ground up for out-of-core processing, making it generally more robust for datasets that vastly exceed your available RAM.

Does DuckDB support writing back to CSV?

Yes, DuckDB can export the results of any query directly to a CSV file using the COPY ... TO 'output.csv' (FORMAT CSV) syntax. However, for better performance and smaller file sizes, it is highly recommended to export your processed data to Parquet instead.

Final Takeaways

The days of struggling with Pandas memory crashes are over. If you need to know how to process large csv files with duckdb python, the strategy is simple: treat your CSVs as database tables, enforce strict memory limits, and let DuckDB’s vectorized C++ engine do the heavy lifting.

Stop loading data into Python objects unless you absolutely have to. By utilizing DuckDB for the initial ingestion, filtering, and aggregation, you can reduce a 50GB file into a 50MB aggregated PyArrow table in seconds. From there, you can safely pass the clean, reduced dataset into Polars, Pandas, or Scikit-learn for your final analysis. Update your stack, embrace Parquet conversions, and your data pipelines will never run out of memory again.

How to Process Massive CSV Files With DuckDB and Python Fast

The Memory Wall: Why Pandas Fails and DuckDB Thrives

Setting Up the Modern Python Data Stack

How to Process Large CSV Files With DuckDB Python

Handling Out-of-Core Processing (When Your RAM is Tiny)

Advanced Ingestion: Surviving Messy Real-World CSVs

Querying CSVs Directly From the Cloud (S3/GCS)

Converting CSVs to Parquet: The Ultimate Fix

Integrating with Polars, PyArrow, and Ibis

Data Validation and Security in Your Pipelines

Frequently Asked Questions

How much RAM do I need to process a 100GB CSV with DuckDB?

Can DuckDB handle compressed CSV files directly?

What is the difference between DuckDB and Polars for large data?

Does DuckDB support writing back to CSV?

Final Takeaways

Leave a Reply Cancel reply

Silas Montgomery

The Memory Wall: Why Pandas Fails and DuckDB Thrives

Setting Up the Modern Python Data Stack

How to Process Large CSV Files With DuckDB Python

Handling Out-of-Core Processing (When Your RAM is Tiny)

Advanced Ingestion: Surviving Messy Real-World CSVs

Querying CSVs Directly From the Cloud (S3/GCS)

Converting CSVs to Parquet: The Ultimate Fix

Integrating with Polars, PyArrow, and Ibis

Data Validation and Security in Your Pipelines

Frequently Asked Questions

How much RAM do I need to process a 100GB CSV with DuckDB?

Can DuckDB handle compressed CSV files directly?

What is the difference between DuckDB and Polars for large data?

Does DuckDB support writing back to CSV?

Final Takeaways

Leave a Reply Cancel reply

Silas Montgomery

Related Posts