Data Engineering
Posts about Data Engineering
Inside Polars’ Streaming Engine: How Spillable Sinks Handle Larger-Than-RAM Joins
If you search polars streaming engine spill today, the top result is a benchmark that OOM’d on Polars 1.6.0 in September 2024.
Inside Polars’ Lazy Query Engine: How Predicate Pushdown Beats Pandas
The pola.rs blog has a headline number that every Polars tutorial copies: the optimized lazy plan runs roughly 4x faster than the naive one on the NYC.
Step-by-Step Tutorial: Writing Python Extensions in Rust With PyO3
I hit a massive performance wall last Tuesday. I was tasked with parsing a 50GB dataset of nested JSON logs for a cybersecurity client doing malware.
Marimo vs Jupyter Notebook: Which Python Environment is Best?
I just spent three hours debugging a machine learning pipeline, only to realize I had executed cell 14 before cell 12.
How to Process Massive CSV Files With DuckDB and Python Fast
A 60GB CSV file lands in your AWS S3 bucket. Your data pipeline triggers, spins up a standard EC2 instance, and attempts to run pandas.read_csv() .
I Dropped Pandas for Polars. Here’s What Broke.
It was 11 PM on a Thursday last month. My data pipeline running on a t3.xlarge AWS instance crashed for the fourth time that week. The culprit?
Mastering Gil Removal: Advanced Techniques and Best Practices for Modern Developers
Introduction to Gil Removal In today’s rapidly evolving technological landscape, GIL removal has emerged as a critical skill for developers seeking to.
FastAPI on the Edge: Running Local LLMs on a Pi
I’m officially sick of renting $3/hour cloud GPUs just to parse text. For the last few weeks, I’ve been moving my background Learn about FastAPI news.
I Turned on the Python 3.14 JIT in Production (Well, Staging). Here’s the Truth.
Well, I have to admit, I was a bit skeptical about this whole Python JIT thing at first. In my experience, “free performance” usually comes …
TF 2.18 & Keras: Real-World Performance Review
I finally bit the bullet last week. After ignoring the notification icons for two months, I upgraded our main training pipeline to TensorFlow 2.18.
