Numpyro: Probabilistic Programming That Doesn’t Waste My Time

I remember the bad old days of probabilistic programming. You’d define a hierarchical model, hit “sample,” and then—I’m not joking—go watch an entire episode of something while the chains warmed up. If you made a typo? Too bad. You wouldn’t find out until you came back 45 minutes later.

It was brutal.

Actually, I should clarify – that’s mostly why I stayed away from Bayesian methods for production stuff. Great for papers, terrible for deadlines. But I’ve been messing around with Numpyro heavily for the last few months, specifically on a project involving customer churn prediction, and honestly? It feels like cheating.

If you haven’t touched it yet, Numpyro is basically what happens when you take the Pyro syntax and smash it together with JAX. The result is a probabilistic programming library (PPL) that compiles your model to XLA and runs it on your GPU at speeds that make CPU-based sampling look like it’s running on a graphing calculator.

The JAX Difference (Or: Why It Screams)

The magic here isn’t really Numpyro itself—it’s JAX. Because Numpyro is built from the ground up on JAX, it gets Just-In-Time (JIT) compilation for free. When you run an MCMC sampler, it compiles the gradient computation (needed for the No-U-Turn Sampler, or NUTS) into highly optimized XLA code.

I ran a test on my MacBook Pro (M3 Max chip) yesterday. I took a fairly standard hierarchical regression model with about 5,000 data points and 50 groups.

Pure Python/NumPy implementation: I killed it after 10 minutes. Life is too short.
Legacy PPL (Theano-based backends): ~3 minutes.
Numpyro (JIT compiled): 14 seconds.

Fourteen seconds. That includes the compilation time. Once it’s compiled, subsequent runs are almost instant. This changes your workflow completely. You stop treating inference as a “batch job” and start treating it like an interactive session. You can iterate on priors, tweak likelihoods, and actually explore the model structure without the context switching penalty.

What It Looks Like

If you’ve used Pyro or even PyMC, the syntax is familiar. You define a model function, use sample statements for random variables, and you’re off.

Here is a stripped-down version of a model I was debugging last Tuesday. It’s a simple linear regression, but pay attention to how clean the code is.

import jax.numpy as jnp
from jax import random
import numpyro
import numpyro.distributions as dist
from numpyro.infer import MCMC, NUTS

def model(x, y=None):
    # Priors
    # Note: We don't worry about shapes usually, JAX handles broadcasting
    alpha = numpyro.sample("alpha", dist.Normal(0.0, 1.0))
    beta = numpyro.sample("beta", dist.Normal(0.0, 1.0))
    sigma = numpyro.sample("sigma", dist.Exponential(1.0))
    
    # Expected value
    mu = alpha + beta * x
    
    # Likelihood
    numpyro.sample("obs", dist.Normal(mu, sigma), obs=y)

# Fake data
x_data = jnp.array([1.0, 2.0, 3.0, 4.0, 5.0])
y_data = 2.0 * x_data + 1.0 + random.normal(random.PRNGKey(0), (5,)) * 0.1

# The part that actually does the work
# tested with JAX 0.5.0 and Numpyro 0.16.0
nuts_kernel = NUTS(model)
mcmc = MCMC(nuts_kernel, num_warmup=500, num_samples=1000)

# CRITICAL: You must pass a PRNGKey. JAX doesn't have global random state.
mcmc.run(random.PRNGKey(42), x=x_data, y=y_data)

mcmc.print_summary()

See that random.PRNGKey(42)? That’s the only part that trips people up.

The One Thing That Will Drive You Crazy

Let’s be real for a second. It’s not all perfect. If you are coming from standard NumPy or PyTorch, JAX’s functional nature is going to hurt your brain for the first week. JAX (and by extension Numpyro) does not handle state globally. You have to thread that random key through everything.

Well, that’s not entirely accurate – I spent two hours last week debugging a model that produced identical “random” results every single loop. Why? because I was reusing the same PRNGKey without splitting it. In standard Python, random.seed(0) sets a global state. In JAX, the key is just a tensor. If you pass the same tensor to a function, you get the same result. Every. Single. Time.

You have to do this explicit splitting dance:

rng_key = random.PRNGKey(0)
rng_key, rng_subkey = random.split(rng_key)
# Use rng_subkey for the operation, keep rng_key for the next split

It’s annoying, but it makes your code deterministic and parallelizable, which is why Numpyro scales so well across multiple GPU cores. Once you build the helper utility to handle keys, you stop noticing it.

Performance in the Real World

I wanted to see if the hype held up on something messy, not just a toy linear regression. I ported a Gaussian Process model we use for time-series forecasting. The original implementation was in a different probabilistic framework (naming no names, but it rhymes with “High MC”).

The dataset had about 2,000 time steps. Gaussian Processes are notoriously slow because they involve inverting big matrices—complexity is roughly cubic, O(N³).

The Benchmark:

Old Framework: 12 minutes 40 seconds for 1000 samples.
Numpyro on CPU: 4 minutes 15 seconds.
Numpyro on GPU (RTX 4070 Ti): 28 seconds.

I literally double-checked the convergence diagnostics (R-hat and effective sample size) because I didn’t believe it actually finished that fast. But it did. The effective sample size was actually higher in Numpyro because the NUTS sampler implementation in JAX seems to mix more efficiently in this specific case.

My Take

And if you are doing Bayesian statistics in 2026 and you aren’t using a JAX-backed library, you are probably wasting your own time. It’s that simple.

Numpyro strikes this weirdly perfect balance. It’s lightweight enough that you can read the source code (I have, it’s surprisingly readable), but powerful enough to run massive models that used to require custom C++ code. The ecosystem around it is maturing too—libraries like numpyro-extensions are filling in the gaps.

Is it harder to learn than the high-level Keras-style APIs? Maybe a little. The error messages from the XLA compiler can sometimes look like a cat walked across a keyboard. But when you can iterate on a complex model in seconds instead of hours, that learning curve pays for itself in a week.

Give it a shot on your next project. Just remember to split your random keys.

The JAX Difference (Or: Why It Screams)

What It Looks Like

The One Thing That Will Drive You Crazy

Performance in the Real World

My Take

Leave a Reply Cancel reply

Riko Ishikawa

The JAX Difference (Or: Why It Screams)

What It Looks Like

The One Thing That Will Drive You Crazy

Performance in the Real World

My Take

Leave a Reply Cancel reply

Riko Ishikawa

Related Posts