Ray and PyTorch are finally under one roof. Good riddance to the anxiety.
Actually, I should clarify – I’ve been writing distributed training scripts for the better part of five years, and let me tell you: the anxiety of open-source licensing changes has been living rent-free in my head since the whole Terraform debacle. But when Ray officially joined the PyTorch Foundation late last year, I actually exhaled. Probably a good call on their part.
It’s February 2026 now. The dust has settled on the Ray Summit announcements, and we can actually look at what this means without the hype goggles on. But if you missed it, Ray—the massive distributed compute framework we all use to scale Python—moved under the PyTorch Foundation, which itself sits inside the Linux Foundation.
And why does this matter? Well, for a minute there, with every major infrastructure tool pulling a “business source license” switcheroo, I was genuinely worried Ray might go the same route. But it didn’t. Instead, it doubled down on open governance. And thank god for that, because rewriting my entire MLOps stack to avoid vendor lock-in is not how I want to spend my spring.
The “Bus Factor” Just Improved
Here’s the thing — I love PyTorch. I use it daily. But scaling it has always been… a choice. You could use torch.distributed directly if you enjoy managing environment variables and manually handling rank assignments. Or you could use Ray, which abstracts the pain away but felt like a separate island.
By moving Ray into the PyTorch Foundation, the commitment to keep it open is legally structural, not just a “trust us” promise from a startup. The Linux Foundation is already home to Kubernetes and vLLM. And having Ray sit right next to them makes sense. It’s the infrastructure layer that was missing from the foundation’s portfolio.
I was testing a new training pipeline on my cluster last week—running PyTorch 2.6 on a mix of A100s and some older V100s—and the friction between the tools is already starting to vanish. The APIs feel less like they’re fighting each other and more like they’re shaking hands.
What This Actually Looks Like in Code
If you haven’t touched Ray Train recently, you might remember the boilerplate nightmare it used to be. But it’s gotten better. Significantly better.
I stripped down a script I used for a client project recently to show how clean the integration is becoming. This isn’t pseudo-code; I ran a variant of this on a 3-node cluster just yesterday to fine-tune a Llama 3 variant.
import ray
from ray import train
from ray.train import ScalingConfig
from ray.train.torch import TorchTrainer
import torch
import torch.nn as nn
import torch.optim as optim
def train_func(config):
# This is the magic part - Ray handles the device placement
# No more manual cuda:0 vs cuda:1 headaches
model = nn.Linear(1, 1)
model = train.torch.prepare_model(model)
optimizer = optim.SGD(model.parameters(), lr=1e-3)
# Fake data loop just to prove the point
for i in range(10):
loss = torch.randn(1, requires_grad=True).sum()
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Reporting metrics back to the head node automatically
train.report({"loss": loss.item()})
# This config is where the governance merge matters
# Standardizing how we define resources across the PyTorch ecosystem
scaling_config = ScalingConfig(
num_workers=4, # Total GPUs across the cluster
use_gpu=True,
resources_per_worker={"CPU": 2}
)
trainer = TorchTrainer(
train_loop_per_worker=train_func,
scaling_config=scaling_config
)
result = trainer.fit()
print(f"Training finished. Last loss: {result.metrics['loss']}")
Notice what isn’t there? I didn’t have to set MASTER_ADDR or MASTER_PORT. I didn’t have to write a bash script to launch processes on different nodes. Ray handles the plumbing.
The vLLM Connection
This is the part that excites me the most, and it’s something people aren’t talking about enough. vLLM is also under the Linux Foundation now.
Right now, serving large models is a mess of fragmentation. You have TorchServe (which feels abandoned sometimes, let’s be honest), Ray Serve, vLLM, TGI… the list goes on. But with Ray and vLLM both under the same governance umbrella as PyTorch, I’m betting my lunch money that we see a unified serving stack emerge by the end of 2026.
Imagine a world where torch.compile optimizations flow directly into vLLM kernels, which are then orchestrated by Ray Serve, all without version conflicts or monkey-patching. That’s the dream. We aren’t there yet—I spent three hours last Tuesday fighting a dependency conflict between pydantic versions in Ray and vLLM—but the path is clear.
Real Talk: It’s Still Not Perfect
Don’t get me wrong. Governance doesn’t fix bugs overnight.
I still run into weird serialization errors when passing complex PyTorch objects through the Ray object store. And just last month, I had a job crash silently because of an OOM killer on a worker node that Ray reported as a “network timeout”. Debugging distributed systems is still pain.
But the alternative? The alternative was a fragmented ecosystem where the tool that scales your compute (Ray) and the tool that defines your model (PyTorch) were owned by different entities with potentially conflicting profit motives. Now, they are roommates.
My Prediction for late 2026
Here is my hot take: By Q4 2026, I expect we’ll see a “PyTorch Native” scaling API that is just a thin wrapper around Ray Core. The distinction between “running PyTorch” and “running PyTorch on Ray” will start to blur until it’s just an implementation detail you configure in a YAML file.
If you’re building platform infrastructure today, you can finally stop worrying if Ray is going to pull a license rug. It’s safe. It’s open. Now we just have to fix the CUDA errors.
