Skip to content

Training in Orbit

What happens when you try to train a neural network on a GPU that's getting hit by cosmic rays. Turns out the fix is surprisingly simple.
Training in Orbit

There's an idea floating around that we should put GPUs in space. Google published a paper about a TPU constellation. Starcloud launched the first H100 into orbit last November. The pitch is compelling: free solar power, free cooling via radiation into the void, no land use, no neighbors to annoy with your megawatt power draw.

The part nobody talks about is that space is trying to kill your hardware.

Specifically, cosmic rays and solar particles slam into silicon and flip bits. These are called Single Event Upsets, and they happen at a measurable rate. On the Chaohu-1 satellite, researchers measured a rate of errors/bit/day in orbit. Scale that to a 94-million-parameter model in bfloat16 — about total bits — and you get roughly 1,200 random bit flips per day.

For inference, this is manageable. A flipped bit produces one wrong answer, you move on. For training, it's catastrophic. A flipped bit corrupts a gradient, which gets baked into every weight, which corrupts every subsequent gradient. Errors don't just happen. They compound.

I wanted to know exactly how fragile training actually is. So I built a simulation.

The setup

I took a 94M parameter GPT-style transformer from the autoresearch project and added a bit-flip injector. At each training step, with probability calibrated to the measured LEO radiation rate, I flip a random bit in a random weight. The bit position is uniform across all 16 bits of the bfloat16 representation.

Then I swept the radiation rate from 1x realistic LEO up to 10,000x, measuring where training breaks.

Six flips is all it takes

The first result was sobering. At 10x the realistic LEO rate, about 42 bit flips over a 5-minute training run, the model crashes. Not degrades. Crashes. Loss goes to NaN and training is dead.

The reason is bfloat16's format. The 16 bits break down as:

A single flip in the exponent field can multiply a weight by , producing a value around . That extreme value propagates through the next forward pass, blows up the attention scores, and produces NaN gradients that instantly destroy every parameter in the model.

At realistic LEO, the model barely notices. About 4 flips in 5 minutes, and the degradation is less than . But extrapolate to a 24-hour training run, and you'd accumulate flips. Enough to crash undefended training many times over.

The obvious defense doesn't work well enough

The first thing I tried was NaN repair: after each bit flip, check if the result is NaN or infinity, and zero it out. This helps. It extends the survivable radiation rate by roughly 100x. But the quality degrades badly, because most exponent flips don't produce NaN. They produce large-but-finite values that accumulate and gradually corrupt the model.

At LEO with NaN repair, the model finishes training but with a val_bpb of , compared to clean. That's barely functional.

The fix that actually works

Here's where it gets interesting. I added one line of logic to the training loop: before each forward pass, flip one random bit in one random weight. That's it. Train with bit flips on purpose.

I call this Fault-Aware Training, and the results surprised me.

The FAT-trained model achieves val_bpb with no radiation. That's a quality cost from clean. But then I tested it at LEO. Still . At . Still . At , the rate where NaN repair alone crashes instantly. Still . At . .

The model becomes radiation-invariant across four orders of magnitude:

Radiation Resilience by Defense Strategy

val_bpb across radiation rates — lower is better, null means crash

I should be precise about what's happening here. FAT doesn't teach the model to handle bit flips specifically. It pushes the optimizer into a flat region of the loss landscape where individual weight perturbations don't matter much. The mechanism is related to Sharpness-Aware Minimization and other flat-minima methods, but with an important difference: the perturbation is sparse and heavy-tailed, matching the actual physics of radiation, rather than smooth and Gaussian.

I tested this directly. Replacing bit-flip noise with Gaussian noise during training does not produce radiation resilience. The model crashes at 100x LEO just like an undefended model. The perturbation has to be sparse and occasionally extreme. A few random weights getting large shocks, not every weight getting a small nudge.

Only half the bits matter

I ran a separate set of experiments where I restricted bit flips to specific bit positions. The results were clean:

Mantissa flips (bits 0-6, 44% of all bits): 434 flips at 100x LEO caused zero measurable degradation. Literally zero.

Sign flips (bit 15, 6% of all bits): Also zero degradation. Flipping a weight's sign is a large perturbation, but the optimizer corrects for it within a few steps.

Exponent flips (bits 7-14, 50% of all bits): These account for 100% of the damage.

Vulnerability by Bit Position

434 flips at 100× LEO — percentage of total quality degradation caused

This has a practical implication. If you could protect just the exponent byte of each bfloat16 weight with ECC, you'd eliminate essentially all radiation vulnerability while protecting only half the bits. That's a much cheaper hardware solution than protecting everything.

The variance result

This might be the most practically significant finding. Without FAT, the outcome of training under radiation is a gamble. I ran three seeds at the same radiation rate:

Seed 42Seed 123Seed 7
No FAT
FAT

That's a huge spread. Whether your model works depends on whether a flip happens to land on a critical exponent bit in a critical weight. You can't predict it in advance.

With FAT, all seeds converge to . The standard deviation drops from to .

Outcome Variance at 10× LEO

val_bpb across random seeds — FAT eliminates the gamble

For a satellite, this is the difference between "training might work" and "training will work." You can't retry a training run from orbit. The system has to be reliable on the first attempt. FAT turns radiation from a high-variance gamble into a predictable, fixed tax.

The other constraints

Radiation isn't the only problem with training in orbit. I simulated two more:

Eclipse power cycling. LEO satellites lose power for about 30 minutes every 90-minute orbit. I simulated this by resetting all optimizer state every 200 training steps. The cost was tiny: about 0.002-0.004 val_bpb per cycle. The optimizer momentum rebuilds in a few steps.

Progressive degradation. Cumulative radiation damage increases error rates over a satellite's lifetime. I simulated this by linearly ramping the flip rate from 1x to 10x over training. FAT handled it without any additional degradation.

The full combined simulation — FAT plus eclipse cycling plus progressive degradation plus radiation — produced a model with val_bpb of . That's worse than clean, trained entirely in software on a standard GPU, with no hardware modifications.

Full Space Simulation: Cumulative Cost

val_bpb with each space constraint added — FAT + eclipse + progressive TID + 100× radiation

What I learned

The big takeaway is that training in space is harder than inference in space, but not as hard as I expected. The vulnerability is real but narrow: it's entirely about exponent bits in floating-point weights. And the defense is simple: train with one bit flip per step, repair NaN values immediately, and the model becomes effectively immune to radiation across a wide range.

There are things I can't test in simulation. Vacuum cooling limits how many GPUs you can run. Inter-satellite bandwidth limits distributed training. Micrometeorite impacts are low-probability but catastrophic. These are hardware problems that require hardware solutions.

But the software side of the problem, making training robust to the bit flips that will inevitably happen, seems surprisingly tractable. A few lines of code, a quality cost, and the model just works.

I don't know if we'll actually train large models in space anytime soon. The economics are still challenging, and there are easier ways to get more compute on the ground. But the technical barrier is lower than I assumed. If the power and cooling economics ever work out, the bit-flip problem won't be what stops us.

The code is at github.com/tylergibbs1/radtrain. Forty-five experiments, three saved model checkpoints, and a portable bit-flip injection library you can use with any PyTorch model.


References

Space-based AI infrastructure:

Radiation effects on hardware:

Radiation defenses (inference-focused, pre-dating this work):

Flat minima and heavy-tailed noise (theoretical grounding):

Orbital edge computing:

Training infrastructure (this work builds on):

  • autoresearch — Karpathy's autonomous research framework. The training script and model architecture used in all experiments.