Training in Orbit

What happens when you try to train a neural network on a GPU that's getting hit by cosmic rays. Turns out the fix is surprisingly simple.

There's an idea floating around that we should put GPUs in space. Google published a paper about a TPU constellation. Starcloud launched the first H100 into orbit last November. The pitch is compelling: free solar power, free cooling via radiation into the void, no land use, no neighbors to annoy with your megawatt power draw.

The part nobody talks about is that space is trying to kill your hardware.

Specifically, cosmic rays and solar particles slam into silicon and flip bits. These are called Single Event Upsets, and they happen at a measurable rate. On the Chaohu-1 satellite, researchers measured a rate of $4.76 \times 1 0^{- 7}$ errors/bit/day in orbit. Scale that to a 94-million-parameter model in bfloat16 — about $2.5 \times 1 0^{9}$ total bits — and you get roughly 1,200 random bit flips per day.

For inference, this is manageable. A flipped bit produces one wrong answer, you move on. For training, it's catastrophic. A flipped bit corrupts a gradient, which gets baked into every weight, which corrupts every subsequent gradient. Errors don't just happen. They compound.

I wanted to know exactly how fragile training actually is. So I built a simulation.

The setup

I took a 94M parameter GPT-style transformer from the autoresearch project and added a bit-flip injector. At each training step, with probability calibrated to the measured LEO radiation rate, I flip a random bit in a random weight. The bit position is uniform across all 16 bits of the bfloat16 representation.

Then I swept the radiation rate from 1x realistic LEO up to 10,000x, measuring where training breaks.

Six flips is all it takes

The first result was sobering. At 10x the realistic LEO rate, about 42 bit flips over a 5-minute training run, the model crashes. Not degrades. Crashes. Loss goes to NaN and training is dead.

The reason is bfloat16's format. The 16 bits break down as:

$[sign: 1 bit] [exponent: 8 bits] [mantissa: 7 bits]$

A single flip in the exponent field can multiply a weight by $2^{128}$ , producing a value around $1 0^{38}$ . That extreme value propagates through the next forward pass, blows up the attention scores, and produces NaN gradients that instantly destroy every parameter in the model.

At $1 \times$ realistic LEO, the model barely notices. About 4 flips in 5 minutes, and the degradation is less than $0.03%$ . But extrapolate to a 24-hour training run, and you'd accumulate $\sim 1, 200$ flips. Enough to crash undefended training many times over.

The obvious defense doesn't work well enough

The first thing I tried was NaN repair: after each bit flip, check if the result is NaN or infinity, and zero it out. This helps. It extends the survivable radiation rate by roughly 100x. But the quality degrades badly, because most exponent flips don't produce NaN. They produce large-but-finite values that accumulate and gradually corrupt the model.

At $100 \times$ LEO with NaN repair, the model finishes training but with a val_bpb of $2.11$ , compared to $0.98$ clean. That's barely functional.

The fix that actually works

Here's where it gets interesting. I added one line of logic to the training loop: before each forward pass, flip one random bit in one random weight. That's it. Train with bit flips on purpose.

I call this Fault-Aware Training, and the results surprised me.

The FAT-trained model achieves $1.011$ val_bpb with no radiation. That's a $3.2%$ quality cost from clean. But then I tested it at $10 \times$ LEO. Still $1.011$ . At $100 \times$ . Still $1.012$ . At $1, 000 \times$ , the rate where NaN repair alone crashes instantly. Still $1.012$ . At $10, 000 \times$ . $1.012$ .

The model becomes radiation-invariant across four orders of magnitude:

$val_bpb (r) \approx 1.011 \forall r \in [1 \times, 10, 000 \times LEO]$

Radiation Resilience by Defense Strategy

val_bpb across radiation rates: lower is better, null means crash

I should be precise about what's happening here. FAT doesn't teach the model to handle bit flips specifically. It pushes the optimizer into a flat region of the loss landscape where individual weight perturbations don't matter much. The mechanism is related to Sharpness-Aware Minimization and other flat-minima methods, but with an important difference: the perturbation is sparse and heavy-tailed, matching the actual physics of radiation, rather than smooth and Gaussian.

I tested this directly. Replacing bit-flip noise with Gaussian noise during training does not produce radiation resilience. The model crashes at 100x LEO just like an undefended model. The perturbation has to be sparse and occasionally extreme. A few random weights getting large shocks, not every weight getting a small nudge.

Only half the bits matter

I ran a separate set of experiments where I restricted bit flips to specific bit positions. The results were clean:

Mantissa flips (bits 0-6, 44% of all bits): 434 flips at 100x LEO caused zero measurable degradation. Literally zero.

Sign flips (bit 15, 6% of all bits): Also zero degradation. Flipping a weight's sign is a large perturbation, but the optimizer corrects for it within a few steps.

Exponent flips (bits 7-14, 50% of all bits): These account for 100% of the damage.

Vulnerability by Bit Position

434 flips at 100× LEO: percentage of total quality degradation caused

This has a practical implication. If you could protect just the exponent byte of each bfloat16 weight with ECC, you'd eliminate essentially all radiation vulnerability while protecting only half the bits. That's a much cheaper hardware solution than protecting everything.

The variance result

This might be the most practically significant finding. Without FAT, the outcome of training under radiation is a gamble. I ran three seeds at the same radiation rate:

	Seed 42	Seed 123	Seed 7
No FAT	$1.009$	$1.022$	$1.102$
FAT	$1.011$	—	$1.009$

That's a huge spread. Whether your model works depends on whether a flip happens to land on a critical exponent bit in a critical weight. You can't predict it in advance.

With FAT, all seeds converge to $\approx 1.010$ . The standard deviation drops from $σ \approx 0.05$ to $σ \approx 0.001$ .

Outcome Variance at 10× LEO

val_bpb across random seeds: FAT eliminates the gamble

For a satellite, this is the difference between "training might work" and "training will work." You can't retry a training run from orbit. The system has to be reliable on the first attempt. FAT turns radiation from a high-variance gamble into a predictable, fixed tax.

The other constraints

Radiation isn't the only problem with training in orbit. I simulated two more:

Eclipse power cycling. LEO satellites lose power for about 30 minutes every 90-minute orbit. I simulated this by resetting all optimizer state every 200 training steps. The cost was tiny: about 0.002-0.004 val_bpb per cycle. The optimizer momentum rebuilds in a few steps.

Progressive degradation. Cumulative radiation damage increases error rates over a satellite's lifetime. I simulated this by linearly ramping the flip rate from 1x to 10x over training. FAT handled it without any additional degradation.

The full combined simulation — FAT plus eclipse cycling plus progressive degradation plus $100 \times$ radiation — produced a model with val_bpb of $1.016$ . That's $3.7%$ worse than clean, trained entirely in software on a standard GPU, with no hardware modifications.

Full Space Simulation: Cumulative Cost

val_bpb with each space constraint added: FAT + eclipse + progressive TID + 100× radiation

What I learned

The big takeaway is that training in space is harder than inference in space, but not as hard as I expected. The vulnerability is real but narrow: it's entirely about exponent bits in floating-point weights. And the defense is simple: train with one bit flip per step, repair NaN values immediately, and the model becomes effectively immune to radiation across a wide range.

There are things I can't test in simulation. Vacuum cooling limits how many GPUs you can run. Inter-satellite bandwidth limits distributed training. Micrometeorite impacts are low-probability but catastrophic. These are hardware problems that require hardware solutions.

But the software side of the problem, making training robust to the bit flips that will inevitably happen, seems surprisingly tractable. A few lines of code, a $3.2%$ quality cost, and the model just works.

I don't know if we'll actually train large models in space anytime soon. The economics are still challenging, and there are easier ways to get more compute on the ground. But the technical barrier is lower than I assumed. If the power and cooling economics ever work out, the bit-flip problem won't be what stops us.

The code is at github.com/tylergibbs1/radtrain. Forty-five experiments, three saved model checkpoints, and a portable bit-flip injection library you can use with any PyTorch model.

References

Space-based AI infrastructure:

Exploring a Space-Based, Scalable AI Infrastructure System Design — Google's TPU constellation proposal. First published radiation test results for a cloud AI accelerator. Survived 15 krad TID with no hard failures.
Starcloud Trains First AI Model in Space — First H100 in orbit, November 2025. Successfully ran nanoGPT training on a 60 kg satellite.

Radiation effects on hardware:

RedNet: A Case for Application-Aware Space Radiation Tolerance — Measured LEO SEU rate of $4.76 \times 1 0^{- 7}$ errors/bit/day on Chaohu-1 satellite. The radiation rate used in all experiments in this post.
Understanding Silent Data Corruption in LLM Training — Meta's study showing 6 SDC events in a 54-day training run on terrestrial hardware. Training diverges silently.
A Single Bit-Flip Destroys Half of an LLM's Accuracy — Demonstrates that one bit flip in a quantized LLM can drop accuracy from 73.5% to 0%.
GPU Resilience at Scale: H100 vs A100 — H100s show 3.2x lower MTBE for memory errors vs A100. 5% overprovisioning needed.

Radiation defenses (inference-focused, pre-dating this work):

Fault-Aware Training for SEU Mitigation — Injects bit flips during training to improve inference resilience. Only validated in simulation, only for inference. Our work extends FAT to pretraining.
Rotated Robustness: Training-Free Defense Against Bit-Flip Attacks on LLMs — Uses orthogonal transforms to smooth activation outliers. Inference-only.
Enhancing DNN Robustness Through Saturated Activation Functions — Bounds weights via tanh during training. Related to our finding that logit softcap acts as a radiation defense.

Flat minima and heavy-tailed noise (theoretical grounding):

A Tail-Index Analysis of Stochastic Gradient Noise in Deep Networks — Simsekli et al. showed SGD gradient noise is heavy-tailed, not Gaussian. Heavier tails find flatter minima.
Hausdorff Dimension, Heavy Tails, and Generalization — Connects tail index of optimization noise to fractal dimension of minima. Foundational for understanding why bit-flip noise (heavy-tailed) produces different minima than Gaussian noise.

Orbital edge computing:

Bringing Federated Learning to Space — First systematic FL study adapted for orbital dynamics. 768 constellation configurations, 9x speedup.
A Comprehensive Survey on Orbital Edge Computing — Full stack survey of compute in orbit.
FPGA-Based Neural Network Accelerators for Space: A Survey — FPGAs dominate actual space deployments due to radiation tolerance through scrubbing.

Training infrastructure (this work builds on):

autoresearch — Karpathy's autonomous research framework. The training script and model architecture used in all experiments.