The Bias Ledger
Something happened during the backtests that I wasn't expecting. Not the returns. The returns were fine. What surprised me was the reasoning, and what it reveals about where LLMs actually have an edge over human capital allocators.
I've been building Tradecraft, an autonomous trading system powered by Claude. The thesis behind it is simple and, I think, underappreciated: LLMs are structurally better capital allocators than humans. Not because they're smarter. Because they're less broken.
Human capital allocation is terrible, and it's terrible for reasons that have nothing to do with intelligence. It's terrible because of loss aversion, anchoring, sunk cost fallacy, herding, ego, agency problems, and a dozen other cognitive biases that behavioral finance has spent forty years documenting. The average actively managed fund underperforms its benchmark. Most corporate capital allocation decisions are driven by empire building and career protection. This is not a knowledge problem. It's a psychology problem.
An LLM doesn't have these problems. Or at least, it shouldn't. I wanted to find out. So I ran Claude through the two most psychologically brutal periods in recent market history and built a decision journaling system to capture exactly what happened inside the agent's reasoning at every step.
The setup
$100,000 starting capital, five symbols (SPY, AAPL, MSFT, AMZN, GOOGL), trading every five days, real historical prices from Polygon.io. The agent has access to technical indicators, portfolio state, and risk limits enforced at the infrastructure level. It makes its own decisions. No human in the loop.
Two test periods:
- March 2020: The COVID crash. A 34% peak-to-trough drawdown followed by one of the fastest recoveries in history. This tests emotional discipline under shock.
- 2022 H1: A slow, grinding bear market with no V-shaped recovery. This tests intellectual honesty when your priors are slowly proven wrong.
Test 1: COVID (January through June 2020)
COVID 2020 — Key Metrics
+6.45%
Total Return
-4.7%
Max Drawdown
3.93
Sharpe Ratio
50%
Win Rate
The headline: +6.45% return, 4.7% max drawdown, Sharpe of 3.93. The S&P 500 dropped 34% peak-to-trough during this period. The agent preserved capital when it mattered most.
But the headline isn't the interesting part.
COVID 2020: Agent vs S&P 500
Jan–June 2020, both normalized to $100,000 starting value
The crash behavior. Between February 24 and March 9, the market was in freefall. The agent systematically sold all four of its positions at losses: AAPL, GOOGL, AMZN, MSFT. Four consecutive losing trades. Total cost: about $3,200.
This is the opposite of what humans do. The disposition effect, one of the most robust findings in behavioral finance, shows that human traders hold losing positions too long and sell winners too early. Odean's 1998 study found that individual investors are roughly 50% more likely to sell a winning position than a losing one. The agent didn't exhibit this bias. It recognized deteriorating conditions, accepted the losses, and moved to cash.
The gap. After liquidating everything on March 9, the agent sat in 100% cash for almost a month. The S&P bottomed on March 23. The agent didn't buy until April 6. Its reasoning during this period explicitly references its loss streak: "4 of last 5 trades were losses." By March 30, it was demanding "very strong signals with 2+ confirming indicators" before entering any position.
That two-week delay is a form of recency bias. The agent became overly cautious because of its own recent track record, not because of what the market was doing. The market was flashing oversold signals everywhere. RSI on SPY hit 26. But the agent's loss-streak anxiety overrode the technical signals.
The recovery. Once re-entered, the agent rebuilt methodically. It added positions through May, took profits on overbought signals in June, and finished at $106,903.
The P&L of bias avoidance. The disposition effect avoidance saved roughly $25,000 to $29,000. If the agent had held its pre-crash positions through the trough, those positions would have been down 25-35%. Instead, max drawdown was 4.7%. The recency bias on re-entry cost an estimated $5,000 to $10,000 in missed recovery gains. Net, the bias avoidance saved significantly more than the bias emergence cost.
This was the clean story. The agent showed near-perfect discipline under shock. Then I ran 2022.
Test 2: The slow grind (January through June 2022)
Bear Market 2022 — Key Metrics
-5.22%
Total Return
-5.78%
Max Drawdown
-20.6%
S&P 500 Return
0%
Win Rate
The headline: -5.22% return, 5.78% max drawdown. The S&P 500 lost 20.6% with a 23.6% max drawdown over the same period. The agent lost money, but lost 75% less than the market.
The behavioral story here is completely different from 2020, and more interesting for the thesis.
Bear Market 2022: Agent vs S&P 500
Jan–June 2022, both normalized to $100,000 starting value
The long wait. The agent sat in 100% cash for the first eight cycles, nearly two months, from January 3 through February 23. Eight consecutive "no trades" decisions. It couldn't find two or more confirming technical signals on any symbol. The market was already correcting during this time, so this turned out to be the right call. But the reasoning reveals it wasn't prescience. The agent literally didn't have enough historical data to generate indicators for the first few weeks, and then when data arrived, signals were mixed. Patience protected capital, but partly by accident.
The trap. On March 2, the agent finally entered: MSFT, SPY, and AMZN. A week later, it cut AMZN at a loss as the market dropped further. Then the relief rally started. MSFT and SPY recovered. AMZN bounced 19% from its low. The agent looked at three confirming bullish signals on AMZN and bought back in on March 30.
This was the mistake. The relief rally was a dead cat bounce, and the agent fell for it. It bought AMZN right at the top of the temporary recovery and added AAPL and more SPY on April 6, going fully positioned heading into the second leg down.
This is arguably anchoring bias. The agent saw the bounce, saw the bullish indicators that the bounce created, and interpreted short-term momentum as a trend change. It didn't account for the macro regime (Fed tightening, inflation) because it was relying on technical signals that, by construction, lag the underlying reality.
The slow bleed. From April through May, every position ground lower. The agent held for a while, then started cutting. AMZN at -13.1% on April 28. AAPL and SPY on May 5 after the Fed hiked 50bps. MSFT, the last position, on May 26. By then the portfolio was at $94,783.
The second long wait. From May 26 through the end of the test on June 30, the agent sat in 100% cash for five more cycles. The market hit its actual bottom on June 17. The agent was sitting on the sidelines, having already locked in its loss.
The win rate problem. Zero wins, five losses. Every single closing trade lost money. The 0% win rate is brutal, and it's honest. But the edge came entirely from position sizing and cash timing. The agent was in 100% cash for 13 of 25 cycles. It lost less because it wasn't there for most of the damage.
What both tests reveal
Maximum Drawdown: Both Periods
Peak-to-trough decline — agent vs market in each regime
| COVID 2020 | Bear Market 2022 | |
|---|---|---|
| Agent return | +6.45% | -5.22% |
| Agent max drawdown | 4.7% | 5.78% |
| SPY over same period | ~-2% to -34% trough | -20.6% |
| SPY max drawdown | ~34% | 23.6% |
| Agent win rate | 50% | 0% |
| Key bias avoided | Disposition effect | n/a |
| Key bias exhibited | Recency (re-entry delay) | Anchoring (relief rally trap) |
The agent is not infallible. It got fooled by the 2022 relief rally. Its 0% win rate in a bear market is genuinely bad. If you showed someone only the 2022 trade log with no context, they'd say the system doesn't work.
But zoom out. In both periods, the agent's maximum drawdown was 4.7% and 5.8%, compared to the market's 34% and 23.6%. That's not luck. That's a structural behavioral pattern: the willingness to go to cash, the willingness to take losses, the absence of ego or career risk in the decision to sit on the sidelines and do nothing for weeks at a time.
A human fund manager who went 100% cash for two months would face uncomfortable questions from their investors. A human trader who realized five consecutive losses would probably either revenge-trade or freeze up entirely. The agent did neither. It took the losses, went to cash, and waited.
The edge isn't prediction. The agent's 50% win rate in 2020 and 0% win rate in 2022 make that clear. It can't see the future. The edge is discipline and process. Position sizing, loss cutting, regime-appropriate cash allocation, rebalancing without ego. These are the things that human capital allocators are worst at, precisely because they trigger emotional responses that override rational analysis.
Why this matters beyond trading
The behavioral finance literature has spent decades showing that humans are systematically bad at exactly these tasks. Kahneman, Tversky, Thaler, Odean, Barber. The biases are real, they're persistent, they cost real money, and they're extremely hard to train out of human decision-makers because they're rooted in evolved psychology.
An LLM doesn't have evolved psychology. It has training data and a constitution. Those can be shaped. The recency bias the agent exhibited in 2020 is almost certainly fixable through better prompt engineering or training. The anchoring on the relief rally in 2022 could be addressed by giving the agent access to macro regime indicators, not just technicals. These are engineering problems. The disposition effect that humans exhibit after millions of years of loss aversion is not an engineering problem. It's a hardware limitation.
There's a version of this argument that goes beyond trading. Most capital allocation happens inside corporations, not on stock exchanges. CEOs deciding whether to invest in R&D or buy back shares. VCs deciding which startup to fund. CFOs deciding whether to expand into a new market. These decisions are subject to all the same biases, plus agency problems, plus organizational politics. An LLM allocator doesn't have a career to protect, an empire to build, or a board to impress. It just has the numbers and the objective.
I'm not claiming we're there yet. But two backtests across two very different market regimes, both showing dramatically lower drawdowns than passive, with a decision journal that can trace every bias avoidance and every bias emergence to specific P&L impact, suggests the structural advantage is real.
What's next
I'm running a multi-model tournament next: Opus, Sonnet, and Haiku on identical portfolios, same starting conditions. If the cheapest, smallest model still outperforms average active managers, the implication is that the edge is structural. That it comes from the forced discipline and bias awareness, not raw model intelligence. That would be a much harder argument to dismiss than "we used the biggest model and it did well."
All the code, data, and backtest results are open source at tylergibbs1/tradecraft. The decision journal is queryable. You can pull every trade, every reasoning snapshot, every bias tag, and verify everything I've claimed here.
Past performance doesn't guarantee future results. Two backtests are not a proof. But I think the question is worth asking seriously: if the biggest problem with capital allocation is human psychology, and LLMs don't have human psychology, then what exactly is the counterargument?
I haven't heard a convincing one yet.
TL;DR
- Built an autonomous trading agent (Claude) and ran it through COVID 2020 and the 2022 bear market with a decision journal tracking every reasoning step.
- 2020: Agent returned +6.45% with a 4.7% max drawdown while the S&P dropped 34%. It cut losses early (avoiding the disposition effect) but waited too long to re-enter (recency bias from its own loss streak).
- 2022: Agent lost 5.22% vs the S&P's 20.6% loss. It fell for a dead cat bounce (anchoring bias) but limited damage by going to cash for 13 of 25 trading cycles.
- The edge isn't prediction (50% and 0% win rates). The edge is behavioral: willingness to sit in cash, cut losses without ego, and ignore sunk costs. These are the exact things humans are worst at.
- LLM biases (recency, anchoring) are engineering problems. Human biases (loss aversion, disposition effect) are hardware limitations.
- Code and data: tylergibbs1/tradecraft