What 11 years of backtests can and can't tell you

Anyone can produce a backtest that turns $10,000 into millions. You just try thousands of rules and keep the one that would have worked. It’s the trading equivalent of betting on a horse race that already finished. The result is real — and completely worthless, because the future is a different race.

So a backtest can tell you one genuinely useful thing: whether an idea was ever plausible. What it cannot tell you, on its own, is whether the idea will work tomorrow. Three habits are what stand between those two, and we hold to all three.

1. Hide the recent data from yourself

Before testing anything, we cut off the most recent stretch of history — for us, the last 18 months — and refuse to look at it while building or tuning. Only once an idea is finished do we let it run on that untouched window. If it worked on the years we studied but falls apart on the years we hid, it was memorizing the past, not finding a pattern. Most ideas die exactly here, and that’s the holdout doing its job.

Walk-forward, in plain terms

We slice the 11 years into chunks, “train” on the early ones, and test on the later ones — then slide the window forward and do it again. A real edge keeps showing up in windows it never saw during tuning. A lucky one only looks good on the slice that birthed it. We require an idea to win across most windows, including the two most recent, before it counts.

2. Punish yourself for trying many things

Test 1,000 random rules against the data and roughly 50 will look “significant” by pure chance — that’s just how statistics works. If you go looking through thousands of patterns (we do), you are guaranteed to find impressive-looking flukes. So we apply a correction that raises the bar in proportion to how many things we tested. It’s the difference between “this beat the odds” and “this beat the odds after I rolled the dice ten thousand times.”

3. Count the events, not the years

This is the one that catches the most seductive mistakes. “Eleven years of data” sounds like a lot. But if a strategy only acts at major market bottoms, eleven years contains maybe three or four bottoms. A result built on four events isn’t a strategy — it’s a small story that happened to end well.

We learned this the expensive way. One idea — buying Bitcoin at deep capitulation — backtested at nearly double the bot’s return. We killed it, because it rested on four moments and one of them (the COVID crash) actually lost money. The full story is here; the lesson is that a giant number built on a tiny number of events is a mirage, no matter how many years it spans. Bitcoin has only lived through about three full market cycles. That, not the number of days, is the real limit on what we can know.

Why this site leads with the bad number

Most performance pages show you the big green return and bury the losses. We do the opposite: the worst drawdown is one of the first things on the scoreboard. Not out of modesty — because the drawdown is the number that actually predicts whether you could live with a strategy long enough to earn its returns. The upside is a story about the past. The downside is a promise about how bad the bad days get, and that promise tends to hold.

The honest conclusion

We trust the process — hidden data, many-tries penalties, counting events — and we distrust any single month, including the good ones. A backtest can’t promise you the future. Done honestly, it can stop you from believing a lie about it. That’s the whole reason the bot on this site is so boring, and why we show you the dips first.