Machine learning explainer
How a computer learns the lake
Buoycast predicts the water temperature off the Evanston and Wilmette shoreline for the next seven days, with an honest range around every number. Here is how it actually works, one idea at a time. Everything below is live and runs in your browser.
Predict the next 168 hours, and say how sure you are
Every hour the Wilmette buoy reports the water temperature. The job is to forecast what it will read for each of the next seven days, not just a single guess, but a shaded band that is narrow when the next hours are knowable and wide when they are not.
A forecast without a band is a guess pretending to be a fact. The band is the honest part.
the forecast fan, drawing out from now
Ten years of the lake, hour by hour
The model has studied roughly 600,000 examples: every past hour paired with what the water actually did next, alongside the weather at the time. From these it learns the lake's habits. Calm warm days nudge the surface up; a hard north wind overturns the water and the temperature can fall several degrees overnight.
No rules are written by hand. The patterns are discovered from the record.
numbers behind the model
The model never sees a summer, only rows
A computer cannot study a season the way a person remembers one. What it studies is a long stack of rows. Each row is a single frozen moment described by 46 numbers, the recent water temperatures, the wind, the sun, the air temperature, the time of year, and how far ahead we are asking, set next to the one answer that moment eventually produced: what the water actually read that far in the future.
Ten years of hourly readings becomes roughly 600,000 of these question-and-answer rows. Learning, in this setting, means nothing more mysterious than finding rules that turn the 46 numbers on the left into the answer on the right, again and again, across every row at once.
one moment in, one answer out, repeated across the record
A single yes-or-no question, and two averages
Before stacking hundreds of trees, it helps to watch exactly one. A decision tree at its simplest asks a single question of a single input, something like “is the wind above 7 m/s?”, and then gives one flat average answer for everything on the yes side and another for everything on the no side. To pick the question, it sweeps every possible threshold, measures how much error each one leaves behind, and keeps the split that fits best.
One question is a blunt instrument, and that is exactly the point. A single tree can only ever draw one step. The power comes from stacking corrections, hundreds of these blunt trees in a row, each fixing what the last one got wrong, which is the next section.
sweeping the threshold to find the split that fits best
A thousand tiny corrections, stacked
The model is not one clever formula. It is a stack of hundreds of tiny decision trees, built one after another. The first makes a crude guess. The second learns only to fix the first one's mistakes. The third fixes what is still wrong, and so on.
Each tree is simple and almost useless alone. Stacked, they bend to the true shape of the data. This is gradient boosting: watch the staircase chase the curve.
trees: 0 · error: –
Four ways to learn the same lake
We do not take one algorithm's word for it. The same training data is handed to four different learners: ridge regression, which draws the straightest defensible line; Bayesian ridge, which keeps uncertainty about its own coefficients; a random forest, hundreds of decorrelated trees averaged; and gradient boosting, the staircase from the previous section. Each predicts the held-out month, and each gets a percentage weight tuned on data none of them trained on.
The weights are learned on held-out data none of the four trained on, not assigned by us, which is why the two linear models landed at exactly 0 percent. A straight line cannot bend around the lake's nonlinear swings, so ridge scored 1.29 degrees on the unseen test against boosting's 1.12, and with 600,000 training rows the Bayesian prior washes out entirely, which is why Bayesian ridge ties plain ridge to the fourth decimal. The optimizer handing both of them 0 percent is the system being honest about which learners earned their place, not a snub. With the real numbers in hand, the tuned weights chose forest 45 percent and boosting 55 percent, yet that mix beat boosting alone by only 0.01 degrees, within noise, and a plain 50/50 average of forest and boosting actually looked better on that validation window. Tuning the weights had memorized the validation month, the same trap as section 11, so we vetted the average the hard way: replayed across nine past seasons against rules written down in advance, it won only 4 of 9 and its early lead vanished. Nothing shipped. Boosting alone kept its crown, and this section is the receipt that we checked.
four learners fit the same dots · learned weights · the composite on top
It predicts a range, not a number
Instead of one temperature, the model is trained five separate times to predict the 5th, 25th, 50th, 75th and 95th percentiles of what could happen. The middle one is the headline forecast; the outer ones are the edges of the band.
So the band is not decoration bolted on afterward. The model is asked, directly, “where is the cold tail, where is the warm tail?”
We do not take the band the model draws on faith. We collect 131,339 real out-of-sample misses from nine replayed seasons, sort them smallest to largest, and read off the fences that contain 90 percent of them, plus a small safety margin the math requires (statisticians call this conformal calibration). The result is a guarantee of at least 90 percent coverage rather than a habit we happened to observe, resting on one honest assumption: that future misses behave like nine seasons of past ones.
each dot is one outcome · the bands should catch them at the right rate
Always start from the reading on the buoy right now
A model that ignored the current temperature would be silly. So the whole trajectory is slid to match the live buoy, and that correction fades over the first day: the next hour is almost certain, next week far less so.
It is the difference between “the water is usually 64° this week” and “it is 65.2° right now, so start there.”
raw model (dashed) snapping to the live reading (dot)
Thirty-four futures vote on the band
The biggest unknown is not the lake, it is the weather. So the forecast is run 34 times: once on each of the 31 members of NOAA's GEFS ensemble, the same atmosphere model started from 31 slightly different versions of right now, because tiny differences today grow into different weather by Friday. The European, German and Canadian models join them for diversity. When the futures agree, the band is tight. When they split over a coming front, the band widens to match the genuine doubt.
Drag the slider: as the members disagree more, the shaded band grows.
weather-model disagreement: low
From raw readings to a banded forecast, end to end
Every hour this whole assembly runs once, top to bottom, with nothing hidden. What follows is the entire path from the raw instruments to the shaded band: four data sources, the real 46 inputs, five quantile models replayed across 34 weather futures, and the single banded forecast that falls out the far side.
reading the instruments
A perfect score on the past can be worthless
Any model can look flawless on history by simply memorizing it, noise and all. The wiggly curve below threads exactly through every observation it was shown, hitting each one dead on, and for a moment it appears perfect. Then new readings arrive that it never saw, and it lurches wildly between them, because it learned the dots rather than the shape behind them.
The smooth curve does the opposite. It accepts small errors on the past in exchange for staying close to the truth on data it has not seen. Everything about how buoycast is trained and tested, the held-out window, the gap, the nine replayed seasons, exists to catch memorization before it can masquerade as skill.
fits the past perfectly, fails on new data, versus learns the shape
Tested on seasons it never saw
It is easy to fit the past. The honest test is to train on the past and predict a stretch the model has never seen, then score it. Buoycast replays nine past seasons this way, 131,000 forecast-and-check pairs, always with a gap so no answer leaks into training.
It beats the naive “tomorrow equals today” baseline from about a day and a half out, and by day seven it is more than twice as accurate.
train on the filled stretch · predict the unseen window · score · slide forward
So how good is it, honestly?
Twelve sections explain how it works. This one says how well, on those 131,339 replayed forecasts. Inside a day, the model is no better than assuming the water stays exactly where it is. That is not failure, it is physics: the lake has enormous thermal inertia, and the forecast already starts pinned to the live buoy. There is simply little left to predict an hour ahead.
The value shows up from two days out, where knowing the coming weather starts to matter, and grows from there: by day seven the model is nearly twice as accurate as the do-nothing baseline. One more honest wrinkle: most replayed seasons are placid autumns where doing nothing is genuinely hard to beat, yet in the volatile spring 2026 window the model won even at one day ahead, 0.89 versus 1.30 degrees. It earns its keep exactly when the water is about to do something.
| lead | model error | "no change" error | edge |
|---|---|---|---|
| +1 hour | 0.09°F | 0.09°F | tie |
| +6 hours | 0.43°F | 0.41°F | tie |
| +1 day | 0.82°F | 0.79°F | tie |
| +2 days | 1.05°F | 1.21°F | +13% |
| +3 days | 1.20°F | 1.62°F | +26% |
| +5 days | 1.48°F | 2.36°F | +37% |
| +7 days | 1.62°F | 3.03°F | +47% |
mean absolute error, nine held-out seasons · fences conformal with a finite-sample margin, so 90% means at least 90%
Where it stands
The standard is the same for everything here: prove it across nine seasons or it does not ship.