Machine learning explainer

How a computer learns the lake

Buoycast predicts the water temperature off the Evanston and Wilmette shoreline for the next seven days, with an honest range around every number. Here is how it actually works, one idea at a time. Everything below is live and runs in your browser.

01 · The question

Predict the next 168 hours, and say how sure you are

Every hour the Wilmette buoy reports the water temperature. The job is to forecast what it will read for each of the next seven days, not just a single guess, but a shaded band that is narrow when the next hours are knowable and wide when they are not.

A forecast without a band is a guess pretending to be a fact. The band is the honest part.

the forecast fan, drawing out from now

02 · The data

Ten years of the lake, hour by hour

The model has studied roughly 600,000 examples: every past hour paired with what the water actually did next, alongside the weather at the time. From these it learns the lake's habits. Calm warm days nudge the surface up; a hard north wind overturns the water and the temperature can fall several degrees overnight.

No rules are written by hand. The patterns are discovered from the record.

0training examples
0seasons, 2016 to 2026
0inputs per example
0out-of-sample checks

numbers behind the model

03 · Turning history into examples

The model never sees a summer, only rows

A computer cannot study a season the way a person remembers one. What it studies is a long stack of rows. Each row is a single frozen moment described by 46 numbers, the recent water temperatures, the wind, the sun, the air temperature, the time of year, and how far ahead we are asking, set next to the one answer that moment eventually produced: what the water actually read that far in the future.

Ten years of hourly readings becomes roughly 600,000 of these question-and-answer rows. Learning, in this setting, means nothing more mysterious than finding rules that turn the 46 numbers on the left into the answer on the right, again and again, across every row at once.

one moment in, one answer out, repeated across the record

04 · What one tree does

A single yes-or-no question, and two averages

Before stacking hundreds of trees, it helps to watch exactly one. A decision tree at its simplest asks a single question of a single input, something like “is the wind above 7 m/s?”, and then gives one flat average answer for everything on the yes side and another for everything on the no side. To pick the question, it sweeps every possible threshold, measures how much error each one leaves behind, and keeps the split that fits best.

One question is a blunt instrument, and that is exactly the point. A single tree can only ever draw one step. The power comes from stacking corrections, hundreds of these blunt trees in a row, each fixing what the last one got wrong, which is the next section.

sweeping the threshold to find the split that fits best

05 · The engine

A thousand tiny corrections, stacked

The model is not one clever formula. It is a stack of hundreds of tiny decision trees, built one after another. The first makes a crude guess. The second learns only to fix the first one's mistakes. The third fixes what is still wrong, and so on.

Each tree is simple and almost useless alone. Stacked, they bend to the true shape of the data. This is gradient boosting: watch the staircase chase the curve.

trees: 0 · error: –

06 · The bake-off

Four ways to learn the same lake

We do not take one algorithm's word for it. The same training data is handed to four different learners: ridge regression, which draws the straightest defensible line; Bayesian ridge, which keeps uncertainty about its own coefficients; a random forest, hundreds of decorrelated trees averaged; and gradient boosting, the staircase from the previous section. Each predicts the held-out month, and each gets a percentage weight tuned on data none of them trained on.

The weights are learned on held-out data none of the four trained on, not assigned by us, which is why the two linear models landed at exactly 0 percent. A straight line cannot bend around the lake's nonlinear swings, so ridge scored 1.29 degrees on the unseen test against boosting's 1.12, and with 600,000 training rows the Bayesian prior washes out entirely, which is why Bayesian ridge ties plain ridge to the fourth decimal. The optimizer handing both of them 0 percent is the system being honest about which learners earned their place, not a snub. With the real numbers in hand, the tuned weights chose forest 45 percent and boosting 55 percent, yet that mix beat boosting alone by only 0.01 degrees, within noise, and a plain 50/50 average of forest and boosting actually looked better on that validation window. Tuning the weights had memorized the validation month, the same trap as section 11, so we vetted the average the hard way: replayed across nine past seasons against rules written down in advance, it won only 4 of 9 and its early lead vanished. Nothing shipped. Boosting alone kept its crown, and this section is the receipt that we checked.

four learners fit the same dots · learned weights · the composite on top

07 · The band

It predicts a range, not a number

Instead of one temperature, the model is trained five separate times to predict the 5th, 25th, 50th, 75th and 95th percentiles of what could happen. The middle one is the headline forecast; the outer ones are the edges of the band.

So the band is not decoration bolted on afterward. The model is asked, directly, “where is the cold tail, where is the warm tail?”

We do not take the band the model draws on faith. We collect 131,339 real out-of-sample misses from nine replayed seasons, sort them smallest to largest, and read off the fences that contain 90 percent of them, plus a small safety margin the math requires (statisticians call this conformal calibration). The result is a guarantee of at least 90 percent coverage rather than a habit we happened to observe, resting on one honest assumption: that future misses behave like nine seasons of past ones.

each dot is one outcome · the bands should catch them at the right rate

08 · Anchoring

Always start from the reading on the buoy right now

A model that ignored the current temperature would be silly. So the whole trajectory is slid to match the live buoy, and that correction fades over the first day: the next hour is almost certain, next week far less so.

It is the difference between “the water is usually 64° this week” and “it is 65.2° right now, so start there.”

raw model (dashed) snapping to the live reading (dot)

09 · The ensemble

Thirty-four futures vote on the band

The biggest unknown is not the lake, it is the weather. So the forecast is run 34 times: once on each of the 31 members of NOAA's GEFS ensemble, the same atmosphere model started from 31 slightly different versions of right now, because tiny differences today grow into different weather by Friday. The European, German and Canadian models join them for diversity. When the futures agree, the band is tight. When they split over a coming front, the band widens to match the genuine doubt.

Drag the slider: as the members disagree more, the shaded band grows.

weather-model disagreement: low

10 · The whole machine

From raw readings to a banded forecast, end to end

Every hour this whole assembly runs once, top to bottom, with nothing hidden. What follows is the entire path from the raw instruments to the shaded band: four data sources, the real 46 inputs, five quantile models replayed across 34 weather futures, and the single banded forecast that falls out the far side.

reading the instruments

11 · Memorizing is not learning

A perfect score on the past can be worthless

Any model can look flawless on history by simply memorizing it, noise and all. The wiggly curve below threads exactly through every observation it was shown, hitting each one dead on, and for a moment it appears perfect. Then new readings arrive that it never saw, and it lurches wildly between them, because it learned the dots rather than the shape behind them.

The smooth curve does the opposite. It accepts small errors on the past in exchange for staying close to the truth on data it has not seen. Everything about how buoycast is trained and tested, the held-out window, the gap, the nine replayed seasons, exists to catch memorization before it can masquerade as skill.

fits the past perfectly, fails on new data, versus learns the shape

12 · The proof

Tested on seasons it never saw

It is easy to fit the past. The honest test is to train on the past and predict a stretch the model has never seen, then score it. Buoycast replays nine past seasons this way, 131,000 forecast-and-check pairs, always with a gap so no answer leaks into training.

It beats the naive “tomorrow equals today” baseline from about a day and a half out, and by day seven it is more than twice as accurate.

train on the filled stretch · predict the unseen window · score · slide forward

13 · The scoreboard

So how good is it, honestly?

Twelve sections explain how it works. This one says how well, on those 131,339 replayed forecasts. Inside a day, the model is no better than assuming the water stays exactly where it is. That is not failure, it is physics: the lake has enormous thermal inertia, and the forecast already starts pinned to the live buoy. There is simply little left to predict an hour ahead.

The value shows up from two days out, where knowing the coming weather starts to matter, and grows from there: by day seven the model is nearly twice as accurate as the do-nothing baseline. One more honest wrinkle: most replayed seasons are placid autumns where doing nothing is genuinely hard to beat, yet in the volatile spring 2026 window the model won even at one day ahead, 0.89 versus 1.30 degrees. It earns its keep exactly when the water is about to do something.

leadmodel error"no change" erroredge
+1 hour0.09°F0.09°Ftie
+6 hours0.43°F0.41°Ftie
+1 day0.82°F0.79°Ftie
+2 days1.05°F1.21°F+13%
+3 days1.20°F1.62°F+26%
+5 days1.48°F2.36°F+37%
+7 days1.62°F3.03°F+47%

mean absolute error, nine held-out seasons · fences conformal with a finite-sample margin, so 90% means at least 90%

Where it stands

The standard is the same for everything here: prove it across nine seasons or it does not ship.

Shipped

What made it in

The 34-member weather ensemble that drives the band: 31 GEFS futures plus the European, German, and Canadian national models. And conformal calibration of the band itself, so 90 percent coverage is a finite-sample guarantee rather than a habit we hope holds.

live in the forecast today
Tested and rejected

What did not survive

Tuned and equal regression blends, offshore and onshore wind features, a neighbor buoy, season degree-days, wind-stress memory. Then the big three: NASA satellite temperature maps of the whole lake, NOAA's 3D lake physics model interrogated at this exact buoy, and Chicago's nearshore beach sensors. The satellite and physics features genuinely helped in violent seasons, including a tenth of a degree in the worst bust year on record here, but they consistently taxed the calm seasons, and a seed-stability check confirmed the tax was real. Rules written in advance said that fails. It failed.

honest negative results
On the bench

What waits for the same trial

A 14-day horizon: the weather feeds reach 16 days and a 51-member European ensemble is free, so the only question is whether the bands stay honest that far out. And a sharper idea from the rejections above: use the satellite and physics signals only when the lake is in a volatile regime, since that is precisely where they earned their keep.

promising, not yet proven