Evanston–Wilmette lakefront · Wilmette buoy 45174 · live
The next 168 hours
Lake Michigan surface temperature, forecast hour by hour to seven days. The dark line is the median of 34 simulated futures; the band around it is calibrated on nine seasons of out-of-sample misses, so it means what it says.
Hourly forecast · P5 / P25 / P50 / P75 / P95
Hover anywhere for the exact numbers at that hour. The faint plume is the forecast run across 34 weather futures (NOAA's 31-member GEFS ensemble plus the European, German, and Canadian models); the band, calibrated on nine seasons, widens only when those futures genuinely disagree. The inner band should hold the outcome half the time, the outer band 90 percent of the time.
In the water right now
--
Observed by the buoy, ten-minute cadence in season. Every forecast above starts from this reading: the model is pinned to reality at hour one and earns its keep from there.
Seven days at a glance
Skill by lead time
Left: median forecast error against "the water stays exactly as it is now." Right: how often the 90 percent (cyan) and 50 percent (yellow) bands actually contained the outcome on the held-out test; dashed lines mark perfect calibration. Bands run a touch narrow at two to three day leads and the 50 percent band degrades past day five: known, quantified, shown.
Median absolute error vs persistence (°F)
Band calibration: nominal vs observed coverage
Multi-season rolling backtest
Replayed over the test weeks
+24h forecasts replayed across the held-out window
Every hour of the held-out test window: what the +24h median (cyan) and 90 percent band said, against what the water actually did (dark line). The model never saw any of this during training.
Under the hood
The held-out test set examined three ways: where the +24h errors fall, which inputs the model leans on, and the full table of numbers behind every claim above.
Error structure at +24h
Residual distribution (°F)
Predicted vs actual (°F)
What the model leans on
Permutation importance, test set
Increase in error when a feature is shuffled (°F). The current water temperature and its recent lags anchor the level; forecast-window air temperature, wind, and solar move it. The fut_ features come from the weather forecast, the rest from the buoy itself.
Full validation table
All numbers in °F on the gap-separated test window. Bias is mean signed error. P90 |err| is the 90th percentile of absolute error, the bad-day number. Width is the average 90 percent band span.
Why it works: the driver study
Hover either panel. Weather during the forecast window (cyan) carries two to four times the signal of the buoy's own past (gray): air-water temperature gap, sustained wind (mixing), solar input, and the alongshore wind that drives upwelling. The longer the window, the stronger the correlation: it is sustained conditions, not snapshots, that move the water.
Signal lives in the forecast window (|corr| with next-24h change)
Sustained conditions beat snapshots (|corr| vs window length)
Anatomy of the uncertainty
Where this forecast's band comes from
The current band split into its two independent parts (90 percent half-widths, °F): the irreducible error the model carries even under perfect weather (gray, from the nine-season backtest) and the part from the four weather models disagreeing (cyan). They add in quadrature to the band on the fan chart above. Near term, almost all the uncertainty is irreducible; only further out does weather-model divergence start to matter. Hover for the split at any lead.
Methodology
Model
A horizon-conditioned gradient-boosted regressor spans +1 h to +168 h with lead time as a feature, trained on roughly 600k stacked (time, horizon) rows, then anchor-blended to the current observation with an 8-hour decay. At forecast time it is run 34 times: on all 31 members of NOAA's GEFS ensemble plus the ECMWF, ICON, and GEM deterministic runs. The member mean is the median; the band combines the model's pooled out-of-sample error with the spread of those 34 trajectories, so width tracks genuine weather uncertainty rather than a fixed inflation.
Data
NDBC 45174 hourly observations across ten seasons (2016 to present), ERA5 reanalysis as historical weather, GFS/HRRR blend as forecast weather, identical variables both sides. Features: water temp lags and trends, air-water gap, wind vectors and gusts, solar, seasonal and diurnal clocks, and forecast-window aggregates of each driver including dewpoint depression (evaporative cooling). An across-lake neighbor buoy, season degree-days, and wind-stress integrals were trialled and dropped: on the nine-season backtest they added variance without skill.
Validation
The last 35 days are an untouched test set; training stops 8 days earlier so no stacked row straddles the split. Persistence is reported at every lead. The published bands are conformally calibrated on the pooled residuals of a nine-season rolling backtest (100k+ genuinely out-of-sample pairs), so 90 percent coverage is a finite-sample guarantee rather than an observed habit. Caveat: training weather is reanalysis, so live skill at day five to seven inherits the weather forecast's own error.