Scoring and Logging

j.andries.j.steenkamp

Scoring and Logging

What value does information have if it is never acted upon? In this post, we apply the theory we built up so far to make a PolyMarket bet (yes, with skin in the game). In particular, we bet on the July Inflation – Monthly, which is about the month-on-month percentage change in the seasonally adjusted Consumer Price Index for All Urban Consumers (CPI-U) published by the Bureau of Labor Statistics (BLS). The bet is on how much US inflation will increase going from July to August. Cutting to the chase: I won the bet by blind luck. This is not a flex; this is a post-mortem on what goes into a consistent prediction system.

The post is more meta than usual, as the focus is not on a prediction nor even a prediction technique, but rather the framework in which predictions are made and judged.

Before pulling the trigger, you pick a target; the same holds for forecasting. I decided to focus on economic variables, as there is a flood of data on the topic and some theory explaining the mechanism behind the data. In contradistinction, once-off cultural phenomena (e.g., Kanye West getting arrested in 2025?) are hard to place, and I see no way to get an edge over the market. To compound difficulties, small probabilities (< 1% ) are counterintuitive. Unfortunately, there are no clear canonical rules for choosing domains of prediction, so go with what interests you and be ready to pivot when failing too often.

Assuming one has a niche, the next question to ask is whether your predictions are accurate. This only comes from following a system where you control for as many of the variables as you can and log with sufficient detail as to replicate past decisions. To those not of the scientific denomination, the process may seem unnecessary and stifling. However, when the dust settles after applying an arsenal of ML and statistical techniques (or heaven forbid pure unaided gut-feel) and reaching a prediction, you have to know what you did in case you hit upon the winning strategy and you wish to replicate results, or if you lost it all and want to stop stepping on the same rake.

Thus, logging choices, decisions, and consequences are essential, they are antidotes to hindsight bias and self-delusion. I spent the last 4 weeks building code infrastructure, including logging features, to force me into tracking my bad choices. Granted, the dividends will only come if I keep logging predictions and finally find out what works.

Unless the events you predict are so numerous and homogenous that you can calculate the outcome probabilities frequentist-style, you are going to take a while to get enough data to tell if you are winning out of skill or out of luck.

Given that you have scraped together enough observations, how do you score your predictions? The most basic approach would be to count your wins, e.g., 5/10. Next level could be how much you won, omitting the bookie’s vigorish, your winnings are inversely proportional to the implied probability of the outcome. Numbers aside, money is why we bet, but learning by losing is expensive, so let’s revisit numbers again, but with some more math.

Brier score is a metric for prediction accuracy first developed (by Glenn W. Brier in 1950) and used by meteorologists, and it works as follows.

$$ BS := \frac{1}{N} \sum_{t=1}^N (f_t – o_t)^2 $$ where

  • $N$ is the number of events,
  • $f_t \in [0,1]$ is the forecasted probability of event $t$,
  • $o_t \in {0, 1}$ is the outcome of event $t$ (true=1, false=0).

The ideal forecaster would have a Brier score of zero because they make every prediction correctly with 100% confidence. Conversely, the worst possible score is one, obtained when every prediction is wrong, and made with absolute conviction. Playing middle, going 50-50 on every bet nets the forecaster a score of 0.25.

Logarithmic score is another metric, defined as follows
$$
LS := \frac{1}{N} \sum_{t=1}^N o_t \cdot \log(f_t) + (1- o_t) \cdot \log(1 – f_t)
$$
Here, the scores range from negative infinity (worst) to zero (best). Clearly, the log score punishes overconfident wrong predictions very harshly. Where the Brier score gives a geometric flavor, the Log score gives an information-theoretic flavor.

Regardless of which one you choose, they both are “proper” scoring methods in the sense that they can’t be gamed; the only way to optimize them is to predict accurately and confidently. Both can be straightforwardly extended to multiple outcomes and can often be combined in order to give a single aggregate score.

Back to US inflation data, notice that we are taking the month-on-month change. This is a lot harder to predict precisely, than over all inflation. Inflation, as a whole, we all know, tends upwards (because someone keeps printing money to buy things they can’t afford), but month to month, the increase looks largely random.

After some shallow digging, I read that the unimodular methods (like the ones I currently use) have little hope of picking out patterns in month-on-month inflation change. Apparently, “now-casting” methods exploiting good exogenous data show the most promise.

Humans, in general (of which I am no exception), are subject to hindsight bias. And thus, prone to recast past decisions in light of present information not available at the time of passing judgment. To combat this tendency, I use public publishing as a means of:

  1. social motivation and accountability
  2. A log of my process for making a prediction.
    Even if no one reads this, I act differently knowing that this information is accessible to others.

As always, if you have a problem that you would like to have “mathed”, consider hiring the MathMerc.