The Math Merc

This post is a two-parter where we pay the wages of ignorance for our univariate predictive models by building a now-casting model supported by automatic feature engineering.

Motivation for a different technique

Previously, we attempted to predict the month-on-month inflation change, but we failed. Upon inspecting our techniques, we were bound to fail, as they were unimodular in nature and relied on learning the long-run pattern. By long-run pattern, I mean something akin to periodicity, but not as regular. The sequence itself contains the data that allows for the inference of the rules by which it is generated. This can be thought of as the target variable $Y_t$ at time $t$ being related to its past values $Y_{t-1},~Y_{t-2},…,Y_{0}$ via some function $Y_t = f_t(Y_{t-1},~Y_{t-2},…,Y_{0})$. Assuming this relation, one can set about trying to learn the function $f_t$, from realized observations $y_T,~y_{n-1},…,~y_{0}$. Note that the function is also time-dependent, as the relationship between past and present can evolve.

(Example of $Y_t = 0.25 Y_{t-1} + 0.75 Y_{t-2} + \epsilon$, where $\epsilon \sim \mathcal{N}(0, 0.1)$ )

We have elaborated on some of the methods used to learn $f_t$, e.g., SARIMA, Exponential smoothing, Prophet, see post.

The monthly inflation change data does not appear to exhibit this property. This post is about another technique based on a different paradigm, now-casting.

Now-casting

The word “Now-casting” indicates that it precedes forecasting but succeeds backcasting (a.k.a retrodiction). The concept is that the target variable $Y_t$ at time $t$ is related to other variables $X_{1,t},~X_{2,t},…,~X_{m,t}$ by means of some function $Y_t~=~g_t(X_{1,t},~X_{2,t},…,~X_{m,t})$. Via this relation we can predict $Y_t$ provided that we have $g_t,~X_{1,t},~X_{2,t},…,~X_{m,t}$. The crux here is that the variables $X_{1,t},~X_{2,t},…,~X_{m,t}$ could be known in advance of $Y_t$ even if they are all realized at the same time. The estimation is hence lateral in time. Again, $g_t$ can be learned from observations:
$$
\left(\begin{array}{@{}ccc@{}}
x_{1,1} & x_{1,2} & \dots & x_{1,T} \
x_{2,1} & x_{2,2} & \dots & x_{2,T} \
\vdots & \vdots & \ddots & \vdots \
x_{m,1} & x_{m,2} & \dots & x_{m,T}
\end{array}\right)
~,~
\left(\begin{array}{@{}c@{}}
y_{1} \
y_{2} \
\vdots \
y_{m}
\end{array}\right)
$$

Thus, we are dealing with a different kind of problem. We have traded a problem of univariate data with unknown temporal relation, for a problem of multivariate data with unknown variable relation. Preferring one model over the other is largely motivated by the assumed dependence of the target variable, be it temporal or lateral in nature. Of course, the ultimate arbiter of truth is experiment. To this extent, we look at the predictive performance of many models. Note, poor model performance does not support the alternative hypothesis, as failure can stem as much from execution as from underlying modeling assumptions. Recall, model evaluation metrics we discussed in (see https://themathmerc.com/?p=847)

Feature engineering

Having committed to the now-casting approach, we take a deeper look at the variables $X_{1},~X_{2},…,~X_{m}$. We seldom have explicit relationships between variables; in fact, a large part of the scientific enterprise consists of defining variables, measuring their values, and postulating/validating their relationships. That being said, we do not have to start from scratch, as we can build on the work of others and rely on our innate understanding of the world. For example, we intuitively suspect that school attendance in Peru is uncorrelated to rent prices in Beijing, but the fuel price we expect has a significant effect on the cost of goods, as goods require transport, which requires fuel.

In the case of the monthly inflation change, we choose:

unemployment_rate
initial_claims
avg_hourly_earnings
consumer_sentiment
10-year_breakeven_inflation_rate
5-year_breakeven_inflation_rate
5-year,5-year_forward_inflation_expectation_rate university_of_michigan_inflation_expectations(1-year)
weekly_retail_gasoline_price_(all_grades)
monthly_retail_gasoline_price_(all_grades)
ppi:finished_goods ppi:_manufacturing_industries import_price_index(all_commodities)
export_price_index_(all_commodities)
import_price_index_(fuels_&lubricants) cpi:_rent_of_primary_residence(sa)

Assuming one has selected a basket of variables that plausibly have a causal dependency on our target variable, we can begin modeling. We use XGBoost, which we will not describe here. However, our work is not done, as XGBoost (and models in general) are dependent on the quality and nature of the input, because the models themselves are also built on assumed relations between data and target variables. The process of transforming data into a form that improves model performance is called “feature engineering”. There are many models to choose from, so many that we do not dare attempt it by hand and would rather have an algorithm do it for us.

Ancestry of features

The engineering process can be as simple as scaling the variables to the interval $[0,1]$, to as complex as wavelet transformation for signal processing. Some feature engineering techniques are deep branches of mathematics unto themselves. We omit this depth here and proceed by only mentioning our implementation, which is somewhat arbitrary. We build the feature sequentially in a hierarchy, with each level acting as the input for the next.

At level 0, we anonymize the raw data. Distinguishing only between target and input variables so as not to be bogged down by subject-specific names, and to store feature ancestry.
At level 1, we winsorize, detrend, and normalize/scale.
At level 2, we compute lag and rolling statistical features. Here, lags are computed for predecided intervals, and for rolling statistics, one must decide on the statistic and the window width.
At level 3, we compute Fourier and Wavelet transforms. Note that the Fourier and Wavelet transformations can be univariate or multivariate. We keep it simple with univariate for now.
At level 4, we compute cluster labels and embeddings on a random sub-selection of features using:
- K-means clustering
- Principal Component Analysis (PCA)
- t-distributed Stochastic Neighbor embedding (t-SNE)
- Uniform Manifold Approximation and Projection (UMAP)

A key trick here is that we name the resulting features in such a way as to reflect their ancestry. This will be useful later when we prune. Here is but a small snippet of the great feature tree we are growing:

At each level, the number of features grows geometrically. This flood of features is undesirable as it slows computation and introduces complexity. To combat this, we prune the features based on three ideas:

Remove low variance features because they are largely constant, while the target changes value, thus they are reasoned to not add predictive power.
Remove information redundancy by selecting only one feature from a group of features that all carry the same information.
- One approach is to look at correlation and mutual information metrics
- The other is to look at ancestry.
Remove features that have a low predictive power as measured by the performance of simpler/cheaper models when including/excluding the features. The crude models in question are:
- random forest model
- and permutation inputs for ridge/logistic regression.

Feature grid plots

We can now inspect the fruits of our labors. Not all features are shown here. We show only a chosen sub-selection for pedagogical purposes. The plots that follow are grid plots with rows and columns indexed by the column names. The target variable “y” is always on the left/top-most column/row. The plots on the diagonal are Kernel Density Estimate (KDE) plots. Above the diagonal are scatter plots with the overall color decided by the Pearson correlation coefficient. Below the diagonal, we show 2D-histogram plots, which convey the same information as the scatter plot but with lower resolution and regions of high density more apparent.

At this point, we are going to call it a day. We have constructed a set of variables that relate to our target variable in a quantifiable way. This was the analysis phase of the project, where the raw ingredients are dissolved, sifted, and purified. Next week, we embark on the synthesis part, where these raw ingredients are reconstituted into a predictive now-casting model, applied to real data.

If you find yourself or your organization drowning in data, variables, and models with no clear line from information to action to profit, consider hiring the MathMerc.