The Math Merc

Feature engineering is the process of transforming data such that the resulting data improves analysis or predictive estimation. This post is a field log entry of my attempt to engineer features from FRED data in the hopes of improving the now-casting XGBoost regressor.

We cover the following

Some signal analysis-based features
A notion of feature depth
Some quality metrics for features
XGBoost performance at various feature depths

Limitations of My Method

Due to model dependencies, there is a risk of missing good features because of poor modeling choices.
This is an inevitable fact of life that, as our methods grow in complexity, so too does the room for error. Thus, a feature’s power is always contextualized to a target variable via a model, and quantified via model performance with respect to an objective.

Signal-based Features

Information is sent by perturbing a medium over time, e.g., pressure waves in air (sound), electromagnetic waves (light), electric pulses in a circuit (electronics). Due to either device imperfections or physical limitations, transmissions are distorted. Received information is classified as signal or noise depending on the intention of the sender and receiver. Signal analysis is the study of encoding and extracting signals from perturbations, often represented as time-indexed amplitude data. Often, the signal exhibits a periodic nature. This could have something to do with us living on a planet that orbits a star (one orbit per year), which is also spinning on its own axis (one revolution per day).

Fourier Transform in Action

A core result of signal analysis is the Fourier transform. It expresses a signal as a unique linear sum of sine waves, thereby representing a time-varying intensity as a superposition of static waves. You can also think of it as a change of basis, from a measurement at each point in time, to a measurement of each frequency’s contribution. For our purposes, it is simply a tool for looking at the same data in a different way, with the intention that the transformed data reveals a pattern.

Wavelet Transform in Action

The wavelet transform is the natural answer to the question of how to make a hybrid representation between the time and frequency domains. Unlike sine waves, wavelets are localized oscillations, and as such, encode time and frequency information. Furthermore, wavelets come in many shapes, similar to how there are many types of knives, each specialized for a task.

As with the Fourier transform, we sacrifice the rich mathematical understanding for practical expedience.

Feature Depth

Since some features are constructed from others, it makes sense to define a “feature depth” as the number of transformations that have been applied to raw data in order to obtain it. Depth gives us a crude way to order features in increasing complexity/cost. By analogy, some foods are more processed than others, but all are derived from some plant/animal (hopefully). However, even if two features have the same depth, they need not be equally complex mathematically nor equally expensive computationally. For example, histogram entropy is significantly more costly than just computing the arithmetic mean. It is our hope that the effort we sink into constructing deep features is reimbursed with improved model performance.

The levels for the features are as follows:

Level 0:
– Raw data is anonymized, imputed, and masked (to show where imputation occurred)
Level 1:
– Winsorization
– Scaling or normalization
– Detrending, linear or ARIMA
Level 2:
– Lagged features for various lags.
– Window aggregation for various windows and aggregation functions:
– Mean
– Median
– Standard deviation
– Median absolute deviation
– Histogram entropy
– Fourier entropy
– Fourier Centroid
– Wavelet entropy
– Wavelet Centroid

Quality Metrics for Features

Measuring the quality of a feature is not straightforward. Vaguely stated, a good feature is one that facilitates good estimation. However, the strength of facilitation possibly depends on the choice of model and the inclusion of other synergizing features.

For our approach, we used:

Linear correlation with the target
Mutual information with the target
Good performance with shallow estimators

Linear correlation is the easiest of the three to understand; it is the degree to which the target can be modeled as a linear function of the feature. Mutual information is harder to grasp; it compares the joint distribution with the marginal distributions of the target and the feature. Thus, if the features and target are entirely uncorrelated/independent/unrelated to each other, their joint distribution carries no extra information beyond the respective marginals. Shallow regressors may feel like putting the cart before the horse, as the ultimate goal was to improve regression, and here we are already regressing. The key is the shallowness, which allows us a low-resolution, low-cost preview of what to expect. Procedurally, we sub-select handfuls of features and see how a random forest or ridge regressor performs.

Pruning Features

Dually, we seek to prune features as information redundancy may drown out the signal and waste computational resources. Determining the amount of shared information between two variables is clear when they have a formula linking them, e.g., given BMI (body mass index) and height of an individual, we know their weight. When the link is more implicit, like the height of a child and their reading ability, we expect taller/older children to read better, but the rule is not absolute. We will measure and prune the information redundancy in our features by two means:

Variance
Ancestry

If the target varies widely in value but the feature remains constant, then clearly the feature does not contain information about the target. For the ancestry of the feature, the idea is that if two features are derived from the same raw data, then they carry largely the same information. Clearly, this is not absolute, as the rolling entropy and the moving median of the same data could be magnifying different aspects of the same information.

Model Performance

Our estimator of choice will be the standard XGBoost regressor from SKlearn. We follow the standard model fitting hygiene of splitting data into test (20%) and train (80%). We then further split the train set into five splits, each with a 90% fit and a 10% validation split for parameter tuning. This is done for a feature selection at each of the levels described above. Each feature selection produces an estimation for the test data. We can scatter-plot estimation vs observations to see the quality of the regression. A good regression will have the points fall on the y = x line; a terrible regression would give a random scatter.

Conclusion

In conclusion, it is a bit inconclusive. I was hoping to see the deeper-level features blow the low-level features out of the water in terms of predictive quality. There does seem to me a minor improvement, but this is likely wishful thinking on my part. That being said, the lack of predictive power could be due to many other factors: data suitability, model choice, operator error, etc.

As always, if you need a math mercenary who gives you the facts straight and without corporate embellishment, consider hiring the MathMerc.

(Numerical) Feature Engineering and Now-casting (Part 2)