Fine-tuning Feature Engineering¶
The Goal: Finding the Best Feature Engineering Strategy Automatically¶
So far, we have explored feature selection and feature extraction (PCA) as separate techniques. But in a real-world project, a critical question arises: "For my specific problem, which feature engineering strategy is best?"
Is it better to select the best original features, or is it better to extract entirely new features? And what is the optimal number of features to use?
The code below answers this question systematically. It builds a single, powerful GridSearchCV
experiment that creates a "competition" between three different feature engineering methods (PCA
, NMF
, and SelectKBest
) to find the single best combination of data reduction technique and model hyperparameters. This complexity is the key to automating what would otherwise be a very tedious, manual process of building and comparing separate models.
Non-Negative Matrix Factorization (NMF): An Alternative to PCA
While PCA is a powerful and general-purpose tool for dimensionality reduction, it is not the only technique available. In our upcoming example, you will see another method called Non-Negative Matrix Factorization (NMF).
NMF is a dimensionality reduction algorithm that, like PCA, aims to find a new, more compact representation of the data. However, it operates under a significant constraint: it does not allow for any negative values in the resulting components. This makes it particularly useful for datasets where the features can only be added together meaningfully, such as pixel intensities in an image or word counts in a text document. Think of NMF as trying to explain the data as a sum of its parts.
PCA vs. NMF: A Brief Comparison¶
Feature | Principal Component Analysis (PCA) | Non-Negative Matrix Factorization (NMF) |
---|---|---|
Core Goal | Finds directions of maximum variance in the data. | Decomposes the data into a sum of non-negative parts. |
Component Values | Components can have both positive and negative values, which can sometimes be hard to interpret. | Components are strictly non-negative (zero or positive). |
Best For | General-purpose dimensionality reduction, especially for data centered around a mean. | Datasets where features are counts or represent parts of a whole (e.g., text analysis, image processing, audio signal processing). |
Interpretation | The first component is the most important, followed by the second, and so on. | All components can be equally important, representing different additive parts of the original data. |
Code Walkthrough: A Competition Pipeline¶
Let's dissect the code piece by piece.
1. The Pipeline Blueprint¶
First, we define a Pipeline
that acts as a template for our experiment. Notice the crucial second step.
pipe = Pipeline(
[
("scaling", MinMaxScaler()),
# the reduce_dim stage is populated by the param_grid
("reduce_dim", "passthrough"),
("classify", LinearSVC(dual=False, max_iter=10000)),
]
)
("scaling", MinMaxScaler())
: Our standard first step to scale the data.("reduce_dim", "passthrough")
: This is the key. We are creating a placeholder step namedreduce_dim
. The"passthrough"
value tells the pipeline to do nothing at this stage by default. This placeholder will be dynamically replaced by our feature engineering objects (PCA
,NMF
,SelectKBest
) during the grid search.("classify", LinearSVC(...))
: Our final classification model.
2. The Parameter Grid: Defining the Competition¶
This is the most complex part. Instead of a single dictionary, param_grid
is a list of dictionaries. GridSearchCV
will run a completely separate search for each dictionary in the list.
Competition Bracket 1: Feature Extraction (PCA
vs. NMF
)
The first dictionary defines the competition between our two feature extraction methods.
param_grid = [
{
"reduce_dim": [PCA(iterated_power=7), NMF(max_iter=1_000)],
"reduce_dim__n_components": N_FEATURES_OPTIONS,
"classify__C": C_OPTIONS,
},
# ...
]
"reduce_dim": [PCA(...), NMF(...)]
: This tellsGridSearchCV
: "For thereduce_dim
step in the pipeline, first try usingPCA
, and then try usingNMF
.""reduce_dim__n_components": [2, 4, 8]
: This uses the double-underscore (__
) syntax to set a hyperparameter on the object inside thereduce_dim
step. It says: "For whichever method you are trying (PCA
orNMF
), test it with 2, 4, and 8 components.""classify__C": [1, 10, 100, 1000]
: This sets theC
hyperparameter for our finalLinearSVC
classifier.
Competition Bracket 2: Feature Selection (SelectKBest
)
The second dictionary defines the experiment for our feature selection method.
# ...
{
"reduce_dim": [SelectKBest(mutual_info_classif)],
"reduce_dim__k": N_FEATURES_OPTIONS,
"classify__C": C_OPTIONS,
},
]
"reduce_dim": [SelectKBest(...)]
: This tellsGridSearchCV
: "Now, for thereduce_dim
step, try usingSelectKBest
.""reduce_dim__k": [2, 4, 8]
: This targets the hyperparameter forSelectKBest
, telling it to try selecting the top 2, 4, and 8 features."classify__C": [1, 10, 100, 1000]
: Again, this tunes the final classifier.
3. Running the Search¶
Finally, we put it all together. GridSearchCV
will now exhaustively test every possible combination defined in the param_grid
. It will try PCA with 2, 4, and 8 components; NMF with 2, 4, and 8 components; and SelectKBest with 2, 4, and 8 features, each paired with all four C
values for the classifier.
After the search is complete, you can inspect grid.best_estimator_
and grid.best_params_
to see which combination—which feature engineering strategy, which number of features, and which classifier setting—emerged as the ultimate winner of this automated competition. This is the power and purpose of this complex-seeming setup: it provides a robust, automated framework for making optimal feature engineering decisions.