Automated Hyperparameter Tuning¶

This lab demonstrates the complete, robust workflow. We will use GridSearchCV to automatically search for the best hyperparameters for our pipeline, using cross-validation on our training data. We will then perform a final evaluation on the held-out test set.

Setup: Imports and Data Splits¶

First, we'll import GridSearchCV and reuse the X_train, X_test, y_train, and y_test sets from our previous lab.

from sklearn.model_selection import GridSearchCV

# We assume X_train, X_test, y_train, y_test are already created
# from the previous lab's train_test_split.

Step 1: Define the Parameter Grid¶

We need to tell GridSearchCV which hyperparameters to test. We define this in a dictionary where the keys are the names of the parameters and the values are lists of settings to try.

To specify a hyperparameter for a step in a pipeline, we use the step's name (in lowercase), followed by two underscores (__), and then the hyperparameter name. The default name for the DecisionTreeRegressor step in our make_pipeline object is decisiontreeregressor.

# Define the grid of hyperparameters to search
param_grid = {
    'decisiontreeregressor__max_depth': [3, 5, 7, 10, None],
    'decisiontreeregressor__min_samples_leaf': [1, 2, 4, 6],
    'decisiontreeregressor__max_features': [None, 'sqrt', 'log2']
}

Step 2: Set Up and Run GridSearchCV¶

Now, we instantiate GridSearchCV. We provide our pipeline (pipe), the param_grid, the number of cross-validation folds (cv=5), and the scoring metric. Since GridSearchCV tries to maximize a score, and we want to minimize error, we use 'neg_mean_absolute_error'.

Note: This step can take a few moments to run, as it is training 60 models (4 * 3 * 5) five times each (300 total fits).

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=pipe,
    param_grid=param_grid,
    cv=5,
    scoring='neg_mean_absolute_error',
    verbose=1 # This will print progress updates
)

# Fit the grid search on the training data
grid_search.fit(X_train, y_train)

Step 3: Inspect the Results¶

GridSearchCV stores the best combination of parameters it found in the best_params_ attribute. The best_estimator_ attribute holds the pipeline that was refit on the entire training set using these optimal parameters.

# Print the best hyperparameters found
print("Best Hyperparameters:")
print(grid_search.best_params_)

# The best score is negative, so we multiply by -1 to get the MAE
best_mae = -grid_search.best_score_
print(f"\nBest Cross-Validated MAE: ${best_mae:,.2f}")

Step 4: Final Evaluation on the Test Set¶

Finally, we use the best_estimator_ found by GridSearchCV to make predictions on our held-out test set. This provides our final, unbiased assessment of the tuned model's performance.

# Use the best estimator to make predictions on the test set
final_predictions = grid_search.predict(X_test)

# Calculate the final MAE on the test set
final_mae = mean_absolute_error(y_test, final_predictions)

print(f"\nFinal Model MAE on Held-Out Test Set: ${final_mae:,.2f}")

You can now compare this final_mae to the baseline MAE from our first lab to quantify the improvement gained from hyperparameter tuning.