This tutorial explains how to use feature importance from scikit-learn to perform backward stepwise feature selection. The feature importance used is the gini importance from a tree based model.
This will prune the features to model arrival delay for flights in and out of NYC in 2013.
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import category_encoders as ce
from sklearn.ensemble import GradientBoostingRegressor
Reading the Data
The data is from rdatasets imported using the Python package statsmodels.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()
This should return a table resembling something like this:
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 336776 non-null int64
1 month 336776 non-null int64
2 day 336776 non-null int64
3 dep_time 328521 non-null float64
4 sched_dep_time 336776 non-null int64
5 dep_delay 328521 non-null float64
6 arr_time 328063 non-null float64
7 sched_arr_time 336776 non-null int64
8 arr_delay 327346 non-null float64
9 carrier 336776 non-null object
10 flight 336776 non-null int64
11 tailnum 334264 non-null object
12 origin 336776 non-null object
13 dest 336776 non-null object
14 air_time 327346 non-null float64
15 distance 336776 non-null int64
16 hour 336776 non-null int64
17 minute 336776 non-null int64
18 time_hour 336776 non-null object
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB
Feature Engineering
Handle Null Values
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
Convert the Times From Floats or Ints to Hour and Minutes
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
Feature Selection
Define Function
Function to build the model.
def build_model(df, target):
"""
Given the dataframe and target, build and return the model
"""
# Set up target
y = df[target]
# Set up train-test split
X = df.drop(columns=[target])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
# Encode the categorical variables
encoder = ce.LeaveOneOutEncoder(return_df=True)
X_train_loo = encoder.fit_transform(X_train, y_train)
X_test_loo = encoder.transform(X_test)
# Fit the model and calculate RMSE
# NOTE: You really should do some hyperparameter tuning here.
model = GradientBoostingRegressor(learning_rate=0.05, max_depth=5, n_estimators=500, min_samples_split=5, n_iter_no_change=10)
model.fit(X_train_loo, y_train)
rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test_loo)))
return rmse, model, X_train_loo.columns
Function to return feature to be dropped
def get_dropped_feature(model, features):
feature_importance = model.feature_importances_
importance_df = pd.DataFrame({'features': features,
'importance': feature_importance})
importance_df.sort_values(by='importance', ascending=False, inplace=True)
return importance_df['features'].iloc[-1]
Function that incremental removes the feature with the lowest feature importance as calculated by scikit-learn until the RMSE stops decreasing.
def backward_selection(df, target, max_features=None):
This function uses gini importance from a sklearn GBM model to incrementally remove features from the training set until the RMSE no longer improves.
This function returns the dataframe with the features that give the best RMSE. Return at most max_features.
# get baseline RMSE
select_df = df.copy()
total_features = df.shape[1]
rmse, model, features = build_model(select_df, target)
print(f"{rmse} with {select_df.shape[1]}")
last_rmse = rmse
# Drop least important feature and recalculate model peformanceif max_features isNone:
max_features = total_features-1
for num_features in range(total_features-1, 0, -1):# Trim features
dropped_feature = get_dropped_feature(model, features)
tmp_df = select_df.drop(columns=[dropped_feature])
# Rerun modeling
rmse, model, features = build_model(tmp_df, target)
print(f"{rmse} with {tmp_df.shape[1]}")if (num_features < max_features) and (rmse > last_rmse):# RMSE increased, return last dataframe
return select_dfelse:# RMSE improved, continue dropping features
last_rmse = rmse
select_df = tmp_dfreturn select_df
Run stepwise feature selection
Call backward_selection on the modeling dataframe. reduced_df will contain the selected features and will be our reduced modeling dataset.
target = 'arr_delay'
reduced_df = backward_selection(df, target, max_features=20)
reduced_df.shape[1]
30.54691197243769 with 25
30.479507645969033 with 24
30.35605963373274 with 23
30.702390669186062 with 22
30.79658581770791 with 21
29.975162972788738 with 20
30.5938801236653 with 19