This tutorial explains how to use Shapley importance from SHAP and a scikit-learn tree-based model to perform feature selection.
This notebook will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.
Packages
This tutorial uses:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
import shap
import category_encoders as ce
from sklearn.ensemble import GradientBoostingRegressor
Reading the Data
The data is from rdatasets imported using the Python package statsmodels.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()
This should return a table resembling something like this:
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 336776 non-null int64
1 month 336776 non-null int64
2 day 336776 non-null int64
3 dep_time 328521 non-null float64
4 sched_dep_time 336776 non-null int64
5 dep_delay 328521 non-null float64
6 arr_time 328063 non-null float64
7 sched_arr_time 336776 non-null int64
8 arr_delay 327346 non-null float64
9 carrier 336776 non-null object
10 flight 336776 non-null int64
11 tailnum 334264 non-null object
12 origin 336776 non-null object
13 dest 336776 non-null object
14 air_time 327346 non-null float64
15 distance 336776 non-null int64
16 hour 336776 non-null int64
17 minute 336776 non-null int64
18 time_hour 336776 non-null object
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB
Feature Engineering
Handle Null Values
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
Convert the Times From Floats or Ints to Hour and Minutes
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
Prepare data for modeling
Set Up Train-Test Split
target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'flight', 'tailnum', 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
Encode Categorical Variables
We use a leave-one-out encoder as it creates a single column for each categorical variable instead of creating a column for each level of the categorical variable like one-hot-encoding. This makes interpreting the impact of categorical variables with feature impact easier.
encoder = ce.LeaveOneOutEncoder(return_df=True)
X_train_loo = encoder.fit_transform(X_train, y_train)
X_test_loo = encoder.transform(X_test)
Fit the Model
model = GradientBoostingRegressor(learning_rate=0.05, max_depth=5, n_estimators=500, min_samples_split=5, n_iter_no_change=10)
model.fit(X_train_loo, y_train)
rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test_loo)))
rmse
Feature Selection Using Mean Absolute Value of the SHAP Values
Create a data frame to hold the SHAP values.
explainer = shap.Explainer(model)
shap_values = explainer(X_test_loo)
shap_importance = shap_values.abs.mean(0).values
importance_df = pd.DataFrame({'features': X_train_loo.columns,
'importance': shap_importance})
importance_df.sort_values(by='importance', ascending=False, inplace=True)
importance_df
features importance
9 arr_hour 2.331481
7 dep_hour 1.112755
3 origin 1.081143
13 sched_dep_hour 0.747311
11 sched_arr_hour 0.358844
6 distance 0.169818
2 carrier 0.115962
4 dest 0.086835
12 sched_arr_minute 0.078803
10 arr_minute 0.057935
5 air_time 0.027669
8 dep_minute 0.014226
1 day 0.012283
14 sched_dep_minute 0.007768
0 month 0.001376
Create a list of the features with Gini importance greater than 0.5 and use that list to retrain the model.
feature_list = importance_df[importance_df.importance > 0.5]['features'].tolist()
feature_list
['arr_hour', 'dep_hour', 'origin', 'sched_dep_hour']
Alternatively, to keep the top 5 features, use the following instead
feature_list = importance_df['features'].head(5).tolist()
feature_list
['arr_hour', 'dep_hour', 'origin', 'sched_dep_hour', 'sched_arr_hour']
X_train_loo_new = X_train_loo[feature_list]
X_test_loo_new = X_test_loo[feature_list]
reduced_model = GradientBoostingRegressor(learning_rate=0.05, max_depth=5, n_estimators=500, min_samples_split=5, n_iter_no_change=10)
reduced_model.fit(X_train_loo_new, y_train)
rmse = np.sqrt(mean_squared_error(y_test, reduced_model.predict(X_test_loo_new)))
rmse