This tutorial explains how to use tree-based (Gini) feature importance from a scikit-learn tree-based model to perform feature selection.
This will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.
Packages
This tutorial uses:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
import category_encoders as ce
from sklearn.ensemble import GradientBoostingRegressor
Reading the Data
The data is from rdatasets imported using the Python package statsmodels.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()
This should return a table resembling something like this:
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 336776 non-null int64
1 month 336776 non-null int64
2 day 336776 non-null int64
3 dep_time 328521 non-null float64
4 sched_dep_time 336776 non-null int64
5 dep_delay 328521 non-null float64
6 arr_time 328063 non-null float64
7 sched_arr_time 336776 non-null int64
8 arr_delay 327346 non-null float64
9 carrier 336776 non-null object
10 flight 336776 non-null int64
11 tailnum 334264 non-null object
12 origin 336776 non-null object
13 dest 336776 non-null object
14 air_time 327346 non-null float64
15 distance 336776 non-null int64
16 hour 336776 non-null int64
17 minute 336776 non-null int64
18 time_hour 336776 non-null object
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB
Feature Engineering
Handle Null Values
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
Convert the Times From Floats or Ints to Hour and Minutes
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
Prepare Data For Modeling
Set Up Train-Test Split
target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'flight', 'tailnum', 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
Encode Categorical Variables
We use a leave-one-out encoder as it creates a single column for each categorical variable instead of creating a column for each level of the categorical variable like one-hot-encoding. This makes interpreting the impact of categorical variables with feature impact easier.
encoder = ce.LeaveOneOutEncoder(return_df=True)
X_train_loo = encoder.fit_transform(X_train, y_train)
X_test_loo = encoder.transform(X_test)
Fit The Model
model = GradientBoostingRegressor(learning_rate=0.05, max_depth=5, n_estimators=500, min_samples_split=5, n_iter_no_change=10)
model.fit(X_train_loo, y_train)
rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test_loo)))
rmse
Feature Selection Using Tree-Based (or Gini) Importance
Using built in SelectFromModel
Use SelectFromModel to use model.feature_importances_ to pick the features to remove. Since the model has already been fit, pass prefit=True to use the prefit model. As Origin has outsized Gini importance, set a low threshold with threshold='0.01*mean' to keep multiple features.
selector = SelectFromModel(model, prefit=True, threshold='0.01*mean')
X_train_loo_new = selector.transform(X_train_loo)
X_test_loo_new = selector.transform(X_test_loo)
print(X_train_loo.shape, X_train_loo_new.shape)
reduced_model = GradientBoostingRegressor(learning_rate=0.05, max_depth=5, n_estimators=500, min_samples_split=5, n_iter_no_change=10)
reduced_model.fit(X_train_loo_new, y_train)
rmse = np.sqrt(mean_squared_error(y_test, reduced_model.predict(X_test_loo_new)))
rmse
Using the Feature Importance Values
Create a data frame to hold the importance scores.
feature_importance = model.feature_importances_
importance_df = pd.DataFrame({'features': X_train_loo.columns,
'importance': feature_importance})
importance_df.sort_values(by='importance', ascending=False, inplace=True)
importance_df
features importance
3 origin 0.945621
9 arr_hour 0.028541
7 dep_hour 0.012011
13 sched_dep_hour 0.007269
11 sched_arr_hour 0.005960
2 carrier 0.000235
12 sched_arr_minute 0.000140
6 distance 0.000068
0 month 0.000046
10 arr_minute 0.000030
5 air_time 0.000024
4 dest 0.000022
1 day 0.000020
14 sched_dep_minute 0.000008
8 dep_minute 0.000005
Create a list of the features with Gini importance greater than 0.005 and use that list to retrain the model.
feature_list = importance_df[importance_df.importance > 0.005]['features'].tolist()
feature_list
['origin', 'arr_hour', 'dep_hour', 'sched_dep_hour', 'sched_arr_hour']
Alternatively, to keep the top 5 features, use the following instead
feature_list = importance_df['features'].head(5).tolist()
feature_list
['origin', 'arr_hour', 'dep_hour', 'sched_dep_hour', 'sched_arr_hour']
X_train_loo_new = X_train_loo[feature_list]
X_test_loo_new = X_test_loo[feature_list]
reduced_model = GradientBoostingRegressor(learning_rate=0.05, max_depth=5, n_estimators=500, min_samples_split=5, n_iter_no_change=10)
reduced_model.fit(X_train_loo_new, y_train)
rmse = np.sqrt(mean_squared_error(y_test, reduced_model.predict(X_test_loo_new)))
rmse