This tutorial explains how to use tree-based (Gini) feature importance from a scikit-learn tree-based model to perform feature selection.

This will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.

Packages

This tutorial uses:


import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import SelectFromModel
import category_encoders as ce

from sklearn.ensemble import GradientBoostingRegressor

Reading the Data

The data is from rdatasets imported using the Python package statsmodels.


df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()

This should return a table resembling something like this:

RangeIndex: 336776 entries, 0 to 336775

Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   year            336776 non-null  int64  
 1   month           336776 non-null  int64  
 2   day             336776 non-null  int64  
 3   dep_time        328521 non-null  float64
 4   sched_dep_time  336776 non-null  int64  
 5   dep_delay       328521 non-null  float64
 6   arr_time        328063 non-null  float64
 7   sched_arr_time  336776 non-null  int64  
 8   arr_delay       327346 non-null  float64
 9   carrier         336776 non-null  object 
 10  flight          336776 non-null  int64  
 11  tailnum         334264 non-null  object 
 12  origin          336776 non-null  object 
 13  dest            336776 non-null  object 
 14  air_time        327346 non-null  float64
 15  distance        336776 non-null  int64  
 16  hour            336776 non-null  int64  
 17  minute          336776 non-null  int64  
 18  time_hour       336776 non-null  object 
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB

Feature Engineering

Handle Null Values

As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.


df.dropna(inplace=True)

Convert the Times From Floats or Ints to Hour and Minutes


df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True)


Prepare Data For Modeling

Set Up Train-Test Split


target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'flight', 'tailnum', 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)

Encode Categorical Variables

We use a leave-one-out encoder as it creates a single column for each categorical variable instead of creating a column for each level of the categorical variable like one-hot-encoding. This makes interpreting the impact of categorical variables with feature impact easier.


encoder = ce.LeaveOneOutEncoder(return_df=True)
X_train_loo = encoder.fit_transform(X_train, y_train)
X_test_loo = encoder.transform(X_test)

Fit The Model


model = GradientBoostingRegressor(learning_rate=0.05, max_depth=5, n_estimators=500, min_samples_split=5, n_iter_no_change=10)
model.fit(X_train_loo, y_train)

rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test_loo)))
rmse

43.36704861580844

Feature Selection Using Tree-Based (or Gini) Importance

Using built in SelectFromModel

Use SelectFromModel to use model.feature_importances_ to pick the features to remove. Since the model has already been fit, pass prefit=True to use the prefit model. As Origin has outsized Gini importance, set a low threshold with threshold='0.01*mean' to keep multiple features.


selector = SelectFromModel(model, prefit=True, threshold='0.01*mean')
X_train_loo_new = selector.transform(X_train_loo)
X_test_loo_new = selector.transform(X_test_loo)
print(X_train_loo.shape, X_train_loo_new.shape)

(261876, 15) (261876, 5)

reduced_model = GradientBoostingRegressor(learning_rate=0.05, max_depth=5, n_estimators=500, min_samples_split=5, n_iter_no_change=10)
reduced_model.fit(X_train_loo_new, y_train)

rmse = np.sqrt(mean_squared_error(y_test, reduced_model.predict(X_test_loo_new)))
rmse

43.524385900038425

Using the Feature Importance Values

Create a data frame to hold the importance scores.


feature_importance = model.feature_importances_
importance_df = pd.DataFrame({'features': X_train_loo.columns,
                              'importance': feature_importance})
importance_df.sort_values(by='importance', ascending=False, inplace=True)
importance_df

features	importance
3	origin	0.945621
9	arr_hour	0.028541
7	dep_hour	0.012011
13	sched_dep_hour	0.007269
11	sched_arr_hour	0.005960
2	carrier	0.000235
12	sched_arr_minute	0.000140
6	distance	0.000068
0	month	0.000046
10	arr_minute	0.000030
5	air_time	0.000024
4	dest	0.000022
1	day	0.000020
14	sched_dep_minute	0.000008
8	dep_minute	0.000005

Create a list of the features with Gini importance greater than 0.005 and use that list to retrain the model.


feature_list = importance_df[importance_df.importance > 0.005]['features'].tolist()
feature_list

['origin', 'arr_hour', 'dep_hour', 'sched_dep_hour', 'sched_arr_hour']


Alternatively, to keep the top 5 features, use the following instead


feature_list = importance_df['features'].head(5).tolist()
feature_list

['origin', 'arr_hour', 'dep_hour', 'sched_dep_hour', 'sched_arr_hour']

X_train_loo_new = X_train_loo[feature_list]
X_test_loo_new = X_test_loo[feature_list]

reduced_model = GradientBoostingRegressor(learning_rate=0.05, max_depth=5, n_estimators=500, min_samples_split=5, n_iter_no_change=10)
reduced_model.fit(X_train_loo_new, y_train)

rmse = np.sqrt(mean_squared_error(y_test, reduced_model.predict(X_test_loo_new)))
rmse

43.28571715980914