This tutorial explains how to use feature importance plots from pyrasgo to perform feature selection. The feature importance importance is calculated from SHAP values from catboost.

This notebook will calculate the SHAP feature importance when predicting arrival delay for flights in and out of NYC in 2013.

Packages

This tutorial uses:


import statsmodels.api as sm
import pandas as pd
import numpy as np
import pyrasgo

Connect to Rasgo

Enter your email and password to create an account. This account gives you free access to the Rasgo API which will calculate dataframe profiles, generate feature importance score, and produce feature explainability for you analysis. In addition, this account allows you to maintain access to your analysis and share with your colleagues.

Note: This only needs to be run the first time you use pyrasgo.

#pyrasgo.register(email='<your email>', password='<your password>')

Enter the email and password you used at registration to connect to Rasgo.

rasgo = pyrasgo.login(email='<your email>', password='<your password>')


Reading the Data

The data is from rdatasets imported using the Python package statsmodels.


df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()

This should return a table resembling something like this:


<class 'pandas.core.frame.DataFrame'>RangeIndex: 336776 entries, 0 to 336775Data columns (total 19 columns): #   Column          Non-Null Count   Dtype  ---  ------          --------------   -----   0   year            336776 non-null  int64   1   month           336776 non-null  int64   2   day             336776 non-null  int64   3   dep_time        328521 non-null  float64 4   sched_dep_time  336776 non-null  int64   5   dep_delay       328521 non-null  float64 6   arr_time        328063 non-null  float64 7   sched_arr_time  336776 non-null  int64   8   arr_delay       327346 non-null  float64 9   carrier         336776 non-null  object  10  flight          336776 non-null  int64   11  tailnum         334264 non-null  object  12  origin          336776 non-null  object  13  dest            336776 non-null  object  14  air_time        327346 non-null  float64 15  distance        336776 non-null  int64   16  hour            336776 non-null  int64   17  minute          336776 non-null  int64   18  time_hour       336776 non-null  object dtypes: float64(5), int64(9), object(5)memory usage: 48.8+ MB


Feature Engineering

Handle Null Values

As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.


df.dropna(inplace=True)

Convert the Times From Floats or Ints to Hour and Minutes


df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True)


Feature Selection

Create Feature Importance Scores

Remove variables that are not of interest to this analysis with the exclude_columns parameter.

First, pyrasgo needs to generate the feature importance scores that will be used during feature selection. This will open another browser window with the feature importance and return to raw values in response.


target = 'arr_delay'
response = rasgo.evaluate.feature_importance(df, target, exclude_columns=['flight', 'tailnum', 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
response

{'url': 'https://app.rasgoml.com/dataframes/pC-VeymbgIYiA8dXrBNNZm74tx-v_M7Tx4RJW7-YhzA/importance',
 'targetfeature': 'arr_delay',
 'featureImportance': {'month': 0.41391112112245904,
  'day': 0.4117113100028012,
  'carrier': 0.47198646615400314,
  'origin': 0.6857863120481936,
  'dest': 0.47936829505796363,
  'air_time': 1.153266291068306,
  'distance': 0.4903008559394087,
  'dep_hour': 2.1234604116574376,
  'dep_minute': 0.0855805224269881,
  'arr_hour': 11.90745875046809,
  'arr_minute': 3.8448685944462557,
  'sched_arr_hour': 9.329975875959654,
  'sched_arr_minute': 3.767534411785733,
  'sched_dep_hour': 0.54279795251197,
  'sched_dep_minute': 0.7199247971796875}}


Prune Features

Select the top 5 features based on the feature importance score using the top_5 parameter. Alternatively, the parameter top_n_pct allows you to keep the top N percent of features, and the parameter pct_of_top_feature allows you to keep all features that have a feature importance score relative to the score of the top feature above N percent.


p_df = rasgo.prune.features(df, target_column=target, top_n=5)
p_df.columns

Prune Method: Keeping top 5 features
Dropped features not in top 5: ['arr_time', 'carrier', 'dest', 'dep_time', 'time_hour', 'dep_hour', 'air_time', 'distance', 'month', 'origin', 'flight', 'day', 'dep_minute', 'sched_dep_hour', 'sched_dep_time', 'tailnum', 'year', 'sched_dep_minute']

Index(['dep_delay', 'sched_arr_time', 'arr_delay', 'arr_hour', 'arr_minute',
       'sched_arr_hour', 'sched_arr_minute'],
      dtype='object')