This tutorial explains how to generate feature importance plots from pyrasgo using without needing to build machine learning models. The feature importance importance is calculated from SHAP values from catboost.
During this tutorial you will calculate the SHAP feature importance when predicting arrival delay for flights in and out of NYC in 2013.
Packages
This tutorial uses:
Open a new Jupyter notebook and import the following:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import pyrasgo
Connect to Rasgo
If you haven't done so already, head over to https://docs.rasgoml.com/rasgo-docs/onboarding/initial-setup and follow the steps outlined there to create your free account. This account gives you free access to the Rasgo API which will calculate dataframe profiles, generate feature importance score, and produce feature explainability for you analysis. In addition, this account allows you to maintain access to your analysis and share with your colleagues.
rasgo = pyrasgo.login(email='', password='')
Reading the data
The data is from rdatasets imported using the Python package statsmodels.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 336776 non-null int64
1 month 336776 non-null int64
2 day 336776 non-null int64
3 dep_time 328521 non-null float64
4 sched_dep_time 336776 non-null int64
5 dep_delay 328521 non-null float64
6 arr_time 328063 non-null float64
7 sched_arr_time 336776 non-null int64
8 arr_delay 327346 non-null float64
9 carrier 336776 non-null object
10 flight 336776 non-null int64
11 tailnum 334264 non-null object
12 origin 336776 non-null object
13 dest 336776 non-null object
14 air_time 327346 non-null float64
15 distance 336776 non-null int64
16 hour 336776 non-null int64
17 minute 336776 non-null int64
18 time_hour 336776 non-null object
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB
Feature Engineering
Handle null values
year 0
month 0
day 0
dep_time 8255
sched_dep_time 0
dep_delay 8255
arr_time 8713
sched_arr_time 0
arr_delay 9430
carrier 0
flight 0
tailnum 2512
origin 0
dest 0
air_time 9430
distance 0
hour 0
minute 0
time_hour 0
dtype: int64
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
Convert the times from floats or ints to hour and minutes
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
Feature Importance
Remove variables that are not of interest to this analysis with the exclude_columns parameter.
This will open another browser window with the feature importance and return to raw values in response.
target = 'arr_delay'
response = rasgo.evaluate.feature_importance(df, target, exclude_columns=['flight', 'tailnum', 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
response
{'url': 'https://app.rasgoml.com/dataframes/fHMOPgyr_YB3SeIGer-mfILzWBgGNF8Tz9wvoyf4MMs/importance',
'targetfeature': 'arr_delay',
'featureImportance': {'month': 0.41391112112245904,
'day': 0.4117113100028012,
'carrier': 0.47198646615400314,
'origin': 0.6857863120481936,
'dest': 0.47936829505796363,
'air_time': 1.153266291068306,
'distance': 0.4903008559394087,
'dep_hour': 2.1234604116574376,
'dep_minute': 0.0855805224269881,
'arr_hour': 11.90745875046809,
'arr_minute': 3.8448685944462557,
'sched_arr_hour': 9.329975875959654,
'sched_arr_minute': 3.767534411785733,
'sched_dep_hour': 0.54279795251197,
'sched_dep_minute': 0.7199247971796875}}