Feature Profiling with PyRasgo
This notebook explains how to use pyrasgo to create feature profiles of a pandas dataframe.
Packages
This tutorial uses:
Open up a Jupyter Notebook and enter the following:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import pyrasgo
Connect to Rasgo
If you haven't done so already, head over to https://docs.rasgoml.com/rasgo-docs/onboarding/initial-setup and follow the steps outlined there to create your free account. This account gives you free access to the Rasgo API which will calculate dataframe profiles, generate feature importance score, and produce feature explainability for you analysis. In addition, this account allows you to maintain access to your analysis and share with your colleagues.
rasgo = pyrasgo.login(email='', password='')
Reading the data
The data is from rdatasets imported using the Python package statsmodels.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
Feature Engineering
Convert the times from floats or ints to hour and minutes
Convert some of the fields into more meaningful fields to better understand the time flights depart and arrive. Next the original fields are dropped as they are now redundant.
df.dropna(inplace=True)
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
df.drop(columns=['time_hour', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'], inplace=True)
Profile Features
response = rasgo.evaluate.profile(df)
response