This tutorial explains how to use PyCaret to build multiple models and compare them to pick the best model for your data.
During this tutorial you will build and evaluate a model to predict arrival delay for flights in and out of NYC in 2013.
Packages
This tutorial uses:
Open up a new Jupyter Notebook and import the following:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import pycaret.regression as pycr
import pycaret.utils as pycu
Reading the data
The data is from rdatasets imported using the Python package statsmodels.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 336776 non-null int64
1 month 336776 non-null int64
2 day 336776 non-null int64
3 dep_time 328521 non-null float64
4 sched_dep_time 336776 non-null int64
5 dep_delay 328521 non-null float64
6 arr_time 328063 non-null float64
7 sched_arr_time 336776 non-null int64
8 arr_delay 327346 non-null float64
9 carrier 336776 non-null object
10 flight 336776 non-null int64
11 tailnum 334264 non-null object
12 origin 336776 non-null object
13 dest 336776 non-null object
14 air_time 327346 non-null float64
15 distance 336776 non-null int64
16 hour 336776 non-null int64
17 minute 336776 non-null int64
18 time_hour 336776 non-null object
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB
Feature Engineering
Handle null values
year 0
month 0
day 0
dep_time 8255
sched_dep_time 0
dep_delay 8255
arr_time 8713
sched_arr_time 0
arr_delay 9430
carrier 0
flight 0
tailnum 2512
origin 0
dest 0
air_time 9430
distance 0
hour 0
minute 0
time_hour 0
dtype: int64
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
Convert the times from floats or ints to hour and minutes
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
As PyCaret can use large amounts of memory, we will randomly select 100,000 rows for this comparison, reserving the remaining rows as a test set.
dftrain = df.sample(n=100000, random_state=1066)
dftest = df.drop(dftrain.index)
Fit the models
Setup the PyCaret environment. session_id is equivalent to random_state in scikit-learn and allows the experiment to be repeated.
pycaret_experiment = pycr.setup(data=dftrain, target="arr_delay", session_id=1066,
ignore_features=['flight', 'tailnum', 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
Description Value
0 session_id 1066
1 Target arr_delay
2 Original Data (100000, 25)
3 Missing Values False
4 Numeric Features 9
5 Categorical Features 6
6 Ordinal Features False
7 High Cardinality Features False
8 High Cardinality Method None
9 Transformed Train Set (69999, 162)
10 Transformed Test Set (30001, 162)
11 Shuffle Train-Test True
12 Stratify Train-Test False
13 Fold Generator KFold
14 Fold Number 10
15 CPU Jobs -1
16 Use GPU False
17 Log Experiment False
18 Experiment Name reg-default-name
19 USI c72b
20 Imputation Type simple
21 Iterative Imputation Iteration None
22 Numeric Imputer mean
23 Iterative Imputation Numeric Model None
24 Categorical Imputer constant
25 Iterative Imputation Categorical Model None
26 Unknown Categoricals Handling least_frequent
27 Normalize False
28 Normalize Method None
29 Transformation False
30 Transformation Method None
31 PCA False
32 PCA Method None
33 PCA Components None
34 Ignore Low Variance False
35 Combine Rare Levels False
36 Rare Level Threshold None
37 Numeric Binning False
38 Remove Outliers False
39 Outliers Threshold None
40 Remove Multicollinearity False
41 Multicollinearity Threshold None
42 Clustering False
43 Clustering Iteration None
44 Polynomial Features False
45 Polynomial Degree None
46 Trignometry Features False
47 Polynomial Threshold None
48 Group Features False
49 Feature Selection False
50 Feature Selection Method classic
51 Features Selection Threshold None
52 Feature Interaction False
53 Feature Ratio False
54 Interaction Threshold None
55 Transform Target False
56 Transform Target Method box-cox
Calling compare_models will train about 20 models and show their MAE, MSE, RMSE, R^2, RMSLE, and MAPE. It will also highlight the best peforming model on each of those metrics.
best = pycr.compare_models(sort='RMSE')
Model MAE MSE RMSE R2 RMSLE MAPE TT (Sec)
catboost CatBoost Regressor 1.9788 36.7907 5.5109 0.9818 0.2145 0.1533 4.0700
rf Random Forest Regressor 1.0965 38.4911 5.6290 0.9809 0.1149 0.0655 25.0300
et Extra Trees Regressor 1.3357 42.8113 6.1431 0.9788 0.1510 0.0817 35.5990
xgboost Extreme Gradient Boosting 3.2131 40.0363 6.1862 0.9800 0.3423 0.2805 17.1310
dt Decision Tree Regressor 1.7987 54.8896 6.8635 0.9728 0.2031 0.1283 0.3770
lightgbm Light Gradient Boosting Machine 6.2720 165.4157 12.5736 0.9176 0.5248 0.4355 1.2810
gbr Gradient Boosting Regressor 14.3364 561.7050 23.6655 0.7183 1.1623 0.8105 5.7960
br Bayesian Ridge 25.1405 1707.8873 41.3148 0.1422 1.3191 1.8122 0.9350
ridge Ridge Regression 25.1445 1708.1058 41.3174 0.1421 1.3159 1.8226 0.0650
lr Linear Regression 25.1495 1708.5206 41.3225 0.1418 1.3159 1.8245 0.4370
knn K Neighbors Regressor 21.6627 1763.1355 41.9758 0.1145 1.1372 1.4348 0.8130
en Elastic Net 26.3483 1830.8023 42.7765 0.0804 1.4243 1.6788 1.6150
lasso Lasso Regression 26.3723 1831.8645 42.7890 0.0799 1.4279 1.6800 1.3760
omp Orthogonal Matching Pursuit 26.8935 1840.2330 42.8878 0.0755 1.3864 1.6620 0.0680
huber Huber Regressor 24.3238 1938.0253 44.0110 0.0265 1.5449 1.2221 5.4420
llar Lasso Least Angle Regression 27.7529 1990.9916 44.6098 -0.0001 1.2494 1.4905 0.3390
par Passive Aggressive Regressor 36.9246 3204.7370 54.3478 -0.6131 1.4722 3.3425 0.2210
ada AdaBoost Regressor 74.5467 7082.7512 82.5292 -2.5762 1.8155 9.5337 9.5170
Calling create_model with one of the types above, will create the model that can then be used like any other model.
catboost = pycr.create_model('catboost')
MAE MSE RMSE R2 RMSLE MAPE
0 2.0394 95.0439 9.7490 0.9536 0.2097 0.1461
1 1.9653 13.1927 3.6322 0.9936 0.2171 0.1557
2 2.0037 12.6137 3.5516 0.9933 0.2239 0.1662
3 2.0038 17.1477 4.1410 0.9914 0.2165 0.1585
4 1.8106 9.8252 3.1345 0.9948 0.2090 0.1422
5 1.9573 37.8786 6.1546 0.9814 0.2032 0.1530
6 2.0721 93.5713 9.6732 0.9533 0.2080 0.1536
7 1.9391 15.2617 3.9066 0.9926 0.2142 0.1474
8 2.0233 62.9052 7.9313 0.9699 0.2108 0.1473
9 1.9735 10.4671 3.2353 0.9942 0.2328 0.1628
Mean 1.9788 36.7907 5.5109 0.9818 0.2145 0.1533
SD 0.0679 32.7283 2.5338 0.0160 0.0082 0.0073
Evaluate this model on the hold_out sample
predict_result = pycr.predict_model(catboost)
Model MAE MSE RMSE R2 RMSLE MAPE
0 CatBoost Regressor 1.9158 30.7283 5.5433 0.9849 0.2069 0.1498
This model is built using the default hyperparameters. The model with tuned hyperparameters can be found using tune_model.
catboost = pycr.tune_model(catboost)
MAE MSE RMSE R2 RMSLE MAPE
0 2.3507 116.4995 10.7935 0.9431 0.2424 0.1784
1 2.5108 22.7701 4.7718 0.9890 0.2792 0.2172
2 2.2563 26.8256 5.1793 0.9858 0.2707 0.1974
3 2.4275 47.8128 6.9147 0.9761 0.2687 0.2114
4 2.2783 13.9257 3.7317 0.9927 0.2747 0.2036
5 2.3571 50.3552 7.0961 0.9752 0.2491 0.1948
6 2.5718 173.7631 13.1819 0.9133 0.2706 0.2126
7 2.4011 53.7416 7.3309 0.9740 0.2814 0.2061
8 2.6607 53.3262 7.3025 0.9744 0.2780 0.2101
9 2.2036 14.4401 3.8000 0.9920 0.2585 0.1939
Mean 2.4018 57.3460 7.0102 0.9716 0.2673 0.2025
SD 0.1374 48.0356 2.8640 0.0237 0.0125 0.0110
Evaluate this model on the hold_out sample
tuned_result = pycr.predict_model(catboost)
Model MAE MSE RMSE R2 RMSLE MAPE
0 CatBoost Regressor 2.3154 48.6610 6.9757 0.9761 0.2608 0.1978
Finalize the model for deployment by fitting the model onto all of the data including the hold-out.
final_catboost = pycr.finalize_model(catboost)
Use this final model to predict on the observations not sampled above
predictions = pycr.predict_model(final_catboost, data=dftest)
Check the R^2 for these predictions
pycu.check_metric(predictions.arr_delay, predictions.Label, 'R2')