This tutorial explains how to use PyCaret to build multiple models and compare them to pick the best model for your data.

During this tutorial you will build and evaluate a model to predict arrival delay for flights in and out of NYC in 2013.

Packages

This tutorial uses:

Open up a new Jupyter Notebook and import the following:


import statsmodels.api as sm
import pandas as pd
import numpy as np
import pycaret.regression as pycr
import pycaret.utils as pycu

Reading the data

The data is from rdatasets imported using the Python package statsmodels.


df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()


RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   year            336776 non-null  int64  
 1   month           336776 non-null  int64  
 2   day             336776 non-null  int64  
 3   dep_time        328521 non-null  float64
 4   sched_dep_time  336776 non-null  int64  
 5   dep_delay       328521 non-null  float64
 6   arr_time        328063 non-null  float64
 7   sched_arr_time  336776 non-null  int64  
 8   arr_delay       327346 non-null  float64
 9   carrier         336776 non-null  object 
 10  flight          336776 non-null  int64  
 11  tailnum         334264 non-null  object 
 12  origin          336776 non-null  object 
 13  dest            336776 non-null  object 
 14  air_time        327346 non-null  float64
 15  distance        336776 non-null  int64  
 16  hour            336776 non-null  int64  
 17  minute          336776 non-null  int64  
 18  time_hour       336776 non-null  object 
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB

Feature Engineering

Handle null values


df.isnull().sum()

year                 0
month                0
day                  0
dep_time          8255
sched_dep_time       0
dep_delay         8255
arr_time          8713
sched_arr_time       0
arr_delay         9430
carrier              0
flight               0
tailnum           2512
origin               0
dest                 0
air_time          9430
distance             0
hour                 0
minute               0
time_hour            0
dtype: int64

As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.


df.dropna(inplace=True)

Convert the times from floats or ints to hour and minutes


df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True)
                   

df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True)
                   

As PyCaret can use large amounts of memory, we will randomly select 100,000 rows for this comparison, reserving the remaining rows as a test set.


dftrain = df.sample(n=100000, random_state=1066)
dftest = df.drop(dftrain.index)

Fit the models

Setup the PyCaret environment. session_id is equivalent to random_state in scikit-learn and allows the experiment to be repeated.


pycaret_experiment = pycr.setup(data=dftrain, target="arr_delay", session_id=1066,
                                ignore_features=['flight', 'tailnum', 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
                                

	Description	Value
0	session_id	1066
1	Target	arr_delay
2	Original Data	(100000, 25)
3	Missing Values	False
4	Numeric Features	9
5	Categorical Features	6
6	Ordinal Features	False
7	High Cardinality Features	False
8	High Cardinality Method	None
9	Transformed Train Set	(69999, 162)
10	Transformed Test Set	(30001, 162)
11	Shuffle Train-Test	True
12	Stratify Train-Test	False
13	Fold Generator	KFold
14	Fold Number	10
15	CPU Jobs	-1
16	Use GPU	False
17	Log Experiment	False
18	Experiment Name	reg-default-name
19	USI	c72b
20	Imputation Type	simple
21	Iterative Imputation Iteration	None
22	Numeric Imputer	mean
23	Iterative Imputation Numeric Model	None
24	Categorical Imputer	constant
25	Iterative Imputation Categorical Model	None
26	Unknown Categoricals Handling	least_frequent
27	Normalize	False
28	Normalize Method	None
29	Transformation	False
30	Transformation Method	None
31	PCA	False
32	PCA Method	None
33	PCA Components	None
34	Ignore Low Variance	False
35	Combine Rare Levels	False
36	Rare Level Threshold	None
37	Numeric Binning	False
38	Remove Outliers	False
39	Outliers Threshold	None
40	Remove Multicollinearity	False
41	Multicollinearity Threshold	None
42	Clustering	False
43	Clustering Iteration	None
44	Polynomial Features	False
45	Polynomial Degree	None
46	Trignometry Features	False
47	Polynomial Threshold	None
48	Group Features	False
49	Feature Selection	False
50	Feature Selection Method	classic
51	Features Selection Threshold	None
52	Feature Interaction	False
53	Feature Ratio	False
54	Interaction Threshold	None
55	Transform Target	False
56	Transform Target Method	box-cox

Calling compare_models will train about 20 models and show their MAE, MSE, RMSE, R^2, RMSLE, and MAPE. It will also highlight the best peforming model on each of those metrics.


best = pycr.compare_models(sort='RMSE')

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE	TT (Sec)
catboost	CatBoost Regressor	1.9788	36.7907	5.5109	0.9818	0.2145	0.1533	4.0700
rf	Random Forest Regressor	1.0965	38.4911	5.6290	0.9809	0.1149	0.0655	25.0300
et	Extra Trees Regressor	1.3357	42.8113	6.1431	0.9788	0.1510	0.0817	35.5990
xgboost	Extreme Gradient Boosting	3.2131	40.0363	6.1862	0.9800	0.3423	0.2805	17.1310
dt	Decision Tree Regressor	1.7987	54.8896	6.8635	0.9728	0.2031	0.1283	0.3770
lightgbm	Light Gradient Boosting Machine	6.2720	165.4157	12.5736	0.9176	0.5248	0.4355	1.2810
gbr	Gradient Boosting Regressor	14.3364	561.7050	23.6655	0.7183	1.1623	0.8105	5.7960
br	Bayesian Ridge	25.1405	1707.8873	41.3148	0.1422	1.3191	1.8122	0.9350
ridge	Ridge Regression	25.1445	1708.1058	41.3174	0.1421	1.3159	1.8226	0.0650
lr	Linear Regression	25.1495	1708.5206	41.3225	0.1418	1.3159	1.8245	0.4370
knn	K Neighbors Regressor	21.6627	1763.1355	41.9758	0.1145	1.1372	1.4348	0.8130
en	Elastic Net	26.3483	1830.8023	42.7765	0.0804	1.4243	1.6788	1.6150
lasso	Lasso Regression	26.3723	1831.8645	42.7890	0.0799	1.4279	1.6800	1.3760
omp	Orthogonal Matching Pursuit	26.8935	1840.2330	42.8878	0.0755	1.3864	1.6620	0.0680
huber	Huber Regressor	24.3238	1938.0253	44.0110	0.0265	1.5449	1.2221	5.4420
llar	Lasso Least Angle Regression	27.7529	1990.9916	44.6098	-0.0001	1.2494	1.4905	0.3390
par	Passive Aggressive Regressor	36.9246	3204.7370	54.3478	-0.6131	1.4722	3.3425	0.2210
ada	AdaBoost Regressor	74.5467	7082.7512	82.5292	-2.5762	1.8155	9.5337	9.5170

Calling create_model with one of the types above, will create the model that can then be used like any other model.


catboost = pycr.create_model('catboost')

	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	2.0394	95.0439	9.7490	0.9536	0.2097	0.1461
1	1.9653	13.1927	3.6322	0.9936	0.2171	0.1557
2	2.0037	12.6137	3.5516	0.9933	0.2239	0.1662
3	2.0038	17.1477	4.1410	0.9914	0.2165	0.1585
4	1.8106	9.8252	3.1345	0.9948	0.2090	0.1422
5	1.9573	37.8786	6.1546	0.9814	0.2032	0.1530
6	2.0721	93.5713	9.6732	0.9533	0.2080	0.1536
7	1.9391	15.2617	3.9066	0.9926	0.2142	0.1474
8	2.0233	62.9052	7.9313	0.9699	0.2108	0.1473
9	1.9735	10.4671	3.2353	0.9942	0.2328	0.1628
Mean	1.9788	36.7907	5.5109	0.9818	0.2145	0.1533
SD	0.0679	32.7283	2.5338	0.0160	0.0082	0.0073

Evaluate this model on the hold_out sample


predict_result = pycr.predict_model(catboost)

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	CatBoost Regressor	1.9158	30.7283	5.5433	0.9849	0.2069	0.1498

This model is built using the default hyperparameters. The model with tuned hyperparameters can be found using tune_model.


catboost = pycr.tune_model(catboost)

	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	2.3507	116.4995	10.7935	0.9431	0.2424	0.1784
1	2.5108	22.7701	4.7718	0.9890	0.2792	0.2172
2	2.2563	26.8256	5.1793	0.9858	0.2707	0.1974
3	2.4275	47.8128	6.9147	0.9761	0.2687	0.2114
4	2.2783	13.9257	3.7317	0.9927	0.2747	0.2036
5	2.3571	50.3552	7.0961	0.9752	0.2491	0.1948
6	2.5718	173.7631	13.1819	0.9133	0.2706	0.2126
7	2.4011	53.7416	7.3309	0.9740	0.2814	0.2061
8	2.6607	53.3262	7.3025	0.9744	0.2780	0.2101
9	2.2036	14.4401	3.8000	0.9920	0.2585	0.1939
Mean	2.4018	57.3460	7.0102	0.9716	0.2673	0.2025
SD	0.1374	48.0356	2.8640	0.0237	0.0125	0.0110

Evaluate this model on the hold_out sample


tuned_result = pycr.predict_model(catboost)

	Model	MAE	MSE	RMSE	R2	RMSLE	MAPE
0	CatBoost Regressor	2.3154	48.6610	6.9757	0.9761	0.2608	0.1978

Finalize the model for deployment by fitting the model onto all of the data including the hold-out.


final_catboost = pycr.finalize_model(catboost)

Use this final model to predict on the observations not sampled above


predictions = pycr.predict_model(final_catboost, data=dftest)

Check the R^2 for these predictions


pycu.check_metric(predictions.arr_delay, predictions.Label, 'R2')

0.9822