This tutorial explains how to generate a train-test split from scikit-learn to allow validation of machine learning models with out of sample data.
You'll use hourly weather data for multiple weather stations (origin) for flights from New York airports in 2013.
Packages
This tutorial uses:
Open a new Jupyter notebook and import the following:
import statsmodels.api as sm
import pandas as pd
from sklearn.model_selection import train_test_split
Reading the data
The data is from rdatasets imported using the Python package statsmodels.
df = sm.datasets.get_rdataset('weather', 'nycflights13').data
df.info()
RangeIndex: 26115 entries, 0 to 26114
Data columns (total 15 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 origin 26115 non-null object
1 year 26115 non-null int64
2 month 26115 non-null int64
3 day 26115 non-null int64
4 hour 26115 non-null int64
5 temp 26114 non-null float64
6 dewp 26114 non-null float64
7 humid 26114 non-null float64
8 wind_dir 25655 non-null float64
9 wind_speed 26111 non-null float64
10 wind_gust 5337 non-null float64
11 precip 26115 non-null float64
12 pressure 23386 non-null float64
13 visib 26115 non-null float64
14 time_hour 26115 non-null object
dtypes: float64(9), int64(4), object(2)
memory usage: 3.0+ MB
array(['EWR', 'JFK', 'LGA'], dtype=object)
Fix dates
time_hour contains the hour of the observation as a string. Convert it to a datetime as observation_time. year, month, day and hour are duplicates and can be dropped from the dataframe.
df['observation_time'] = pd.to_datetime(df.time_hour)
df.drop(columns=['year', 'month', 'day', 'hour', 'time_hour'], inplace=True)
df.head()
origin temp dewp humid wind_dir wind_speed wind_gust precip pressure visib observation_time
0 EWR 39.02 26.06 59.37 270.0 10.35702 NaN 0.0 1012.0 10.0 2013-01-01 01:00:00
1 EWR 39.02 26.96 61.63 250.0 8.05546 NaN 0.0 1012.3 10.0 2013-01-01 02:00:00
2 EWR 39.02 28.04 64.43 240.0 11.50780 NaN 0.0 1012.5 10.0 2013-01-01 03:00:00
3 EWR 39.92 28.04 62.21 250.0 12.65858 NaN 0.0 1012.2 10.0 2013-01-01 04:00:00
4 EWR 39.02 28.04 64.43 260.0 12.65858 NaN 0.0 1011.9 10.0 2013-01-01 05:00:00
Train-test splitting
train_df, test_df = train_test_split(df, test_size=.2)
train_df
origin temp dewp humid wind_dir wind_speed wind_gust precip pressure visib observation_time
9030 JFK 53.06 33.98 48.16 340.0 9.20624 NaN 0.0 1021.6 10.0 2013-01-14 17:00:00
22499 LGA 75.92 64.94 68.78 180.0 10.35702 18.41248 0.0 1014.3 10.0 2013-08-01 09:00:00
6287 EWR 78.08 55.94 46.49 160.0 8.05546 NaN 0.0 1017.0 10.0 2013-09-20 13:00:00
15793 JFK 48.02 33.98 58.07 310.0 13.80936 NaN 0.0 1007.6 10.0 2013-10-23 22:00:00
11971 JFK 64.94 39.92 39.79 300.0 10.35702 21.86482 0.0 1018.3 10.0 2013-05-17 10:00:00
... ... ... ... ... ... ... ... ... ... ... ...
3598 EWR 73.94 64.94 73.49 220.0 6.90468 NaN 0.0 1019.0 10.0 2013-05-31 04:00:00
4973 EWR 80.96 62.96 54.35 170.0 11.50780 19.56326 0.0 1016.6 10.0 2013-07-27 13:00:00
6147 EWR 64.94 46.04 50.32 290.0 16.11092 21.86482 0.0 1017.3 10.0 2013-09-14 17:00:00
15586 JFK 53.06 51.08 92.96 40.0 3.45234 NaN 0.0 1023.8 10.0 2013-10-15 07:00:00
9050 JFK 39.02 24.98 56.77 350.0 9.20624 NaN 0.0 1025.3 10.0 2013-01-15 13:00:00
Print out the result:
print("Train:", train_df.origin.unique())
print("Test:", test_df.origin.unique())
print("Train:", train_df.observation_time.min(), train_df.observation_time.max())
print("Test:", test_df.observation_time.min(), test_df.observation_time.max())
Train: ['JFK' 'LGA' 'EWR']
Test: ['LGA' 'JFK' 'EWR']
Train: 2013-01-01 01:00:00 2013-12-30 18:00:00
Test: 2013-01-01 02:00:00 2013-12-30 18:00:00