This tutorial explains how to generate a time series split from scikit-learn to allow out of time validation of machine learning models, why this approach may not be what is needed and how to create true time-based splits with pandas.

This tutorial will use hourly weather data for multiple weather stations (origin) for flights from New York airports in 2013.

Packages

This tutorial uses:


import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.model_selection import TimeSeriesSplit

Reading the data

The data is from rdatasets imported using the Python package statsmodels.


df = sm.datasets.get_rdataset('weather', 'nycflights13').data
df.info()


RangeIndex: 26115 entries, 0 to 26114
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   origin      26115 non-null  object 
 1   year        26115 non-null  int64  
 2   month       26115 non-null  int64  
 3   day         26115 non-null  int64  
 4   hour        26115 non-null  int64  
 5   temp        26114 non-null  float64
 6   dewp        26114 non-null  float64
 7   humid       26114 non-null  float64
 8   wind_dir    25655 non-null  float64
 9   wind_speed  26111 non-null  float64
 10  wind_gust   5337 non-null   float64
 11  precip      26115 non-null  float64
 12  pressure    23386 non-null  float64
 13  visib       26115 non-null  float64
 14  time_hour   26115 non-null  object 
dtypes: float64(9), int64(4), object(2)
memory usage: 3.0+ MB

df.origin.unique()f

array(['EWR', 'JFK', 'LGA'], dtype=object)

Fix dates

time_hour contains the hour of the observation as a string. Convert it to a datetime as observation_time. year, month, day and hour are duplicates and can be dropped from the dataframe.


df['observation_time'] = pd.to_datetime(df.time_hour)
df.drop(columns=['year', 'month', 'day', 'hour', 'time_hour'], inplace=True)
df.head()

	origin	temp	dewp	humid	wind_dir	wind_speed	wind_gust	precip	pressure	visib	observation_time
0	EWR	39.02	26.06	59.37	270.0	10.35702	NaN	0.0	1012.0	10.0	2013-01-01 01:00:00
1	EWR	39.02	26.96	61.63	250.0	8.05546	NaN	0.0	1012.3	10.0	2013-01-01 02:00:00
2	EWR	39.02	28.04	64.43	240.0	11.50780	NaN	0.0	1012.5	10.0	2013-01-01 03:00:00
3	EWR	39.92	28.04	62.21	250.0	12.65858	NaN	0.0	1012.2	10.0	2013-01-01 04:00:00
4	EWR	39.02	28.04	64.43	260.0	12.65858	NaN	0.0	1011.9	10.0	2013-01-01 05:00:00

Time-based splitting

Scikit-learn TimeSeriesSplit

TimeSeriesSplit doesn't implement true time series split. Instead, it assumes that the data contains a single series with evenly spaced observations ordered by the timestamp. With that data it partitions the first n observations into the train set and the remaining test_size into the test set.

Note this will not work in this case, as the weather data contains three different weather stations, EWR, JFK and LGA. While this data could be resorted to be ordered purely by timestamp, as TimeSeriesSplit will still split on a row count level, not on a date or time boundary.


tss = TimeSeriesSplit(n_splits=2)
train_splits, test_splits = tss.split(df)
train_split = train_splits[1]
test_split = test_splits[1]
print("Train Split:", train_split)
print("Test Split:", test_split)

Train Split: [ 8705  8706  8707 ... 17407 17408 17409]
Test Split: [17410 17411 17412 ... 26112 26113 26114]

train_df = df.iloc[train_split, :]
test_df = df.iloc[test_split, :]
print("Train:", train_df.origin.unique())
print("Test:", test_df.origin.unique())
print("Train:", train_df.observation_time.min(), train_df.observation_time.max())
print("Test:", test_df.observation_time.min(), test_df.observation_time.max())

Train: ['JFK' 'LGA']
Test: ['LGA']
Train: 2013-01-01 01:00:00 2013-12-30 18:00:00
Test: 2013-01-01 02:00:00 2013-12-30 18:00:00

This is not splitting the data on the time value as we need to conduct this analysis correctly.

Time-based splitting with pandas

Calculate the date to split on


min_date = df.observation_time.min()
max_date = df.observation_time.max()
print("Min:", min_date, "Max:", max_date)

Min: 2013-01-01 01:00:00 Max: 2013-12-30 18:00:00

Calculate the train-test cutoff date


train_percent = .75
time_between = max_date - min_date
train_cutoff = min_date + train_percent*time_between
train_cutoff

Timestamp('2013-09-30 19:45:00')

Create the train and test dataframes


train_df = df[df.observation_time <= train_cutoff]
test_df = df[df.observation_time > train_cutoff]
print("Train:", train_df.origin.unique())
print("Test:", test_df.origin.unique())
print("Train:", train_df.observation_time.min(), train_df.observation_time.max())
print("Test:", test_df.observation_time.min(), test_df.observation_time.max())

Train: ['EWR' 'JFK' 'LGA']
Test: ['EWR' 'JFK' 'LGA']
Train: 2013-01-01 01:00:00 2013-09-30 19:00:00
Test: 2013-09-30 20:00:00 2013-12-30 18:00:00

The train and test datasets now contain all of the observation sites with no overlap in dates. These can now be used as the train and test sets in machine learning model training.

Alternatively, to specify the number of days in the training set


days_between = time_between / np.timedelta64(1, 'D')
days_between

363.7083333333333


train_days = 273
train_cutoff = min_date + pd.DateOffset(train_days)

train_df = df[df.observation_time <= train_cutoff]
test_df = df[df.observation_time > train_cutoff]
print("Train:", train_df.origin.unique())
print("Test:", test_df.origin.unique())
print("Train:", train_df.observation_time.min(), train_df.observation_time.max())
print("Test:", test_df.observation_time.min(), test_df.observation_time.max())

Train: ['EWR' 'JFK' 'LGA']
Test: ['EWR' 'JFK' 'LGA']
Train: 2013-01-01 01:00:00 2013-10-01 01:00:00
Test: 2013-10-01 02:00:00 2013-12-30 18:00:0