This tutorial explains how to generate a time series split from pyrasgo to allow out of time validation of machine learning models.

You'll use hourly weather data for multiple weather stations (origin) for flights from New York airports in 2013.

Packages

This tutorial uses:

Open up a Jupyter notebook and import the following:


import statsmodels.api as sm
import pandas as pd
import pyrasgo

Connect to Rasgo

If you haven't done so already, head over to https://docs.rasgoml.com/rasgo-docs/onboarding/initial-setup and follow the steps outlined there to create your free account. This account gives you free access to the Rasgo API which will calculate dataframe profiles, generate feature importance score, and produce feature explainability for you analysis. In addition, this account allows you to maintain access to your analysis and share with your colleagues.


rasgo = pyrasgo.login(email='', password='')

Reading the data

The data is from rdatasets imported using the Python package statsmodels.


df = sm.datasets.get_rdataset('weather', 'nycflights13').data
df.info()


RangeIndex: 26115 entries, 0 to 26114
Data columns (total 15 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   origin      26115 non-null  object 
 1   year        26115 non-null  int64  
 2   month       26115 non-null  int64  
 3   day         26115 non-null  int64  
 4   hour        26115 non-null  int64  
 5   temp        26114 non-null  float64
 6   dewp        26114 non-null  float64
 7   humid       26114 non-null  float64
 8   wind_dir    25655 non-null  float64
 9   wind_speed  26111 non-null  float64
 10  wind_gust   5337 non-null   float64
 11  precip      26115 non-null  float64
 12  pressure    23386 non-null  float64
 13  visib       26115 non-null  float64
 14  time_hour   26115 non-null  object 
dtypes: float64(9), int64(4), object(2)
memory usage: 3.0+ MB

df.origin.unique()

array(['EWR', 'JFK', 'LGA'], dtype=object)

Fix dates

time_hour contains the hour of the observation as a string. Convert it to a datetime as observation_time. year, month, day and hour are duplicates and can be dropped from the dataframe.


df['observation_time'] = pd.to_datetime(df.time_hour)
df.drop(columns=['year', 'month', 'day', 'hour', 'time_hour'], inplace=True)
df.head()

	origin	temp	dewp	humid	wind_dir	wind_speed	wind_gust	precip	pressure	visib	observation_time
0	EWR	39.02	26.06	59.37	270.0	10.35702	NaN	0.0	1012.0	10.0	2013-01-01 01:00:00
1	EWR	39.02	26.96	61.63	250.0	8.05546	NaN	0.0	1012.3	10.0	2013-01-01 02:00:00
2	EWR	39.02	28.04	64.43	240.0	11.50780	NaN	0.0	1012.5	10.0	2013-01-01 03:00:00
3	EWR	39.92	28.04	62.21	250.0	12.65858	NaN	0.0	1012.2	10.0	2013-01-01 04:00:00
4	EWR	39.02	28.04	64.43	260.0	12.65858	NaN	0.0	1011.9	10.0	2013-01-01 05:00:00

Time-based splitting

The function evaluate.train_test_split will split a dataframe into a train and test dataframe.


train_percent = .75
train_df, test_df = rasgo.evaluate.train_test_split(df, training_percentage=.75, timeseries_index='observation_time')\

The observation_time has become a datetime index of the dataframe. For ease of use, we will reset the index and rename it observation_time.


train_df = train_df.reset_index().rename(columns={'datetimeIdx': 'observation_time'})
test_df = test_df.reset_index().rename(columns={'datetimeIdx': 'observation_time'})

print("Train:", train_df.origin.unique())
print("Test:", test_df.origin.unique())
print("Train:", train_df.observation_time.min(), train_df.observation_time.max())
print("Test:", test_df.observation_time.min(), test_df.observation_time.max())

Train: ['EWR' 'JFK' 'LGA']
Test: ['LGA' 'EWR' 'JFK']
Train: 2013-01-01 01:00:00 2013-09-30 13:00:00
Test: 2013-09-30 13:00:00 2013-12-30 18:00:00