This tutorial explains how to generate K-folds for cross-validation using scikit-learn for evaluation of machine learning models with out of sample data using stratified sampling. With stratified sampling, the relative proportions of classes from the overall dataset is maintained in each fold.

During this tutorial you will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.

Packages

This tutorial uses:

Open up a new Jupyter notebook and import the following:


from sklearn.datasets import fetch_openml
import pandas as pd
from sklearn.model_selection import StratifiedKFold

Reading the data

The data is from OpenML imported using the Python package sklearn.datasets.


data = fetch_openml(name='kdd_internet_usage')
df = data.frame
df.info()

Split the data into target and features.

Drop target leakage features of other options to pay.


target = 'Who_Pays_for_Access_Work'
y = df[target]
X = data.data.drop(columns=['Who_Pays_for_Access_Dont_Know',
       'Who_Pays_for_Access_Other', 'Who_Pays_for_Access_Parents',
       'Who_Pays_for_Access_School', 'Who_Pays_for_Access_Self'])
       

Cross-validation splitting

Scikit-learn's StratifiedKFold will randomly sample data from each class into N folds (default of 5) that can be used to perform cross-validation during machine learning training.


skf = StratifiedKFold(n_splits=10, random_state=1066, shuffle=True)
for train_index, test_index in skf.split(X, y):
    print("Train:", train_index, "Test:", test_index)
    X_train = X.iloc[train_index, :]
    y_train = y[train_index]
    X_test = X.iloc[test_index, :]
    y_test = y[test_index]
    

Train: [    0     1     2 ... 10105 10106 10107] Test: [    6    12    17 ... 10031 10066 10097]
Train: [    0     1     2 ... 10104 10106 10107] Test: [    8    34    35 ... 10090 10099 10105]
Train: [    0     2     3 ... 10105 10106 10107] Test: [    1    30    31 ... 10045 10057 10060]
Train: [    0     1     2 ... 10105 10106 10107] Test: [   15    22    23 ... 10080 10087 10092]
Train: [    1     2     3 ... 10105 10106 10107] Test: [    0     4     9 ... 10069 10076 10088]
Train: [    0     1     2 ... 10104 10105 10106] Test: [    5    11    14 ... 10089 10095 10107]
Train: [    0     1     2 ... 10105 10106 10107] Test: [   18    28    36 ... 10054 10094 10101]
Train: [    0     1     2 ... 10104 10105 10107] Test: [    3     7    19 ... 10096 10102 10106]
Train: [    0     1     2 ... 10105 10106 10107] Test: [   10    41    54 ... 10098 10100 10104]
Train: [    0     1     3 ... 10105 10106 10107] Test: [    2    46    57 ... 10067 10081 10103]