How To Do Scikit-Learn Cross-Validation Splits

This tutorial explains how to generate K-folds for cross-validation using scikit-learn for evaluation of machine learning models with out of sample data.

You'll work with an OpenML dataset to predict who pays for the internet with 10108 observations and 69 columns.

Packages

This tutorial uses:

Open up a new Jupyter notebook and import the following:


from sklearn.datasets import fetch_openml
import pandas as pd
from sklearn.model_selection import KFold

Reading the data

The data is from OpenML imported using the Python package sklearn.datasets.


data = fetch_openml(name='kdd_internet_usage', as_frame=True)
df = data.frame
df.info()



RangeIndex: 10108 entries, 0 to 10107
Data columns (total 69 columns):
 #   Column                                    Non-Null Count  Dtype   
---  ------                                    --------------  -----   
 0   Actual_Time                               10108 non-null  category
 1   Age                                       10108 non-null  category
 2   Community_Building                        10108 non-null  category
 3   Community_Membership_Family               10108 non-null  category
 4   Community_Membership_Hobbies              10108 non-null  category
 5   Community_Membership_None                 10108 non-null  category
 6   Community_Membership_Other                10108 non-null  category
 7   Community_Membership_Political            10108 non-null  category
 8   Community_Membership_Professional         10108 non-null  category
 9   Community_Membership_Religious            10108 non-null  category
 10  Community_Membership_Support              10108 non-null  category
 11  Country                                   10108 non-null  category
 12  Disability_Cognitive                      10108 non-null  category
 13  Disability_Hearing                        10108 non-null  category
 14  Disability_Motor                          10108 non-null  category
 15  Disability_Not_Impaired                   10108 non-null  category
 16  Disability_Not_Say                        10108 non-null  category
 17  Disability_Vision                         10108 non-null  category
 18  Education_Attainment                      10108 non-null  category
 19  Falsification_of_Information              10108 non-null  category
 20  Gender                                    10108 non-null  category
 21  Household_Income                          10108 non-null  category
 22  How_You_Heard_About_Survey_Banner         10108 non-null  category
 23  How_You_Heard_About_Survey_Friend         10108 non-null  category
 24  How_You_Heard_About_Survey_Mailing_List   10108 non-null  category
 25  How_You_Heard_About_Survey_Others         10108 non-null  category
 26  How_You_Heard_About_Survey_Printed_Media  10108 non-null  category
 27  How_You_Heard_About_Survey_Remebered      10108 non-null  category
 28  How_You_Heard_About_Survey_Search_Engine  10108 non-null  category
 29  How_You_Heard_About_Survey_Usenet_News    10108 non-null  category
 30  How_You_Heard_About_Survey_WWW_Page       10108 non-null  category
 31  Major_Geographical_Location               10108 non-null  category
 32  Major_Occupation                          10108 non-null  category
 33  Marital_Status                            10108 non-null  category
 34  Most_Import_Issue_Facing_the_Internet     10108 non-null  category
 35  Opinions_on_Censorship                    10108 non-null  category
 36  Primary_Computing_Platform                7409 non-null   category
 37  Primary_Language                          10108 non-null  category
 38  Primary_Place_of_WWW_Access               10108 non-null  category
 39  Race                                      10108 non-null  category
 40  Not_Purchasing_Bad_experience             10108 non-null  category
 41  Not_Purchasing_Bad_press                  10108 non-null  category
 42  Not_Purchasing_Cant_find                  10108 non-null  category
 43  Not_Purchasing_Company_policy             10108 non-null  category
 44  Not_Purchasing_Easier_locally             10108 non-null  category
 45  Not_Purchasing_Enough_info                10108 non-null  category
 46  Not_Purchasing_Judge_quality              10108 non-null  category
 47  Not_Purchasing_Never_tried                10108 non-null  category
 48  Not_Purchasing_No_credit                  10108 non-null  category
 49  Not_Purchasing_Not_applicable             10108 non-null  category
 50  Not_Purchasing_Not_option                 10108 non-null  category
 51  Not_Purchasing_Other                      10108 non-null  category
 52  Not_Purchasing_Prefer_people              10108 non-null  category
 53  Not_Purchasing_Privacy                    10108 non-null  category
 54  Not_Purchasing_Receipt                    10108 non-null  category
 55  Not_Purchasing_Security                   10108 non-null  category
 56  Not_Purchasing_Too_complicated            10108 non-null  category
 57  Not_Purchasing_Uncomfortable              10108 non-null  category
 58  Not_Purchasing_Unfamiliar_vendor          10108 non-null  category
 59  Registered_to_Vote                        10108 non-null  category
 60  Sexual_Preference                         10108 non-null  category
 61  Web_Ordering                              10108 non-null  category
 62  Web_Page_Creation                         10108 non-null  category
 63  Who_Pays_for_Access_Dont_Know             10108 non-null  category
 64  Who_Pays_for_Access_Other                 10108 non-null  category
 65  Who_Pays_for_Access_Parents               10108 non-null  category
 66  Who_Pays_for_Access_School                10108 non-null  category
 67  Who_Pays_for_Access_Self                  10108 non-null  category
 68  Who_Pays_for_Access_Work                  10108 non-null  category
dtypes: category(69)
memory usage: 715.7 KB

Split the data into target and features.

Drop target leakage features of other options to pay.


target = 'Who_Pays_for_Access_Work'
y = df[target]
X = data.data.drop(columns=['Who_Pays_for_Access_Dont_Know',
       'Who_Pays_for_Access_Other', 'Who_Pays_for_Access_Parents',
       'Who_Pays_for_Access_School', 'Who_Pays_for_Access_Self'])

Cross-validation splitting

Scikit-learn's KFold will randomly sample the data into N folds (default of 5) that can be used to perform cross-validation during machine learning training.


kf = KFold(n_splits=10, random_state=1066, shuffle=True)
for train_index, test_index in kf.split(X):
    print("Train:", train_index, "Test:", test_index)
    X_train = X.iloc[train_index, :]
    y_train = y[train_index]
    X_test = X.iloc[test_index, :]
    y_test = y[test_index]


Train: [    0     1     2 ... 10105 10106 10107] Test: [    9    52    80 ... 10092 10102 10103]
Train: [    0     1     2 ... 10105 10106 10107] Test: [   16    20    21 ... 10069 10079 10101]
Train: [    0     1     2 ... 10105 10106 10107] Test: [    4    12    22 ... 10066 10074 10076]
Train: [    0     1     2 ... 10105 10106 10107] Test: [   13    25    34 ... 10073 10075 10100]
Train: [    0     1     2 ... 10105 10106 10107] Test: [    3     6     7 ... 10093 10095 10104]
Train: [    1     3     4 ... 10104 10105 10107] Test: [    0     2    18 ... 10045 10096 10106]
Train: [    0     1     2 ... 10105 10106 10107] Test: [    8    11    14 ... 10067 10084 10086]
Train: [    0     1     2 ... 10105 10106 10107] Test: [   10    30    31 ... 10083 10085 10098]
Train: [    0     2     3 ... 10103 10104 10106] Test: [    1     5    19 ... 10097 10105 10107]
Train: [    0     1     2 ... 10105 10106 10107] Test: [   15    32    39 ... 10081 10094 10099]