This tutorial explains how to generate K-folds for cross-validation using scikit-learn for evaluation of machine learning models with out of sample data.
You'll work with an OpenML dataset to predict who pays for the internet with 10108 observations and 69 columns.
Packages
This tutorial uses:
Open up a new Jupyter notebook and import the following:
from sklearn.datasets import fetch_openml
import pandas as pd
from sklearn.model_selection import KFold
Reading the data
The data is from OpenML imported using the Python package sklearn.datasets.
data = fetch_openml(name='kdd_internet_usage', as_frame=True)
df = data.frame
df.info()
RangeIndex: 10108 entries, 0 to 10107
Data columns (total 69 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Actual_Time 10108 non-null category
1 Age 10108 non-null category
2 Community_Building 10108 non-null category
3 Community_Membership_Family 10108 non-null category
4 Community_Membership_Hobbies 10108 non-null category
5 Community_Membership_None 10108 non-null category
6 Community_Membership_Other 10108 non-null category
7 Community_Membership_Political 10108 non-null category
8 Community_Membership_Professional 10108 non-null category
9 Community_Membership_Religious 10108 non-null category
10 Community_Membership_Support 10108 non-null category
11 Country 10108 non-null category
12 Disability_Cognitive 10108 non-null category
13 Disability_Hearing 10108 non-null category
14 Disability_Motor 10108 non-null category
15 Disability_Not_Impaired 10108 non-null category
16 Disability_Not_Say 10108 non-null category
17 Disability_Vision 10108 non-null category
18 Education_Attainment 10108 non-null category
19 Falsification_of_Information 10108 non-null category
20 Gender 10108 non-null category
21 Household_Income 10108 non-null category
22 How_You_Heard_About_Survey_Banner 10108 non-null category
23 How_You_Heard_About_Survey_Friend 10108 non-null category
24 How_You_Heard_About_Survey_Mailing_List 10108 non-null category
25 How_You_Heard_About_Survey_Others 10108 non-null category
26 How_You_Heard_About_Survey_Printed_Media 10108 non-null category
27 How_You_Heard_About_Survey_Remebered 10108 non-null category
28 How_You_Heard_About_Survey_Search_Engine 10108 non-null category
29 How_You_Heard_About_Survey_Usenet_News 10108 non-null category
30 How_You_Heard_About_Survey_WWW_Page 10108 non-null category
31 Major_Geographical_Location 10108 non-null category
32 Major_Occupation 10108 non-null category
33 Marital_Status 10108 non-null category
34 Most_Import_Issue_Facing_the_Internet 10108 non-null category
35 Opinions_on_Censorship 10108 non-null category
36 Primary_Computing_Platform 7409 non-null category
37 Primary_Language 10108 non-null category
38 Primary_Place_of_WWW_Access 10108 non-null category
39 Race 10108 non-null category
40 Not_Purchasing_Bad_experience 10108 non-null category
41 Not_Purchasing_Bad_press 10108 non-null category
42 Not_Purchasing_Cant_find 10108 non-null category
43 Not_Purchasing_Company_policy 10108 non-null category
44 Not_Purchasing_Easier_locally 10108 non-null category
45 Not_Purchasing_Enough_info 10108 non-null category
46 Not_Purchasing_Judge_quality 10108 non-null category
47 Not_Purchasing_Never_tried 10108 non-null category
48 Not_Purchasing_No_credit 10108 non-null category
49 Not_Purchasing_Not_applicable 10108 non-null category
50 Not_Purchasing_Not_option 10108 non-null category
51 Not_Purchasing_Other 10108 non-null category
52 Not_Purchasing_Prefer_people 10108 non-null category
53 Not_Purchasing_Privacy 10108 non-null category
54 Not_Purchasing_Receipt 10108 non-null category
55 Not_Purchasing_Security 10108 non-null category
56 Not_Purchasing_Too_complicated 10108 non-null category
57 Not_Purchasing_Uncomfortable 10108 non-null category
58 Not_Purchasing_Unfamiliar_vendor 10108 non-null category
59 Registered_to_Vote 10108 non-null category
60 Sexual_Preference 10108 non-null category
61 Web_Ordering 10108 non-null category
62 Web_Page_Creation 10108 non-null category
63 Who_Pays_for_Access_Dont_Know 10108 non-null category
64 Who_Pays_for_Access_Other 10108 non-null category
65 Who_Pays_for_Access_Parents 10108 non-null category
66 Who_Pays_for_Access_School 10108 non-null category
67 Who_Pays_for_Access_Self 10108 non-null category
68 Who_Pays_for_Access_Work 10108 non-null category
dtypes: category(69)
memory usage: 715.7 KB
Split the data into target and features.
Drop target leakage features of other options to pay.
target = 'Who_Pays_for_Access_Work'
y = df[target]
X = data.data.drop(columns=['Who_Pays_for_Access_Dont_Know',
'Who_Pays_for_Access_Other', 'Who_Pays_for_Access_Parents',
'Who_Pays_for_Access_School', 'Who_Pays_for_Access_Self'])
Cross-validation splitting
Scikit-learn's KFold will randomly sample the data into N folds (default of 5) that can be used to perform cross-validation during machine learning training.
kf = KFold(n_splits=10, random_state=1066, shuffle=True)
for train_index, test_index in kf.split(X):
print("Train:", train_index, "Test:", test_index)
X_train = X.iloc[train_index, :]
y_train = y[train_index]
X_test = X.iloc[test_index, :]
y_test = y[test_index]
Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 9 52 80 ... 10092 10102 10103]
Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 16 20 21 ... 10069 10079 10101]
Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 4 12 22 ... 10066 10074 10076]
Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 13 25 34 ... 10073 10075 10100]
Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 3 6 7 ... 10093 10095 10104]
Train: [ 1 3 4 ... 10104 10105 10107] Test: [ 0 2 18 ... 10045 10096 10106]
Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 8 11 14 ... 10067 10084 10086]
Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 10 30 31 ... 10083 10085 10098]
Train: [ 0 2 3 ... 10103 10104 10106] Test: [ 1 5 19 ... 10097 10105 10107]
Train: [ 0 1 2 ... 10105 10106 10107] Test: [ 15 32 39 ... 10081 10094 10099]