Feature Selection Using the F-Test in Scikit-learn

This tutorial explains how to use scikit-learn's univariate feature selection methods to select the top N features and the top P% features with the F-test statistic.

This will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.

Packages

This tutorial uses:


import pandas as pd
from sklearn.datasets import fetch_openml
import category_encoders as ce

from sklearn.feature_selection import SelectKBest, SelectPercentile, f_classif

Reading the Data

The data is from OpenML imported using the Python package sklearn.datasets.


data = fetch_openml(name='kdd_internet_usage')
df = data.frame
df.info()

This should return a table resembling something like this:



RangeIndex: 10108 entries, 0 to 10107
Data columns (total 69 columns):
 #   Column                                    Non-Null Count  Dtype   
---  ------                                    --------------  -----   
 0   Actual_Time                               10108 non-null  category
 1   Age                                       10108 non-null  category
 2   Community_Building                        10108 non-null  category
 3   Community_Membership_Family               10108 non-null  category
 4   Community_Membership_Hobbies              10108 non-null  category
 5   Community_Membership_None                 10108 non-null  category
 6   Community_Membership_Other                10108 non-null  category
 7   Community_Membership_Political            10108 non-null  category
 8   Community_Membership_Professional         10108 non-null  category
 9   Community_Membership_Religious            10108 non-null  category
 10  Community_Membership_Support              10108 non-null  category
 11  Country                                   10108 non-null  category
 12  Disability_Cognitive                      10108 non-null  category
 13  Disability_Hearing                        10108 non-null  category
 14  Disability_Motor                          10108 non-null  category
 15  Disability_Not_Impaired                   10108 non-null  category
 16  Disability_Not_Say                        10108 non-null  category
 17  Disability_Vision                         10108 non-null  category
 18  Education_Attainment                      10108 non-null  category
 19  Falsification_of_Information              10108 non-null  category
 20  Gender                                    10108 non-null  category
 21  Household_Income                          10108 non-null  category
 22  How_You_Heard_About_Survey_Banner         10108 non-null  category
 23  How_You_Heard_About_Survey_Friend         10108 non-null  category
 24  How_You_Heard_About_Survey_Mailing_List   10108 non-null  category
 25  How_You_Heard_About_Survey_Others         10108 non-null  category
 26  How_You_Heard_About_Survey_Printed_Media  10108 non-null  category
 27  How_You_Heard_About_Survey_Remebered      10108 non-null  category
 28  How_You_Heard_About_Survey_Search_Engine  10108 non-null  category
 29  How_You_Heard_About_Survey_Usenet_News    10108 non-null  category
 30  How_You_Heard_About_Survey_WWW_Page       10108 non-null  category
 31  Major_Geographical_Location               10108 non-null  category
 32  Major_Occupation                          10108 non-null  category
 33  Marital_Status                            10108 non-null  category
 34  Most_Import_Issue_Facing_the_Internet     10108 non-null  category
 35  Opinions_on_Censorship                    10108 non-null  category
 36  Primary_Computing_Platform                7409 non-null   category
 37  Primary_Language                          10108 non-null  category
 38  Primary_Place_of_WWW_Access               10108 non-null  category
 39  Race                                      10108 non-null  category
 40  Not_Purchasing_Bad_experience             10108 non-null  category
 41  Not_Purchasing_Bad_press                  10108 non-null  category
 42  Not_Purchasing_Cant_find                  10108 non-null  category
 43  Not_Purchasing_Company_policy             10108 non-null  category
 44  Not_Purchasing_Easier_locally             10108 non-null  category
 45  Not_Purchasing_Enough_info                10108 non-null  category
 46  Not_Purchasing_Judge_quality              10108 non-null  category
 47  Not_Purchasing_Never_tried                10108 non-null  category
 48  Not_Purchasing_No_credit                  10108 non-null  category
 49  Not_Purchasing_Not_applicable             10108 non-null  category
 50  Not_Purchasing_Not_option                 10108 non-null  category
 51  Not_Purchasing_Other                      10108 non-null  category
 52  Not_Purchasing_Prefer_people              10108 non-null  category
 53  Not_Purchasing_Privacy                    10108 non-null  category
 54  Not_Purchasing_Receipt                    10108 non-null  category
 55  Not_Purchasing_Security                   10108 non-null  category
 56  Not_Purchasing_Too_complicated            10108 non-null  category
 57  Not_Purchasing_Uncomfortable              10108 non-null  category
 58  Not_Purchasing_Unfamiliar_vendor          10108 non-null  category
 59  Registered_to_Vote                        10108 non-null  category
 60  Sexual_Preference                         10108 non-null  category
 61  Web_Ordering                              10108 non-null  category
 62  Web_Page_Creation                         10108 non-null  category
 63  Who_Pays_for_Access_Dont_Know             10108 non-null  category
 64  Who_Pays_for_Access_Other                 10108 non-null  category
 65  Who_Pays_for_Access_Parents               10108 non-null  category
 66  Who_Pays_for_Access_School                10108 non-null  category
 67  Who_Pays_for_Access_Self                  10108 non-null  category
 68  Who_Pays_for_Access_Work                  10108 non-null  category
dtypes: category(69)
memory usage: 715.7 KB

Split the data into target and features.

Drop target leakage features of other options to pay.


target = 'Who_Pays_for_Access_Work'
y = df[target]
X_cat = data.data.drop(columns=['Who_Pays_for_Access_Dont_Know',
       'Who_Pays_for_Access_Other', 'Who_Pays_for_Access_Parents',
       'Who_Pays_for_Access_School', 'Who_Pays_for_Access_Self'])

Encode the categorical variables prior to feature selection.


encoder = ce.LeaveOneOutEncoder(return_df=True)
X = encoder.fit_transform(X_cat, y)

‍

Feature Selection

Select the top N

Start with 63 features.


X.shape


(10108, 63)

Select the top 20 features.

Note, f_classif is used as this is a classification problem. For a regression problem, use f_regression instead.


selector = SelectKBest(f_classif, k=20)
X_reduced = selector.fit_transform(X, y)
X_reduced.shape


(10108, 20)

The function get_support can be used to generate the list of features that were kept.


cols = selector.get_support(indices=True)
selected_columns = X.iloc[:,cols].columns.tolist()
selected_columns


['Actual_Time',
 'Age',
 'Community_Membership_Professional',
 'Country',
 'Education_Attainment',
 'Falsification_of_Information',
 'Household_Income',
 'How_You_Heard_About_Survey_Friend',
 'How_You_Heard_About_Survey_Mailing_List',
 'How_You_Heard_About_Survey_WWW_Page',
 'Major_Geographical_Location',
 'Major_Occupation',
 'Most_Import_Issue_Facing_the_Internet',
 'Primary_Computing_Platform',
 'Primary_Language',
 'Primary_Place_of_WWW_Access',
 'Not_Purchasing_Company_policy',
 'Not_Purchasing_Not_option',
 'Web_Ordering',
 'Web_Page_Creation']

‍

Select the top P%

Select the top 25% of features.


selector = SelectPercentile(f_classif, percentile=25)
X_reduced = selector.fit_transform(X, y)
X_reduced.shape


(10108, 16)

Again, using the function get_support to generate the list of features that were kept.


cols = selector.get_support(indices=True)
selected_columns = X.iloc[:,cols].columns.tolist()
selected_columns


['Actual_Time',
 'Age',
 'Community_Membership_Professional',
 'Country',
 'Education_Attainment',
 'Falsification_of_Information',
 'Household_Income',
 'How_You_Heard_About_Survey_Friend',
 'How_You_Heard_About_Survey_Mailing_List',
 'Major_Geographical_Location',
 'Major_Occupation',
 'Primary_Computing_Platform',
 'Primary_Place_of_WWW_Access',
 'Not_Purchasing_Company_policy',
 'Web_Ordering',
 'Web_Page_Creation']