How To Handle Data Type Mismatch with PyRasgo

This tutorial explains how to identify data in columns with the wrong data type with pyrasgo.

Packages

This tutorial uses:

Open a new Jupyter Notebook and import the following:


import pandas as pd
import pyrasgo

Connect to Rasgo

If you haven't done so already, head over to https://docs.rasgoml.com/rasgo-docs/onboarding/initial-setup and follow the steps outlined there to create your free account. This account gives you free access to the Rasgo API which will calculate dataframe profiles, generate feature importance score, and produce feature explainability for you analysis. In addition, this account allows you to maintain access to your analysis and share with your colleagues.


rasgo = pyrasgo.login(email='', password='')

Creating the data

We will create a dataframe that contains multiple occurrences of duplication for this example.


df = pd.DataFrame({'A': ['text']*20,
                   'B': [1, 2.2]*10,
                   'C': [True, False]*10,
                   'D': pd.to_datetime('2020-01-01')
                  })

Next, add some mistyped data to the dataframe.


df.iloc[0,0] = 1
df.iloc[1,0] = -2
df.iloc[10,0] = pd.to_datetime('2021-01-01')
df.iloc[5,1] = '2.2'
df.iloc[7,1] = 'A+B'
df.iloc[4,2] = 1
df.iloc[5,2] = 'False'
df.iloc[9,2] = -12.6
df.iloc[12,2] = 'text'
df.iloc[2,3] = 12
df.iloc[12,3] = '2020-01-01'
df

Your dataframe should look something like:


	A	B	C	D
0	1	1.0	True	2020-01-01 00:00:00
1	-2	2.2	False	2020-01-01 00:00:00
2	text	1.0	True	12
3	text	2.2	False	2020-01-01 00:00:00
4	text	1.0	1	2020-01-01 00:00:00
5	text	2.2	False	2020-01-01 00:00:00
6	text	1.0	True	2020-01-01 00:00:00
7	text	A+B	False	2020-01-01 00:00:00
8	text	1.0	True	2020-01-01 00:00:00
9	text	2.2	-12.6	2020-01-01 00:00:00
10	2021-01-01 00:00:00	1.0	True	2020-01-01 00:00:00
11	text	2.2	False	2020-01-01 00:00:00
12	text	1.0	text	2020-01-01
13	text	2.2	False	2020-01-01 00:00:00
14	text	1.0	True	2020-01-01 00:00:00
15	text	2.2	False	2020-01-01 00:00:00
16	text	1.0	True	2020-01-01 00:00:00
17	text	2.2	False	2020-01-01 00:00:00
18	text	1.0	True	2020-01-01 00:00:00
19	text	2.2	False	2020-01-01 00:00:00

Identify mistyped data

The function evaluate.type_mismatches will cast column to data_type and return a dataframe containing the recast column with elements that were of the wrong type as NaN.

Cast to numeric


new_column_df = rasgo.evaluate.type_mismatches(df, column='B', data_type='numeric')
new_column_df

‍

Convert this to a Boolean series using the pandas function isnull and use that series to return the non-numeric data


df[new_column_df.isnull().iloc[:,0]]

‍

Cast to datetime


new_column_df = rasgo.evaluate.type_mismatches(df, column='D', data_type='datetime')
new_column_df

‍

Convert this to a Boolean series using the pandas function isnull and use that series to return the data that is not a datetime.


df[new_column_df.isnull().iloc[:,0]]