This tutorial explains how to identify data in columns with the wrong data type with pyrasgo.
Packages
This tutorial uses:
Open a new Jupyter Notebook and import the following:
import pandas as pd
import pyrasgo
Connect to Rasgo
If you haven't done so already, head over to https://docs.rasgoml.com/rasgo-docs/onboarding/initial-setup and follow the steps outlined there to create your free account. This account gives you free access to the Rasgo API which will calculate dataframe profiles, generate feature importance score, and produce feature explainability for you analysis. In addition, this account allows you to maintain access to your analysis and share with your colleagues.
rasgo = pyrasgo.login(email='', password='')
Creating the data
We will create a dataframe that contains multiple occurrences of duplication for this example.
df = pd.DataFrame({'A': ['text']*20,
'B': [1, 2.2]*10,
'C': [True, False]*10,
'D': pd.to_datetime('2020-01-01')
})
Next, add some mistyped data to the dataframe.
df.iloc[0,0] = 1
df.iloc[1,0] = -2
df.iloc[10,0] = pd.to_datetime('2021-01-01')
df.iloc[5,1] = '2.2'
df.iloc[7,1] = 'A+B'
df.iloc[4,2] = 1
df.iloc[5,2] = 'False'
df.iloc[9,2] = -12.6
df.iloc[12,2] = 'text'
df.iloc[2,3] = 12
df.iloc[12,3] = '2020-01-01'
df
Your dataframe should look something like:
A B C D
0 1 1.0 True 2020-01-01 00:00:00
1 -2 2.2 False 2020-01-01 00:00:00
2 text 1.0 True 12
3 text 2.2 False 2020-01-01 00:00:00
4 text 1.0 1 2020-01-01 00:00:00
5 text 2.2 False 2020-01-01 00:00:00
6 text 1.0 True 2020-01-01 00:00:00
7 text A+B False 2020-01-01 00:00:00
8 text 1.0 True 2020-01-01 00:00:00
9 text 2.2 -12.6 2020-01-01 00:00:00
10 2021-01-01 00:00:00 1.0 True 2020-01-01 00:00:00
11 text 2.2 False 2020-01-01 00:00:00
12 text 1.0 text 2020-01-01
13 text 2.2 False 2020-01-01 00:00:00
14 text 1.0 True 2020-01-01 00:00:00
15 text 2.2 False 2020-01-01 00:00:00
16 text 1.0 True 2020-01-01 00:00:00
17 text 2.2 False 2020-01-01 00:00:00
18 text 1.0 True 2020-01-01 00:00:00
19 text 2.2 False 2020-01-01 00:00:00
Identify mistyped data
The function evaluate.type_mismatches will cast column to data_type and return a dataframe containing the recast column with elements that were of the wrong type as NaN.
Cast to numeric
new_column_df = rasgo.evaluate.type_mismatches(df, column='B', data_type='numeric')
new_column_df
Convert this to a Boolean series using the pandas function isnull and use that series to return the non-numeric data
df[new_column_df.isnull().iloc[:,0]]
Cast to datetime
new_column_df = rasgo.evaluate.type_mismatches(df, column='D', data_type='datetime')
new_column_df
Convert this to a Boolean series using the pandas function isnull and use that series to return the data that is not a datetime.
df[new_column_df.isnull().iloc[:,0]]