This tutorial explains how to identify data in columns with the wrong data type with pandas.
Packages
This tutorial uses:
Open up a Jupyter notebook and import the following:
import pandas as pd
import datetime
import numpy as np
Creating the data
We will create a dataframe that contains multiple occurrences of duplication for this example.
df = pd.DataFrame({'A': ['text']*20,
'B': [1, 2.2]*10,
'C': [True, False]*10,
'D': pd.to_datetime('2020-01-01')
})
Next, add some mistyped data to the dataframe.
df.iloc[0,0] = 1
df.iloc[1,0] = -2
df.iloc[10,0] = pd.to_datetime('2021-01-01')
df.iloc[5,1] = '2.2'
df.iloc[7,1] = 'A+B'
df.iloc[4,2] = 1
df.iloc[5,2] = 'False'
df.iloc[9,2] = -12.6
df.iloc[12,2] = 'text'
df.iloc[2,3] = 12
df.iloc[12,3] = '2020-01-01'
df
Identify mistyped data
The function applymap and isinstance will return a Boolean dataframe with True when the data type matches and False when the data type does not match.
Check numeric
numeric = df.applymap(lambda x: isinstance(x, (int, float)))
numeric
Since only column B is supposed to be numeric, this can be made more specific by running applymap only on column B.
numeric = df.applymap(lambda x: isinstance(x, (int, float)))['B']
numeric
Your output should look something like:
0 True
1 True
2 True
3 True
4 True
5 False
6 True
7 False
8 True
9 True
10 True
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
19 True
Name: B, dtype: bool
Using this Boolean series to return the non-numeric data
Check datetime
dt = df.applymap(lambda x: isinstance(x, (datetime.datetime)))['D']
dt
Output:
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 True
11 True
12 False
13 True
14 True
15 True
16 True
17 True
18 True
19 True
Name: D, dtype: bool
Using this Boolean series to return the non-numeric data
Check string
strings = df.applymap(lambda x: isinstance(x, (str)))['A']
strings
Your output should look something like this:
0 False
1 False
2 True
3 True
4 True
5 True
6 True
7 True
8 True
9 True
10 False
11 True
12 True
13 True
14 True
15 True
16 True
17 True
18 True
19 True
Name: A, dtype: bool
Using this Boolean series to return the non-numeric data
Check Boolean
torf = df.applymap(lambda x: isinstance(x, (bool)))['C']
torf
You should see the following output:
0 True
1 True
2 True
3 True
4 False
5 False
6 True
7 True
8 True
9 False
10 True
11 True
12 False
13 True
14 True
15 True
16 True
17 True
18 True
19 True
Name: C, dtype: bool
Using this Boolean series to return the non-numeric data