This tutorial explains how to identify data in columns with the wrong data type with pandas.

Packages

This tutorial uses:

Open up a Jupyter notebook and import the following:


import pandas as pd
import datetime
import numpy as np

Creating the data

We will create a dataframe that contains multiple occurrences of duplication for this example.


df = pd.DataFrame({'A': ['text']*20,
                   'B': [1, 2.2]*10,
                   'C': [True, False]*10,
                   'D': pd.to_datetime('2020-01-01')
                  })
                  

Next, add some mistyped data to the dataframe.


df.iloc[0,0] = 1
df.iloc[1,0] = -2
df.iloc[10,0] = pd.to_datetime('2021-01-01')
df.iloc[5,1] = '2.2'
df.iloc[7,1] = 'A+B'
df.iloc[4,2] = 1
df.iloc[5,2] = 'False'
df.iloc[9,2] = -12.6
df.iloc[12,2] = 'text'
df.iloc[2,3] = 12
df.iloc[12,3] = '2020-01-01'
df

Identify mistyped data

The function applymap and isinstance will return a Boolean dataframe with True when the data type matches and False when the data type does not match.


Check numeric


numeric = df.applymap(lambda x: isinstance(x, (int, float)))
numeric


Since only column B is supposed to be numeric, this can be made more specific by running applymap only on column B.


numeric = df.applymap(lambda x: isinstance(x, (int, float)))['B']
numeric

Your output should look something like:


0      True
1      True
2      True
3      True
4      True
5     False
6      True
7     False
8      True
9      True
10     True
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
19     True
Name: B, dtype: bool

Using this Boolean series to return the non-numeric data


df[~numeric]

Check datetime


dt = df.applymap(lambda x: isinstance(x, (datetime.datetime)))['D']
dt

Output:


0      True
1      True
2     False
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10     True
11     True
12    False
13     True
14     True
15     True
16     True
17     True
18     True
19     True
Name: D, dtype: bool

Using this Boolean series to return the non-numeric data


df[~dt]

Check string


strings = df.applymap(lambda x: isinstance(x, (str)))['A']
strings

Your output should look something like this:


0     False
1     False
2      True
3      True
4      True
5      True
6      True
7      True
8      True
9      True
10    False
11     True
12     True
13     True
14     True
15     True
16     True
17     True
18     True
19     True
Name: A, dtype: bool

Using this Boolean series to return the non-numeric data


df[~strings]

Check Boolean


torf = df.applymap(lambda x: isinstance(x, (bool)))['C']
torf

You should see the following output:


0      True
1      True
2      True
3      True
4     False
5     False
6      True
7      True
8      True
9     False
10     True
11     True
12    False
13     True
14     True
15     True
16     True
17     True
18     True
19     True
Name: C, dtype: bool

Using this Boolean series to return the non-numeric data


df[~torf]