Packages
This tutorial uses:
import pandas as pd
import numpy as np
Creating the data
We will create a dataframe that contains multiple time series, one for each group.
np.random.seed(1066)
dates = pd.date_range(start='2010-01-01', end='2010-12-31', freq='D')
df = pd.DataFrame({'date': dates,
'group': 'A',
'value': np.random.randint(0, 100, size=len(dates))
}).append(pd.DataFrame({'date': dates,
'group': 'B',
'value': np.random.randint(0, 100, size=len(dates))
})).append(pd.DataFrame({'date': dates,
'group': 'C',
'value': np.random.randint(0, 100, size=len(dates))
})).reset_index(drop=True)
df
Your output should look something like this:
date group value
0 2010-01-01 A 57
1 2010-01-02 A 11
2 2010-01-03 A 83
3 2010-01-04 A 83
4 2010-01-05 A 93
... ... ... ...
1090 2010-12-27 C 50
1091 2010-12-28 C 59
1092 2010-12-29 C 85
1093 2010-12-30 C 32
1094 2010-12-31 C 3
Drop some rows randomly to create gaps in the data.
length = df.shape[0]
droplist = np.unique(np.sort(np.random.randint(0, length, size=100))).tolist()
df = df.drop(droplist).reset_index(drop=True)
df
output:
date group value
0 2010-01-01 A 57
1 2010-01-02 A 11
2 2010-01-03 A 83
3 2010-01-04 A 83
4 2010-01-05 A 93
... ... ... ...
992 2010-12-27 C 50
993 2010-12-28 C 59
994 2010-12-29 C 85
995 2010-12-30 C 32
996 2010-12-31 C 3
Identify Date Gaps
In a single series
The pandas function diff applied to the date field will generate the time between subsequent datas. Comparing that to 1 day can identify date gaps in the data.
singleseries = df[df.group == 'A']
singleseries['gap'] = singleseries['date'].sort_values().diff() > pd.to_timedelta('1 day')
singleseries[singleseries.gap]
date group value gap
39 2010-02-10 A 97 True
44 2010-02-17 A 93 True
45 2010-02-19 A 88 True
57 2010-03-04 A 92 True
77 2010-03-25 A 44 True
81 2010-03-30 A 94 True
86 2010-04-05 A 7 True
89 2010-04-10 A 65 True
92 2010-04-15 A 85 True
99 2010-04-23 A 7 True
115 2010-05-10 A 46 True
129 2010-05-25 A 50 True
132 2010-05-29 A 71 True
136 2010-06-03 A 42 True
164 2010-07-02 A 26 True
175 2010-07-14 A 50 True
198 2010-08-07 A 6 True
201 2010-08-11 A 84 True
209 2010-08-20 A 14 True
212 2010-08-24 A 60 True
238 2010-09-20 A 39 True
246 2010-09-29 A 34 True
259 2010-10-13 A 27 True
270 2010-10-25 A 19 True
271 2010-10-27 A 3 True
291 2010-11-17 A 1 True
297 2010-11-24 A 39 True
298 2010-11-26 A 28 True
302 2010-12-01 A 3 True
In multiple time series
The same process can be applied to data with multiple time series, we just need to group on the time series identifier (group) before applying the
df['gap'] = df[['group', 'date']].sort_values(by=['group', 'date']).groupby('group').diff() > pd.to_timedelta('1 day')
df[df.gap]
output:
date group value gap
39 2010-02-10 A 97 True
44 2010-02-17 A 93 True
45 2010-02-19 A 88 True
57 2010-03-04 A 92 True
77 2010-03-25 A 44 True
... ... ... ... ...
920 2010-10-11 C 98 True
950 2010-11-11 C 0 True
958 2010-11-20 C 72 True
974 2010-12-08 C 1 True
985 2010-12-20 C 39 True