This tutorial explains how to identify date gaps in time series data with pyrasgo.

Packages

This tutorial uses:

Open a new Jupyter Notebook and import the following:


import pandas as pd
import numpy as np
import pyrasgo

Connect to Rasgo

If you haven't done so already, head over to https://docs.rasgoml.com/rasgo-docs/onboarding/initial-setup and follow the steps outlined there to create your free account. This account gives you free access to the Rasgo API which will calculate dataframe profiles, generate feature importance score, and produce feature explainability for you analysis. In addition, this account allows you to maintain access to your analysis and share with your colleagues.


rasgo = pyrasgo.login(email='', password='')

Creating the data

We will create a dataframe that contains multiple time series, one for each group.


np.random.seed(1066)
dates = pd.date_range(start='2010-01-01', end='2010-12-31', freq='D')
df = pd.DataFrame({'date': dates,
                   'group': 'A',
                   'value': np.random.randint(0, 100, size=len(dates))
                  }).append(pd.DataFrame({'date': dates,
                                          'group': 'B',
                                          'value': np.random.randint(0, 100, size=len(dates))
                  })).append(pd.DataFrame({'date': dates,
                                           'group': 'C',
                                           'value': np.random.randint(0, 100, size=len(dates))
                                          })).reset_index(drop=True)
df

Your dataframe should look like:


	date	group	value
0	2010-01-01	A	57
1	2010-01-02	A	11
2	2010-01-03	A	83
3	2010-01-04	A	83
4	2010-01-05	A	93
...	...	...	...
1090	2010-12-27	C	50
1091	2010-12-28	C	59
1092	2010-12-29	C	85
1093	2010-12-30	C	32
1094	2010-12-31	C	3

Next, drop some rows randomly to create gaps in the data.


length = df.shape[0]
droplist = np.unique(np.sort(np.random.randint(0, length, size=100))).tolist()
df = df.drop(droplist).reset_index(drop=True)
df

Identify Date Gaps

In a single series

The function evaluate.timeseries_gaps will identify date gaps in the data.


gaps = rasgo.evaluate.timeseries_gaps(df[df.group == 'A'], datetime_column='date', partition_columns=['group'])
gaps

That should return something like:


	date	group	value	TSGAPLastDate	TSGAPNextDate
0	2010-01-01	A	57	NaT	2010-01-02
38	2010-02-08	A	58	2010-02-07	2010-02-10
39	2010-02-10	A	97	2010-02-08	2010-02-11
43	2010-02-14	A	54	2010-02-13	2010-02-17
44	2010-02-17	A	93	2010-02-14	2010-02-19
45	2010-02-19	A	88	2010-02-17	2010-02-20
56	2010-03-02	A	93	2010-03-01	2010-03-04
57	2010-03-04	A	92	2010-03-02	2010-03-05
76	2010-03-23	A	21	2010-03-22	2010-03-25
77	2010-03-25	A	44	2010-03-23	2010-03-26
80	2010-03-28	A	10	2010-03-27	2010-03-30
81	2010-03-30	A	94	2010-03-28	2010-03-31
85	2010-04-03	A	47	2010-04-02	2010-04-05
86	2010-04-05	A	7	2010-04-03	2010-04-06
88	2010-04-07	A	67	2010-04-06	2010-04-10
89	2010-04-10	A	65	2010-04-07	2010-04-11
91	2010-04-12	A	75	2010-04-11	2010-04-15
92	2010-04-15	A	85	2010-04-12	2010-04-16
98	2010-04-21	A	24	2010-04-20	2010-04-23
99	2010-04-23	A	7	2010-04-21	2010-04-24
114	2010-05-08	A	89	2010-05-07	2010-05-10
115	2010-05-10	A	46	2010-05-08	2010-05-11
128	2010-05-23	A	45	2010-05-22	2010-05-25
129	2010-05-25	A	50	2010-05-23	2010-05-26
131	2010-05-27	A	3	2010-05-26	2010-05-29
132	2010-05-29	A	71	2010-05-27	2010-05-30
135	2010-06-01	A	67	2010-05-31	2010-06-03
136	2010-06-03	A	42	2010-06-01	2010-06-04
163	2010-06-30	A	83	2010-06-29	2010-07-02
164	2010-07-02	A	26	2010-06-30	2010-07-03
174	2010-07-12	A	30	2010-07-11	2010-07-14
175	2010-07-14	A	50	2010-07-12	2010-07-15
197	2010-08-05	A	95	2010-08-04	2010-08-07
198	2010-08-07	A	6	2010-08-05	2010-08-08
200	2010-08-09	A	21	2010-08-08	2010-08-11
201	2010-08-11	A	84	2010-08-09	2010-08-12
208	2010-08-18	A	49	2010-08-17	2010-08-20
209	2010-08-20	A	14	2010-08-18	2010-08-21
211	2010-08-22	A	23	2010-08-21	2010-08-24
212	2010-08-24	A	60	2010-08-22	2010-08-25
237	2010-09-18	A	88	2010-09-17	2010-09-20
238	2010-09-20	A	39	2010-09-18	2010-09-21
245	2010-09-27	A	94	2010-09-26	2010-09-29
246	2010-09-29	A	34	2010-09-27	2010-09-30
258	2010-10-11	A	2	2010-10-10	2010-10-13
259	2010-10-13	A	27	2010-10-11	2010-10-14
269	2010-10-23	A	68	2010-10-22	2010-10-25
270	2010-10-25	A	19	2010-10-23	2010-10-27
271	2010-10-27	A	3	2010-10-25	2010-10-28
290	2010-11-15	A	69	2010-11-14	2010-11-17
291	2010-11-17	A	1	2010-11-15	2010-11-18
296	2010-11-22	A	12	2010-11-21	2010-11-24
297	2010-11-24	A	39	2010-11-22	2010-11-26
298	2010-11-26	A	28	2010-11-24	2010-11-27
301	2010-11-29	A	35	2010-11-28	2010-12-01
302	2010-12-01	A	3	2010-11-29	2010-12-02
332	2010-12-31	A	70	2010-12-30	NaT

In multiple time series

Passing the series identifier (group in this case) into evaluate.timeseries_gaps using the partition_columns parameter checks for date gaps in each of the series independently.


gaps = rasgo.evaluate.timeseries_gaps(df, datetime_column='date', partition_columns=['group'])
gaps

Your dataframe should look like this:


	date	group	value	TSGAPLastDate	TSGAPNextDate
0	2010-01-01	A	57	NaT	2010-01-02
38	2010-02-08	A	58	2010-02-07	2010-02-10
39	2010-02-10	A	97	2010-02-08	2010-02-11
43	2010-02-14	A	54	2010-02-13	2010-02-17
44	2010-02-17	A	93	2010-02-14	2010-02-19
...	...	...	...	...	...
973	2010-12-05	C	51	2010-12-04	2010-12-08
974	2010-12-08	C	1	2010-12-05	2010-12-09
984	2010-12-18	C	71	2010-12-17	2010-12-20
985	2010-12-20	C	39	2010-12-18	2010-12-21
996	2010-12-31	C	3	2010-12-30	NaT