Survival analysis is a common problem across many different industries. These include patient survival in medical trials, customer attrition and hardware failure among others. Survival analysis using machine learning is a complex process and beyond the scope of a single blog. However, calculations common to survival analysis models are also useful to other time series applications. Unfortunately, time series data, especially for hardware failure and online customer attrition can become quite large, making working with this data difficult on a single machine in pandas. In this blog, we will consider one of these calculations (moving averages) and discuss alternatives to pandas to create these features.
This blog will use data from Backblaze that contains daily SMART stats for all of the hard drives in their data centers. In Q2 2021, Backblaze had almost 180,000 drives in their data center. We’ll consider an analysis to predict the expected life of a hard drive given its stats to date. In the most recent data, there are 124 columns of data. Of these, Backblaze actively monitors a handful as likely indicators of failure. These are:
In this analysis, we will assume all daily data is stored in a cloud data warehouse and we will show three ways to create moving averages for all 5 indicators. In addition, for this analysis, we are interested in the impact temperature of the drive has on failure, so we will also consider SMART 194.
Many data scientists are most comfortable working in pandas on a single machine. To build the moving average with pandas, the data must first be extracted from the data warehouse.
The 7 day moving average for each of these fields (the temperature in this case) can be created by
This could be repeated for each of the other features. Alternatively, a rolling grouping could be created
Next agg could be applied
In this new data frame, the feature names now contain the rolling 7 day average, so rename these features to make it clear what they contain.
Finally, merge these moving averages back into the original dataframe.
With multiple years worth of data, this can be very time consuming on a single machine. However, it is possible to take advantage of the fact that this data is stored in a cloud data warehouse. We can calculate the 7 day moving average of a field in SQL as
This means we can pull the final data directly into a pandas dataframe by modifying the sql sent to pd.read_sql.
This gives us the same data as we created in pandas.
Finally, because this data is located in a cloud data warehouse, if Rasgo is connected to that warehouse, you can use Rasgo to create the code for these transformations and execute them from Python on the warehouse and save the data back into the warehouse.
First, connect to Rasgo
Connect to the data source (table that contains the hard drive statistics),
Create the transformation to create the moving average (in this case we will create it for an arbitrary sized window. The transformation text will be a SQL select statement contained in a Jinja template.
Next, this template can be applied against this data for a single field by calling
And the data can be created on the data warehouse and published on Rasgo by calling
These transformations can be chained together to create moving averages for all six features. Further, other transformations built in a similar way can be combined to create additional features. However, as the transform was built to take lists of features and window sizes, the 7 and 14 day moving average for all six features can be created simultaneously.
And this can be published as
This approach has a couple of advantages. First, as mentioned, these features are created on the cloud data warehouse and immediately saved to a table on that warehouse. In addition to leveraging the warehouse’s ability to run the calculations, it saves the I/O time to move the data to and from the computer running Python. As it is saved to the database, it is immediately available to all users with access to the warehouse, not just users on the same computer. Finally, because it is published in Rasgo, features can be created from this source
This makes these features discoverable on the Rasgo platform and available to be combined with other features in Rasgo. In addition, Rasgo will generate a feature profile allowing a quick review of the features generated.
This feature profile includes a histogram showing the distribution of values
The average value over time
Common statistics for this feature
Comparisons between this feature and other features over time
And comparison of the distribution of this feature and another feature
While data scientists are comfortable working in pandas, there are considerable advantages of leveraging the power of cloud data warehouses when generating features. In this blog, we’ve seen how a data scientist can go from working locally in pandas to creating and even exploring features for the rest of the team to use in Rasgo.