x
Data Science

Third Times the Charm?

by
Andrew Engel
on
4/26/2022
Third Times the Charm?

Survival analysis is a common problem across many different industries. These include patient survival in medical trials, customer attrition and hardware failure among others. Survival analysis using machine learning is a complex process and beyond the scope of a single blog. However, calculations common to survival analysis models are also useful to other time series applications. Unfortunately, time series data, especially for hardware failure and online customer attrition can become quite large, making working with this data difficult on a single machine in pandas. In this blog, we will consider one of these calculations (moving averages) and discuss alternatives to pandas to create these features.

This blog will use data from Backblaze that contains daily SMART stats for all of the hard drives in their data centers. In Q2 2021, Backblaze had almost 180,000 drives in their data center. We’ll consider an analysis to predict the expected life of a hard drive given its stats to date. In the most recent data, there are 124 columns of data. Of these, Backblaze actively monitors a handful as likely indicators of failure. These are:

  • SMART 5: Reallocated Sectors Count
  • SMART 187: Reported Uncorrectable Errors
  • SMART 188: Command Timeout
  • SMART 197: Current Pending Sector Count
  • SMART 198: Uncorrectable Sector Count

In this analysis, we will assume all daily data is stored in a cloud data warehouse and we will show three ways to create moving averages for all 5 indicators. In addition, for this analysis, we are interested in the impact temperature of the drive has on failure, so we will also consider SMART 194.

Many data scientists are most comfortable working in pandas on a single machine. To build the moving average with pandas, the data must first be extracted from the data warehouse.

sqltext = """SELECT date, serial_number, model, failure, smart_5_raw, smart_187_raw, smart_188_raw, smart_194_raw, smart_197_raw, smart_198_raw FROM daily WHERE model = 'ST12000NM0008' AND date between '2020-10-01' and '2020-12-31' ORDER BY serial_number, date""" df = pd.read_sql(sqltext, engine, parse_dates=['date'])

The 7 day moving average for each of these fields (the temperature in this case) can be created by

df['mean_temp_7'] = df.groupby('serial_number')['smart_194_raw'] \ .rolling(7,min_periods=1).mean().reset_index(drop=True)

This could be repeated for each of the other features. Alternatively, a rolling grouping could be created

rolling_group = df.groupby('serial_number').rolling(7,min_periods=1)

Next agg could be applied

tdf = rolling_group.agg({ 'smart_5_raw': 'mean',                         'smart_187_raw': 'mean',                         'smart_188_raw': 'mean',                         'smart_194_raw': 'mean',                         'smart_197_raw': 'mean',                         'smart_198_raw': 'mean'})

In this new data frame, the feature names now contain the rolling 7 day average, so rename these features to make it clear what they contain.

moving_avg_df = tdf.rename(columns=                     {'smart_5_raw': 'mean_smart_5_raw',                     'smart_187_raw': 'mean_smart_187_raw',                     'smart_188_raw': 'mean_smart_188_raw',                     'smart_194_raw': 'mean_smart_194_raw',                     'smart_197_raw': 'mean_smart_197_raw',                     'smart_198_raw': 'mean_smart_198_raw'})

Finally, merge these moving averages back into the original dataframe.

final_df = df.merge(moving_avg_df.reset_index(level='serial_number', drop=True), left_index=True, right_index=True)

With multiple years worth of data, this can be very time consuming on a single machine. However, it is possible to take advantage of the fact that this data is stored in a cloud data warehouse. We can calculate the 7 day moving average of a field in SQL as

avg() OVER(PARTITION BY serial_number ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS mean_

This means we can pull the final data directly into a pandas dataframe by modifying the sql sent to pd.read_sql.

sqltext = "" "SELECT date, serial_number, model, failure, smart_5_raw, smart_187_raw, smart_188_raw, smart_194_raw, smart_197_raw, smart_198_raw , AVG(smart_5_raw) OVER(PARTITION BY serial_number ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS mean_smart_5_raw, AVG(smart_187_raw) OVER(PARTITION BY serial_number ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS mean_smart_187_raw, AVG(smart_188_raw) OVER(PARTITION BY serial_number ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS mean_smart_188_raw, AVG(smart_194_raw) OVER(PARTITION BY serial_number ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS mean_smart_194_raw, AVG(smart_197_raw) OVER(PARTITION BY serial_number ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS mean_smart_197_raw, AVG(smart_198_raw) OVER(PARTITION BY serial_number ORDER BY date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS mean_smart_198_raw FROM daily WHERE model = 'ST12000NM0008' AND date between '2020-10-01' and '2020-12-31' ORDER by serial_number, date""" final_df = pd.read_sql(sqltext, engine, parse_dates=['date'])

This gives us the same data as we created in pandas.

Finally, because this data is located in a cloud data warehouse, if Rasgo is connected to that warehouse, you can use Rasgo to create the code for these transformations and execute them from Python on the warehouse and save the data back into the warehouse.

First, connect to Rasgo

import pyrasgorasgo = pyrasgo.connect('')

Connect to the data source (table that contains the hard drive statistics),

datasource = rasgo.get.data_source(id=<Data Source ID)

Create the transformation to create the moving average (in this case we will create it for an arbitrary sized window. The transformation text will be a SQL select statement contained in a Jinja template.

sqltext = "" "SELECT *{%- for column in fields_to_average -%}     {%- for window in window_sizes -%}, avg({{column}}) OVER(PARTITION BY {{serial_dim}} ORDER BY {{date_dim}} ROWS BETWEEN {{window - 1}} PRECEDING AND CURRENT ROW) AS mean_{{column}}_{{window}}     {%- endfor -%}{%- endfor -%} FROM {{source_table}}"""new_transform = rasgo.create.transform(name="moving_average", source_code=sqltext)

Next, this template can be applied against this data for a single field by calling

newsource = datasource.transform(transform_name='moving_average', fields_to_average = ['SMART_194_RAW'],                             serial_dim = "SERIAL_NUMBER",                             date_dim = "DATE",                             window_sizes = [7]})

And the data can be created on the data warehouse and published on Rasgo by calling

transformed_source = newsource.to_source(new_table_name="")

These transformations can be chained together to create moving averages for all six features. Further, other transformations built in a similar way can be combined to create additional features. However, as the transform was built to take lists of features and window sizes, the 7 and 14 day moving average for all six features can be created simultaneously.

newsource = datasource.transform(                     transform_name='moving_average',                     fields_to_average = ['SMART_5_RAW',                                         'SMART_187_RAW', 'SMART_188_RAW', 'SMART_194_RAW', 'SMART_197_RAW', 'SMART_198_RAW'],                     serial_dim = "SERIAL_NUMBER",                     date_dim = "DATE",                     window_sizes = [7, 14])

And this can be published as

newsource = newsource.to_source(new_table_name="")

This approach has a couple of advantages. First, as mentioned, these features are created on the cloud data warehouse and immediately saved to a table on that warehouse. In addition to leveraging the warehouse’s ability to run the calculations, it saves the I/O time to move the data to and from the computer running Python. As it is saved to the database, it is immediately available to all users with access to the warehouse, not just users on the same computer. Finally, because it is published in Rasgo, features can be created from this source

newsource = rasgo.publish.features_from_source(data_source_id=newsource.id, dimensions=['DATE', 'SERIAL_NUMBER', 'MODEL'],                                   granularity=['day', 'serial_number', 'model'],                                   features=['MEAN_SMART_5_RAW_7', 'MEAN_SMART_187_RAW_7', 'MEAN_SMART_188_RAW_7', 'MEAN_SMART_194_RAW_7', 'MEAN_SMART_197_RAW_7', 'MEAN_SMART_198_RAW_7'], tags=['drive_failure'], sandbox=False)

This makes these features discoverable on the Rasgo platform and available to be combined with other features in Rasgo. In addition, Rasgo will generate a feature profile allowing a quick review of the features generated.

This feature profile includes a histogram showing the distribution of values

The average value over time

Common statistics for this feature

Comparisons between this feature and other features over time

And comparison of the distribution of this feature and another feature

While data scientists are comfortable working in pandas, there are considerable advantages of leveraging the power of cloud data warehouses when generating features. In this blog, we’ve seen how a data scientist can go from working locally in pandas to creating and even exploring features for the rest of the team to use in Rasgo.