x
Data Science

Refined Data is the New Fuel part 2

by
Andrew Engel
on
4/26/2022
Refined Data is the New Fuel part 2

In part one, we discussed the quote “Data is the new oil” and used it to discuss the state of data science and the lack of investment in the tools to improve the data slog. This lack of investment exists despite the fact that 50% to 80% of a data scientist’s time is spent performing data preparation and feature engineering. Even Andrew Ng has begun talking about shifting priorities to this portion of the machine learning lifecycle, “If 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team.” More damning, despite a massive investment in tools for the rest of the data science lifecycle, data science projects still commonly fail to produce significant ROI.

The data preparation process.

The data slog is not just the technical aspect of extracting data, profiling data, handling outliers, and creating new features from existing data. A large part of the work requires soft skills to complete. It includes meetings to find the necessary data, get access to the data and work with the business and data owners to understand what is in the data. Once the profiling is done, the knowledge gained needs to be shared back with the business, both to validate the data scientist’s understanding, but also to provide value to the business based on their existing data.

At the beginning of the process, the data scientist has an initial list of data sources to perform the above work on. However, during this initial data preparation work, gaps in the data may be identified. This identification leads to additional meetings and work for the data scientist.

Once all of the data has been collected and analyzed, the data scientist needs to focus on how to merge the data. While some of the information necessary to do this merge will have already been captured, often additional context is needed. This leads to more meetings with the data owner to be able to do the technical work necessary before they can even begin to create models.

Why data preparation is a slog?

To make matters even worse, anytime something changes in the source data warehouse, the data scientist needs to identify it and find a solution. This could involve working with the data owner to fix the data in the data warehouse, create new feature engineering code to transform the data back into the form it originally was in, or rebuilding the models to use the new data.

This means that in order to build and deploy models, A lot of technical work needs to be done. Often, however, the vast majority of the time spent in data preparations is spent in meetings: finding, getting access to, and understanding the data, and developing the domain expertise to even do the data preparation. While most data scientists have the technical skills to perform the data preparation tasks, most are trained to build models and believe that is the primary purpose of being a data scientist. A much more limited number of data scientists are interested in either the development of domain expertise or the necessary interactions with the rest of the business.

Is it any wonder that data scientists complain about data preparation and would rather spend their time building models? This causes the data science community to put model building above feature engineering and data preparation in importance even though without quality features, the models will not perform.

This data slog does not just impact the data scientist working on the initial project. Because of the difficulty of the data preparation process and the capture of domain knowledge, the data scientist tasked with rebuilding or improving an existing model will need to repeat much of the data preparation process. Even if the code exists, understanding what it does and modifying it to handle changes to the data sources can be a laborious process. More importantly, even to do this, the new data scientist must develop the domain knowledge to understand the data and why the data preparation code works the way it does. This involves the same set of meetings and conversations with the business and data owners as the original project needed.

This necessity of repeated work doesn’t just impact the data scientist working on a model refresh. It can impact the business in much more serious ways. Because the knowledge on how to create a feature or metric is embedded in the data preparation and feature engineering code, it is difficult for others to even know it exists. This means two different data scientists, working on different projects, but needing the same metric will often create two slightly different versions. This can lead to different portions of the business having different views of the business and multiple answers to the same question.

Data science needs to focus on the features not the models.

To resolve this problem, we believe we need to turn the data preparation process on its side and focus on the features, not the models. It is the features that result from refining the data and it is the features that contain the valuable signal. In the end, the output of a model is just another feature that itself can become an input into another model.

When we focus on the feature, we see all of data preparation, from data catalogs to feature engineering as a process to extract signals from the data. It turns out, data preparation is not a menial task, it is a critical and valuable part of the process of extracting signals from data. And finding and utilizing these valuable signals is the point of data science. Without these signals, data-driven decisions will be blind.

Finally, when we focus on the feature, we realize data preparation and feature engineering tools need to focus on the feature and consider each feature as a column of data that contains a signal of some behavior. This implies that instead of focusing on data in tables or files, data science needs to focus on the features independent of the underlying storage mechanism. Tools need to abstract away the technical details of where the data resides, how it can be joined to other data and even how it can be aggregated to different levels of granularity.

Beyond that, all features can be stored in one central repository along with metadata that documents where the data came from, who owns it, what processing has been performed on it, the domain knowledge needed to understand it, its statistical profile and relationship to other features and where it is being used. By allowing all this metadata to be easily searched, data scientists can easily find and understand existing features both from prior versions of their project and other unrelated projects. This, in turn, makes it easy for data scientists to gather all relevant features for their machine learning models and execute (and document) additional feature engineering approaches they have used.

This allows data scientists to focus on what really matters, the features. Tools that focus on the feature will make it easy for data scientists to pull only the features they want, aggregated at the level they want into their models. Further, by focusing on the features, data scientists are equipped to find and capture the signal, develop better features, and easily share these features with the entire organization.

The feature is the future.

The feature is the future of data science. Despite data preparation and feature engineering being the portion of the data science lifecycle where most time is spent, the data science community has focused its efforts on optimizing either side of the process. Unless we fix the data preparation and feature engineering experience, data science will continue to fail to deliver significant returns on the investments being made.