Data science teams often struggle with the transition from modeling to production. MLOps tools aim to solve the short term challenge of putting models into production. There are additional longer-term challenges that are often overlooked.
Data science teams often struggle with the transition from modeling to production. MLOps tools aim to solve the short term challenge of putting models into production. There are additional longer-term challenges that are often overlooked.
Data Productionalization Checklist
Data science teams often struggle with the transition from modeling to production. MLOps tools aim to solve the short term challenge of putting models into production. There are additional longer-term challenges that are often overlooked.
First, models will eventually need to be refreshed (often by data scientists new to the problem). Unfortunately, much of the domain expertise and knowledge about the data gained during the project is lost and will need to be regained during the model refresh process. This repeated work wastes time, delays the model refresh and limits the number of high value projects that can be completed.
Second, many data science projects within an organization are related, using the same or similar data and similar feature engineering. By not capturing the work the first time it is completed, data science teams will keep duplicating the work. Often, this duplication is not complete, leading to slightly different (but incompatible) versions of the same feature. Even if the code is made available, without understanding the context, domain and limitations of the data, the subsequent effort will need to work to gain that understanding before making use of the data. Again, this rework wastes time and limits the number of projects the team can undertake.
Following a simple checklist during the productionalization of a model can help the data science team capture the work that was done and prevent duplication of efforts.
Capture the features as they are created in production and store them along with the features created in modeling to allow future modeling efforts to simply retrieve these features instead of needing to recreate them.
Linked to the stored features, capture what system the input data for each feature comes from, who owns that data, what preprocessing has already happened, and what processing changes have and are expected with this data.
Beyond just understanding the technical details of the original data, capture why the data was collected, what the business uses it for and how the data relates to that business. This is meant to be a documentation of the domain knowledge contained within the data.
Future modeling efforts will need to understand what processing (and why) was done to create the features. What data cleaning techniques were used? Why were they chosen? Why were others skipped?
Some features capture important business facts and domain knowledge. Link the explanation of these to the features to enable the next team of data scientists to easily gain the demain knowledge and understand which features are likely to be useful or easily explained.
During the original data preparation process, exploratory data analysis was performed on both the original data and the created features. Capture the results of this analysis and link it to the production features. This serves two purposes. First, it limits the need to repeat the analysis. Second, it documents the baseline that can be used to determine if the data is changing.
Don’t just wait for a model refresh or related project to check the production data. Use the results of the modeling EDA to build tests on the production features. This allows you to proactively find changes to the data before the end user notices something is wrong with the model. In addition, this saves subsequent data scientists from needing to recode the analysis.
A discussion of the results of simple analysis can be extremely valuable to both the business and the data science team. The data science team can validate their domain expertise and feature engineering approach, while the business can gain new insights into their business and understand potential data issues that may be impacting other processes within the business. Documenting this interaction will enhance a new data science team's ability to quickly gain domain knowledge.