Welcome to part 2 of our Data Science Lifecycle Investment Prioritization series. In part 1, we proposed making early investments in the steps making up the Prepare phase of the lifecycle. Specifically, we suggested prioritizing investments in tools and processes that accelerate:
In this article, we’ll discuss the remaining phases of the lifecycle and recommend how to prioritize investments in the specific steps in these phases.
Once again, let’s review Rasgo’s definition of the data science lifecycle that we first introduced in our whitepaper and in the article The Data Science Lifecycle and Why Projects Fail.
DEFINE
PREPARE
MODEL
PREDICT
Without further ado, let’s jump back into prioritizing investments in the Data Science Lifecycle.
Once the data scientist team has the data ready for modeling, they can use their training and existing tools to build and evaluate models. However, many data science projects then fail during the deployment of models into production. This happens for many reasons, but one of the most significant is that data scientists are not software engineers and are not doing their modeling in an environment that closely resembles the actual production environment.
This leads to two problems. The first is that the data science team creates features or performs merges that are either impossible or very time consuming to perform in a production environment. Even if the data science team’s process can be implemented in production, often gaining access to the software engineering resources to implement these processes can be challenging. These two challenges can lead to models that can’t be implemented in the first case or aren’t implemented in a timely manner in the second case.
Finally, even if the model is successfully placed into production, many teams do not plan to monitor the model and do not assign resources for this work. Model monitoring includes tracking the incoming data to detect any changes from the modeling data, tracking the predictions to determine if the distribution of predictions is changing and tracking the actual performance of the model compared to actual results to determine if the model's performance is degrading. As mentioned in the prior section, all models eventually need to be refreshed. Model monitoring is the key tool to help the data science team and business identify when this refresh needs to happen.
Many tools exist to help deploy data scientist’s models to a production system (for example, through a REST API) without the need for significant software engineering time. These same tools or others also help automatically monitor the model once it is in production. In addition, many data science platforms make it easy for models built within the platform to be deployed and monitored.
Unfortunately, even when a model is successfully put into production, it may fail to deliver value. Often this happens because neither the business nor data science team considered who was going to use the model results, how the results would be used or how the users would be trained.
First, teams need to understand who and how the results will be consumed. Often these results will be integrated into an existing solution (such as Salesforce or a transaction processing system) or displayed on a dashboard. These integrations will need to be built and ready to go when the model is placed into production.
Often that is the easy part. Using the model will require changes in the user’s workflow. Teams need to consider this change management problem and begin working with the impacted users early in the lifecycle to ensure these users are comfortable with the model and will incorporate it into their decision process.
Once the prior challenges are alleviated, the model training and evaluation phase can be improved. This needs to be done with the understanding that data scientists were trained to do this work and often take great pride in it.
There are a couple of different approaches an organization can take. In the first, the organization can choose tools/platforms that allow the data science teams to build and evaluate their models using their skills, but with the addition of collaboration tools to allow data scientists to work together on a single platform, the ability to scale their models beyond what could be contained within their laptop and share code. The second approach is to invest in an AutoML tool. While this may clash with the data scientists desires and skillset, it can be used successfully within data science teams to accelerate their initial modeling and help them quickly refine their approach before settling on data, features and modeling algorithms to develop more fully. In addition, because the need for data science tends to exceed the resources within all but the largest data science organizations, AutoML can allow business users to quickly evaluate their ideas and help develop better justifications and ROI estimates before asking the data science team to take up a project.
Up to this point, the focus of this blog is on tools and technical activities. As data science teams begin to scale from solving just a few problems to driving decisions across the business, investment in tools to manage the entire workflow need to be considered across four broad categories.
First, data science management needs visibility into the current status of all projects the team is currently managing. To start with, this is just classic project management to ensure the project is adequately staffed, how the project is tracking against the schedule and insight into potential blockers. Second, a method to track use cases is essential, from understanding the likely ROI to enable initial prioritization, current phase, deployed model performance and time to refresh all the way to actual ROI achieved.
As most model refreshes are done by data scientists who did not work on the original model, complete documentation is the third category and must document the entire modeling process and this documentation must be easily accessible to the entire team. Beyond just supporting the model refresh work, this documentation is vital for sharing code and techniques that will allow all data scientists to benefit from the talents of their colleagues.
Data science projects are rarely the work of a single data scientist. Often multiple data scientists will work together, either with one managing the project or with multiple data scientists dividing the project into different parts they can work on simultaneously. In either case, the overhead in managing this collaboration can become overwhelming. Data collaboration tools, the fourth category, can enable teams to easily collaborate on their projects.
In highly regulated businesses, data science teams will need to ensure that they are working within the compliance framework in order to allow their models to ever reach production. However, with recent changes in privacy regulations due to GDPR and CCPA along with recent newsworthy data breaches and the growing understanding of how data science tools can be biased and produce unfair models, all data science teams should ensure they have management policies to address security, privacy, fairness, bias and provide full auditability of the entire data science workflow.
Data science has the ability to generate significant ROI for businesses, unfortunately, report after report shows that this is just not happening. The reasons are many, but smart investment in tools and procedures to support the data science team can help drive successful data science projects and transform the business.
Click here to download our white paper on Why Data Science Projects Fail to learn more.