Three years ago, I was in a conference room with the VP of data science at a Fortune 100 Retailer. We were discussing two key issues that the Data science team was struggling with, and we were stuck.
The first issue was bandwidth - specifically, the ability of the team’s data engineers to support all of the ad hoc data pipelines the data scientists were asking for. There was an eighteen month roadmap for the data lake, but both sides of the table knew that there was a better chance of Congress passing a long-term budget bill than of the team fully delivering on that roadmap (spoiler - they didn’t. They took on a platform migration six months later and the rest is history).
The second issue was the lack of collaboration on the team… reflected not just in the siloed, single-threaded projects each team member took on and always started from scratch, but (even more detrimental) in the lack of standardization of key metric definitions across machine learning pipelines and basic reporting. Three ML models + two tableau dashboards = five different answers to executive questions, such as “what is our most profitable product?”
These two issues may sound familiar to you. I’ve heard them hundreds of times by now from various data science teams, often with a fairly depressing outcome: the data scientist has to take on all their own data sourcing and preparation if they ever want to get to feature engineering and model training, leading to frustration and lack of business outcomes. Their island drifts further and further away from the mainland of their team, until they are spending more time on LinkedIn than in Jupyter. They eventually leave the company for greener pastures, where the data lake is “75% complete” and the data science teams are “building some really cutting edge models”.
By now, being the brilliant data scientist that you are, you probably see where this is heading and don’t need Facebook prophet to get there. These problems are finally solvable, because of the intersection of two things: the support for orchestrating scalable python on public cloud platforms, and the innovation in cloud data warehouses with products such as Snowflake, BigQuery, and Redshift.
Today we’re announcing the launch of Rasgo, the modeling preparation platform for data scientists, with the goal to accelerate the data scientist through data exploration, data preparation and feature engineering… without forcing them to learn new tools or languages.
At Rasgo, we believe feature engineering is part art, part science, part having enough RAM in your development environment. We know that users build the best features, and are focused on storing and surfacing these features as reusable, shareable assets… not automatically generating them in a black box and expecting users to blindly trust them.
Like to build features in pandas and train in tensorflow? Or do you prefer creating your features in SQL and then running them through DataRobot? Rasgo is built with open APIs and 100% compatible with these or other data science tools and packages. All feature transforms are stored in a Git repository for versioning and sharing with team members via the Rasgo UI.
Need help writing the python code for that fifth lag variable in your time series forecast, when you’ve still got ten more to do? Rasgo can write that code for you, and store the transformation in the feature code repository. Get data ready for model training as soon as possible, with full transparency to the lineage of those features.
These core principles have guided the development of Rasgo, and will continue to inform our product and engineering roadmap. Rasgo is the bridge between your data engineers and data scientists: a platform that enables data scientists to prepare data for modeling quickly and self-sufficiently, but in a manner that doesn’t require expensive re-factors of code when it’s time for production.
For more information on Rasgo, check out our product page or request a demo.