Data preparation (and sometimes with feature engineering broken out as a separate step) is estimated to account for between 50% and 80% of data scientist’s time spent during a project. Further, much of the data preparation process is tedious and time consuming leading to the New York Times famously calling it data janitorial work. Despite this, there are surprisingly few packages that have had a substantial impact on reducing the burden of data prep for data scientists.
At Rasgo, we are committed to changing this. We are data scientists on a mission to enable the global data science community to generate valuable and trusted insights from data in under 5 minutes. Over the past 12 months, we have been writing code and engaging with the data science community each and every day. We have been in the weeds with over 400 end users, constantly diving into their biggest pain points and obsessing over delivering solutions that save them time and heartache.
To this end, we are excited to announce the formal release of our Python package for feature engineering, PyRasgo. PyRasgo is designed to help you:
We deliver data scientists the ability to write just two lines of code (after pip install) within the existing Python tools you know and love. We provide beautiful visualizations that enable you to easily see the most impactful features. We provide visualizations with an easy to share link so you can easily explain your results to teammates and other stakeholders.
A single line of code will generate a full profile of every feature in the dataframe.
This includes a histogram,
summary statistics,
the most common values,
and a count of missing or Null values.
Once you are satisfied with the data in the dataframe, another single line of code
will calculate feature importance scores for each feature in the dataframe. PyRasgo calculates this feature importance by building a CatBoost model, calculating the SHAP values and using the mean absolute value of the SHAP values as the feature importance.
In addition, for each feature, the underlying SHAP values for each observation are visualized to give the relationship between the feature value and the corresponding SHAP value.
Give PyRasgo a try by installing it on your machine and working through the tutorial here.