Preprocessing…

Preprocessing…
Photo by Jukan Tateisi on Unsplash

Preprocessing might be one of the less appreciated steps in data science. How much time it takes to do is not clear but in general it is one of the lengthier steps in the data preparation process. Part of the ambiguity is associated with what encompasses data preparation. From my experience these are some of the keys steps:

  • understanding the problem and deliverables
  • collecting data and understanding its usefulness
  • exploring data for relationships and univariate qualities
  • cleaning and validating outliers and missing values
  • restructuring data for the algorithm
  • splitting the dataset for training, testing, and validating
  • selection of features and scaling

In some estimates, I have heard that between 50% to 80% of a data scientists time is dedicated to prepartion of the data. Although, this number varies greatly based on how projects are executed and the composition of the team. Many people refer to a version of the Pareto Principle, where they say 80% of their time is spent prior to training algorithms and the 20% of time is spent on analysis and reporting. On the opposite end of the spectrum is a survey from 2021 by Anaconda that indicated 22% of time is spent on data preparation but that survey also include deploying, training, and creating visualizations as separate tasks. The one agreement is that data preparation is typically considered a mundane process. When working from home, I try to do as much of this in the early morning (6am - 9am) when it is quiet and I am very focused.

Some surveys have tried to estimate less ambiguous processes. For example, Algorithmia’s 2021 ‘Enterprise Trends in Machine Learning’ survey indicates that two-thirds of origanizations take more than a month to develop and deploy into production a machine learning model. The individual time spent on the steps or what steps are included is not important with this metric.

As for me, here are some individual tasks that I often look at during the latter stages of preprossesing. I would first identify the data quality and useability of the data, which would require quite a few plots and descriptive statistics.

  1. Imputing Missing Data
  2. Encoding Categorical Variables
  3. Transforming Numerical Variables
  4. Performing Variable Discretization
  5. Eleminating Outliers
  6. Extracting Date and Time Features
  7. Creating Features from Time Series
  8. Performing Feature Scaling
  9. Creating New Features
  10. Extracting Features from Relational Data
  11. Balancing Targets using oversampling, undersampling, and hybrid methods

Currently, my favorite text book on this topic is The Elements of Statistical Learning, Data Mining, Inference, and Prediction 2nd. Edition by Hastie, Tibshirani, and Friedman. The book is just well written and the key pre-processing steps are included as part of the model development process. Another book by many of the same authors is An Introduction to Statistical Learning with Applications in Python.

A book that I am trying to find is Applied Predictive Modeling and Imbalanced Learning: Foundations, Algorithms, and Applications are reference books. I have read snippets of these books and was impressed with the clarity.

This is a preview of Clap Button, a new feedback and analytics tools for Hydejack, built by yours truly. You can try it out on localhost for free, but it will be removed (together with this message) when building with JEKYLL_ENV=production. To use Clap Button on your site, get a subscription
and set clap_button: true in your config file.


This site is a modified version of Hydejack v9.1.4 created by Erin Wills.