ADA Lab @ UCSD

Project SortingHat

Overview

Data preparation (prep) time still remains a major roadblock for real-world ML applications, since it is ususally handled fully manually by data scientists. The grunt work of data prep involves diverse tasks such as identifying features types for ML, standardizing and cleaning feature values, and feature transformations. This reduces data scientists’ productivity and raises costs. It is also a major impediment to the effectiveness of emerging industrial end-to-end AutoML platforms that build ML models on millions of datasets for enterprises, Web companies, and more.

To tackle the above challenge, we envision a new line of research to dramatically reduce the human effort needed in data prep for ML, as well as to accurately benchmark the automation of data prep in AutoML platforms: create benchmarks and labeled datasets and use ML to automate ML data prep. We abstract a series of specific and ubiquitous ML data prep tasks and formalize them as prediction tasks. We present the ML Data Prep Zoo, a community-led repository of our benchmark labeled datasets and pre-trained ML models for such ML data prep tasks.

So far, we have applied the above philosophy to a gateway step in ML data prep for tabular data: ML Schema Extraction. Datasets are typically exported from DBMSs into tools such as Python as CSV files. This creates a semantic gap between attribute types in a DB schema (such as strings or integers) and feature types in a ML schema (such as numeric or categorical). Hence, data scientists or AutoML platform users are forced to manually extract the ML schema, which is tedious, slow, and error-prone. We cast this task as a multi-class ML classification problem for the first time and allow users to quickly dispose of easy features, repriortize their effort towards features that need more human attention, and enable AutoML platforms to more accurately and robustly identify feature types, which in turn boosts their downstream model building.

All of our labeled datasets, pre-trained models, code, and documentation are available here: ML Data Prep Zoo

Recent long talk on vision and first task: Youtube video.

Overview Paper

The ML Data Prep Zoo: Towards Semi-Automatic Data Preparation for ML
Vraj Shah and Arun Kumar
ACM SIGMOD 2019 DEEM Workshop | Paper PDF and BibTeX | TechReport | Blog post | Data Prep Zoo Repository on GitHub

Category Deduplication

How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses
Vraj Shah, Thomas Parashos, and Arun Kumar
VLDB 2024 | Paper PDF | TechReport | Code and Data coming soon

Categorical Data Deduplication
Soham Pachpande and Gehan Chopade
CSE 234 Project TechReport | Code, Data, and Pre-trained Models on GitHub

Feature Type Inference

Towards Benchmarking Feature Type Inference for AutoML Platforms
Vraj Shah, Jonathan Lacanlale, Premanand Kumar, Kevin Yang, and Arun Kumar
ACM SIGMOD 2021 | Paper PDF | TechReport | Talk Videos: Short Talk Long Talk | Data, Code, and Pre-trained Models on GitHub | Python library

Bringing ML-based Feature Type Inference to OpenML
Ryan Tran and Victor Zhu
CSE 234 Project TechReport | Code Release on GitHub | Package on PyPi

Improving Feature Type Inference Accuracy of TFDV with SortingHat
Vraj Shah, Kevin Yang, and Arun Kumar
TechReport

Towards Semi-Automatic Embedded Data Type Inference
Jonathan Lacanlale, Vraj Shah, and Arun Kumar
TechReport

Student Contact

Vraj Shah: vps002 [at] eng [dot] ucsd [dot] edu

Acknowledgments

This project is supported in part by a Faculty Research Award from Google Research, an Amazon Research Award, and the NSF Convergence Accelerator under award OIA-2040727.