Project SortingHat


Data preparation (prep) time still remains a major roadblock for real-world ML applications, since it is ususally handled fully manually by data scientists. The grunt work of data prep involves diverse tasks such as identifying features types for ML, standardizing and cleaning feature values, and feature transformations. This reduces data scientists’ productivity and raises costs. It is also a major impediment to the effectiveness of emerging industrial end-to-end AutoML platforms that build ML models on millions of datasets for enterprises, Web companies, and more.

To tackle the above challenge, we envision a new line of research to dramatically reduce the human effort needed in data prep for ML, as well as to accurately benchmark the automation of data prep in AutoML platforms: create benchmarks and labeled datasets and use ML to automate ML data prep. We abstract a series of specific and ubiquitous ML data prep tasks and formalize them as prediction tasks. We present the ML Data Prep Zoo, a community-led repository of our benchmark labeled datasets and pre-trained ML models for such ML data prep tasks.

So far, we have applied the above philosophy to a gateway step in ML data prep for tabular data: ML Schema Extraction. Datasets are typically exported from DBMSs into tools such as Python as CSV files. This creates a semantic gap between attribute types in a DB schema (such as strings or integers) and feature types in a ML schema (such as numeric or categorical). Hence, data scientists or AutoML platform users are forced to manually extract the ML schema, which is tedious, slow, and error-prone. We cast this task as a multi-class ML classification problem for the first time and allow users to quickly dispose of easy features, repriortize their effort towards features that need more human attention, and enable AutoML platforms to more accurately and robustly identify feature types, which in turn boosts their downstream model building.

All of our labeled datasets, pre-trained models, code, and documentation are available here: ML Data Prep Zoo

Recent long talk on vision and first task: Youtube video.

Overview Paper

Category Deduplication

  • How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses
    Vraj Shah, Thomas Parashos, and Arun Kumar
    VLDB 2024 | Paper PDF | TechReport | Code and Data coming soon

Feature Type Inference

  • Improving Feature Type Inference Accuracy of TFDV with SortingHat
    Vraj Shah, Kevin Yang, and Arun Kumar

  • Towards Semi-Automatic Embedded Data Type Inference
    Jonathan Lacanlale, Vraj Shah, and Arun Kumar

Student Contact

Vraj Shah: vps002 [at] eng [dot] ucsd [dot] edu


This project is supported in part by a Faculty Research Award from Google Research, an Amazon Research Award, and the NSF Convergence Accelerator under award OIA-2040727.