Project SortingHat


Data preparation time still remains a major roadblock for real-world ML applications, since it is ususally handled fully manually by data scientists. The grunt work of data prep involves diverse tasks such as identifying features types for ML, standardizing and cleaning feature values, and feature transformations. This reduces data scientists’ productivity and raises costs. It is also an impediment to industrial-scale AutoML platforms that build ML models on millions of datasets.

To tackle this challenge, we envision a new line of research to dramatically reduce the human effort need in data prep for ML: using ML to automate data prep for ML. We abstract specific ML data prep tasks and cast them as applied ML tasks. We present the ML Data Prep Zoo, a community-led repository of benchmark labeled datasets, and pre-trained ML models for ML data prep tasks.

In this project, we apply the above philosophy on a ubiquitous data prep issue when applying ML over relation data: ML Schema Extraction. Datasets are typically exported from DBMSs into tools such as Python and R as CSV files. This creates a semantic gap between attribute types in a DB schema (such as strings or integers) and feature types in a ML schema (such as numeric or categorical). Hence, the data scientist is forced to manually extract the ML schema. SortingHat cast this task as ML classification problem and allow users to quickly dispose of easy features, and repriortize their effort towards features that need more human attention.

All of our labeled datasets, pre-trained models, code, and documentation are available here: ML Data Prep Zoo

Recent long talk on vision and first task: Youtube video.

Downloads (Paper, Code, Data, etc.)

  • Improving Feature Type Inference Accuracy of TFDV with SortingHat
    Vraj Shah, Kevin Yang, and Arun Kumar

  • Towards Semi-Automatic Embedded Data Type Inference
    Jonathan Lacanlale, Vraj Shah, and Arun Kumar

Student Contact

Vraj Shah: vps002 [at] eng [dot] ucsd [dot] edu


This project is supported in part by a Faculty Research Award from Google Research, an Amazon Research Award, and the NSF Convergence Accelerator under award OIA-2040727.