ADA Lab @ UCSD

 

Project Triptych

Overview

Triptych is an end-to-end model selection management system (MSMS) that aims to simplify and accelerate the process of sourcing data/features and selecting ML models. Our guiding principles are to exploit the semantics of the data and the ML task to the extent possible to reduce work for the data scientist and reduce runtimes and costs. We apply these principles to remove or mitigate different bottlenecks in this end-to-end process, eventually unifying these components to yield an integrated ‘‘operating system’’ for ML analytics tasks. Please refer to the ACM SIGMOD Record paper below for more details of this vision.

Active Component Projects

 

Cerebro
Efficient and reproducible model selection on deep learning systems.

 

Morpheus
Integrating linear algebra and relational algebra to simplify feature engineering for ML.

 

SortingHat
ML schema inference and automatic data preparation.

Publications

  • Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches
    Yuhao Zhang, Frank McQuillan, Nandish Jayaram, Nikhil Kak, Ekta Khanna, Orhan Kislal, Domino Valdano, and Arun Kumar
    VLDB 2021 | Paper PDF | TechReport | Talk video | Code release

  • Intermittent Human-in-the-Loop Model Selection using Cerebro: A Demonstration
    Liangde Li, Supun Nakandala, and Arun Kumar
    VLDB 2021 Demo | Paper PDF | TechReport | Video

  • Towards A Polyglot Framework for Factorized ML
    David Justo, Shaoqing Yi, Lukas Stadler, Nadia Polikarpova, and Arun Kumar
    VLDB 2021 (Industrial Track) | Paper PDF | TechReport | Talk video | Code coming soon

  • The CNN Hip Accelerometer Posture (CHAP) Method for Classifying Sitting Patterns from Hip Accelerometers: A Validation Study
    Mikael Anne Greenwood-Hickman, Supun Nakandala, Marta M. Jankowska, Fatima Tuz-Zahra, John Bellettiere, Jordan Carlson, Paul R. Hibbing, Jingjing Zou, Andrea Z. LaCroix, Arun Kumar, and Loki Natarajan
    Medicine and Science in Sports and Exercise Journal, 2021 | Paper PDF coming soon | Code

  • Application of Convolutional Neural Network Algorithms for Advancing Sedentary and Activity Bout Classification
    Supun Nakandala, Marta Jankowska, Fatima Tuz-Zahra, John Bellettiere, Jordan Carlson, Andrea LaCroix, Sheri Hartman, Dori Rosenberg, Jingjing Zou, Arun Kumar, and Loki Natarajan
    Journal for the Measurement of Physical Behaviour, 2021 | Paper PDF and BibTeX | Code

  • Cerebro: A Layered Data Platform for Scalable Deep Learning
    Arun Kumar, Supun Nakandala, Yuhao Zhang, Side Li, Advitya Gemawat, and Kabir Nagrecha
    CIDR 2021 (Vision paper) | Paper PDF and BibTeX | Talk video

  • Enabling and Optimizing Non-linear Feature Interactions in Factorized Linear Algebra
    Side Li, Lingjiao Chen, and Arun Kumar
    ACM SIGMOD 2019 | Paper PDF and BibTeX | Code and Data on Github

  • Tuple-oriented Compression for Large-scale Mini-batch Stochastic Gradient Descent
    Fengan Li, Lingjiao Chen, Yijing Zeng, Arun Kumar, Jeffrey Naughton, Jignesh Patel, and Xi Wu
    ACM SIGMOD 2019 | Paper PDF | TechReport | Code on GitHub

  • Model-based Pricing for Machine Learning in a Data Marketplace
    Lingjiao Chen, Paraschos Koutris, and Arun Kumar
    ACM SIGMOD 2019 | Paper PDF | TechReport | Code and Data coming soon

  • Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems
    Supun Nakandala, Yuhao Zhang, and Arun Kumar
    ACM SIGMOD 2019 DEEM Workshop | Paper PDF and BibTeX | TechReport | Blog post

  • Demonstration of Nimbus: Model-based Pricing for Machine Learning in a Data Marketplace
    Lingjiao Chen, Hongyi Wang, Leshang Chen, Paraschos Koutris, and Arun Kumar
    ACM SIGMOD 2019 Demo | Paper PDF | Video coming soon

  • A Comparative Evaluation of Systems for Scalable Linear Algebra-based Analytics
    Anthony Thomas and Arun Kumar
    VLDB 2018/2019 | Paper PDF | TechReport | Code and Data

  • Model-based Pricing: Do Not Pay for More than What You Learn!
    Lingjiao Chen, Paraschos Koutris, and Arun Kumar
    ACM SIGMOD 2017 DEEM Workshop | Paper PDF

  • Cerebro: A System to Manage Deep Learning for Relational Data Analytics
    Arun Kumar
    CIDR 2017 Abstract | Paper PDF

  • To Join or Not to Join? Thinking Twice about Joins before Feature Selection
    Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu
    ACM SIGMOD 2016 | Paper PDF and BibTeX | TechReport | Code and Data

  • Model Selection Management Systems: The Next Frontier of Advanced Analytics
    Arun Kumar, Robert McCann, Jeffrey Naughton, and Jignesh M. Patel
    ACM SIGMOD Record Dec 2015 Vision Track | Paper PDF

Technical Reports

  • An Empirical Study on the (Non-)Importance of Cleaning Categorical Duplicates before ML
    Vraj Shah, Thomas Parashos, and Arun Kumar
    Under submission | TechReport

  • Nautilus: An Optimized System for Deep Transfer Learning over Evolving Training Datasets
    Supun Nakandala and Arun Kumar
    Under submission | TechReport

  • SystemX: A Scalable and Optimized Data System for Large Multi-Model Deep Learning
    Kabir Nagrecha and Arun Kumar
    Under submission | TechReport

  • Improving Feature Type Inference Accuracy of TFDV with SortingHat
    Vraj Shah, Kevin Yang, and Arun Kumar
    TechReport

Past Projects

 

Hamlet
Exploiting database schema information to simplify data sourcing.

 

Nimbus
Enabling the first ML-aware cloud-based commodity market for the new black gold: training data.

 

SLAB
The first comprehensive benchmark comparison of scalable linear algebra systems.