ADA Lab @ UCSD

Project Cerebro

Overview

Artificial neural networks in the form of deep learning (DL) are revolutionizing many machine learning (ML) applications. Their success at major Web companies has created excitement among many enterprises and domain scientists to try DL for their applications. But training DL is a notoriously painful empirical process, since accuracy is tied to the data representation, neural architecture, and hyper-parameter settings. The common practice to choose these settings is to empirically compare as many training configurations as feasible for the application, a central process in ML called model selection. This process is inevitable because it is how one controls underfitting vs overfitting. Model selection is a major bottleneck for adoption of DL among enterprises and domain scientists due to both the time spent and resource costs.

In this project, we are building a first-of-its-kind model selection-first platform for scalable DL that raises model selection throughput without raising resource costs. Our target setting is small clusters (say, 10s of nodes), which covers a vast majority (over 90%) of parallel ML workloads in practice. We have 4 key system desiderata: scalability, statistical convergence efficiency, reproducibility, and system generality. To satisfy all these desiderata, we have been building a suite of novel parallel execution approaches to enable resource-efficient scalabilty at all levels of the memory hierarchy-clusters, cloud, local disks, DRAM, and GPU memory-and for all axes of scalability in DL workloads: data sizes, model sizes, number of models and tasks, number of data subgroups, and more.

Check out the CIDR 2021 paper below for an overview of our vision for Cerebro. Our techniques are inspired by classical lessons in RDBMS design and implementation and operations research, including multi-query optimization, materialized views, and hybrid parallelism. The technical papers cover our suite of new execution approaches, including model hopper parallelism (MOP), gradient accumulation parallelism (GAP), shard alternation parallelism (SHARP), model fusion, and DL feature materialization. All of our techniques are amenable to non-disruptive integration with existing DL frameworks such as TensorFlow and PyTorch, which makes practical adoption easier.

Cerebro is open sourced under Apache License v2.0. Code and deatiled documentation are available here: Cerebro System

Most recent long talk on overall system vision: Youtube video.

Overview Papers and Talks

Some Damaging Delusions of Deep Learning Practice (and How to Avoid Them)
Arun Kumar, Supun Nakandala, and Yuhao Zhang
KDD 2021 Deep Learning Day | Extended Abstract PDF | Talk slides | Talk video

Cerebro: A Layered Data Platform for Scalable Deep Learning
Arun Kumar, Supun Nakandala, Yuhao Zhang, Side Li, Advitya Gemawat, and Kabir Nagrecha
CIDR 2021 (Vision paper) | Paper PDF and BibTeX | Talk video

Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems
Supun Nakandala, Yuhao Zhang, and Arun Kumar
ACM SIGMOD 2019 DEEM Workshop | Paper PDF and BibTeX | Blog post

Model Scalability: Saturn and Hydra

Saturn: An Optimized Data System for Multi-Large-Model Deep Learning Workloads
Kabir Nagrecha and Arun Kumar
VLDB 2024 | Paper PDF | TechReport | Code and Docs Release

Hydra: A Data System for Large Multi-Model Deep Learning
Kabir Nagrecha and Arun Kumar
TechReport | Code release

Model Parallel Model Selection for Deep Learning Systems
Kabir Nagrecha
SIGMOD 2021 (Student Research Competition Winner) | Arxiv

Data Scalability: MOP

Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches
Yuhao Zhang, Frank McQuillan, Nandish Jayaram, Nikhil Kak, Ekta Khanna, Orhan Kislal, Domino Valdano, and Arun Kumar
VLDB 2021 | Paper PDF | TechReport | Talk video | Code release

Cerebro: A Data System for Optimized Deep Learning Model Selection
Supun Nakandala, Yuhao Zhang, and Arun Kumar
VLDB 2020 | Paper PDF and BibTeX | Errata | TechReport | Talk videos: Youtube Bilibili | Blog post | SAIS Talk video | Source code and documentation

Scalable Graph Neural Networks

Lotan: Bridging the Gap between GNNs and Scalable Graph Analytics Engines
Yuhao Zhang and Arun Kumar
VLDB 2023 | Paper PDF | TechReport | Code Release | Blog post

Scalable Transfer Learning

Nautilus: An Optimized System for Deep Transfer Learning over Evolving Training Datasets
Supun Nakandala and Arun Kumar
SIGMOD 2022 | Paper PDF | TechReport | Code Release

Also see Project Vista.

New Backend Extensions (CSE 234 Fall 2021 Projects)

Integrating Cerebro with Ray
Abhishek Gupta and Rishikesh Ingale
TechReport | Code Release

Integrating Cerebro with Dask
Vignesh Nanda Kumar and Pratik Ratadiya
TechReport | Code Release

New Abstractions and Interfaces

Intermittent Human-in-the-Loop Model Selection using Cerebro: A Demonstration
Liangde Li, Supun Nakandala, and Arun Kumar
VLDB 2021 Demo | Paper PDF | TechReport | Video

Towards an Optimized GROUP BY Abstraction for Large-Scale Machine Learning
Side Li and Arun Kumar
VLDB 2021 | Paper PDF | TechReport | Talk video | Code Release

Applications

Low movement, deep-learned sitting patterns, and sedentary behavior in the International Study of Childhood Obesity, Lifestyle, and the Environment (ISCOLE)
Paul R. Hibbing et al. (12 authors)
International Journal of Obesity 2023 | Paper PDF

CHAP-child: An open source method for estimating sit-to-stand transitions and sedentary bout patterns from hip accelerometers among children
Jordan A. Carlson et al. (15 authors)
International Journal of Behavioral Nutrition and Physical Activity 2022 | Paper PDF | Code, Models, and Documentation

CHAP-Adult: A Reliable and Valid Algorithm to Classify Sitting and Measure Sitting Patterns Using Data from Hip-Worn Accelerometers in Adults Aged 35+
John Bellettiere et al. (14 authors)
Journal for the Measurement of Physical Behaviour 2022 | PDF | Code, Models, and Documentation

The CNN Hip Accelerometer Posture (CHAP) Method for Classifying Sitting Patterns from Hip Accelerometers: A Validation Study
Mikael Anne Greenwood-Hickman, Supun Nakandala, Marta M. Jankowska, Fatima Tuz-Zahra, John Bellettiere, Jordan Carlson, Paul R. Hibbing, Jingjing Zou, Andrea Z. LaCroix, Arun Kumar, and Loki Natarajan
Medicine and Science in Sports and Exercise Journal, 2021 | Paper PDF | Code, Models, and Documentation

Application of Convolutional Neural Network Algorithms for Advancing Sedentary and Activity Bout Classification
Supun Nakandala, Marta Jankowska, Fatima Tuz-Zahra, John Bellettiere, Jordan Carlson, Andrea LaCroix, Sheri Hartman, Dori Rosenberg, Jingjing Zou, Arun Kumar, and Loki Natarajan
Journal for the Measurement of Physical Behaviour, 2021 | Paper PDF and BibTeX | Code, Models, and Documentation

Student Contact

Kabir Nagrecha: knagrech [at] ucsd [dot] edu
Pradyumna Sridhara: prsridha [at] ucsd [dot] edu
Vignesh Nanda Kumar: vnandakumar [at] ucsd [dot] edu
Yuhao Zhang: yuz870 [at] eng [dot] ucsd [dot] edu

Acknowledgments

This project was/is supported in part by a Hellman Fellowship, the NIDDK of the NIH under award number R01DK114945, an NSF CAREER Award under award number 1942724, gifts from VMware, and a Meta PhD Fellowship.