ADA Lab @ UCSD

 

Project Cerebro

Overview

Artificial neural networks in the form of deep learning (DL) are revolutionizing many machine learning (ML) applications. Their success at major Web companies has created excitement among many enterprises and domain scientists to try DL for their applications. But training DL is a notoriously painful empirical process, since accuracy is tied to the data representation, neural architecture, and hyper-parameter settings. The common practice to choose these settings is to empirically compare as many training configurations as feasible for the application, a central process in ML called model selection. This process is inevitable because it is how one controls underfitting vs overfitting. Model selection is a major bottleneck for adoption of DL among enterprises and domain scientists due to both the time spent and resource costs.

In this project, we are building a first-of-its-kind model selection-first platform for scalable DL that raises model selection throughput without raising resource costs. Our target setting is small clusters (say, 10s of nodes), which covers a vast majority (over 90%) of parallel ML workloads in practice. We have 4 key system desiderata: scalability, statistical convergence efficiency, reproducibility, and system generality. To satisfy all these desiderata, we have been building a suite of novel parallel execution approaches to enable resource-efficient scalabilty at all levels of the memory hierarchy-clusters, cloud, local disks, DRAM, and GPU memory-and for all axes of scalability in DL workloads: data sizes, model sizes, number of models and tasks, number of data subgroups, and more.

Check out the CIDR 2021 paper below for an overview of our vision for Cerebro. Our techniques are inspired by classical lessons in RDBMS design and implementation and operations research, including multi-query optimization, materialized views, and hybrid parallelism. The technical papers cover our suite of new execution approaches, including model hopper parallelism (MOP), gradient accumulation parallelism (GAP), shard alternation parallelism (SHARP), model fusion, and DL feature materialization. All of our techniques are amenable to non-disruptive integration with existing DL frameworks such as TensorFlow and PyTorch, which makes practical adoption easier.

Cerebro is open sourced under Apache License v2.0. Code and deatiled documentation are available here: Cerebro System

Most recent long talk on overall system vision: Youtube video.

Overview Papers and Talks

  • Cerebro: A Layered Data Platform for Scalable Deep Learning
    Arun Kumar, Supun Nakandala, Yuhao Zhang, Side Li, Advitya Gemawat, and Kabir Nagrecha
    CIDR 2021 (Vision paper) | Paper PDF and BibTeX | Talk video

  • Cerebro: Efficient and Reproducible Model Selection on Deep Learning Systems
    Supun Nakandala, Yuhao Zhang, and Arun Kumar
    ACM SIGMOD 2019 DEEM Workshop | Paper PDF and BibTeX | Blog post

Model Scalability

  • Hydra: A Data System for Large Multi-Model Deep Learning
    Kabir Nagrecha and Arun Kumar
    Under submission | TechReport

  • Model Parallel Model Selection for Deep Learning Systems
    Kabir Nagrecha
    SIGMOD 2021 (Student Research Competition Winner) | Arxiv

Data Scalability

  • Distributed Deep Learning on Data Systems: A Comparative Analysis of Approaches
    Yuhao Zhang, Frank McQuillan, Nandish Jayaram, Nikhil Kak, Ekta Khanna, Orhan Kislal, Domino Valdano, and Arun Kumar
    VLDB 2021 | Paper PDF | TechReport | Talk video | Code release

Transfer Learning

  • Nautilus: An Optimized System for Deep Transfer Learning over Evolving Training Datasets
    Supun Nakandala and Arun Kumar
    Under submission | TechReport

New Abstractions and Interfaces

  • Intermittent Human-in-the-Loop Model Selection using Cerebro: A Demonstration
    Liangde Li, Supun Nakandala, and Arun Kumar
    VLDB 2021 Demo | Paper PDF | TechReport | Video

Applications

  • The CNN Hip Accelerometer Posture (CHAP) Method for Classifying Sitting Patterns from Hip Accelerometers: A Validation Study
    Mikael Anne Greenwood-Hickman, Supun Nakandala, Marta M. Jankowska, Fatima Tuz-Zahra, John Bellettiere, Jordan Carlson, Paul R. Hibbing, Jingjing Zou, Andrea Z. LaCroix, Arun Kumar, and Loki Natarajan
    Medicine and Science in Sports and Exercise Journal, 2021 | Paper PDF coming soon | Code

  • Application of Convolutional Neural Network Algorithms for Advancing Sedentary and Activity Bout Classification
    Supun Nakandala, Marta Jankowska, Fatima Tuz-Zahra, John Bellettiere, Jordan Carlson, Andrea LaCroix, Sheri Hartman, Dori Rosenberg, Jingjing Zou, Arun Kumar, and Loki Natarajan
    Journal for the Measurement of Physical Behaviour, 2021 | Paper PDF and BibTeX | Code

Student Contact

  • Supun Nakandala: snakanda [at] eng [dot] ucsd [dot] edu

  • Yuhao Zhang: yuz870 [at] eng [dot] ucsd [dot] edu

  • Kabir Nagrecha: knagrech [at] ucsd [dot] edu

Acknowledgments

This project was/is supported in part by a Hellman Fellowship, the NIDDK of the NIH under award number R01DK114945, an NSF CAREER Award under award number 1942724, and gifts from VMware.