ADA Lab @ UCSD

Project Orion

Overview

While almost all ML toolkits assume that the input data for ML is a single table, many relational datasets in real-world advanced analytics applications are multi-table, typically connected by key-foreign key dependencies (KFKDs). This forces data scientists to join the base tables and materialize a single table. This strategy of learning after joins introduces redundancy in the data, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication.

In this project, we introduce the paradigm of learning over joins to avoid the need to materialize joins before ML. In particular, we introduce a novel technique we call “factorized learning” that decomposes ML computations and pushes them down through joins to the base tables, inspired by how RDBMSs push relational operations through joins. Focusing on a large class of ML models known as generalized linear models (GLMs) solved using gradient descent methods, we show factorized learning works, including at scale. We also analyze the performance of different alternative plans and devise RDBMS-style cost models. Extensive empirical analysis with different scale factors of data validate the performance and storage benefits of factorized learning and the accuracy of our cost models.

We have protoyped Orion using user-defined functions in the PostgreSQL and Hive/Hadoop environments. Orion has been generalized and superseded by Project Morpheus.

The ideas from this work have been protoyped and/or adopted for applications at LogicBlox and Microsoft.

Santoku (Factorized learning and scoring in R)

This toolkit is an extension of Orion that extends the idea of factorized learning to other ML models with categorical features such as Naive Bayes and TAN as well, and implement them in the popular R environment. We also extend the idea to ML inference by creating the “factorized scoring” technique and introduce capabilities to enable exploratory feature selection across different tables.

Downloads (Paper, Code, Data, etc.)

Orion

Learning Generalized Linear Models Over Normalized Data
Arun Kumar, Jeffrey Naughton, and Jignesh M. Patel
ACM SIGMOD 2015 | Paper PDF

Orion on GitHub (source code for data synthesizer, Orion on PostgreSQL, and Orion on Hive)

Santoku

Demonstration of Santoku: Optimizing Machine Learning over Normalized Data
Arun Kumar, Mona Jalal, Boqun Yan, Jeffrey Naughton, and Jignesh M. Patel
VLDB 2015 (Demo) | Paper PDF

Santoku on GitHub (source code, including R Shiny-based GUI, and real datasets)