ADA Lab @ UCSD
OverviewApplying ML to structured data often involves performing relational operations as part of feature and data engineering. For instance, joins before ML are ubiquitous, since many datasets in the real world are multitable, while almost all ML toolkits expect singletable inputs. This forces data scientists to join those tables and materialize a single table, which leads to data redundancy and runtime waste. In recent work (Project Orion), we introduced the paradigm of “factorized” ML to mitigate this issue for a few specific ML algorithms by showing how to push ML through joins. But that approach requires a manual rewrite of ML implementations. Such a piecemeal approach creates a massive development overhead when extending factorized ML to other ML algorithms. In this project, we mitigate the above overhead by leveraging a popular formal algebra to represent the computations of many ML algorithms: linear algebra (LA). We introduce a new logical data type to represent multitable data and devise a framework of algebraic rewrite rules to convert a large set of LA operations over denormalized data into operations over the base tables. This enables us to automatically factorize several popular ML algorithms, thus unifying and generalizing prior works. Experiments with realworld multitable datasets show that our approach also yields significant runtimes speedups in multiple ML system environments. We have protoyped Morpheus in the popular R environment. Versions in Python and TensorFlow, as well as Apache SystemML are in the works. This project sets the stage for a holistic unification of relational algebrabased feature and data engineering with LAbased ML to help accelerate ML workloads over structured data. The ideas from this work have been protoyped and/or adopted for applications at LogicBlox and Microsoft. Downloads (Paper, Code, Data, etc.)
Student Contacts
