ADA Lab @ UCSD

Project Hamlet

Overview

While almost all ML toolkits assume that the input data for ML is a single table, many relational datasets in real-world advanced analytics applications are multi-table, typically connected by key-foreign key dependencies (KFKDs). This forces data scientists to source the extra base tables in order to join with the main table and apply a feature selection method, either explicitly or implicitly, with the aim of improving accuracy. In this project, we study the fundamental effects of KFKDs (and the derivative functional dependencies) on the accuracy and efficiency of ML algorithms from a first-principles perspective.

We show that the features brought in by such joins can often be ignored without affecting ML accuracy significantly, i.e., we can “avoid joins safely.” This has at least two major implications. First, it could help reduce the runtimes of ML and feature selection methods. Second, it could reduce the burden of data sourcing, since data scientists can ignore certain tables altogether. Applying statistical learning theory, we explain how avoiding KFK joins might cause extra overfitting in some cases. Using a simulation study, we validate our analysis and measure the effects of various properties of KFKDs on the accuracy of linear classifiers. We apply our analysis to design easy-to-understand decision rules to predict when it is safe to avoid joins to help data scientists use our idea in practice. Experiments with multiple real-world datasets show that our rules are able to accurately predict when joins are safe to avoid, and in some cases, this led to significant reductions in the runtime of some popular feature selection methods.

Hamlet++ (Hamlet for High-Capacity Classifiers)

Linear models are nice but data scientists also love complex non-linear classifiers with high capacity, e.g., decision trees or RBF-SVMs, due to their potentially higher accuracy. So, in this follow-up work, we ask: what happens to the dichotomy observed on linear classifiers for avoiding joins safely with such high capacity classifiers? One might expect that these complex models would face even worse overfitting. Surprisingly, our results show the exact opposite! Using an extensive empirical and simulation-based analysis, we dissect the behavior of these models over KFK joins. We also take a step towards formally explaining their behavior and identify new questions for further ML theoretical research. Oh, and as for deep learning, neural networks also exhibit the same behavior.

The ideas from this work have been protoyped and/or adopted for applications at Facebook, LogicBlox, and MakeMyTrip.

Downloads (Paper, Code, Data, etc.)

Hamlet++

Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers?
Vraj Shah, Arun Kumar, and Xiaojin Zhu.
VLDB 2018 (To appear) | Paper PDF and BibTeX | TechReport

Source code on GitHub (Python code for simulations; TensorFlow scripts for neural networks and R scripts for other models with the real-world datasets)
Real-world datasets (slightly different from Hamlet's data; numeric features are not discretized)
Real-world datasets (same as above but categorical features are one-hot encoded)
Detailed errors of all folds for all models and datasets

Hamlet

To Join or Not to Join? Thinking Twice about Joins before Feature Selection
Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu
ACM SIGMOD 2016 | Paper PDF and BibTeX| TechReport

Source code on GitHub (Python code for simulations; R scripts for Naive Bayes and logistic regression with the real-world datasets)
Real-world datasets

Student Contact

Vraj Shah: vps002 [at] eng [dot] ucsd [dot] edu