Project Hamlet


While almost all ML toolkits assume that the input data for ML is a single table, many relational datasets in real-world advanced analytics applications are multi-table, typically connected by key-foreign key dependencies (KFKDs). This forces data scientists to source the extra base tables in order to join with the main table and apply a feature selection method, either explicitly or implicitly, with the aim of improving accuracy. In this project, we study the fundamental effects of KFKDs (and the derivative functional dependencies) on the accuracy and efficiency of ML algorithms from a first-principles perspective.

We show that the features brought in by such joins can often be ignored without affecting ML accuracy significantly, i.e., we can “avoid joins safely.” This has at least two major implications. First, it could help reduce the runtimes of ML and feature selection methods. Second, it could reduce the burden of data sourcing, since data scientists can ignore certain tables altogether. Applying statistical learning theory, we explain how avoiding KFK joins might cause extra overfitting in some cases. Using a simulation study, we validate our analysis and measure the effects of various properties of KFKDs on the accuracy of linear classifiers. We apply our analysis to design easy-to-understand decision rules to predict when it is safe to avoid joins to help data scientists use our idea in practice. Experiments with multiple real-world datasets show that our rules are able to accurately predict when joins are safe to avoid, and in some cases, this led to significant reductions in the runtime of some popular feature selection methods.

Hamlet++ (Hamlet for High-Capacity Classifiers)

Linear models are nice but data scientists also love complex non-linear classifiers with high capacity, e.g., decision trees or RBF-SVMs, due to their potentially higher accuracy. So, in this follow-up work, we ask: what happens to the dichotomy observed on linear classifiers for avoiding joins safely with such high capacity classifiers? One might expect that these complex models would face even worse overfitting. Surprisingly, our results show the exact opposite! Using an extensive empirical and simulation-based analysis, we dissect the behavior of these models over KFK joins. We also take a step towards formally explaining their behavior and identify new questions for further ML theoretical research. Oh, and as for deep learning, neural networks also exhibit the same behavior.

The ideas from this work have been protoyped and/or adopted for applications at Facebook, LogicBlox, and MakeMyTrip.

Downloads (Paper, Code, Data, etc.)


  • Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers?
    Vraj Shah, Arun Kumar, and Xiaojin Zhu.
    VLDB 2018 (To appear) | Paper PDF and BibTeX | TechReport


  • To Join or Not to Join? Thinking Twice about Joins before Feature Selection
    Arun Kumar, Jeffrey Naughton, Jignesh M. Patel, and Xiaojin Zhu
    ACM SIGMOD 2016 | Paper PDF and BibTeX| TechReport

Student Contact

Vraj Shah: vps002 [at] eng [dot] ucsd [dot] edu