Streaming weighted sampling over join queries
Published in International Conference on Extending Database Technology (EDBT) 2023, 2023
Regression Models (RMs) and Machine Learning models (ML) in general, aim to offer high prediction accuracy, even for unforeseen queries/datasets. This depends on their fundamental ability to generalize. However, overfitting a model, with respect to the current DB state, may be best suited to offer excellent accuracy. This overfit-generalize divide bears many practical implications faced by a data analyst. The paper will reveal, shed light, and quantify this divide using a large number of real- world datasets and a large number of RMs. It will show that different RMs occupy different positions in this divide, which results in different RMs being better suited to answer queries on different parts of the same dataset (as queries typically target specific data subspaces defined using selection operators on attributes). It will study in detail 8 real-life data sets and from the TPC-DS benchmark and experiment with various dimensionalities therein. It will employ new appropriate metrics that will reveal the performance differences of RMs and will substantiate the problem across a wide variety of popular RMs, ranging from simple linear models to advanced, state-of-the-art, ensembles (which enjoy excellent generalization performance). It will put forth and study a new, query-centric, model that addresses this problem, improving per-query accuracy, while also offering excellent overall accuracy. Finally, it will study the effects of scale on the problem and its solutions.
Recommended citation: Shekelyan, Michael, Graham Cormode, Qingzhi Ma, Ali Mohammadi Shanghooshabad, and Peter Triantafillou. "Streaming weighted sampling over join queries." In Proceedings of the 26th International Conference on Extending Database Technology (EDBT) 2023, March 2023. 2023. https://wrap.warwick.ac.uk/172076/1/WRAP-Streaming-weighted-sampling-join-queries-22.pdf