Approximate Query Processing (AQP)
Approximate query processing (or AQP) is an alternative technology to provide approximated answers to complex queries using statistical methods. It aims to provide accurate query estimations within a short time frame. The major difference between traditional query processing (TQP) and AQP is that AQP doesn’t need to wait for the query on the original large data, but can quickly estimate the query result based on statistical approximation methods based on the statistics collected from datasets.
An AQP Example
Given a TPC-H dataset with a volume of 100GB in Apache Hive database, we execute a selection query in the following, denoted by Q, which computes a SUM function on the LINEITEM table.
Q: SELECT SUM(L_QUANTITY) FROM TPCD100GZ05.LINEITEM;
Table 1 demonstrates the AQP accuracy and performance. The traditional query processing ran the query on the original database of 100GB for over two minutes to get the ground truth of the query result,
This relative error of
(A) Traditional (Orignal Data 100 GB) | (B) AQP (Synopsis 10 MB, sampling fraction 0.01%) |
---|---|
![]() |
![]() |
Ground truth=1.5302751993E10 | Estimation=1.533593538665E101 |
Time=143.647s | Time=2.468s |
Table 1: Traditional Query Processing vs AQP: Relative error 0.22%, Speedup 58.2x
Categories of AQP
Based on how synopses are collected, AQP research can be categorized into online AQP and offline AQP.
Online AQP
The online AQP by the name, starts collecting statistics only after the target query for approximation is given. The online AQP usually relies on auxiliary data structures, such as indices and hash tables, in order to collect statistics quickly. Another drawback of the online AQP is its collected statistics are not reusable and must be re-collected for each different query.
Offline AQP
The offline AQP on the other hand, collects statistics before a target query is submitted. It usually needs the knowledge of the whole database schema in order to create a holistic statistical synopsis.
One advantage of the offline AQP is it doesn’t require auxiliary data structures to collect statistics because the statistics collection happens before the query is given and doesn’t affect the system performance. Another advantage is the collected statistics by the offline AQP are usually reusable for future queries. This alleviates the data accessing costs.
- Traditional Query Processing
- Online AQP
- Offline AQP