avatar

Feng (George) Yu

Approximate Query Processing (AQP)

Approximate query processing (or AQP) is an alternative technology to provide approximated answers to complex queries using statistical methods. It aims to provide accurate query estimations within a short time frame. The major difference between traditional query processing (TQP) and AQP is that AQP doesn’t need to wait for the query on the original large data, but can quickly estimate the query result based on statistical approximation methods based on the statistics collected from datasets.

An AQP Example

Given a TPC-H dataset with a volume of 100GB in Apache Hive database, we execute a selection query in the following, denoted by Q, which computes a SUM function on the LINEITEM table.

Q: SELECT SUM(L_QUANTITY) FROM TPCD100GZ05.LINEITEM;

Table 1 demonstrates the AQP accuracy and performance. The traditional query processing ran the query on the original database of 100GB for over two minutes to get the ground truth of the query result, . The AQP method ran the query on a tiny simple random sample without replacement (or SRSWOR) of 10MB of the original database and completed within two seconds. The estimation of query Q is obtained by using the query result on the sample, , divided by the sample ratio . i.e.

This relative error of compared with the ground truth query result is less than 0.22%. Furthermore, if we check the query running times, the speedup factor of AQP over the original query is about 58.2x.

(A) Traditional (Orignal Data 100 GB) (B) AQP (Synopsis 10 MB, sampling fraction 0.01%)
Ground truth=1.5302751993E10 Estimation=1.533593538665E101
Time=143.647s Time=2.468s

Table 1: Traditional Query Processing vs AQP: Relative error 0.22%, Speedup 58.2x

Categories of AQP

Based on how synopses are collected, AQP research can be categorized into online AQP and offline AQP.

Online AQP

The online AQP by the name, starts collecting statistics only after the target query for approximation is given. The online AQP usually relies on auxiliary data structures, such as indices and hash tables, in order to collect statistics quickly. Another drawback of the online AQP is its collected statistics are not reusable and must be re-collected for each different query.

Offline AQP

The offline AQP on the other hand, collects statistics before a target query is submitted. It usually needs the knowledge of the whole database schema in order to create a holistic statistical synopsis.

One advantage of the offline AQP is it doesn’t require auxiliary data structures to collect statistics because the statistics collection happens before the query is given and doesn’t affect the system performance. Another advantage is the collected statistics by the offline AQP are usually reusable for future queries. This alleviates the data accessing costs.

traditional query processing
Traditional Query Processing
Online AQP
Online AQP
Offline AQP
Offline AQP