CN117216096A

CN117216096A - Approximate query processing optimization method

Info

Publication number: CN117216096A
Application number: CN202311080324.6A
Authority: CN
Inventors: 刘雨蒙; 万梓航; 赵怡婧; 王潮; 徐帆江
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2023-08-25
Filing date: 2023-08-25
Publication date: 2023-12-12
Anticipated expiration: 2043-08-25
Also published as: CN117216096B

Abstract

The application discloses an approximate query processing optimization method, which comprises the following steps: 1) Constructing a predicate-awareness generator and an offline query data set, training the predicate-awareness generator by using the offline query data set, and generating a sample data set; 2) Constructing a selectivity estimator, and training the selectivity estimator by using an offline query data set to obtain a selectivity factor of each query condition in a query statement; 3) For a query statement Q which is currently input, if an aggregation function in the Q is averaging, calculating the average value of each representative sample in the sample data set according to the Q; if the aggregation function is summation, inputting Q into a trained selection evaluator to obtain a corresponding selectivity factor set, and then inquiring in a sample data set according to an inquiry statement Q for setting the selectivity factor set; if the aggregate function is a count, a trained selection estimator is utilized to estimate a representative number of samples in the sample dataset that satisfy the query statement Q.

Description

Approximate query processing optimization method

Technical Field

The application belongs to the technical field of artificial intelligence, and relates to an approximate query processing optimization method.

Background

Approximate query processing is to provide fast and "sufficiently accurate" results for costly aggregate queries in database systems, improving the interactive experience of users over large data sets. In big data analysis, users often need to perform complex aggregate queries on the data, such as calculating averages, summing, counting, etc., to obtain the demand information.

The approximate query processing technology based on machine learning, such as deep DB with special data structure or simple generation model VAE-AQP, can realize very low query delay because only model inference is involved in query execution, thus avoiding complex query processing process on the traditional database system.

Deep is an approximate query processing system based on deep learning, and the core idea is to convert a query execution process into a neural network inference process; in addition, it employs a compact binary representation of the original data set, by quantizing the data and storing it in a binary tensor, to reduce the storage and computation costs of the data.

The VAE-AQP converts the query into an operation in the potential space, and utilizes the trained VAE model to infer and sample the potential space, thereby generating an approximate query result; because the data representation in the potential space has a low dimension and the inference and sampling process of the VAE model is efficient, query results can be quickly generated in a short time.

Since analysts need a large number of aggregated queries to filter information in the database, the above model may be limited in modeling data when processing query statements containing a large number of aggregated queries, resulting in a significant increase in approximation errors of deep b and VAE-AQP methods. Therefore, keeping the approximation error low is important to prevent the analyst from deriving misleading results.

Disclosure of Invention

In view of the problems existing in the prior art, it is an object of the present application to provide an approximate query processing optimization method aimed at answering an analytical query with a large number of aggregated queries with low approximation errors. The application uses the predicate perception generator and the selectivity evaluator to perform approximate query processing, and adopts a corresponding training method to capture the accurate condition distribution of data, namely, accurate results can be obtained for highly selective query, and then a small amount of representative samples are generated in operation, so that sufficiently accurate answers are provided. The method provided by the application can solve the problem of high approximation error in processing queries containing a large number of screening predicates, and is still effective in processing queries with very strict screening conditions.

The structure of the selectivity estimator provided by the application uses a fully connected neural network, and can also use a more complex model structure to obtain better training effect, such as adding a attention mechanism, a convolutional neural network and the like.

The model structure of the predicate-awareness generator provided by the application uses a fully-connected neural network, and can also add an attention mechanism, generate an countermeasure network, a graph neural network and the like.

The technical scheme of the application is as follows:

an approximate query processing optimization method, comprising the steps of:

1) Constructing a predicate-awareness generator and an offline query data set, training the predicate-awareness generator by using the offline query data set, and generating a sample data set; the predicate awareness generator includes: the encoder module is used for encoding the real samples in the offline query data set and fitting to obtain posterior distribution corresponding to each real sample; the prior distribution calculation module is used for processing the real samples by using the mask to obtain mask samples, then encoding the mask samples and fitting the mask samples to obtain prior distribution of each mask sample, and then fitting potential distribution by the posterior distribution and the prior distribution; a decoder module for generating representative samples from potential vectors z derived from the potential spatial samples; each representative sample comprising the sample dataset;

2) Constructing a selectivity estimator, and training the selectivity estimator by using the offline query data set to obtain a selectivity factor of each query condition in the query statement; the selectivity evaluator is used for constructing a feature vector Y= [ Y ] from n query conditions parsed from the input query statement ₁ ，y ₂ ，...，y _i ，...，y _n ]Wherein y is _i Is the ith feature, corresponds to the ith query condition; then setting a selective factor set s corresponding to each feature in the feature vector Y, inquiring in a database, and calculating the proportion of data meeting the inquiry statement to the whole data set; then use the loss functionOptimizing the selectivity evaluator to determine an optimal selectivity factor set s corresponding to the query statement; wherein s is _i Is y _i Corresponding predictive selectivity factor, s _i ' is y _i Is a label of (2);

3) For a currently input query statement Q, if an aggregation function in the query statement Q is averaging, calculating the average value of each representative sample in the sample data set according to the query statement Q; if the aggregation function in the query statement Q is summation, inputting the query statement Q into the trained selection evaluator to obtain a selective factor set corresponding to the query statement Q, and then inquiring in the sample data set according to the query statement Q with the selective factor set; if the aggregate function in the query statement Q is a count, estimating a representative number of samples D in the sample dataset that satisfy the query statement Q using the trained selection estimator _c ＝D _t s _c ；D _t Is saidAll data in the sample dataset s _c And the selection factor corresponding to the c-th query condition in the query statement Q.

Further, the parameters of the posterior distribution include an average μ of the posterior distribution _φ Variance sigma of sum posterior distribution _φ 。

Further, the method for processing the real sample by using the mask to obtain the mask sample comprises the following steps: will sub-table T' _s And sequences using masksInput to a priori distribution computing module, T' _s The field value in (a) is +.>The prior distribution module divides the sub-table T' _s Performing complete mask or layered mask processing on the numerical field values in the mask to obtain a mask sample; the layering mask is used for adjusting a random mask according to the occurrence frequency of the samples of each text field name, wherein the samples with low occurrence frequency have lower mask probability, and the samples with higher occurrence frequency have higher mask probability; then accordingto the field value->And sequence->Calculating a priori distribution, wherein parameters of the priori distribution comprise an average value mu of the priori distribution _ψ Variance sigma _ψ 。

Further, mu _ψ ＝fc ₆ (h ⁽⁵⁾ )，σ _ψ ＝fc ₇ (h ⁽⁵⁾ ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein h is ⁽⁵⁾ ＝h ⁽³⁾ ||h ⁽⁴⁾ ， fc ₄ Is study->Is a fully connected neural network of fc ₅ Is study->Is a fully connected neural network; relu is a nonlinear activation function.

Further, the method for generating the representative sample comprises the following steps: firstly, fusing the posterior distribution and the prior distribution to obtain the potential distribution; average μ=μ of the potential distributions _φ W ₁ +μ _ψ W ₂ Variance σ=σ _φ W ₃ +σ _ψ W ₄ ，W ₁ 、W ₂ 、W ₃ 、W ₄ Is a learnable weight matrix; then sampling a random vector v from the standard gaussian distribution, and calculating a potential vector z=μ+exp (0.5σ) v, exp being an exponential function; then inputting the latent variable z into the decoder module to generate a representative sample V _x′ ＝sigmoid(fc ₉ (z′))，z′＝Relu(fc ₈ (z))，fc ₈ And fc ₉ Is a full connection layer; sigmoid is a nonlinear activation function.

Further, in training the predicate awareness generator, the predicate awareness generator is trained by a loss function Optimizing parameters of the predicate awareness generator such that the generated representative sample approximates sub-table T _s Is a distribution of (3); wherein KL is a KL divergence function, +.>Is a reconstruction error function, log is a logarithmic function, < ->Is the posterior distribution, z, calculated by the encoder module ₁ Is a random variable in posterior distribution, ||is a concatenation operation, |is +.>Is the prior distribution calculated by the prior distribution calculation module, z ₂ Is a random variable in a priori distribution, p _ψ (V _x′ Z) is to generate a sample V based on the latent vector z calculation _x′ Is a priori of (2) distribution.

Further, for a sub-table T _s And sequences without maskInputting it into the encoder module, which is based on sub-table T _s Field value +.>Fitting the corresponding mask to obtain a sub-table T _s Posterior distribution of the medium data; average value mu of the posterior distribution _φ ＝fc ₂ (h ⁽²⁾ ) Variance sigma of the posterior distribution _φ ＝fc ₃ (h ⁽²⁾ ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein h is ⁽²⁾ ＝h ⁽⁰⁾ ||h ⁽¹⁾ ，/>fc ₀ Is study->Is a fully connected neural network of fc ₁ Is study->Is a fully connected neural network of fc ₂ Is used for calculating mu _φ Is a fully connected neural network of fc ₃ Is used for calculating sigma _φ Is a non-linear activation function.

A server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the above method.

A computer readable storage medium having stored thereon a computer program, characterized in that the computer program when executed by a processor performs the steps of the method described above.

The application has the following characteristics:

1. for low-selectivity query, the samples meeting the query conditions in the data set are fewer, and the aggregated query results of all the samples are very time-consuming to calculate, so that the predicate-awareness generator is provided to generate a small part of representative samples, and the generated samples can well represent the whole data set through training of a model, so that when the aggregated query results are calculated, only the representative samples are required to be processed, and the calculation complexity is greatly reduced;

2. by using the predicate perception generator, potential distribution is fitted simultaneously by using prior distribution and posterior distribution, so that the generated representative sample can reflect real distribution more accurately, and then query is executed by using the representative sample, thereby reducing the approximation error of the query;

3. the application provides a layering mask technology, which adopts high mask probability for sample data with high occurrence frequency and low mask probability for sample data with low occurrence frequency, so that a model can learn rare sample data better, and further learn better potential condition distribution, thereby improving the accuracy and generalization of sample data generated by the model.

The application has the following advantages:

1. the application adopts a predicate perception generator, and in a training stage, the predicate perception generator can capture accurate condition distribution of real data in an input offline data set. Compared with the prior art, the method has the advantages that the sample data generated by the predicate-awareness generator is closer to the real distribution of the whole data set, and the characteristics and the distribution condition of the real data can be better simulated.

2. Compared with the prior art, the method of the application shows lower approximation errors when answering query sentences with a large number of aggregated queries; meaning that the method can provide approximate answers closer to the true results under complex queries, so that the user gets more accurate results when interacting with the database system.

3. The method of the application can obtain accurate results when processing highly selective query, the highly selective query refers to a database with a large amount of data, but only a small amount of data meets a certain specific aggregation query condition, under the condition, the use of the existing query processing method can cause larger calculation errors because the data in the sample cannot accurately represent the whole data set. However, predicate perceptrons can handle stringent filtering conditions, thereby avoiding the risk of drawing false conclusions in complex query tasks.

Drawings

FIG. 1 is a schematic flow chart of the method.

FIG. 2 is a block diagram of a predicate awareness generator.

FIG. 3 is a flow chart of an approximate query processing system.

Detailed Description

The application will now be described in further detail with reference to the accompanying drawings, which are given by way of illustration only and are not intended to limit the scope of the application.

The application provides an approximate query processing optimization method, which aims to provide a rapid and 'enough accurate' result for highly selective aggregated queries when a user interacts with a database system. As shown in fig. 1, the data set is first queried offline for training the predicate-awareness generator and the selectivity evaluator; after training is completed, the predicate-awareness generator generates a small amount of representative samples according to the query conditions analyzed in the online query, and then executes the aggregated query by adopting the representative samples and calculates an approximate result; the selectivity evaluator is used to generate a selectivity factor, the purpose of which is to optimize the results of the aggregated query. The offline data set is composed of partial data in the database and corresponding query sentences, namely, the aggregated query sentences of the query database and query results thereof are included, and each aggregated query sentence, corresponding data and corresponding query result thereof serve as one sample.

An approximate query processing optimization method uses query results obtained by executing aggregated queries by a pre-trained predicate-perception generator and representative samples generated by a selectivity evaluator to answer the aggregated query requests of users online in real time, so that the efficiency and performance of the system for coping with highly-selective queries are improved, wherein the highly-selective queries refer to fewer samples meeting query conditions in a data set, so that a great deal of time is consumed for calculating all samples, namely the representative samples are needed to be generated to solve the problem of fewer samples, and the calculation time is further reduced.

S1, constructing a predicate-awareness generator and generating a sample data set: the interactive data browsing and analysis of the user and the database system is represented by query sentences, and the following structure is shown:

SELECT ψ, AGG (N) as a FROM T WHERE θ [ GROUP BY C ] [ ORDER BY β ] wherein the aggregate query AGG includes average AVG, SUM, etc., applied to the value field name N, satisfying the SELECT predicate θ; GROUP BY (word segmentation sentence) and ORDER BY (ORDER sentence) are applied to the text field names C and β; t is the table name of the query, ψ is the field name of the query; the AGG, the GROUP BY and the ORDER BY belong to query conditions, wherein the query conditions are requirements and operations of users on data, and the query conditions are used for helping a database system to accurately search and process the data so as to return query results meeting the requirements of the users; the query conditions include: predicate, aggregate function, grouping statement, sort statement, projection statement, join statement, sub-query statement, scope statement, union statement, and window function.

For all field names a, a=a in a table T in a database D _N ∪A _C ，A _N Representing the name of a numeric field, A _C Representing a text field name; a is a series of data in table T, a= [ x ] ₁ ，x ₂ ，...，x _K ]The vector is composed of K-dimensional field values, wherein the field values comprise numerical field values and text field values; m= [ m ] ₁ ，m ₂ ，...，m _K ]∈{0，1} ^K Representing a binary mask sequence in the K dimension, where (m _i =1) represents an unobserved field value; (m) _i =0) tableShowing observed field values; applying m to a, obtainingRepresenting the field value not observed in a; />Representing the observed field values in a.

Adding an innovative masking strategy to the predicate-aware generator for unobserved field values a _m Modeling is carried out; given the observed field value a _1-m And mask sequence m, while learning a _m And a _1-m The conditional distribution of a, i.e. the conditional distribution of a p (a|a _1-m M); to solve for the problem of the value of a given field _1-m And mask m to generate only a _m Distribution p (a) _m |a _1-m M), the problem that the generated samples are not in accordance with the actual situation or exist is caused, and the accuracy and quality of the generated samples are improved. Namely, the mask strategy is: using the conditional distribution p of a (a|a _1-m M) substitution of a _m The conditional distribution p (a) _m |a _1-m M), solve the problem with a _m Problems caused by the conditional distribution of (a). For a given set of observed attributes a _1-m Its combination may not exist in the data and therefore should not be generated to satisfy a _1-m Is a positive example of an error); to avoid this, the present application is developed by learning the distribution p (a|a _1-m M) to generate the complete a (including a) _m And a _1-m ) Not just learning p (a _m |a _1-m M) to avoid generating false positive examples. The application is realized by learning the distribution p (a|a _1-m M), the predicate-awareness generator may better understand the relationships and dependencies between all attributes to generate samples that more closely conform to the actual data distribution, which helps ensure that the generated samples do not contain non-existent combinations of attributes to avoid generating false positive examples. Generating complete a also helps the predicate-awareness generator to more accurately capture complex relationships between attributes, improving the quality and accuracy of the generated samples. This method can ensure that the generated sample is of an attributeThe combination aspect is consistent with the actual data distribution, so that the possibility of generating error positive examples is reduced.

As shown in FIG. 2, the predicate awareness generator of the present application consists of three modules: the encoder module encodes the real samples and then fits the encoded real samples to obtain posterior distribution corresponding to each real sample; the prior distribution calculation module is used for processing the real samples by using the mask to obtain mask samples, and then encoding and fitting the mask samples to obtain prior distribution of each mask sample; the decoder module generates representative samples from potential vectors z derived from potential spatial samples.

As shown in Table 1, a sub-table T of Table T is entered _s And sequences without maskIn the encoder module, the encoder module functions according to sub-table T _s Field value +.>And its corresponding sequence without mask +.>Since the field values of the encoder modules are observable, the sequence without mask +.>The values of (2) are all 0, i.e. +.>Fitting to obtain an approximation sub-table T _s Posterior distribution of the medium data; t (T) _s Comprising field name and field value->Through neural network, field value->And sequence without mask +.>Fitting posterior distribution +.>Where δ is the parameter to be estimated; z ₁ Is a random variable in the posterior distribution; outputting parameters of posterior distribution including average value mu of posterior distribution _φ Variance sigma of sum posterior distribution _φ Which is used to represent the posterior distribution, expressed by the following formula:

wherein fc is ₀ Is study ofIs a fully connected neural network of fc ₁ Is study->Is a nonlinear activation function; then will->And->Embedded representation h of (2) ⁽⁰⁾ And h ⁽¹⁾ Splicing is carried out, and the splicing is expressed by the following formula:

h ⁽²⁾ ＝h ⁽⁰⁾ ||h ⁽¹⁾

wherein || is a stitching operation; the parameters of the posterior distribution are then calculated and expressed as:

μ _φ ＝fc ₂ (h ⁽²⁾ )

σ _φ ＝fc ₃ (h ⁽²⁾ )

wherein fc is ₂ Is used for calculating mu _φ Is a fully connected neural network of fc ₃ Is used for calculating sigma _φ Is provided.

Table 1 shows encoder module input information

Month	Vaule	Browser	OS	Age
					Jan	250	Chrome	Android	Adult
Jan	500	Safari	Android	Adult
					Jan	100	Safari	IOS	Senior
Feb	200	Chrome	IOS	Youth
					Feb	600	Chrome	Android	Adult
Feb	350	Safari	IOS	Senior

As shown in Table 2, the sub-table T is entered _s And without maskTo the prior distribution calculation module, the prior distribution calculation module uses a brand new mask strategy: on the one hand, the sub-table T will be entered _s The field value of the numerical value in (a) is completely masked, namely if the field value is a numerical value, the value in the mask sequence is 1; on the other hand, sub-table T _s Performing layering masking on the text field values in the text field values; the layering mask is used for adjusting a random mask according to the occurrence frequency of samples of each text field name, so that samples with low occurrence frequency have lower mask probability, samples with high occurrence frequency have higher mask probability, rare samples are better learned, the potential distribution learning capability of a predicate perception generator is improved, and layering mask processing is represented by the following formula:

wherein S is _i [V _x ]Is the frequency with which text field values appear in the data; r is a mask factor; 0 represents no mask, 1 represents a mask; and further calculate a sequence using the maskAnd field value after mask is used via neural network +.> And sequence using mask->Calculating a priori distribution->Wherein ψ is the parameter to be estimated; z ₂ Is a random variable in the a priori distribution; outputting the prior distribution parameter, and the average value mu of the prior distribution _ψ And a variance sigma of a priori distribution _ψ Expressed by the following formula:

wherein fc is ₄ Is study ofIs a fully connected neural network and fc ₅ Is study->Is a fully connected neural network; then will beAnd->Embedded representation h of (2) ⁽³⁾ And h ⁽⁴⁾ Splicing is carried out, and the splicing is expressed by the following formula:

h ⁽⁵⁾ ＝h ⁽³⁾ ||h ⁽⁴⁾

then the parameters of the prior distribution, i.e. the average μ, are calculated _ψ Sum of variances sigma _ψ Expressed by the following formula:

μ _ψ ＝fc ₆ (h ⁽⁵⁾ )

σ _ψ ＝fc ₇ (h ⁽⁵⁾ )

wherein fc is ₆ Is used for calculating mu _ψ Is a fully connected neural network and fc ₇ Is used for calculating sigma _ψ Is a fully connected neural network.

Fusing posterior distribution parameters and prior distribution parameters calculated by an encoder module and a prior distribution calculation module to obtain potential distribution; and sampling a latent variable z from the latent distribution, expressed by the following formula:

μ＝μ _φ W ₁ +μ _ψ W ₂

σ＝σ _φ W ₃ +σ _ψ W ₄

where μ, σ are the mean and variance of the potential distribution, used to represent the potential distribution; w (W) ₁ ，W ₂ ，W ₃ ，W ₄ Is a learnable weight matrix.

Next, a random vector v is sampled from the standard gaussian distribution (average 0, variance 1), and the potential vector z is calculated as follows:

z＝μ+exp(0.5σ)v

wherein exp (·) is an exponential function.

Inputting the latent variable z into the decoder module, outputting the generated samples V _x′ Can be used forIs expressed by the following formula:

z′＝Relu(fc ₈ (z))

V _x′ ＝sigmoid(fc ₉ (z′))

wherein fc is ₈ And fc ₉ Is a full connection layer; sigmoid is a nonlinear activation function.

In the training process of the predicate-awareness generator, parameters of the predicate-awareness generator are optimized by minimizing KL divergence and reconstruction errors, so that the generated samples can approximate the original data T _s Thus, the loss function consists of two parts:

(1) KL divergence between posterior distribution and conditional distribution of potential space, namely the quality of conditional potential space generation;

(2) The reconstruction error of the text field value, namely the quality of the actual sample generation; the training goal may be expressed by the following formula:

wherein KL is KL divergence;is the posterior distribution calculated by the encoder module; the I is a stitching operation;is the prior distribution calculated by the prior distribution calculation module; />Is a reconstruction error;]log (·) is a logarithmic function; p is p _ψ (V _x′ Z) is to generate a sample V based on the latent vector z calculation _x′ Is a priori of (2) distribution.

Performing iterative training on the predicate perception generator to reach the maximum iterative timesAfter stopping training and memorizingRecording weight parameters of a predicate perception generator, and finishing a training process; in the inference process of the predicate-awareness generator, the weight parameters recorded in the training stage are called, query conditions and query contents are input into the encoder module and the prior distribution calculation module, the input is mapped into potential vectors in the potential space, and sample data are obtained by sampling from the potential space.

Table 2 shows the input information of the prior distribution calculation module

Month	Vaule	Browser	OS	Age
					Jan	——	Chrome	Android	Adult
——	——	——	——	——
					Jan	——	——	——	——
——	——	Chrome	IOS	——
					Feb	——	——	——	Adult
——	——	Safari	IOS	Senior

S2, constructing a selectivity estimator and calculating a selectivity factor: the selectivity evaluator is configured to evaluate an effect of the query condition on a portion of the data set in the database. Specifically, given a query term, it will obtain data that satisfies the query term, generate a query result, and calculate the proportion of the query result in the entire dataset. The main task of the selectivity evaluator is to evaluate the selectivity of the query result through the query statement and generate a selectivity factor related to the query condition, namely, weights are given to different query conditions, and the weights are the selectivity factors; namely, the query result is optimized by adopting the selectivity factor, so that the query processing efficiency is improved.

The input features of the selectivity evaluator mainly include query conditions, such as predicates, aggregation functions, and other relevant information describing the query intent and requirements of the user; the labels are selectivity factors of the query results, which represent the proportion, namely the selectivity, of the query results in the whole data set under a given query statement, and the query results can be processed more intelligently by using the selectivity factors to adjust the query results, so that the accuracy of the query results is improved.

The input of the selectivity evaluator is a characteristic vector Y= [ Y ] of information of query conditions such as predicates, grouping sentences, aggregation functions and the like which are analyzed by a query ₁ ，y ₂ ，...，y _i ，...，y _n ]Wherein y is _i Is the ith feature (i.e., the ith query condition); outputs a predicted set of selectivity factors s (including y ₁ ，y ₂ ，...，y _i ，...，y _n The corresponding selectivity factor), representing the proportion of data satisfying the query statement to the entire data set, is expressed by the following formula:

h ⁽⁵⁾ ＝Relu(fc ₁₀ (Y))

s＝fc ₁₁ (dropout(h ⁽⁵⁾ ))

wherein fc is ₁₀ Is a fully connected neural network that learns Y, fc ₁₁ Is used for calculating s _i Is a fully connected neural network; dropout is a neural network layer used to avoid the selectivity estimator training over-fits. The selectivity evaluator predicts the selectivity of the query result according to the input feature vector, namely, under a given query statement, the proportion of the data of the query statement to the whole data is satisfied.

Training of the selectivity evaluator uses query conditions (such as predicates, aggregation functions, etc.) parsed from a known offline query dataset (made up of partial data in the database and query statements related to those data), for the ith query condition, there is a true selectivity value s 'in the dataset' _i The loss function of training the selectivity estimator is expressed as follows:

wherein s is _i Is the selectivity factor of the ith query condition, s _i E s. Obtaining network weight and bias parameters of the selective estimator through iterative training, so that the selective estimator can predict the selectivity of the position query condition; the inference stage of the selectivity evaluator is used for online approximate query processing, and when a user submits a query statement, the selectivity evaluator receives a feature vector of the query statement as input, outputs a selectivity factor of the query statement, and uses the selectivity factor for optimizing the query processing.

S3, for a currently input query statement Q, respectively processing different aggregation functions in the query statement Q on a sample data set; the aggregation function in query statement Q includes averaging, summing, or counting the data: as shown in fig. 3, the workflow of the approximate query processing system is as follows: firstly, the approximate query processing system analyzes the query and extracts information such as predicates, grouping sentences, aggregation functions and the like in the query; then, generating a representative sample by a predicate-awareness generator;

a) If the aggregation function is averaging, the approximate query processing system directly applies query logic to the representative sample generated by the predicate-awareness generator to calculate a result, namely, the average value of the representative sample is adopted to represent the average value of the whole data, and the average value of the representative sample is directly calculated as a result and returned to the user;

b) If the aggregation function is summation, inputting the query statement Q into a trained selection evaluator, generating the selectivity factor of each query condition in the query statement Q through the selection evaluator, and giving the selectivity factor as a weight to the query statement Q, thereby improving the quality of a query result obtained by executing the query statement Q. The approximate query processing system is adopted to calibrate the result by using an appropriate selectivity factor, the larger the selectivity factor is, the more general predicate conditions are indicated, namely, more data meeting the predicate conditions are obtained, so that the sample data is given more weight, and the sample data is more important in calculating the aggregated query result;

c) If the aggregate function is a count, the approximate query processing system directly uses the selectivity estimator to calibrate the query results. If the goal of the online query is to aggregate function counts, the selection evaluator generatesSelectivity factor s of (2) _c Representing sample data satisfying aggregate function counts across a sample data setIs a ratio of (a) to (b). The calculated query result may be obtained by selectively factor calibrating sample data satisfying the aggregate function count in a sample dataset, expressed by the following formula:

D _c ＝D _t s _c

wherein D is _c Is an approximation of the aggregate function count; d (D) _t The number of sample data meeting the query condition in the sample data set; s is(s) _c Is a selective factor of the aggregate function count. For example, the query condition count is "Value>10", wherein Value is a field name and 10 is a field Value; a selectivity factor from the selectivity generator to the query condition count is 0.1; the sample data set is a small part of the whole sample data set, the data volume is 10, then the approximate result of the aggregation function count is 1, which means that about 1 data volume in the sample data set satisfies the query condition count "Value>10'; by rapidly acquiring the rough query result, the detailed scanning of the whole sample data set is avoided, and the interaction time of a user is saved.

After the query calculation is completed, the approximate query processing system returns the result to the user, and the interaction process of the query and the user is realized.

Although specific embodiments of the application have been disclosed for illustrative purposes, it will be appreciated by those skilled in the art that the application may be implemented with the help of a variety of examples: various alternatives, variations and modifications are possible without departing from the spirit and scope of the application and the appended claims. Therefore, it is intended that the application not be limited to the particular embodiment disclosed as the best mode contemplated for carrying out this application, but that the application will have the scope indicated by the scope of the appended claims.

Claims

1. An approximate query processing optimization method, comprising the steps of:

2) Constructing a selectivity estimator, and training the selectivity estimator by using the offline query data set to obtain a selectivity factor of each query condition in the query statement; the selectivity evaluator is used for constructing a feature vector Y= [ Y ] from n query conditions parsed from the input query statement ₁ ,y ₂ ,…,y _i ,…,y _n ]Wherein y is _i Is the ith feature, corresponds to the ith query condition; then setting a selective factor set s corresponding to each feature in the feature vector Y, inquiring in a database, and calculating the proportion of data meeting the inquiry statement to the whole data set; then use the loss functionOptimizing the selectivity evaluator to determine an optimal selectivity factor set s corresponding to the query statement; wherein s is _i Is y _i Corresponding predictive selectivity factor, s _i ' is y _i Is a label of (2);

3) For a currently input query statement Q, if an aggregation function in the query statement Q is averaging, calculating the average value of each representative sample in the sample data set according to the query statement Q; if the aggregate function in the query statement Q is summation, the query statement Q is input into the trained selection evaluator to obtain a set of selectivity factors corresponding to the query statement Q,then, inquiring in the sample data set according to an inquiry statement Q for setting the selective factor set; if the aggregate function in the query statement Q is a count, estimating a representative number of samples D in the sample dataset that satisfy the query statement Q using the trained selection estimator _c ＝D _t s _c ；D _t S for all data in the sample dataset _c And the selection factor corresponding to the c-th query condition in the query statement Q.

2. The method of claim 1, wherein the parameter of the posterior distribution comprises an average μ of the posterior distribution _φ Variance sigma of sum posterior distribution _φ 。

3. The method according to claim 2, wherein the method for processing the real sample by using the mask to obtain the mask sample is as follows: will sub-table T' _s And sequences using masksInput to a priori distribution computing module, T' _s The field value in (a) isThe prior distribution module divides the sub-table T' _s Performing complete mask or layered mask processing on the numerical field values in the mask to obtain a mask sample; the layering mask is used for adjusting a random mask according to the occurrence frequency of the samples of each text field name, wherein the samples with low occurrence frequency have lower mask probability, and the samples with higher occurrence frequency have higher mask probability; then accordingto the field value->And sequence->Calculating a priori distribution, the parameters of the priori distributionThe number comprises the average μ of the a priori distribution _ψ Variance sigma _ψ 。

4. A method according to claim 3, characterized in that μ _ψ ＝fc ₆ (h ⁽⁵⁾ )，σ _ψ ＝fc ₇ (h ⁽⁵⁾ ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein h is ⁽⁵⁾ ＝h ⁽³⁾ ||h ⁽⁴⁾ ，fc ₄ Is study->Is a fully connected neural network of fc ₅ Is study->Is a fully connected neural network; relu is a nonlinear activation function.

5. The method of claim 3 or 4, wherein the method of generating the representative sample is: firstly, fusing the posterior distribution and the prior distribution to obtain the potential distribution; average μ=μ of the potential distributions _φ W ₁ +μ _ψ w ₂ Variance σ=σ _φ W ₃ +σ _ψ W ₄ ，W ₁ 、W ₂ 、W ₃ 、W ₄ Is a learnable weight matrix; then sampling a random vector v from the standard gaussian distribution, and calculating a potential vector z=μ+exp (0.5σ) v, exp being an exponential function; then inputting the latent variable z into the decoder module to generate a representative sample V _x ′＝sigmoid(fc ₉ (z′))，z′＝Relu(fc ₈ (z))，fc ₈ And fc ₉ Is a full connection layer; sigmoid is a nonlinear activation function.

6. The method of claim 5, wherein training the predicate awareness generator is performedBy loss functionOptimizing parameters of the predicate awareness generator such that the generated representative sample approximates sub-table T _s Is a distribution of (3); wherein KL is a KL divergence function, +.>Is a reconstruction error function, log is a logarithmic function, < ->Is the posterior distribution, z, calculated by the encoder module ₁ Is a random variable in posterior distribution, ||is a concatenation operation, |is +.>Is the prior distribution calculated by the prior distribution calculation module, z ₂ Is a random variable in a priori distribution, p _ψ (V _x′ Z) is to generate a sample V based on the latent vector z calculation _x′ Is a priori of (2) distribution.

7. The method of claim 2, wherein for a sub-table T _s And sequences without maskInputting it into the encoder module, which is based on sub-table T _s Field value +.>Fitting the corresponding mask to obtain a sub-table T _s Posterior distribution of the medium data; average value mu of the posterior distribution _φ ＝fc ₂ (h ⁽²⁾ ) Variance sigma of the posterior distribution _φ ＝fc ₃ (h ⁽²⁾ ) The method comprises the steps of carrying out a first treatment on the surface of the Wherein h is ⁽²⁾ ＝h ⁽⁰⁾ ||h ⁽¹⁾ ，/> fc ₀ Is study->Is a fully connected neural network of fc ₁ Is study->Is a fully connected neural network of fc ₂ Is used for calculating mu _φ Is a fully connected neural network of fc ₃ Is used for calculating sigma _φ Is a non-linear activation function.

8. A server comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the steps of the method of any of claims 1 to 7.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.