CN114969023A

CN114969023A - Database learning type index construction method and system

Info

Publication number: CN114969023A
Application number: CN202210150431.0A
Authority: CN
Inventors: 杨仝; 陈春辉; 屠要峰; 杨洪章
Original assignee: Peking University; ZTE Corp
Current assignee: Peking University; ZTE Corp
Priority date: 2022-02-18
Filing date: 2022-02-18
Publication date: 2022-08-30

Abstract

The invention relates to a database learning type index construction method and a database learning type index construction system. The method comprises the following steps: constructing a cumulative distribution function according to the data key words and the data storage positions; fitting the cumulative distribution function by using a machine learning model to obtain the relevance between the data keywords and the data storage position to obtain a learning type index; and quickly positioning the position of the key value to be inquired according to the learning index. The invention can overcome the problems of high adjustment difficulty, poor adaptive capacity and large occupation space of the auxiliary data structure in the traditional B-tree database index algorithm, effectively reduce the occupation space of the auxiliary data structure, and improve the adaptive capacity of the database index.

Description

Database learning type index construction method and system

Technical Field

The invention belongs to the field of key value storage-based memory type databases, and particularly relates to a method and a system for constructing a database learning type index.

Background

With the continuous development of the ecology of the internet, the data required to be stored by large internet companies reaches the PB level, and the new business data generated by the large internet companies every day reaches the TB level. In order to meet such a large-scale requirement for incremental data storage and update, distributed key-value storage databases are becoming the first choice for large-scale data storage. Key-value storage databases typically store different key-value pairs in order of the size of the key values in memory, hard disks, and other storage devices. Therefore, how to quickly locate a specific storage location for a given key value and efficiently support insert, delete, etc. functions becomes a key issue in determining the performance of current key value storage databases.

Well-known key-value store databases, such as rockdb, Redis, and PostgreSQL, basically use auxiliary data structures as indices to speed up incremental and destructive lookups and lookups of the database. It is common practice to group key-value pair data into blocks, where a particular key-value pair is located by a binary search algorithm, and for each block, a binary index is constructed by building a B-tree. This approach requires a lot of space to maintain the index in the memory, and in the face of different data loads, adjusting to the parameters of the optimal B-tree and data block often requires a lot of labor time, and it is difficult to provide high quality adaptive service.

Disclosure of Invention

In order to solve the problems of high adjusting difficulty, poor adaptive capacity and large occupation of memory space of an auxiliary data structure of the conventional B-tree database index algorithm, the invention provides a method using a learning index, which can effectively reduce the occupation of the memory of the auxiliary data structure and improve the adaptive adjusting capability of the database index.

The purpose of the invention is realized by the following technical scheme:

a database learning type index construction method comprises the following steps:

constructing a cumulative distribution function according to the data key words and the data storage positions;

fitting the cumulative distribution function by using a machine learning model to obtain the relevance between the data keywords and the data storage position to obtain a learning type index;

and quickly positioning the position of the key value to be inquired according to the learning index.

Further, on the premise of ordered storage of data Key values, a monotonically increasing relationship is present between a Key (Key) and a storage location (Position) of the data, so that a Cumulative Distribution Function (CDF) can be constructed, where the Cumulative Distribution Function f (Key) is modeled as pos ═ f (Key) × N, where Key represents a data Key, pos represents a data storage location, and N represents a total data scale. The cumulative distribution function itself contains two key pieces of information, namely, the data key and the data storage location. The basic idea of the learning type index is to utilize a machine learning model to fit a cumulative distribution function to obtain the relevance between a storage position and a data keyword value, so that the effect of quickly positioning the data position is achieved.

Further, the process of fitting the cumulative distribution function is to select a suitable cumulative distribution function, i.e., parameters of the F function, by machine learning, so that the loss function is minimized. In practice, the F function may be a linear function, a feed Forward Neural Network (FNN), or the like. In the present invention, the basic goal of a learning-type index is to fit the cumulative distribution function with as little average error as possible.

Further, in order to enable the cumulative distribution function to use a relatively simple function fitting, the invention performs grouping processing on the key value data, sets the number of segmentSize of each group of data points in advance, divides the data points into one group after reading the segmentSize data points, fits a machine learning model, records model parameters, and takes the maximum value and the minimum value of the group of data keywords as a boundary point to serve as grouping basis. Therefore, the inference speed and the construction speed of the model are greatly increased, the inference time of the model is reduced to the greatest extent under the condition that the prediction accuracy is not seriously influenced, the query process can be accelerated, the advantage that the simple model is constructed quickly is utilized, the reconstruction time of the model is greatly reduced, and a good foundation is laid for realizing dynamic updating.

Further, in order to accurately and quickly locate the key to be queried to the targetAnd the target group uses an auxiliary data structure of a root lookup table. The root lookup table maps the first k data keys with the same binary digits to the same node based on a Radix Tree (Radix Tree), thereby realizing the rapid grouping positioning of the data keys. Where k is the radix bit, i.e., the depth of the radix tree, e.g., where 3 is actually mapped to 2 according to the first three bits of the data key ³ Equal to 8 leaf nodes. In model building, the demarcation points between two groups are used as nodes (knots) of a root lookup table, an array is built by the demarcation points, the node array is called, and pointers are stored in each position of the node array to point to the corresponding array (namely, the array taking the demarcation point as the right interval endpoint). When inputting the Key of data Key to be inquired, the model firstly locates two nodes in the node array (knots array) according to the radix tree, and the property of the radix tree ensures that the Key is between the two nodes. And then, through binary search, finding the subscript which is not less than the Key maximum value in the node array, obtaining the subscript which is correspondingly grouped in the array, and immediately finishing the grouping and positioning process of the root lookup table.

Furthermore, because a certain error occasionally exists between the predicted position and the real position of the learning-type index, after the predicted position in the target group is located, the method adopts an exponential search mode to carry out the last step of search.

Further, the data structure adopted by the present invention is mainly described above, and the following describes the complete process of index construction by taking grouping linear fitting as an example:

1) the index structure is constructed from bottom to top, firstly ensuring the data keys to be stored in the array in order, then reading the data keys and the positions positon of the data keys in the array in order, and executing the training process once each reading the segmentSize data keys. The fitting may alternatively be performed using a linear function or a polynomial function of lower degree, here for example a linear function least squares fit, and the fitting coefficients, i.e. slope and intercept interrupt, are saved after the set of fits is over. And put the maximum value of the group of data keys into the node array (hits array), and save the pointer of the corresponding group. The root lookup table is then updated based on the previous RadixBit bits of the maximum binary representation. And continuously reading the data until the number of the remaining data points is less than the segmentSize, forming a group separately, and training the last group of index models.

2) After the model construction is completed, when the query operation is executed each time, the corresponding group is found by using the key to be queried through the root query table. And obtaining the stored model coefficients slope and interrupt of the group, obtaining a predicted position pred through a formula pred-key multiplied by slope + interrupt, and finally finding the accurate position pos of the data in the array from the pred by using an index search method to complete one query.

Further, in order to obtain a better model fitting effect and save space as much as possible, the invention provides a dynamic adaptive array length segmentation method, which can dynamically segment the array. The dynamic self-adaptive array length segmentation method comprises the following steps:

1) firstly, on the basis of completing the model construction, a plurality of arrays under the same father node at the bottommost layer are combined tentatively, and one index model is used for replacing a plurality of original index models, so that the space consumption is saved.

2) And then, evaluating the merging effect based on the index of the average prediction error of the index model, accepting merging if the error is lower than a set threshold, and otherwise, canceling the merging operation.

3) This merge operation is repeated from bottom to top until there are no arrays that satisfy the merge condition.

Furthermore, in order to support update operations such as insertion and deletion, a chaining external array is introduced, the inserted elements are placed in the external array to be inserted, and then the external array and the original array are periodically integrated. When the inserting operation is performed each time, firstly, a query operation is performed to find a position corresponding to the key to be inserted (i.e. a position in the array where the maximum value is not greater than the current key, which is referred to as lower _ bound). A fixed size array is circumscribed at this location and the key-value pairs to be inserted are placed into it, this array being independent of the model. At this time, the query operation needs to be added with one step, the original model is firstly positioned to the lower _ bound, if the lower _ bound is the key to be queried, the query operation is completed once, and if the lower _ bound is not the key to be queried, the query operation is performed in the corresponding external array. And when the external array is full, integrating the external array and the original array, and retraining the model. When the deletion operation is carried out, the query is still executed firstly, the position of the data is found, the position is marked to be empty, and when the model is reconstructed, all the positions marked to be empty are integrated.

Based on the same inventive concept, the invention also provides a database learning type index construction system adopting the method, which comprises the following steps:

the cumulative distribution function building module is used for building a cumulative distribution function according to the data keywords and the data storage position;

the cumulative distribution function fitting module is used for fitting the cumulative distribution function by utilizing a machine learning model so as to obtain the relevance between the data keywords and the data storage position and obtain a learning type index;

and the query module is used for quickly positioning the position of the key value to be queried according to the learning index.

The invention has the beneficial effects that: the invention provides a database learning type index construction method commonly called Group Spline, which obtains the Spline (Spline) divided into groups in a tree shape through the dynamic self-adaptive segmentation of array length so as to construct a learning type database index, can quickly locate the position of a key value to be queried, ensures that a query result can be given after accessing a limited layer segment, and improves the query speed compared with the traditional B tree index scheme. And the dynamic self-adaptive flexible adjustment can be realized in the face of different data loads, requirements of increasing, deleting, changing, checking and the like, and the time and energy consumed by manual adjustment are greatly reduced.

Drawings

FIG. 1 is an example of a root lookup table and one accelerated lookup based thereon. The models 1-9 are abstracts of the function F.

Fig. 2 is an example of a final step search using an exponential search.

FIG. 3 is a partial example of an array length dynamic adaptive slicing training process. Wherein segment represents the segmented data segment, and trailing represents the training index model.

FIG. 4 is an example of performing a new key-value pair data insertion based on a chaining circumscribed array.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following examples in the accompanying drawings. It should be understood that the specific examples described herein are intended to be illustrative only and are not intended to be limiting.

FIG. 1 is a root look-up table and a query process based thereon. Firstly, a root lookup table is constructed by taking 32 spans as a span, each span comprises 1 function model (such as a linear model), and the maximum value and the minimum value of the group of data keywords are used as left and right demarcation points. For the key 162 to be queried, its eight-bit binary is represented as 10100010, according to the first 3-bit radix xbit indication, it is located between the two

nodes

152 and 191 by using radix tree, then it is determined that the key to be queried is between the two

nodes

152 and 178 by means of binary search in the node array, and it is located to the corresponding group by using the one-to-one correspondence of the demarcation point and the array, that is, the location of the root query table is completed once.

FIG. 2 is a sample process for a final search using an exponential search. For convenience, the predicted location of the model is noted as 0, and the exact location of the query key-value pair is offset from the predicted location by 12. Firstly, by means of increasing exponential power step length, comparing the size difference between the key value of the exploration point and the query key value to determine a search boundary, in the graph, the key value at the offset 8 is smaller than the query key value, and the key value at the offset 16 is larger than the prediction key value, so that the left boundary and the right boundary are respectively determined to be 8 and 16. And determining the accurate position of data storage by using a binary search mode after the left and right boundaries are determined.

FIG. 3 is a partial process of array length dynamic adaptive slicing training. In the process, after an index model is adopted for the topmost segment, the average prediction error of the model exceeds a set threshold value, so that the fitting effect is considered to be not good enough, and the segmentation is continued; then, in the second layer segment, an index model is trained on the left segment to achieve the expected effect, so that the method is accepted. And the index model of the right segment has the average prediction error still exceeding the set threshold, so that the segmentation is continued.

FIG. 4 illustrates a process for performing new key-value pair data insertion based on a chaining circumscribed array. And executing the key value pair with the inserted key value of 79, firstly finding out that the corresponding lower _ bound in the fragment of document is 69, and the external array of the chaining corresponding to 69 still has a vacant space, so that 79 is inserted into the external array of the chaining corresponding to 69, and a plugging process is completed.

The specific application scenarios of the invention are as follows: in the databases such as PostgreSQL, MySQL and the like, the method can be applied to replace the traditional B-tree index structure, so that the query efficiency is improved.

Table 1 is the experimental data using the method of the invention:

TABLE 1

Average query delay/nanosecond	norm	logn	amzn	osm	wiki
						Group Spline	93	103	205	263	183
B tree	473	473	471	474	483

The norm is an artificial data set generated based on normal distribution, the logn is an artificial data set generated based on lognormal distribution, the amzn is a data set composed of amazon upper shopping data, the osm is a data set composed of address coordinate data provided on an Open Street Map website, and the wiki is a data set composed of partial vocabulary access in the wikipedia. All of the above data sets contain one million data entries.

Based on the same inventive concept, another embodiment of the present invention provides a database learning type index construction system using the above method, including:

Based on the same inventive concept, another embodiment of the present invention provides an electronic device (computer, server, smartphone, etc.) comprising a memory storing a computer program configured to be executed by the processor, and a processor, the computer program comprising instructions for performing the steps of the inventive method.

Based on the same inventive concept, another embodiment of the present invention provides a computer-readable storage medium (e.g., ROM/RAM, magnetic disk, optical disk) storing a computer program, which when executed by a computer, performs the steps of the inventive method.

The particular embodiments of the present invention disclosed above are illustrative only and are not intended to be limiting, since various alternatives, modifications, and variations will be apparent to those skilled in the art without departing from the spirit and scope of the invention. The invention should not be limited to the disclosure of the embodiments in the present specification, but the scope of the invention is defined by the appended claims.

Claims

1. A database learning type index construction method is characterized by comprising the following steps:

fitting the cumulative distribution function by using a machine learning model to acquire the relevance of the data keywords and the data storage position to obtain a learning type index;

2. The method of claim 1, wherein the cumulative distribution function is modeled as pos ═ f (key) xn, where f (key) represents the cumulative distribution function, key represents a data key, pos represents a data storage location, and N represents a total data size.

3. The method of claim 2, wherein fitting the cumulative distribution function using the machine learning model is to select parameters of an appropriate cumulative distribution function (F function) by machine learning so that the loss function is minimized.

4. The method of claim 1, wherein fitting the cumulative distribution function using a machine learning model comprises grouping key-value data; the grouping processing of the key value data includes: presetting the number of segmentSize of each group of data points, dividing the data points into a group after each data point of the segmentSize is read, fitting a machine learning model, recording model parameters, and taking the maximum value and the minimum value of a group of data keywords as boundary points to serve as grouping basis.

5. The method of claim 4, wherein the fast locating the position of the key value to be queried according to the learning-based index comprises: using the root look-up table as an auxiliary data structure to accurately and quickly locate the key to be queried to the target group; the root query table maps the first k data keys with the same binary bit to the same node based on a Radix Tree to realize the rapid grouping positioning of the data keys, wherein k is Radix bit, namely the depth of the Radix Tree; using the demarcation points between the two groups as nodes of the root lookup table, and constructing an array by using the demarcation points, wherein the array is called a node array, and pointers are stored at each position of the node array; when a Key of data to be queried is input each time, firstly two nodes in a node array are positioned according to a radix tree, and then a subscript which is not less than the maximum value of the Key in the node array is found through binary search, namely the subscript which is correspondingly grouped in the array is obtained, and then a root query table grouping positioning process is completed immediately.

6. The method according to claim 4 or 5, characterized in that the array is dynamically sliced using an array length dynamic adaptive slicing method, comprising the steps of:

tentatively combining a plurality of arrays under the same father node at the bottommost layer, and replacing a plurality of original index models with one index model to save space consumption;

evaluating the merging effect based on the index of the average prediction error of the index model, if the error is lower than a set threshold value, accepting merging, otherwise, cancelling the merging operation;

and repeating the merging operation from bottom to top until no array meeting the merging condition exists.

7. The method of claim 1, wherein the insertion operation and the deletion operation are supported by introducing an external array, namely, an element to be inserted is placed in the external array of a position to be inserted, and then the external array and an original array are periodically integrated; when the inserting operation is carried out each time, firstly, the inquiring operation is executed, the position corresponding to the key to be inserted is found, namely the position where the maximum value of the current key in the array is not larger than is called as lower _ bound, the position is externally connected with an array with a fixed size, and the key value pair to be inserted is placed in the array; in the query operation, firstly, the lower _ bound is positioned, if the lower _ bound is the key to be queried, one query operation is completed, and if the lower _ bound is not the key to be queried, the query is performed in the corresponding external array; when the external array is full, integrating the external array and the original array, and retraining the model; when the deletion operation is carried out, the query is still executed firstly, the position of the data is found, the position is marked to be empty, and all the positions marked to be empty are integrated.

8. A database-learning index building system using the method of any one of claims 1 to 7, comprising:

the cumulative distribution function fitting module is used for fitting a cumulative distribution function by using a machine learning model so as to obtain the relevance of the data keywords and the data storage position and obtain a learning type index;

9. An electronic apparatus, comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1 to 7.