CN111126625A

CN111126625A - Extensible learning index method and system

Info

Publication number: CN111126625A
Application number: CN201911328057.3A
Authority: CN
Inventors: 华宇; 李鹏飞
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-08
Anticipated expiration: 2039-12-20
Also published as: CN111126625B

Abstract

The invention discloses an extensible learning index method and system, belonging to the field of computer data storage and comprising the following steps: sampling key value pairs in a key value storage system to obtain an ordered training data set; training by utilizing a training data set to obtain a plurality of linear regression models, wherein each linear regression model is respectively used for indexing data in a data interval, and data areas covered by the linear regression models are not overlapped with each other; storing each linear regression model according to a form of < key, model >, wherein the key and the model are respectively maximum data and model parameters in an interval covered by the model; processing the newly inserted data using a hierarchical bucket structure; each hierarchical bucket structure corresponds to one data participating in training; each hierarchical bucket structure comprises a father bucket, and data in the father bucket is ordered; each parent bucket data corresponds to a child bucket, the data in the child buckets are ordered, and the data in the child buckets are smaller than the corresponding parent bucket data. The invention can effectively improve the expandability of the learning index.

Description

Extensible learning index method and system

Technical Field

The invention belongs to the field of computer data storage, and particularly relates to an extensible learning index method and system.

Background

In the current big data era, how to efficiently store and access data becomes an important issue of attention of all circles. Computer systems typically use various index structures to efficiently store and access data on a demand-by-demand basis, where a tree index structure is an important structure for satisfying a scope request. Many existing methods, such as CSS-Tree, CSB + -Tree, FAST, etc., allow Tree structures to provide FAST Data access by using memory, cache, or SIMD (Single Instruction Multiple Data) optimization, but these structures usually occupy a large amount of memory space. Once these structures overflow the limited memory with ever-increasing amounts of data, the efficiency of data access will be severely reduced.

The existing learning index technology can utilize a machine learning algorithm to learn the distribution rule of data, the obtained machine learning model can well reflect the distribution condition of the data, based on the learning index technology, only a small amount of memory is needed to store the models, and meanwhile, the strong computing power of a computer is utilized to access the data. Compared with the traditional method for positioning the data position by comparing data, the method for accessing the data by utilizing calculation with lower cost and higher speed and occupying a very small amount of memory is more suitable for the big data era nowadays. However, the existing learning index method cannot be widely used because it has the following challenges:

(1) poor expandability: the existing learning index method cannot well insert data, and if new data is directly inserted into original data according to the sequence of the data size, the positions of partial data are changed, so that the distribution condition of the data is inconsistent with that of a machine learning model learned before, and a part of data cannot be found through the existing model and is lost. At this time, the changed data distribution must be learned by retraining the learning index model to ensure that all data can be found. However, due to the high degree of dependency between models, changing one model will change other models in conjunction, making it difficult to add or remove a certain portion of the model and data.

(2) The overhead is expensive: in order to solve the problem that data cannot be well inserted due to high dependency between models, some learning index methods use a buffer to store newly inserted data, so that the original data distribution is not damaged after the newly inserted data is inserted, but this results in that the data must be accessed by accessing two structures (i.e., the original structure and the buffer), thereby greatly reducing the efficiency of data access. In order to store data to other places separately, the existing learning index method either separates data covered by multiple models or builds a data conversion table to separate and migrate data, but both methods need to be reconstructed during retraining and introduce large space and time overhead.

In general, existing learning index methods are poorly scalable.

Disclosure of Invention

Aiming at the defects and improvement requirements of the prior art, the invention provides an extensible learning index method and system, and aims to improve the extensibility of the learning index.

To achieve the above object, according to a first aspect of the present invention, there is provided an extensible learning index method, including:

sampling key value pairs in a key value storage system to obtain an ordered training data set record [ N ];

training by using a training data set record [ N ] to obtain a plurality of linear regression models, wherein each linear regression model is respectively used for indexing data in a data interval, and data areas covered by the linear regression models are not overlapped with each other;

and storing each linear regression model according to a form of < key, model >, wherein key is the maximum data in the data interval covered by the linear regression model, and model is the model parameter of the linear regression model.

The data distribution rule in a certain data interval is self-adaptively learned by utilizing the plurality of linear regression models respectively, the linear regression models are simple in structure and few in parameters, the fast training speed can be still ensured when the plurality of models are realized, the linear regression models are mutually independent, and when one model needs to be changed, other models do not need to be retrained, so that a certain part of models and data can be conveniently added or removed, and the expandability of the learning index is effectively improved.

Further, the expandable learning index method provided by the first aspect of the present invention further includes processing newly inserted data by using a hierarchical bucket structure;

each hierarchical bucket structure corresponds to one data participating in training; each hierarchical bucket structure comprises a father bucket, and data in the father bucket is ordered; each parent bucket data corresponds to a child bucket through a pointer pointing to the child bucket, the data in the child bucket is ordered, and the data in the child bucket is smaller than the corresponding parent bucket data.

The invention utilizes the hierarchical barrel structure to process the newly inserted data (namely the data outside the training data set record [ N ]), so that the data distribution in the original training data set can not be damaged after the newly inserted data, and the linear regression model obtained by training the training data set does not need to be retrained, thereby ensuring the data expandability and avoiding the reduction of the data access efficiency due to the access of a buffer area.

Further, the expandable learning index method provided by the first aspect of the present invention further includes processing the newly inserted data D according to the following steps_insert：

(T1) based on the newly inserted data D_insertDetermining a corresponding Level-bin of a hierarchical bucket structure in the data interval to which the data belongs;

(T2) query parent bucket bin of Level-bin of hierarchical bucket structure_FTo determine the sub-bucket bin to be queried_SThe corresponding parent bucket data is f_S；

(T3) if the barrel bin_SThe previous sub-barrel bin_S-1If the current time does not exist or is full, the step (T5) is carried out; otherwise, data D is obtained_insertAnd barrel bin_SOf the smallest data, the larger data D₁And smaller data D₂Data D of₁Sequentially insert into sub-barrel bin_SThen, go to step (T4);

(T4) because the sub-barrel bin_SThe internal data is greater than the sub-barrel bin_S-1Inner data and sub-bucket bin_S-1Corresponding parent bucket data, so data D₂Greater than sub barrel bin_S-1Data within and parent bucket data f_S-1Will father bucket data f_S-1Sequentially insert into sub-barrel bin_S-1And data D₂Insert into parent barrel bin_FMiddle as sub-barrel bin_S-1Corresponding parent bucket data, and then proceeds to step (T8);

(T5) Huobin_SIf the current is full, the step (T6) is carried out; otherwise, the newly inserted data is inserted into the sub-barrel bin in sequence_SThen, go to step (T8);

(T6) putting the parent bucket data f_SThe subsequent parent bucket data and corresponding pointers to the child buckets are all moved backward by one position, and a new child bucket bin is created_newWill father bucket data f_SInsert into parent barrel bin_FThe empty position is used as the sub-barrel bin_newCorresponding parent bucket data, and combining child bucket bins_SA larger part of data is inserted into the sub-barrel bin_newPerforming the following steps;

(T7) sub-barrel bin_SInserting the largest data in the inner residual data into the parent bucket bin_FMiddle as sub-barrel bin_SCorresponding parent bucket data, and proceeds to step (T2);

(T8) the insertion ends.

When newly inserted data is processed, only partial father barrel data and a small amount of son barrel data are needed to be migrated under the worst condition, so that the time overhead can be effectively reduced; and when inserting data, under the situation that the previous sub-bucket is full, new data can be inserted into the following sub-bucket, so that the space utilization rate of the sub-buckets is ensured, and the space overhead is effectively reduced.

Further, in step (T6), insert into sub-barrel bin_newThe data in (1) is sub-bucket bin_SHalf of the total data.

According to the invention, after the new sub-bucket is created, half of the data in the current sub-bucket is migrated into the newly created sub-bucket, so that frequent creation of the sub-bucket and data migration can be avoided.

Further, in the extensible learning index method provided by the first aspect of the present invention, the training with the training data set record [ N ] to obtain a plurality of linear regression models includes:

(S1) initializing an empty data set S;

(S2) sequentially extracting a learning _ step key value pair from the training data set record [ N ] and adding the key value pair to the data set S;

(S3) training a linear regression model with the data set S and obtaining the maximum error of the linear regression model₁If error r₁<threshold, and training data set record [ N]If the key value pair which does not participate in the model training exists, the step is carried out (S2); if error₁<threshold, and training data set record [ N]If no key value pair does not participate in model training, the step (S6) is carried out; if error₁>If the threshold, the last added learning _ step key value pair is used as the data to be shifted out, and after the number n of the initialization iteration rounds is equal to 1, the step is shifted to (S4); if error1 is greater than threshold, the process proceeds to step (S6);

(S4) moving the learning step learning rate out of the data set S sequentially from back to frontⁿKey value pair, then retraining a linear regression model by using the data set S, and obtaining the maximum error of the linear regression model₂；

(S5) if error₂>threshold, and the number of key-value pairs remaining in the data to be shifted out KV_left>learning_step*learning_rateⁿThen, the process proceeds to step (S4); if error₂>threshold, and the number of key-value pairs remaining in the data to be shifted out KV_left＝learning_step*learning_rateⁿIf so, the iteration round number n is updated according to n-n +1, and then the process proceeds to step (S4); if error₂If the threshold is less than or equal to the threshold, the step (S6) is carried out;

(S6) storing the finally obtained linear regression model, obtaining the position of the last data in the data set S in the training data set record [ N ], taking the next position of the position as the initial position of data extraction, emptying the data set S, and if the training data set record [ N ] has key value pairs which do not participate in model training, turning to the step (S2); otherwise, the operation is ended;

wherein, the threshold is a preset error threshold, the learning _ step is a preset learning step length, and the learning _ rate is a preset learning rate.

The method completes model training based on the greedy thought, and under the condition that the model error does not exceed the threshold, the data interval corresponding to the model is expanded by a larger step length until the model error just exceeds the threshold; after the model exceeds the threshold value, the model is backed up by a smaller step length and the step length is reduced round by round, so that the data interval is gradually reduced, under the condition of ensuring the index precision, one model can cover data which accord with the same distribution rule as much as possible, the number of the used models is reduced, and the independence among the models is realized.

Further, in the extensible learning index method provided by the first aspect of the present invention, storing each linear regression model in a form of < key, model >, includes:

storing by adopting a two-layer structure;

in the two-layer structure, the construction mode of the second layer is as follows: sorting all < keys, models > corresponding to the linear regression models according to the order of keys from small to large, then forming a page by M < keys, models >, and filling according to the maximum value if the number of the models is less than M for the last page, and forming a second layer by all pages;

in the two-layer structure, the first layer is constructed in the following manner: copying the maximum data in each page, and forming a first layer according to the sequence from small to large;

wherein M is less than or equal to 64.

The invention stores the linear regression model corresponding to the key, model by using a two-layer structure, which is beneficial to accelerating the searching speed of the linear regression model, thereby improving the indexing efficiency of data.

Further preferably, M ═ 64; the number of models forming one page is not more than 64, the two-layer structure can be used for storage, the search speed of the linear regression model can be effectively accelerated, and the first layer in the two-layer structure can be prevented from occupying too much storage space by further setting M to 64.

Further, in the extensible learning index method provided by the first aspect of the present invention, when searching for the linear regression model corresponding to the key value pair, each page is optimized by performing interpolation search using SIMD.

The invention optimizes interpolation search by utilizing SIMD (single instruction multiple data) on each page on the basis of two-layer structure, thereby further accelerating the search speed of a linear regression model and improving the index efficiency of data.

Further, in the extensible learning index method provided by the first aspect of the present invention, when searching for a sub-bucket, all sub-buckets are optimized for interpolation search by using SIMD.

Further, the expandable learning index method provided by the first aspect of the present invention further includes, for any linear regression model_iThe training is carried out according to the following steps:

linear regression model_iReordering the data within the covered interval to obtain a data set record_i；

Using dataset record_iRetraining one or more linear regression models to replace the original linear regression model_iAnd the new linear regression model is fitted to<key，model>Is stored.

According to the invention, through a model training mode of adaptive learning, the data interval corresponding to each linear regression model is determined, and when the model is retrained, only the data in the corresponding data interval is reordered, and then the linear regression model is retrained, so that the retraining of the model can be rapidly completed; and for other linear regression models which do not need to be retrained, only the offset is correspondingly modified according to the change condition of the data interval corresponding to the retraining model.

According to a second aspect of the present invention, there is provided a system comprising a computer readable storage medium for storing an executable program and a processor;

the processor is used for reading an executable program stored in the computer readable storage medium and executing the expandable learning index method provided by the first aspect of the invention.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) the data distribution rule in a certain data interval is self-adaptively learned by utilizing the plurality of linear regression models respectively, the linear regression models are simple in structure and few in parameters, the fast training speed can be still ensured when the plurality of models are realized, the linear regression models are mutually independent, and when one model needs to be changed, other models do not need to be retrained, so that a certain part of models and data can be conveniently added or removed, and the expandability of the learning index is effectively improved.

(2) The invention utilizes the hierarchical barrel structure to process the newly inserted data, so that the data distribution in the original training data set can not be damaged after the newly inserted data are inserted, the retraining of the linear regression model obtained by utilizing the training data set is not needed, and the data access efficiency is prevented from being reduced due to the access of a buffer area while the data expandability is ensured.

(3) When newly inserted data is processed, only partial father barrel data and a small amount of son barrel data are needed to be migrated under the worst condition, so that the time overhead can be effectively reduced; and when inserting data, under the situation that the previous sub-bucket is full, new data can be inserted into the following sub-bucket, so that the space utilization rate of the sub-buckets is ensured, and the space overhead is effectively reduced.

(4) The method completes model training based on the greedy thought, enables one model to cover data conforming to the same distribution rule as much as possible under the condition of ensuring index precision, and is beneficial to reducing the number of the used models and realizing independence among the models.

(5) The invention optimizes the interpolation search for each page, each sub-bucket and SIMD, and can further accelerate the search of the linear regression model and the sub-bucket and improve the index efficiency of data.

Drawings

Fig. 1 is a schematic diagram of a scope query index model and a KPF (Key Position Function) model according to an embodiment of the present invention; wherein, (a) is a range query index model, and (b) is a KPF model schematic diagram;

FIG. 2 is a schematic diagram of an extensible index learning method according to an embodiment of the present invention;

fig. 3 is a schematic diagram of ensuring data integrity according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

In the present application, the terms "first," "second," and the like (if any) in the description and the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The conventional tree-shaped index model is shown in fig. 1 (a), and is regarded as a prediction model, i.e., the data storage location is predicted according to the input data, and finally a leaf node is output, wherein the leaf node contains the data to be searched. Similarly, the process can be viewed as a machine-learned regression model, and existing learned indexes have proven feasible to supplement existing index structures with machine-learned models. However, it is difficult to accurately predict the storage locations of all data by using a machine learning model, which requires establishing a complex and highly accurate machine learning model, and such a model often consumes a large amount of storage space and has great training difficulty. However, in practice, the conventional tree index structure does not accurately predict the positions of all data, but finally outputs leaf nodes containing the searched data, and the leaf nodes are a range. Such a finding greatly reduces the difficulty of learning the index model, i.e., the learning index model can also predict a range, [ pred + min _ err, pred + max _ err ], and it is sufficient to ensure that the range contains the queried data. In the invention, an error threshold value threshold >0 is specifically set, and specific values can be determined according to actual application requirements; accordingly, in the prediction range of the learning index, min _ err ═ threshold, max _ err ═ threshold;

in fig. 1 (b), the learning index regards the data distribution as a KPF model, arranges keys in descending order, takes the keys as input and the positions as output, and the final data distribution is a KPF model. Therefore, the distribution rule of the data can be well mastered by learning the KPF model by using the machine learning model, so that all data can be more quickly indexed by using the learned model.

It is easy to understand that, because the present invention processes data (i.e. key value pairs) in a key value storage system, keys in the key value pairs are used for uniquely identifying each key value pair, and when data size comparison and sequence are involved, the size relationship of the keys represents the relationship of the whole key value pair; for example, any one of the data A and any one of the data B, A > B specifically means that the key of the data A is larger than that of the data B, whereas A ≦ B specifically means that the key of the data A is smaller than or equal to that of the data B; for another example, the data in the data set is ordered, specifically, the data in the data set is ordered from small to large according to the key.

In order to improve the expandability of the learning index, the expandable learning index method provided by the invention comprises the following steps:

storing each linear regression model according to a form of < key, model >, wherein key is the maximum data (namely the data with the maximum key) in the data interval covered by the linear regression model, and model is the model parameter of the linear regression model; the linear regression model can be represented as y ═ ax + b, in the present invention, x is the data to be indexed, y is the position of the data x predicted by the model, and the model parameters specifically include weight a and offset b.

In an alternative embodiment, each linear regression model is stored in the form of < key, model >, including:

storing by adopting a two-layer structure;

in the two-layer structure, the construction mode of the second layer is as follows: sorting all < keys, models > corresponding to the linear regression models according to the order of keys from small to large, then forming a page by M < keys, models >, and filling according to the maximum value if the number of the models is less than M for the last page, and forming a second layer by all pages; when maximum value filling is specifically performed, the filled data can be the key of the maximum data in the page, or can be a preset maximum value;

wherein M is less than or equal to 64 so as to ensure that the search speed of the linear regression model can be effectively accelerated; as a preferred implementation manner, in this embodiment, specifically setting M to 64 can avoid that the first layer in the two-layer structure occupies too much storage space;

the two-layer structure for storing the linear regression model established by the present invention is shown in fig. 2.

In order to reduce the time overhead and data overhead of learning the index, the present embodiment processes newly inserted data using a hierarchical bucket structure;

as shown in fig. 2, each hierarchical bucket structure corresponds to one data participating in training; for example, in fig. 1, the data interval corresponding to the first linear regression model is (0, 99), and in the training dataset, the interval (0, 99) is further specifically divided into four sub-intervals of (1, 4], (4, 65], (65, 90], (90, 99), and each sub-interval corresponds to one hierarchical bucket structure;

each hierarchical bucket structure comprises a father bucket, and data in the father bucket is ordered; each father bucket data corresponds to a son bucket through a pointer pointing to the son bucket, the data in the son bucket is ordered, and the data in the son bucket is smaller than the corresponding father bucket data;

based on the above hierarchical bucket structure, in the present embodiment, the method further includes processing the newly inserted data D according to the following steps_insert：

(T4) because of sub-barrel bin_SThe data in is greater than the sub-barrel bin_S-1Inner data and sub-bucket bin_S-1Corresponding parent bucket data f_S-1So data D2 is larger than sub-bucket bin_S-1Inner data and corresponding father bucket data f_S-1(ii) a Will father bucket data f_S-1Sequentially insert into sub-barrel bin_S-1And data D₂Insert into parent barrel bin_FMiddle as sub-barrel bin_S-1Corresponding parent bucket data, and then proceeds to step (T8);

(T6) putting the parent bucket data f_SThe subsequent parent bucket data and corresponding pointers to the child buckets are all moved backward by one position, and a new child bucket bin is created_newWill father bucket data f_SInsert into parent barrel bin_FEmpty position of center as a childBarrel bin_newCorresponding parent bucket data, and combining child bucket bins_SA larger part of data is inserted into the sub-barrel bin_newPerforming the following steps;

(T7) sub-barrel bin_SInserting the largest data in the inner residual data into the parent bucket bin_FMiddle as sub-barrel bin_SCorresponding father bucket data, and the step (T2) is carried out, after the jump, the new data D is determined again according to the current structure of Level-bin of the hierarchical bucket structure_insertThe insertion position of (a);

(T8) the end of the insertion;

in a preferred embodiment, to avoid creating sub-buckets and migrating data frequently, in the step (T6) described above, the sub-bucket bin is followed_SMiddle migration to sub-barrel bin_newThe data in (1) is sub-bucket bin_SHalf of the total data.

In order to improve the data indexing efficiency, in the embodiment, when a linear regression model corresponding to a key value pair is searched, each page is optimized by interpolation search by using SIMD;

when the sub-buckets are searched, all the sub-buckets are optimized by interpolation search through SIMD.

FIG. 3 illustrates a specific process of the present invention for guaranteeing data integrity; as shown in the global diagram on the left side of FIG. 3, all models of the extensible learning index in the present invention can cover all data intervals, and all models can cover all data intervals<key,model>Is stored, where key is the maximum data of the data interval covered by each model, and model is the model parameters, i.e. weight a and offset b; in the enlarged partial view shown on the right side of fig. 3, the solid black lines represent the learned linear regression model, the black points are the data covered by the model, the gray shaded parts are the prediction ranges given by the model, and all the points can be found by the model; e.g. error x of point a_aMax err ≦ this allows the prediction interval given by the model [ pred + min err, pred + max err ≦]Always contains point a.

However, if new data smaller than a is directly inserted, all data larger than the newly inserted data needs to be shifted backward by one bit in the conventional learning index methodTo make room for new data to maintain an ordered arrangement of data. At this time, after a moves to a ', the error x of the point a' is^a’Max _ err, so that the prediction interval given by the regression model does not contain the data, and a part of the data is lost; in the present invention, newly inserted data is processed through hierarchical buckets, and the structure does not need to move the position of the existing data participating in training, thereby ensuring that the data can be always found.

The present invention will be further explained below by taking the specific insertion process of data 7, data 23 and data 24 in fig. 2 as an example.

The specific data insertion can be divided into 3 stages, namely Stage1, Stage2 and Stage3 in fig. 2;

stage 1: searching which model covers the new data through the stored < key, model >, firstly searching which page the model belongs to by using a dichotomy, and then searching in the page by using SIMD;

stage 2: calculating by using the searched model, predicting which range the new data should be inserted into, namely calculating [ pred + min _ err, pred + max _ err ], and finding the position which is closest to the new data and is less than or equal to the position of the new data in the range;

stage 3: if the model prediction range contains new data, the data is indicated to be inserted; otherwise the data is inserted into the inventive hierarchical bucket.

When inserting data 7, i.e. "Insert 7" process in fig. 2, it is first determined that data 7 is represented by a first model f₁(x) The data interval covered by the model is specifically (0, 99)](ii) a In the training data set record [ N ]]In (5), data 7 is specifically located in the data interval (4, 65)]Obtaining a hierarchical bucket corresponding to the data interval, inquiring data of a parent bucket to obtain that new data 7 is to be inserted into the 2 nd sub-bucket, wherein the state of the hierarchical bucket is as shown in the state I in figure 2, and according to the principle of preferentially inserting the previous bucket of the hierarchical bucket, because the previous bucket, namely the 1 st bucket, has a position, the new data 7 is to be inserted into the first bucket;

when the data 23 is inserted, namely the process of 'Insert 23' in fig. 2, determining a hierarchical bucket corresponding to a data interval (4, 65) to be queried through

stages

1 and 2, querying parent bucket data, and obtaining that the new data 23 should be inserted into the 2 nd sub-bucket, wherein the state of the hierarchical bucket is as state 'II' in fig. 2, and the previous bucket has an empty position, the hierarchical bucket should be preferentially inserted into the 1 st sub-bucket, and by comparison, the smallest data 20 in the 1 st sub-bucket is smaller than the data 23, the data 20 is inserted into the 1 st sub-bucket, and the data 23 is inserted into the 2 nd sub-bucket;

when the data 24 is inserted, that is, the "Insert 24" process in fig. 2, through

stages

1 and 2, it is determined that the hierarchical bucket corresponding to the data interval (4, 65) should be queried, and the parent bucket data is queried, it is found that the new data 24 should be inserted into the 2 nd sub-bucket, and the state of the hierarchical bucket is as state "III" in fig. 2, since the 1 st sub-bucket and the 2 nd sub-bucket are full, the parent bucket data corresponding to the 3 rd sub-bucket and the following sub-bucket is shifted backward by 1 position, then a new sub-bucket is created and contains half of the data of the 2 nd sub-bucket, and the new data is inserted into the 2 nd sub-bucket which is empty at this time, and after the data 24 is inserted, the state of the hierarchical bucket is as state "IV" in fig. 2.

In order to ensure the independence between the linear regression models, in this embodiment, a plurality of linear regression models are obtained by training with a training data set record [ N ], which specifically includes:

(S1) initializing an empty data set S;

According to the model training procedure, model training is completed based on a greedy thought, and under the condition that the model error does not exceed the threshold, the data interval corresponding to the model is expanded by a larger step length until the model error just exceeds the threshold; after the model exceeds the threshold value, the model is backed up by a smaller step length and the step length is reduced round by round, so that the data interval is gradually reduced, under the condition of ensuring the index precision, one model can cover data which accord with the same distribution rule as much as possible, and the independence among the models is favorably realized.

The above-described expandable learning index method,it can also include, for any one of the linear regression models_iThe training is carried out according to the following steps:

In this embodiment, through a model training mode of adaptive learning, a data interval corresponding to each linear regression model is already determined, and when model retraining is performed, only data in the corresponding data interval is reordered, and then the linear regression model is retrained, so that retraining of the model can be quickly completed; and for other linear regression models which do not need to be retrained, only the offset is correspondingly modified according to the change condition of the data interval corresponding to the retraining model.

The invention also provides a system comprising a computer-readable storage medium and a processor, the computer-readable storage medium for storing an executable program;

the processor is used for reading an executable program stored in the computer readable storage medium and executing the expandable learning index method.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An extensible learning index method, comprising:

2. The extensible learning index method of claim 1, further comprising, processing newly inserted data using a hierarchical bucket structure;

3. The extensible learning index method of claim 2, further comprising processing newly inserted data D according to the following steps_insert：

(T4) sub-barrel bin_S-1Corresponding parent bucket data f_S-1Sequentially insert into sub-barrel bin_S-1And data D₂Insert intoTo the father barrel bin_FMiddle as sub-barrel bin_S-1Corresponding parent bucket data, and then proceeds to step (T8);

(T8) the insertion ends.

4. The scalable learning-indexing method of claim 3, wherein said step (T6) is inserting sub-bucket bins_newThe data in (1) is sub-bucket bin_SHalf of the total data.

5. The scalable learning index method of any of claims 1-4, wherein training with a training data set record [ N ] yields a plurality of linear regression models, comprising:

(S1) initializing an empty data set S;

(S3) training a linear regression model with the data set S and obtaining the maximum error of the linear regression model₁If error r₁<threshold, and training data set record [ N]If the key value pair which does not participate in the model training exists, the step is carried out (S2); if error₁<threshold, and training data set record [ N]If no key value pair does not participate in model training, the step (S6) is carried out; if error₁>If the threshold, the last added learning _ step key value pair is used as the data to be shifted out, and after the number n of the initialization iteration rounds is equal to 1, the step is shifted to (S4); if error₁If yes, the process proceeds to step (S6);

(S5) if error₂>threshold, and the number KV of the remaining key-value pairs in the data to be shifted out_left>learning_step*learning_rateⁿThen, the process proceeds to step (S4); if error₂<threshold, and the number KV of the remaining key-value pairs in the data to be shifted out_left＝learning_step*learning_rateⁿIf so, the iteration round number n is updated according to n-n +1, and then the process proceeds to step (S4); if error₂If the threshold is less than or equal to the threshold, the step (S6) is carried out;

6. The extensible learning-indexing method of any of claims 1-4, wherein storing each linear regression model in the form of < key, model >, comprises:

storing by adopting a two-layer structure;

wherein M is less than or equal to 64.

7. The scalable learning index method of any of claims 1-4, wherein each page is optimized for interpolation search using SIMD when searching for the linear regression model corresponding to the key-value pair.

8. The scalable learning index method of any of claims 1-4, wherein when searching for sub-buckets, all sub-buckets are optimized for interpolation search using SIMD.

9. The extensible learning index method of any of claims 1-4, further comprising, for any linear regression model_iThe training is carried out according to the following steps:

10. A system comprising a computer-readable storage medium and a processor, wherein,

the computer readable storage medium is used for storing an executable program;

the processor is configured to read an executable program stored in the computer-readable storage medium and execute the scalable learning index method of any one of claims 1-9.