CN111126625A - Extensible learning index method and system - Google Patents

Extensible learning index method and system Download PDF

Info

Publication number
CN111126625A
CN111126625A CN201911328057.3A CN201911328057A CN111126625A CN 111126625 A CN111126625 A CN 111126625A CN 201911328057 A CN201911328057 A CN 201911328057A CN 111126625 A CN111126625 A CN 111126625A
Authority
CN
China
Prior art keywords
data
bucket
linear regression
model
learning
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911328057.3A
Other languages
Chinese (zh)
Other versions
CN111126625B (en
Inventor
华宇
李鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201911328057.3A priority Critical patent/CN111126625B/en
Publication of CN111126625A publication Critical patent/CN111126625A/en
Application granted granted Critical
Publication of CN111126625B publication Critical patent/CN111126625B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an extensible learning index method and system, belonging to the field of computer data storage and comprising the following steps: sampling key value pairs in a key value storage system to obtain an ordered training data set; training by utilizing a training data set to obtain a plurality of linear regression models, wherein each linear regression model is respectively used for indexing data in a data interval, and data areas covered by the linear regression models are not overlapped with each other; storing each linear regression model according to a form of < key, model >, wherein the key and the model are respectively maximum data and model parameters in an interval covered by the model; processing the newly inserted data using a hierarchical bucket structure; each hierarchical bucket structure corresponds to one data participating in training; each hierarchical bucket structure comprises a father bucket, and data in the father bucket is ordered; each parent bucket data corresponds to a child bucket, the data in the child buckets are ordered, and the data in the child buckets are smaller than the corresponding parent bucket data. The invention can effectively improve the expandability of the learning index.

Description

Extensible learning index method and system
Technical Field
The invention belongs to the field of computer data storage, and particularly relates to an extensible learning index method and system.
Background
In the current big data era, how to efficiently store and access data becomes an important issue of attention of all circles. Computer systems typically use various index structures to efficiently store and access data on a demand-by-demand basis, where a tree index structure is an important structure for satisfying a scope request. Many existing methods, such as CSS-Tree, CSB + -Tree, FAST, etc., allow Tree structures to provide FAST Data access by using memory, cache, or SIMD (Single Instruction Multiple Data) optimization, but these structures usually occupy a large amount of memory space. Once these structures overflow the limited memory with ever-increasing amounts of data, the efficiency of data access will be severely reduced.
The existing learning index technology can utilize a machine learning algorithm to learn the distribution rule of data, the obtained machine learning model can well reflect the distribution condition of the data, based on the learning index technology, only a small amount of memory is needed to store the models, and meanwhile, the strong computing power of a computer is utilized to access the data. Compared with the traditional method for positioning the data position by comparing data, the method for accessing the data by utilizing calculation with lower cost and higher speed and occupying a very small amount of memory is more suitable for the big data era nowadays. However, the existing learning index method cannot be widely used because it has the following challenges:
(1) poor expandability: the existing learning index method cannot well insert data, and if new data is directly inserted into original data according to the sequence of the data size, the positions of partial data are changed, so that the distribution condition of the data is inconsistent with that of a machine learning model learned before, and a part of data cannot be found through the existing model and is lost. At this time, the changed data distribution must be learned by retraining the learning index model to ensure that all data can be found. However, due to the high degree of dependency between models, changing one model will change other models in conjunction, making it difficult to add or remove a certain portion of the model and data.
(2) The overhead is expensive: in order to solve the problem that data cannot be well inserted due to high dependency between models, some learning index methods use a buffer to store newly inserted data, so that the original data distribution is not damaged after the newly inserted data is inserted, but this results in that the data must be accessed by accessing two structures (i.e., the original structure and the buffer), thereby greatly reducing the efficiency of data access. In order to store data to other places separately, the existing learning index method either separates data covered by multiple models or builds a data conversion table to separate and migrate data, but both methods need to be reconstructed during retraining and introduce large space and time overhead.
In general, existing learning index methods are poorly scalable.
Disclosure of Invention
Aiming at the defects and improvement requirements of the prior art, the invention provides an extensible learning index method and system, and aims to improve the extensibility of the learning index.
To achieve the above object, according to a first aspect of the present invention, there is provided an extensible learning index method, including:
sampling key value pairs in a key value storage system to obtain an ordered training data set record [ N ];
training by using a training data set record [ N ] to obtain a plurality of linear regression models, wherein each linear regression model is respectively used for indexing data in a data interval, and data areas covered by the linear regression models are not overlapped with each other;
and storing each linear regression model according to a form of < key, model >, wherein key is the maximum data in the data interval covered by the linear regression model, and model is the model parameter of the linear regression model.
The data distribution rule in a certain data interval is self-adaptively learned by utilizing the plurality of linear regression models respectively, the linear regression models are simple in structure and few in parameters, the fast training speed can be still ensured when the plurality of models are realized, the linear regression models are mutually independent, and when one model needs to be changed, other models do not need to be retrained, so that a certain part of models and data can be conveniently added or removed, and the expandability of the learning index is effectively improved.
Further, the expandable learning index method provided by the first aspect of the present invention further includes processing newly inserted data by using a hierarchical bucket structure;
each hierarchical bucket structure corresponds to one data participating in training; each hierarchical bucket structure comprises a father bucket, and data in the father bucket is ordered; each parent bucket data corresponds to a child bucket through a pointer pointing to the child bucket, the data in the child bucket is ordered, and the data in the child bucket is smaller than the corresponding parent bucket data.
The invention utilizes the hierarchical barrel structure to process the newly inserted data (namely the data outside the training data set record [ N ]), so that the data distribution in the original training data set can not be damaged after the newly inserted data, and the linear regression model obtained by training the training data set does not need to be retrained, thereby ensuring the data expandability and avoiding the reduction of the data access efficiency due to the access of a buffer area.
Further, the expandable learning index method provided by the first aspect of the present invention further includes processing the newly inserted data D according to the following stepsinsert
(T1) based on the newly inserted data DinsertDetermining a corresponding Level-bin of a hierarchical bucket structure in the data interval to which the data belongs;
(T2) query parent bucket bin of Level-bin of hierarchical bucket structureFTo determine the sub-bucket bin to be queriedSThe corresponding parent bucket data is fS
(T3) if the barrel binSThe previous sub-barrel binS-1If the current time does not exist or is full, the step (T5) is carried out; otherwise, data D is obtainedinsertAnd barrel binSOf the smallest data, the larger data D1And smaller data D2Data D of1Sequentially insert into sub-barrel binSThen, go to step (T4);
(T4) because the sub-barrel binSThe internal data is greater than the sub-barrel binS-1Inner data and sub-bucket binS-1Corresponding parent bucket data, so data D2Greater than sub barrel binS-1Data within and parent bucket data fS-1Will father bucket data fS-1Sequentially insert into sub-barrel binS-1And data D2Insert into parent barrel binFMiddle as sub-barrel binS-1Corresponding parent bucket data, and then proceeds to step (T8);
(T5) HuobinSIf the current is full, the step (T6) is carried out; otherwise, the newly inserted data is inserted into the sub-barrel bin in sequenceSThen, go to step (T8);
(T6) putting the parent bucket data fSThe subsequent parent bucket data and corresponding pointers to the child buckets are all moved backward by one position, and a new child bucket bin is creatednewWill father bucket data fSInsert into parent barrel binFThe empty position is used as the sub-barrel binnewCorresponding parent bucket data, and combining child bucket binsSA larger part of data is inserted into the sub-barrel binnewPerforming the following steps;
(T7) sub-barrel binSInserting the largest data in the inner residual data into the parent bucket binFMiddle as sub-barrel binSCorresponding parent bucket data, and proceeds to step (T2);
(T8) the insertion ends.
When newly inserted data is processed, only partial father barrel data and a small amount of son barrel data are needed to be migrated under the worst condition, so that the time overhead can be effectively reduced; and when inserting data, under the situation that the previous sub-bucket is full, new data can be inserted into the following sub-bucket, so that the space utilization rate of the sub-buckets is ensured, and the space overhead is effectively reduced.
Further, in step (T6), insert into sub-barrel binnewThe data in (1) is sub-bucket binSHalf of the total data.
According to the invention, after the new sub-bucket is created, half of the data in the current sub-bucket is migrated into the newly created sub-bucket, so that frequent creation of the sub-bucket and data migration can be avoided.
Further, in the extensible learning index method provided by the first aspect of the present invention, the training with the training data set record [ N ] to obtain a plurality of linear regression models includes:
(S1) initializing an empty data set S;
(S2) sequentially extracting a learning _ step key value pair from the training data set record [ N ] and adding the key value pair to the data set S;
(S3) training a linear regression model with the data set S and obtaining the maximum error of the linear regression model1If error r1<threshold, and training data set record [ N]If the key value pair which does not participate in the model training exists, the step is carried out (S2); if error1<threshold, and training data set record [ N]If no key value pair does not participate in model training, the step (S6) is carried out; if error1>If the threshold, the last added learning _ step key value pair is used as the data to be shifted out, and after the number n of the initialization iteration rounds is equal to 1, the step is shifted to (S4); if error1 is greater than threshold, the process proceeds to step (S6);
(S4) moving the learning step learning rate out of the data set S sequentially from back to frontnKey value pair, then retraining a linear regression model by using the data set S, and obtaining the maximum error of the linear regression model2
(S5) if error2>threshold, and the number of key-value pairs remaining in the data to be shifted out KVleft>learning_step*learning_ratenThen, the process proceeds to step (S4); if error2>threshold, and the number of key-value pairs remaining in the data to be shifted out KVleft=learning_step*learning_ratenIf so, the iteration round number n is updated according to n-n +1, and then the process proceeds to step (S4); if error2If the threshold is less than or equal to the threshold, the step (S6) is carried out;
(S6) storing the finally obtained linear regression model, obtaining the position of the last data in the data set S in the training data set record [ N ], taking the next position of the position as the initial position of data extraction, emptying the data set S, and if the training data set record [ N ] has key value pairs which do not participate in model training, turning to the step (S2); otherwise, the operation is ended;
wherein, the threshold is a preset error threshold, the learning _ step is a preset learning step length, and the learning _ rate is a preset learning rate.
The method completes model training based on the greedy thought, and under the condition that the model error does not exceed the threshold, the data interval corresponding to the model is expanded by a larger step length until the model error just exceeds the threshold; after the model exceeds the threshold value, the model is backed up by a smaller step length and the step length is reduced round by round, so that the data interval is gradually reduced, under the condition of ensuring the index precision, one model can cover data which accord with the same distribution rule as much as possible, the number of the used models is reduced, and the independence among the models is realized.
Further, in the extensible learning index method provided by the first aspect of the present invention, storing each linear regression model in a form of < key, model >, includes:
storing by adopting a two-layer structure;
in the two-layer structure, the construction mode of the second layer is as follows: sorting all < keys, models > corresponding to the linear regression models according to the order of keys from small to large, then forming a page by M < keys, models >, and filling according to the maximum value if the number of the models is less than M for the last page, and forming a second layer by all pages;
in the two-layer structure, the first layer is constructed in the following manner: copying the maximum data in each page, and forming a first layer according to the sequence from small to large;
wherein M is less than or equal to 64.
The invention stores the linear regression model corresponding to the key, model by using a two-layer structure, which is beneficial to accelerating the searching speed of the linear regression model, thereby improving the indexing efficiency of data.
Further preferably, M ═ 64; the number of models forming one page is not more than 64, the two-layer structure can be used for storage, the search speed of the linear regression model can be effectively accelerated, and the first layer in the two-layer structure can be prevented from occupying too much storage space by further setting M to 64.
Further, in the extensible learning index method provided by the first aspect of the present invention, when searching for the linear regression model corresponding to the key value pair, each page is optimized by performing interpolation search using SIMD.
The invention optimizes interpolation search by utilizing SIMD (single instruction multiple data) on each page on the basis of two-layer structure, thereby further accelerating the search speed of a linear regression model and improving the index efficiency of data.
Further, in the extensible learning index method provided by the first aspect of the present invention, when searching for a sub-bucket, all sub-buckets are optimized for interpolation search by using SIMD.
Further, the expandable learning index method provided by the first aspect of the present invention further includes, for any linear regression modeliThe training is carried out according to the following steps:
linear regression modeliReordering the data within the covered interval to obtain a data set recordi
Using dataset recordiRetraining one or more linear regression models to replace the original linear regression modeliAnd the new linear regression model is fitted to<key,model>Is stored.
According to the invention, through a model training mode of adaptive learning, the data interval corresponding to each linear regression model is determined, and when the model is retrained, only the data in the corresponding data interval is reordered, and then the linear regression model is retrained, so that the retraining of the model can be rapidly completed; and for other linear regression models which do not need to be retrained, only the offset is correspondingly modified according to the change condition of the data interval corresponding to the retraining model.
According to a second aspect of the present invention, there is provided a system comprising a computer readable storage medium for storing an executable program and a processor;
the processor is used for reading an executable program stored in the computer readable storage medium and executing the expandable learning index method provided by the first aspect of the invention.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
(1) the data distribution rule in a certain data interval is self-adaptively learned by utilizing the plurality of linear regression models respectively, the linear regression models are simple in structure and few in parameters, the fast training speed can be still ensured when the plurality of models are realized, the linear regression models are mutually independent, and when one model needs to be changed, other models do not need to be retrained, so that a certain part of models and data can be conveniently added or removed, and the expandability of the learning index is effectively improved.
(2) The invention utilizes the hierarchical barrel structure to process the newly inserted data, so that the data distribution in the original training data set can not be damaged after the newly inserted data are inserted, the retraining of the linear regression model obtained by utilizing the training data set is not needed, and the data access efficiency is prevented from being reduced due to the access of a buffer area while the data expandability is ensured.
(3) When newly inserted data is processed, only partial father barrel data and a small amount of son barrel data are needed to be migrated under the worst condition, so that the time overhead can be effectively reduced; and when inserting data, under the situation that the previous sub-bucket is full, new data can be inserted into the following sub-bucket, so that the space utilization rate of the sub-buckets is ensured, and the space overhead is effectively reduced.
(4) The method completes model training based on the greedy thought, enables one model to cover data conforming to the same distribution rule as much as possible under the condition of ensuring index precision, and is beneficial to reducing the number of the used models and realizing independence among the models.
(5) The invention optimizes the interpolation search for each page, each sub-bucket and SIMD, and can further accelerate the search of the linear regression model and the sub-bucket and improve the index efficiency of data.
Drawings
Fig. 1 is a schematic diagram of a scope query index model and a KPF (Key Position Function) model according to an embodiment of the present invention; wherein, (a) is a range query index model, and (b) is a KPF model schematic diagram;
FIG. 2 is a schematic diagram of an extensible index learning method according to an embodiment of the present invention;
fig. 3 is a schematic diagram of ensuring data integrity according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
In the present application, the terms "first," "second," and the like (if any) in the description and the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
The conventional tree-shaped index model is shown in fig. 1 (a), and is regarded as a prediction model, i.e., the data storage location is predicted according to the input data, and finally a leaf node is output, wherein the leaf node contains the data to be searched. Similarly, the process can be viewed as a machine-learned regression model, and existing learned indexes have proven feasible to supplement existing index structures with machine-learned models. However, it is difficult to accurately predict the storage locations of all data by using a machine learning model, which requires establishing a complex and highly accurate machine learning model, and such a model often consumes a large amount of storage space and has great training difficulty. However, in practice, the conventional tree index structure does not accurately predict the positions of all data, but finally outputs leaf nodes containing the searched data, and the leaf nodes are a range. Such a finding greatly reduces the difficulty of learning the index model, i.e., the learning index model can also predict a range, [ pred + min _ err, pred + max _ err ], and it is sufficient to ensure that the range contains the queried data. In the invention, an error threshold value threshold >0 is specifically set, and specific values can be determined according to actual application requirements; accordingly, in the prediction range of the learning index, min _ err ═ threshold, max _ err ═ threshold;
in fig. 1 (b), the learning index regards the data distribution as a KPF model, arranges keys in descending order, takes the keys as input and the positions as output, and the final data distribution is a KPF model. Therefore, the distribution rule of the data can be well mastered by learning the KPF model by using the machine learning model, so that all data can be more quickly indexed by using the learned model.
It is easy to understand that, because the present invention processes data (i.e. key value pairs) in a key value storage system, keys in the key value pairs are used for uniquely identifying each key value pair, and when data size comparison and sequence are involved, the size relationship of the keys represents the relationship of the whole key value pair; for example, any one of the data A and any one of the data B, A > B specifically means that the key of the data A is larger than that of the data B, whereas A ≦ B specifically means that the key of the data A is smaller than or equal to that of the data B; for another example, the data in the data set is ordered, specifically, the data in the data set is ordered from small to large according to the key.
In order to improve the expandability of the learning index, the expandable learning index method provided by the invention comprises the following steps:
sampling key value pairs in a key value storage system to obtain an ordered training data set record [ N ];
training by using a training data set record [ N ] to obtain a plurality of linear regression models, wherein each linear regression model is respectively used for indexing data in a data interval, and data areas covered by the linear regression models are not overlapped with each other;
storing each linear regression model according to a form of < key, model >, wherein key is the maximum data (namely the data with the maximum key) in the data interval covered by the linear regression model, and model is the model parameter of the linear regression model; the linear regression model can be represented as y ═ ax + b, in the present invention, x is the data to be indexed, y is the position of the data x predicted by the model, and the model parameters specifically include weight a and offset b.
In an alternative embodiment, each linear regression model is stored in the form of < key, model >, including:
storing by adopting a two-layer structure;
in the two-layer structure, the construction mode of the second layer is as follows: sorting all < keys, models > corresponding to the linear regression models according to the order of keys from small to large, then forming a page by M < keys, models >, and filling according to the maximum value if the number of the models is less than M for the last page, and forming a second layer by all pages; when maximum value filling is specifically performed, the filled data can be the key of the maximum data in the page, or can be a preset maximum value;
in the two-layer structure, the first layer is constructed in the following manner: copying the maximum data in each page, and forming a first layer according to the sequence from small to large;
wherein M is less than or equal to 64 so as to ensure that the search speed of the linear regression model can be effectively accelerated; as a preferred implementation manner, in this embodiment, specifically setting M to 64 can avoid that the first layer in the two-layer structure occupies too much storage space;
the two-layer structure for storing the linear regression model established by the present invention is shown in fig. 2.
In order to reduce the time overhead and data overhead of learning the index, the present embodiment processes newly inserted data using a hierarchical bucket structure;
as shown in fig. 2, each hierarchical bucket structure corresponds to one data participating in training; for example, in fig. 1, the data interval corresponding to the first linear regression model is (0, 99), and in the training dataset, the interval (0, 99) is further specifically divided into four sub-intervals of (1, 4], (4, 65], (65, 90], (90, 99), and each sub-interval corresponds to one hierarchical bucket structure;
each hierarchical bucket structure comprises a father bucket, and data in the father bucket is ordered; each father bucket data corresponds to a son bucket through a pointer pointing to the son bucket, the data in the son bucket is ordered, and the data in the son bucket is smaller than the corresponding father bucket data;
based on the above hierarchical bucket structure, in the present embodiment, the method further includes processing the newly inserted data D according to the following stepsinsert
(T1) based on the newly inserted data DinsertDetermining a corresponding Level-bin of a hierarchical bucket structure in the data interval to which the data belongs;
(T2) query parent bucket bin of Level-bin of hierarchical bucket structureFTo determine the sub-bucket bin to be queriedSThe corresponding parent bucket data is fS
(T3) if the barrel binSThe previous sub-barrel binS-1If the current time does not exist or is full, the step (T5) is carried out; otherwise, data D is obtainedinsertAnd barrel binSOf the smallest data, the larger data D1And smaller data D2Data D of1Sequentially insert into sub-barrel binSThen, go to step (T4);
(T4) because of sub-barrel binSThe data in is greater than the sub-barrel binS-1Inner data and sub-bucket binS-1Corresponding parent bucket data fS-1So data D2 is larger than sub-bucket binS-1Inner data and corresponding father bucket data fS-1(ii) a Will father bucket data fS-1Sequentially insert into sub-barrel binS-1And data D2Insert into parent barrel binFMiddle as sub-barrel binS-1Corresponding parent bucket data, and then proceeds to step (T8);
(T5) HuobinSIf the current is full, the step (T6) is carried out; otherwise, the newly inserted data is inserted into the sub-barrel bin in sequenceSThen, go to step (T8);
(T6) putting the parent bucket data fSThe subsequent parent bucket data and corresponding pointers to the child buckets are all moved backward by one position, and a new child bucket bin is creatednewWill father bucket data fSInsert into parent barrel binFEmpty position of center as a childBarrel binnewCorresponding parent bucket data, and combining child bucket binsSA larger part of data is inserted into the sub-barrel binnewPerforming the following steps;
(T7) sub-barrel binSInserting the largest data in the inner residual data into the parent bucket binFMiddle as sub-barrel binSCorresponding father bucket data, and the step (T2) is carried out, after the jump, the new data D is determined again according to the current structure of Level-bin of the hierarchical bucket structureinsertThe insertion position of (a);
(T8) the end of the insertion;
in a preferred embodiment, to avoid creating sub-buckets and migrating data frequently, in the step (T6) described above, the sub-bucket bin is followedSMiddle migration to sub-barrel binnewThe data in (1) is sub-bucket binSHalf of the total data.
In order to improve the data indexing efficiency, in the embodiment, when a linear regression model corresponding to a key value pair is searched, each page is optimized by interpolation search by using SIMD;
when the sub-buckets are searched, all the sub-buckets are optimized by interpolation search through SIMD.
FIG. 3 illustrates a specific process of the present invention for guaranteeing data integrity; as shown in the global diagram on the left side of FIG. 3, all models of the extensible learning index in the present invention can cover all data intervals, and all models can cover all data intervals<key,model>Is stored, where key is the maximum data of the data interval covered by each model, and model is the model parameters, i.e. weight a and offset b; in the enlarged partial view shown on the right side of fig. 3, the solid black lines represent the learned linear regression model, the black points are the data covered by the model, the gray shaded parts are the prediction ranges given by the model, and all the points can be found by the model; e.g. error x of point aaMax err ≦ this allows the prediction interval given by the model [ pred + min err, pred + max err ≦]Always contains point a.
However, if new data smaller than a is directly inserted, all data larger than the newly inserted data needs to be shifted backward by one bit in the conventional learning index methodTo make room for new data to maintain an ordered arrangement of data. At this time, after a moves to a ', the error x of the point a' isa’Max _ err, so that the prediction interval given by the regression model does not contain the data, and a part of the data is lost; in the present invention, newly inserted data is processed through hierarchical buckets, and the structure does not need to move the position of the existing data participating in training, thereby ensuring that the data can be always found.
The present invention will be further explained below by taking the specific insertion process of data 7, data 23 and data 24 in fig. 2 as an example.
The specific data insertion can be divided into 3 stages, namely Stage1, Stage2 and Stage3 in fig. 2;
stage 1: searching which model covers the new data through the stored < key, model >, firstly searching which page the model belongs to by using a dichotomy, and then searching in the page by using SIMD;
stage 2: calculating by using the searched model, predicting which range the new data should be inserted into, namely calculating [ pred + min _ err, pred + max _ err ], and finding the position which is closest to the new data and is less than or equal to the position of the new data in the range;
stage 3: if the model prediction range contains new data, the data is indicated to be inserted; otherwise the data is inserted into the inventive hierarchical bucket.
When inserting data 7, i.e. "Insert 7" process in fig. 2, it is first determined that data 7 is represented by a first model f1(x) The data interval covered by the model is specifically (0, 99)](ii) a In the training data set record [ N ]]In (5), data 7 is specifically located in the data interval (4, 65)]Obtaining a hierarchical bucket corresponding to the data interval, inquiring data of a parent bucket to obtain that new data 7 is to be inserted into the 2 nd sub-bucket, wherein the state of the hierarchical bucket is as shown in the state I in figure 2, and according to the principle of preferentially inserting the previous bucket of the hierarchical bucket, because the previous bucket, namely the 1 st bucket, has a position, the new data 7 is to be inserted into the first bucket;
when the data 23 is inserted, namely the process of 'Insert 23' in fig. 2, determining a hierarchical bucket corresponding to a data interval (4, 65) to be queried through stages 1 and 2, querying parent bucket data, and obtaining that the new data 23 should be inserted into the 2 nd sub-bucket, wherein the state of the hierarchical bucket is as state 'II' in fig. 2, and the previous bucket has an empty position, the hierarchical bucket should be preferentially inserted into the 1 st sub-bucket, and by comparison, the smallest data 20 in the 1 st sub-bucket is smaller than the data 23, the data 20 is inserted into the 1 st sub-bucket, and the data 23 is inserted into the 2 nd sub-bucket;
when the data 24 is inserted, that is, the "Insert 24" process in fig. 2, through stages 1 and 2, it is determined that the hierarchical bucket corresponding to the data interval (4, 65) should be queried, and the parent bucket data is queried, it is found that the new data 24 should be inserted into the 2 nd sub-bucket, and the state of the hierarchical bucket is as state "III" in fig. 2, since the 1 st sub-bucket and the 2 nd sub-bucket are full, the parent bucket data corresponding to the 3 rd sub-bucket and the following sub-bucket is shifted backward by 1 position, then a new sub-bucket is created and contains half of the data of the 2 nd sub-bucket, and the new data is inserted into the 2 nd sub-bucket which is empty at this time, and after the data 24 is inserted, the state of the hierarchical bucket is as state "IV" in fig. 2.
In order to ensure the independence between the linear regression models, in this embodiment, a plurality of linear regression models are obtained by training with a training data set record [ N ], which specifically includes:
(S1) initializing an empty data set S;
(S2) sequentially extracting a learning _ step key value pair from the training data set record [ N ] and adding the key value pair to the data set S;
(S3) training a linear regression model with the data set S and obtaining the maximum error of the linear regression model1If error r1<threshold, and training data set record [ N]If the key value pair which does not participate in the model training exists, the step is carried out (S2); if error1<threshold, and training data set record [ N]If no key value pair does not participate in model training, the step (S6) is carried out; if error1>If the threshold, the last added learning _ step key value pair is used as the data to be shifted out, and after the number n of the initialization iteration rounds is equal to 1, the step is shifted to (S4); if error1 is greater than threshold, the process proceeds to step (S6);
(S4) moving the learning step learning rate out of the data set S sequentially from back to frontnKey value pair, then retraining a linear regression model by using the data set S, and obtaining the maximum error of the linear regression model2
(S5) if error2>threshold, and the number of key-value pairs remaining in the data to be shifted out KVleft>learning_step*learning_ratenThen, the process proceeds to step (S4); if error2>threshold, and the number of key-value pairs remaining in the data to be shifted out KVleft=learning_step*learning_ratenIf so, the iteration round number n is updated according to n-n +1, and then the process proceeds to step (S4); if error2If the threshold is less than or equal to the threshold, the step (S6) is carried out;
(S6) storing the finally obtained linear regression model, obtaining the position of the last data in the data set S in the training data set record [ N ], taking the next position of the position as the initial position of data extraction, emptying the data set S, and if the training data set record [ N ] has key value pairs which do not participate in model training, turning to the step (S2); otherwise, the operation is ended;
wherein, the threshold is a preset error threshold, the learning _ step is a preset learning step length, and the learning _ rate is a preset learning rate.
According to the model training procedure, model training is completed based on a greedy thought, and under the condition that the model error does not exceed the threshold, the data interval corresponding to the model is expanded by a larger step length until the model error just exceeds the threshold; after the model exceeds the threshold value, the model is backed up by a smaller step length and the step length is reduced round by round, so that the data interval is gradually reduced, under the condition of ensuring the index precision, one model can cover data which accord with the same distribution rule as much as possible, and the independence among the models is favorably realized.
The above-described expandable learning index method,it can also include, for any one of the linear regression modelsiThe training is carried out according to the following steps:
linear regression modeliReordering the data within the covered interval to obtain a data set recordi
Using dataset recordiRetraining one or more linear regression models to replace the original linear regression modeliAnd the new linear regression model is fitted to<key,model>Is stored.
In this embodiment, through a model training mode of adaptive learning, a data interval corresponding to each linear regression model is already determined, and when model retraining is performed, only data in the corresponding data interval is reordered, and then the linear regression model is retrained, so that retraining of the model can be quickly completed; and for other linear regression models which do not need to be retrained, only the offset is correspondingly modified according to the change condition of the data interval corresponding to the retraining model.
The invention also provides a system comprising a computer-readable storage medium and a processor, the computer-readable storage medium for storing an executable program;
the processor is used for reading an executable program stored in the computer readable storage medium and executing the expandable learning index method.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. An extensible learning index method, comprising:
sampling key value pairs in a key value storage system to obtain an ordered training data set record [ N ];
training by using a training data set record [ N ] to obtain a plurality of linear regression models, wherein each linear regression model is respectively used for indexing data in a data interval, and data areas covered by the linear regression models are not overlapped with each other;
and storing each linear regression model according to a form of < key, model >, wherein key is the maximum data in the data interval covered by the linear regression model, and model is the model parameter of the linear regression model.
2. The extensible learning index method of claim 1, further comprising, processing newly inserted data using a hierarchical bucket structure;
each hierarchical bucket structure corresponds to one data participating in training; each hierarchical bucket structure comprises a father bucket, and data in the father bucket is ordered; each parent bucket data corresponds to a child bucket through a pointer pointing to the child bucket, the data in the child bucket is ordered, and the data in the child bucket is smaller than the corresponding parent bucket data.
3. The extensible learning index method of claim 2, further comprising processing newly inserted data D according to the following stepsinsert
(T1) based on the newly inserted data DinsertDetermining a corresponding Level-bin of a hierarchical bucket structure in the data interval to which the data belongs;
(T2) query parent bucket bin of Level-bin of hierarchical bucket structureFTo determine the sub-bucket bin to be queriedSThe corresponding parent bucket data is fS
(T3) if the barrel binSThe previous sub-barrel binS-1If the current time does not exist or is full, the step (T5) is carried out; otherwise, data D is obtainedinsertAnd barrel binSOf the smallest data, the larger data D1And smaller data D2Data D of1Sequentially insert into sub-barrel binSThen, go to step (T4);
(T4) sub-barrel binS-1Corresponding parent bucket data fS-1Sequentially insert into sub-barrel binS-1And data D2Insert intoTo the father barrel binFMiddle as sub-barrel binS-1Corresponding parent bucket data, and then proceeds to step (T8);
(T5) HuobinSIf the current is full, the step (T6) is carried out; otherwise, the newly inserted data is inserted into the sub-barrel bin in sequenceSThen, go to step (T8);
(T6) putting the parent bucket data fSThe subsequent parent bucket data and corresponding pointers to the child buckets are all moved backward by one position, and a new child bucket bin is creatednewWill father bucket data fSInsert into parent barrel binFThe empty position is used as the sub-barrel binnewCorresponding parent bucket data, and combining child bucket binsSA larger part of data is inserted into the sub-barrel binnewPerforming the following steps;
(T7) sub-barrel binSInserting the largest data in the inner residual data into the parent bucket binFMiddle as sub-barrel binSCorresponding parent bucket data, and proceeds to step (T2);
(T8) the insertion ends.
4. The scalable learning-indexing method of claim 3, wherein said step (T6) is inserting sub-bucket binsnewThe data in (1) is sub-bucket binSHalf of the total data.
5. The scalable learning index method of any of claims 1-4, wherein training with a training data set record [ N ] yields a plurality of linear regression models, comprising:
(S1) initializing an empty data set S;
(S2) sequentially extracting a learning _ step key value pair from the training data set record [ N ] and adding the key value pair to the data set S;
(S3) training a linear regression model with the data set S and obtaining the maximum error of the linear regression model1If error r1<threshold, and training data set record [ N]If the key value pair which does not participate in the model training exists, the step is carried out (S2); if error1<threshold, and training data set record [ N]If no key value pair does not participate in model training, the step (S6) is carried out; if error1>If the threshold, the last added learning _ step key value pair is used as the data to be shifted out, and after the number n of the initialization iteration rounds is equal to 1, the step is shifted to (S4); if error1If yes, the process proceeds to step (S6);
(S4) moving the learning step learning rate out of the data set S sequentially from back to frontnKey value pair, then retraining a linear regression model by using the data set S, and obtaining the maximum error of the linear regression model2
(S5) if error2>threshold, and the number KV of the remaining key-value pairs in the data to be shifted outleft>learning_step*learning_ratenThen, the process proceeds to step (S4); if error2<threshold, and the number KV of the remaining key-value pairs in the data to be shifted outleft=learning_step*learning_ratenIf so, the iteration round number n is updated according to n-n +1, and then the process proceeds to step (S4); if error2If the threshold is less than or equal to the threshold, the step (S6) is carried out;
(S6) storing the finally obtained linear regression model, obtaining the position of the last data in the data set S in the training data set record [ N ], taking the next position of the position as the initial position of data extraction, emptying the data set S, and if the training data set record [ N ] has key value pairs which do not participate in model training, turning to the step (S2); otherwise, the operation is ended;
wherein, the threshold is a preset error threshold, the learning _ step is a preset learning step length, and the learning _ rate is a preset learning rate.
6. The extensible learning-indexing method of any of claims 1-4, wherein storing each linear regression model in the form of < key, model >, comprises:
storing by adopting a two-layer structure;
in the two-layer structure, the construction mode of the second layer is as follows: sorting all < keys, models > corresponding to the linear regression models according to the order of keys from small to large, then forming a page by M < keys, models >, and filling according to the maximum value if the number of the models is less than M for the last page, and forming a second layer by all pages;
in the two-layer structure, the first layer is constructed in the following manner: copying the maximum data in each page, and forming a first layer according to the sequence from small to large;
wherein M is less than or equal to 64.
7. The scalable learning index method of any of claims 1-4, wherein each page is optimized for interpolation search using SIMD when searching for the linear regression model corresponding to the key-value pair.
8. The scalable learning index method of any of claims 1-4, wherein when searching for sub-buckets, all sub-buckets are optimized for interpolation search using SIMD.
9. The extensible learning index method of any of claims 1-4, further comprising, for any linear regression modeliThe training is carried out according to the following steps:
linear regression modeliReordering the data within the covered interval to obtain a data set recordi
Using dataset recordiRetraining one or more linear regression models to replace the original linear regression modeliAnd the new linear regression model is fitted to<key,model>Is stored.
10. A system comprising a computer-readable storage medium and a processor, wherein,
the computer readable storage medium is used for storing an executable program;
the processor is configured to read an executable program stored in the computer-readable storage medium and execute the scalable learning index method of any one of claims 1-9.
CN201911328057.3A 2019-12-20 2019-12-20 Extensible learning index method and system Active CN111126625B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911328057.3A CN111126625B (en) 2019-12-20 2019-12-20 Extensible learning index method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911328057.3A CN111126625B (en) 2019-12-20 2019-12-20 Extensible learning index method and system

Publications (2)

Publication Number Publication Date
CN111126625A true CN111126625A (en) 2020-05-08
CN111126625B CN111126625B (en) 2022-05-20

Family

ID=70500741

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911328057.3A Active CN111126625B (en) 2019-12-20 2019-12-20 Extensible learning index method and system

Country Status (1)

Country Link
CN (1) CN111126625B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112364093A (en) * 2020-11-11 2021-02-12 天津大学 Learning type big data visualization method and system
CN113268457A (en) * 2021-05-24 2021-08-17 华中科技大学 Self-adaptive learning index method and system supporting efficient writing
CN113377883A (en) * 2020-06-15 2021-09-10 浙江大学 Multidimensional data query method based on learning index model
CN113722319A (en) * 2021-08-05 2021-11-30 平凯星辰(北京)科技有限公司 Data storage method based on learning index
CN113742350A (en) * 2021-09-09 2021-12-03 北京中安智能信息科技有限公司 Spatio-temporal index construction method and device based on machine learning model and query method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190303753A1 (en) * 2018-03-28 2019-10-03 Ca, Inc. Insertion tolerant learned index structure through associated caches
CN110442684A (en) * 2019-08-14 2019-11-12 山东大学 A kind of class case recommended method based on content of text
CN110460529A (en) * 2019-06-28 2019-11-15 天津大学 Content router FIB storage organization and its data processing method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190303753A1 (en) * 2018-03-28 2019-10-03 Ca, Inc. Insertion tolerant learned index structure through associated caches
CN110460529A (en) * 2019-06-28 2019-11-15 天津大学 Content router FIB storage organization and its data processing method
CN110442684A (en) * 2019-08-14 2019-11-12 山东大学 A kind of class case recommended method based on content of text

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
PENGFEI LI ET AL.: "A Scalable Learned Index Scheme in Storage Systems", 《COMPUTER SCIENCE》 *
孟小峰: "机器学习化数据库***研究综述", 《计算机研究与发展》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113377883A (en) * 2020-06-15 2021-09-10 浙江大学 Multidimensional data query method based on learning index model
CN112364093A (en) * 2020-11-11 2021-02-12 天津大学 Learning type big data visualization method and system
CN112364093B (en) * 2020-11-11 2023-04-04 天津大学 Learning type big data visualization method and system
CN113268457A (en) * 2021-05-24 2021-08-17 华中科技大学 Self-adaptive learning index method and system supporting efficient writing
CN113268457B (en) * 2021-05-24 2022-07-08 华中科技大学 Self-adaptive learning index method and system supporting efficient writing
CN113722319A (en) * 2021-08-05 2021-11-30 平凯星辰(北京)科技有限公司 Data storage method based on learning index
CN113742350A (en) * 2021-09-09 2021-12-03 北京中安智能信息科技有限公司 Spatio-temporal index construction method and device based on machine learning model and query method

Also Published As

Publication number Publication date
CN111126625B (en) 2022-05-20

Similar Documents

Publication Publication Date Title
CN111126625B (en) Extensible learning index method and system
CN110083601B (en) Key value storage system-oriented index tree construction method and system
US7885967B2 (en) Management of large dynamic tables
JP5425541B2 (en) Method and apparatus for partitioning and sorting data sets on a multiprocessor system
CN105975587B (en) A kind of high performance memory database index organization and access method
WO2018129500A1 (en) Optimized navigable key-value store
CN110147204B (en) Metadata disk-dropping method, device and system and computer-readable storage medium
CN106599091B (en) RDF graph structure storage and index method based on key value storage
US7054994B2 (en) Multiple-RAM CAM device and method therefor
Hadian et al. Interpolation-friendly B-trees: Bridging the Gap Between Algorithmic and Learned Indexes.
CN113779154B (en) Construction method and application of distributed learning index model
CN115718819A (en) Index construction method, data reading method and index construction device
CN108717448B (en) Key value pair storage-oriented range query filtering method and key value pair storage system
CN112817530B (en) Method for reading and writing ordered data in full high efficiency through multiple lines Cheng An
CN112000845B (en) Hyperspatial hash indexing method based on GPU acceleration
CN100525133C (en) Sorting device
KR102006283B1 (en) Dataset loading method in m-tree using fastmap
US8805891B2 (en) B-tree ordinal approximation
CN112988064B (en) Concurrent multitask-oriented disk graph processing method
CN113468178B (en) Data partition loading method and device of association table
CN114547086B (en) Data processing method, device, equipment and computer readable storage medium
CN108460453B (en) Data processing method, device and system for CTC training
CN115563116A (en) Database table scanning method, device and equipment
CN113961568A (en) Block chain-based block fast searching method for chain data structure
CN109241098B (en) Query optimization method for distributed database

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant