CN113297136A - LSM tree-oriented key value storage method and storage system - Google Patents

LSM tree-oriented key value storage method and storage system Download PDF

Info

Publication number
CN113297136A
CN113297136A CN202110573140.8A CN202110573140A CN113297136A CN 113297136 A CN113297136 A CN 113297136A CN 202110573140 A CN202110573140 A CN 202110573140A CN 113297136 A CN113297136 A CN 113297136A
Authority
CN
China
Prior art keywords
layer
data
task
key
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110573140.8A
Other languages
Chinese (zh)
Other versions
CN113297136B (en
Inventor
王宏超
叶保留
唐斌
陆桑璐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University
Original Assignee
Nanjing University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University filed Critical Nanjing University
Priority to CN202110573140.8A priority Critical patent/CN113297136B/en
Priority to PCT/CN2021/103902 priority patent/WO2022246953A1/en
Publication of CN113297136A publication Critical patent/CN113297136A/en
Application granted granted Critical
Publication of CN113297136B publication Critical patent/CN113297136B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a key value storage method and a key value storage system for an LSM tree. The method comprises the following steps: and performing fine-grained division on the disk layers, and setting the compact policy as follows: in the compact task, all the upper sub-layers participate in the task, and the lower layer only has one sub-layer to participate in the task, so as to reduce the ratio of the lower layer participating in the data to the total participating in the data; when executing the compact task, the compact task is divided, so that the number of files participating in the compact task is reduced, and the parallelism of the compact is improved. The invention also provides a method for selecting parameters which minimize write amplification by reducing the influence on the read performance through a parallel read algorithm and modeling the write amplification of the LSM tree.

Description

LSM tree-oriented key value storage method and storage system
Technical Field
The invention relates to a computer storage technology, in particular to a key value storage method and a key value storage system facing an LSM tree.
Background
Key-Value Store (Key-Value Store) stores data as a set of < Key-Value >, where a Key is a unique identifier for a Value. It does not support complex relational schema like relational database, but processes data through simple interfaces Put (k, v), get (k), Update (k, v), delete (k), etc. Due to the advantages of high performance, high expandability and the like, the method plays an important role in the current network application and distributed system, and is widely applied to the fields of graphic databases, task queues, stream processing engines, application program data caching, event tracking systems and the like.
An LSM tree (Log-Structured large tree) is a storage engine that is widely used in key-value storage systems. When a user writes a key value pair, the data is written into the cache and is sorted in the cache, and then the write operation is executed. When the buffer memory exceeds the preset size, the data in the buffer memory is written into the disk once. This effectively converts a large number of random writes to a small number of sequential writes. Because the sequential writing performance of the hard disk is far higher than the random writing performance, and the writing speed of the LSM tree is very high, the method is suitable for the workload with more writing operations. In order to avoid the memory data loss during the system crash, before the data is written into the cache, it needs to write a WAL (write Ahead Log) located in the disk. Since the operation is performed in the mode of additional writing, the writing performance of the system is not obviously influenced.
Data in disk is stored in multiple tiers (L)1,L2,…,Ln) Wherein L isnDenotes the lowest layer, LiRepresents the ith (1. ltoreq. i. ltoreq.n) layer. The data of each layer is stored in order according to the keys in the key value pair, and is stored in a plurality of SSTable (sorted String Table) in a scattered way, and each SSTable stores data of a certain key range in order. In two adjacent layers, the ratio of the amount of data that can be accommodated in the next layer to the amount of data that can be accommodated in the previous layer is called a growth factor T, and is generally 10. Taking the first layer as an example of being capable of storing 10MB data at most, the second layer can store 100MB data at most, and only 7 layers are needed in the secondary class push, so that more than 10TB data can be contained in total. In order to maintain the stability of the hierarchy and prevent excessive data in one layer, the background has a compact process to reorganize the data in the disk and write part of the data in one layer into the next layer.
Specifically, when the data amount of a certain layer exceeds the maximum value that can be accommodated by the certain layer, the compact process selects one SSTable file of the layer, then selects all SSTable files with key ranges overlapped with the SSTable files in the next layer of the layer, merges and sorts the files, generates a new file and writes the new file into the next layer, and deletes the old selected file.
With L1For example, assume that in the key value pair contained in the SSTable file selected in the layer, the range of the key is [2,8 ]]Then at L2If a SSTable file contains a range of keys corresponding to [2,8 ]]With overlap, the file needs to be selected as input for the compact task. This is done to ensure that data is written to L2After that, L can still be ensured2The ordering of the data. Since data that can be contained in different layers grows exponentially, in order to write one SSTable file in a certain layer into a next layer, multiple files in the next layer are often required to participate, and one compact in the layer increases data stored in the lower layer, which may cause compact in the next layer. Such multiple layers accumulate to allow the data in the disk to be frequently overwritten. The ratio of the actual amount of disk write data to the amount of user requested write data is referred to as write amplification. Taking the key value storage system LevelDB adopting the LSM tree structure as an example, experimental results show that when a user requests to write 50GB of data, the write amplification is close to 20, that is, the actual disk write amount is close to 1 TB. Too high write amplification severely affects the write performance of the LSM tree. While the LSM tree often runs in computers that employ SSDs, frequent hard disk reads and writes can reduce the lifetime of the SSD. In summary, write amplification is a serious problem for LSM tree structures. On the other hand, when the memory buffer and the disk L are in use1When the data volume exceeds the threshold value, the memory data can not be flushed, and L must be waited1After completing a compact, a space is left for this layer before a new write request can be serviced, resulting in a write stall, i.e. a large increase in the periodic write delay.
Disclosure of Invention
Aiming at the problems in the background art, the invention aims to provide an LSM tree-oriented key value storage method, which reduces the write amplification by reducing the ratio of the lower layer to the upper layer participating in the data volume in the compact task, describes the write amplification of a system by adopting a modeling mode, optimizes the system parameters, and reduces the influence on the reading performance by adopting a parallel reading algorithm.
Another object of the present invention is to provide a key value storage system and device using the above key value storage method.
In order to achieve the purpose, the invention adopts the following technical scheme:
according to a first aspect of the present invention, there is provided a LSM tree-oriented key value storage method, including the following steps:
dividing a level of the LSM tree into a plurality of sub-levels, wherein the jth sub-level of the ith level is marked as Li.jSSTable files in the sub-hierarchy are arranged in an increasing mode from left to right according to the range of keys;
maintaining a compact pointer at each layer for selecting a first input file of a compact task;
when the ith layer LiWhen the total data volume of the layer exceeds the rated size, triggering one compact at the layer, and enabling LiPartial data write L of a layeri+1Layer for realizing reorganization of disk data, wherein L is used for executing compact taskiAll sub-levels of a layer participate in a task, and Li+1A layer has only one sub-level participating in a task.
Wherein for LiA layer one-time compact task, comprising the steps of:
according to LiComposition pointer of layer at LiFirst sub-level L ofi.1Selecting SSTable files which contain the minimum keys more than or equal to the pointer and are closest to the pointer as task initial input files, adding an input file set of a compact task, taking the minimum keys of the files as task left boundaries, and taking the maximum keys of the files as task right boundaries;
for LiOther sub-levels L of a layeri.2
Figure BDA0003083402800000031
In turn selectPart or all of the files within the left and right boundaries and adding to the input file set, wherein SiRepresents LiNumber of sub-levels divided in a layer;
expanding the boundary of the current task according to the minimum key and the maximum key of the files in the input file set so that the task comprises more files completely positioned in the boundary;
at Li+1Selecting the sub-hierarchy L with the least current data size from the layersi+1,jFrom L by task boundariesi+1,jSelecting a file which is positioned in the boundary or overlapped with the boundary to be added into a candidate file set, segmenting the compact task through the file in the candidate file set, and adding only the file which needs to participate in the task after the task is segmented into an input file set;
for an input fileset, it will be at LiData within layers and task boundaries and at Li+1The data of the layer is subjected to multi-path merging and sorting, a new file is generated and written into the Li+1,j
For an input fileset, it will be at LiPerforming multi-path merging and sequencing on data outside the task boundary, and writing the data smaller than the left boundary of the task in the generated new file into LiLayer, new file greater than task right boundary data write LiCaching the composition of the layer, recording the minimum key and the maximum key of a cached file in the log, recording the files which are overlapped with the cached file in the input file set, and deleting the files which are not recorded in the log in the input file set;
mixing L withiThe composition pointer of the layer is replaced by the right boundary of the composition task.
The specific method for segmenting the task comprises the following steps:
for each file in the candidate file set, acquiring the minimum key k contained in each file through metadata in a memoryminAnd maximum key kmax
According to kminAnd k ismaxA query is made for files in the input set of files, if for each file in the input set of files, [ k [ ]min,kmax]Non-overlapping with, or small in, the fileAt kminIs greater than kmaxIf no other key-value pair exists between the minimum keys, the candidate file is moved out of the candidate file set and the k is used as the basisminAnd kmaxCutting the files in the input file set into two parts, wherein one part comprises keys smaller than kminAnother part containing a bond greater than kmaxOtherwise, the candidate file is moved out of the candidate file set and added into the input file set.
In some embodiments of the first aspect of the present invention, the LiIs layered on LD+2~LnAmong the layers, wherein n is the number of the layers of the LSM tree, D is a set layer boundary parameter, and D is more than or equal to 1 and less than or equal to n; the method further comprises the following steps:
for L1~LDAdopting a tipped compatibility algorithm, sequencing all data of the layer at one time, writing a newly generated file into the next layer, forming a new sub-layer on the next layer, and during the period, not allowing lower-layer data to participate in sequencing;
for LD+1And sorting all data of the layer and data of one sub-layer of the lower layer, and writing the newly generated data into the selected sub-layer of the lower layer.
In this hierarchical manner, the write operation includes:
acquiring a global version number maintained for the key-value pair, increasing the version number, and encoding the version number into the key;
writing data into the WAL in an additional writing mode;
writing the data into a memory buffer, and returning;
the search operation comprises the following steps:
inquiring a memory buffer and a cache, if the memory buffer and the cache exist, returning data, and if the memory buffer and the cache do not exist, carrying out the next step;
from L1To LnLayer, for each level L in the diskbSequentially searching, wherein b is more than or equal to 1 and less than or equal to n, maintaining a thread pool, and the number of threads in the thread pool is max (S)1,S2,…,Sn) For LbSubmit S to the thread poolbA read task, threadjTo Lb,jTo carry outBinary search, 1<j<Sb
Summary SvIf any thread reads data, selecting the data with the maximum version number to return, finishing reading, and if no thread reads data, continuously reading Lb+1
If all layers are read, the data are not read yet, and the returned data do not exist.
The scope query operation includes:
searching a key value pair corresponding to the minimum key which is greater than or equal to k by utilizing a Seek (k) interface: submitting a plurality of query tasks to a thread pool, wherein each thread is responsible for querying a sub-layer or a memory buffer, each thread searches a minimum key which is greater than or equal to k through a dichotomy, and if each thread does not read data, the returned data does not exist; otherwise, for the thread reading the data, constructing an iterator from the read data, sequencing the read data according to the version number, and taking out the data with the latest version and returning;
and finding a key-value pair corresponding to the minimum key which is larger than the currently found key in the system by using a Next () interface: if Seek (k) finds the data, when the user submits a Next () request, the last iterator returning the result runs Next (), compares the data currently pointed by each iterator again, and returns the latest data, and the old version of data is ignored in the period.
In some embodiments of the first aspect of the present invention, the method further comprises: modeling write amplification, and selecting optimal parameters by minimizing write amplification, the steps comprising:
let the number of LSM tree layers be n and the number of sub-layers in each layer be SbGrowth factor per layer is TbB is more than or equal to 1 and less than or equal to n, the boundary of the layers adopting different compact algorithms is D, and the write amplification of each layer is calculated:
for a write WAL, its write amplification is 1;
for memory buffer disk-flushing, the write amplification is buf/Unique-1(buf), where buf is the maximum number of key-value pairs, Unique, that the buffer can hold-1(k) Is UInverse function of nique (p), unique (p) Σk∈K(1-(1-fX(k))p) N is the total number of independent keys in the workload, K is the key space [0, N-1 ]]Set of integers of (1), fX(k) Represents the probability of the occurrence of key k in a write-once request;
when 1. ltoreq. b. ltoreq.D for LbWhich is written and amplified as
Figure BDA0003083402800000051
Figure BDA0003083402800000052
Wherein
Figure BDA0003083402800000053
Intervalb=Intervalb-1*Sb,Interval0=Unique-1(buf),Size1=buf*S1
Figure BDA0003083402800000054
Figure BDA0003083402800000055
Sizeb+1=Size(b+1).j*Sb+1
For LD+1Which is written and amplified as
Figure BDA0003083402800000056
Figure BDA0003083402800000057
Wherein, IntervalD +1 is IntervalD SD +1, SizeD+2=SizeD+1*TD+2,Size(D+2).j=SizeD+2/SD+2
When D +2 is less than or equal to b<n is, for LbWhich is written and amplified as
Figure BDA0003083402800000058
Wherein Intervalb=Intervalb-1+DIntervalb,DIntervalbBy solving equations
Figure BDA0003083402800000059
Figure BDA00030834028000000510
Obtaining Sizeb +1 ═ Sizeb Tb +1, Size (b +1). j ═ Sizeb +1/Sb + 1;
the write amplification of each disk layer, the write amplification of the write WAL and the write amplification of the memory buffer disk refreshing are combined to form the write amplification of the whole LSM tree;
fixing the number of total sub-layers of the LSM tree, iteratively solving the write amplification under different parameters, and obtaining the S which enables the write amplification to be minimumb、TbAnd D.
According to a second aspect of the present invention, there is provided an LSM tree-oriented key-value storage system, comprising:
a first storage unit which stores the first D levels of an LSM tree including n levels and executes a composition task by adopting a finished composition algorithm for minimizing write amplification, wherein D represents a set level boundary parameter;
a second storage unit for storing the D +1 th layer of the LSM tree and executing the compact task by a compact method including the steps of:
selecting all files of all sub-layers of the layer and adding the files into an input file set;
at LD+2Selecting the sub-hierarchy L with the least current data size from the layersD+2,jAccording to LD+1The data includes a range from LD+2,jSelecting all overlapped files to add into an input file set;
carrying out multi-path merging and sorting on data in the input file set, and putting a newly generated file into a lower selected sub-layer;
a third storage unit for storing L of the LSM treeD+2~LnAnd executing the compact task by adopting a compact method comprising the following steps:
one level L of the LSM treeiThe j sub-level of the i level is marked as Li.jSSTable file in child hierarchy as followsThe range of keys is arranged in an increasing manner from left to right;
maintaining a compact pointer at each layer for selecting a first input file of a compact task;
when the ith layer LiWhen the total data volume of the layer exceeds the rated size, triggering one compact at the layer, and enabling LiPartial data write L of a layeri+1Layer for realizing reorganization of disk data, wherein L is used for executing compact taskiAll sub-levels of a layer participate in a task, and Li+1A layer has only one sub-level participating in a task.
Wherein the third storage unit executes the steps for one compact task and the method for storing key values facing LSM tree according to the first aspect of the present inventioniThe steps involved in one compact task of a layer are the same.
According to a third aspect of the present invention, there is provided a key-value pair storage device, the device comprising:
one or more processors;
a memory; and
one or more computer programs, stored in the memory and configured for execution by the one or more processors, which when executed by the one or more processors, cause the one or more processors to perform steps comprising a LSM tree oriented key-value storage method according to the first aspect of the invention.
The invention can obtain the following beneficial effects:
1. and (3) finely dividing each level of the LSM tree, wherein during compact, a plurality of sub-levels participate in the upper layer, and the data range contained in each sub-level is the same, so that the data quantity selected by each sub-level is similar. While the lower layer has only one sub-level to participate. When the data is selected at the lower layer, the composition task is segmented, so that the number of files participating in the composition at the lower layer is reduced as much as possible, the ratio of the data amount participating in the composition at the lower layer to the data amount participating in the composition at the upper layer is reduced, that is, in order to introduce a certain amount of data into the next layer, the data amount required to participate in the sorting at the lower layer is reduced, and the write amplification is reduced.
2. By adopting different compact algorithms for different layers and adopting a tipped compact algorithm which can minimize write amplification on an upper layer, the efficiency of importing data into a lower layer is improved, and the occurrence of a write pause phenomenon is reduced.
3. By multi-threaded parallel reading, the impact on the reading performance is reduced. And modeling is carried out on the write amplification, a method for selecting the optimal parameters is provided, and the write performance of the system is maximized under the condition of fixed read performance.
Drawings
FIG. 1 is a schematic diagram of an LSM tree according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a compact algorithm according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of compact task segmentation according to an embodiment of the present invention;
FIG. 4 is a diagram of a parallel read algorithm according to an embodiment of the present invention.
Detailed Description
The technical solution of the present invention is further explained with reference to the drawings and the embodiments.
Fig. 1 is a schematic diagram of an LSM tree according to an embodiment of the present invention. As shown in the figure, there is a buffer in the memory, and the WAL is a disk pre-write log configured to avoid loss of buffer data when the program crashes, and receives a write request from a user. Data in a disk is divided into three levels (L)1,L2,L3). Each layer is divided into three sub-layers. Each sub-hierarchy contains a plurality of SSTable files. The data in the sub-hierarchies is ordered, and there is no relationship between the data in different sub-hierarchies. This is equivalent to relaxing the ordering of the original LSM tree, where the data in each level of the original LSM tree is strictly ordered, and the data in each level of the LSM tree of the present invention is divided into a plurality of smaller ordered groups by means of segmentation.
FIG. 2 is a schematic diagram of a compact algorithm according to an embodiment of the present invention. Each box in the figure represents an SSTable file. For convenience of description, it is assumed that a file can accommodate at most two key-value pairs (in practice, the number of key-value pairs that each file can accommodate is much higher than 2), and the numbers in the squares represent the keys in the key-value pairs contained in the file, and the corresponding values are not shown. The upper right-hand symbol of the number indicates the old-new relationship of the version of the value corresponding to the key, taking key 5 as an example, the value corresponding to 5 "is newer than the value corresponding to 5 ', and the value corresponding to 5 ' is newer than the value corresponding to 5 '. In particular implementations, the order in which key-value pairs are written may be recorded by maintaining a global version number (e.g., a 64-bit integer). And (3) representing the current latest version number by a number, encoding the current version number into a key value pair every time a new key value pair is written, and adding the global version number to be + 1. For example, the current system version number is 1, a key-value pair is inserted, 1 is assigned to this key-value pair, and stored as < key1,1, value1 >. Then the system version number becomes 2. Then, a key-value pair is inserted, and 2 is assigned to this key-value pair, and stored as < key2,2, value2 >. The system version number then becomes 3. Thus, when reading a plurality of key value pairs, the new and old relations of the data can be known by judging the version numbers of the key value pairs.
When L is2Exceeds the maximum value that can be accommodated, and a compact task is triggered. First from L2.1And selecting an input file. Since the compact pointer of the layer is 6, the file containing the minimum key greater than or equal to 6 and closest to 6, namely SSTable (6 ', 12'), is selected as the initial file and added to the input file set. Record the minimum key 6 contained in the file as the left boundary of the compact task and the maximum key 12 contained in the file as the right boundary of the compact task. Then, from L2.2To L2.3Files within this boundary, or overlapping the boundary, are selected from each sub-hierarchy according to the left and right boundaries to join the input file set, where a total of 4 files SSTable (5 ', 8), SSTable (12, 13'), SSTable (5 ', 7'), SSTable (10 ', 14') are selected. To this end, L2And finishing the file selection.
For L3For simplicity of illustration, L is not shown3.1And L3.3Contained in the fileA key. Due to L3.2The data size is the least, and the sub-hierarchy is selected to participate in the compact. Files SSTable (5,6), SSTable (7,9) and sstave (10,11) are selected at the sub-hierarchy to be added to the input file set, also according to left and right boundaries. And selecting the sub-hierarchy with the least data quantity to participate in the compact, so that the data quantity of each sub-hierarchy is closest to the data quantity of each sub-hierarchy after the compact task is finished, but the new-old relation of the data of each sub-hierarchy cannot be guaranteed. In order to maintain the data version relationship between the layers (for the same key, the value of the upper layer is newer than that of the lower layer), L needs to be added2All data within a certain range is written to the next layer. Therefore, it is necessary to set L according to the boundary of the compact2And (4) cutting the selected file, wherein the data outside the boundary range needs to be rewritten back to the layer. Otherwise L may result3Data ratio L in (1)2And (5) new.
Data outside the boundary is eventually written back to the layer, which increases write amplification, thus expanding the boundary according to the minimum and maximum keys of the selected file in each sub-hierarchy. If the number of files to be cut can be reduced after the expansion, the boundary is updated without introducing a new file. That is, the boundary is only expanded according to the initial file, ensuring that no additional files are added. Otherwise, the situation that the boundary is continuously expanded and finally all files are added into the input file may occur, which causes the compact task to be too large and affects the system stability. As shown, the boundaries are initially [6,12] and the boundaries are expanded to [5,12], thus eliminating the need to cut both SSTable (5 ', 8) and SSTable (5 ', 7 ').
After the files are selected, the files are divided into two parts: 1 from L2Part of the selected file within the boundary and the slave L3All selected files; 2 from L2The parts of the selected file that are outside the boundary. For the data of the first part, 4 new files SSTable (5 ', 6'), SSTable (7 ', 8), SSTable (9, 10'), SSTable (11,12) are generated and put into L by adopting multi-path merging and sorting3.2. And performing intra-layer compact on the second part of data, i.e. adopting multi-path merging and sorting, generating new file SSTable (13 ', 14') and putting back L2.3. Finally, the compact pointer is replaced with the right boundary 12 of the task, and the files in the input file set are deleted.
To further reduce file write back L2The resulting write amplification sets a compact cache for each layer, storing files generated by the compact in the layer. Specifically, the intra-layer compact is in L2It is possible to generate two parts of the document, the first part of the document being located to the left of the left boundary and the second part of the document being located to the right of the right boundary. The first part of file is written into the disk, and the second part of file is stored into the compact cache without being written into the disk. The input file of the compact task adopts a circular selection strategy, that is, when the layer carries out the next compact, the right boundary of the present compact task is taken as the left boundary of the next compact task. Therefore, the cache file can be directly read from the memory, and one-time reading and writing of the disk file are reduced. Since the boundary expansion operation is performed and the compact task has a clear boundary, L is not considered1The files located to the left of the left boundary are not generated in the compact of (1), so that the memory space occupied by the cache files is small. The cache will only be used once if L1The triggered compact task comprises a file in the cache, and the file is moved out of the cache, so that the cache is still reduced by one read-write of the disk file.
When the computer crashes, it may cause the loss of the compact cache. In order to avoid data loss, in the disk log, the sub-hierarchy to which the cache file of the composition belongs, the minimum key and the maximum key of the cache file, the source SSTable file of the cache file, and other metadata of the composition (such as a new file generated by the composition, a composition pointer, composition task statistical information and the like) are recorded together, and in the last step of the task, the input file related to the cache is not deleted. Thus, when the computer crashes, the data in the compact cache can be recovered from the input file by using the data.
FIG. 3 shows how the amount of data for the underlying participating tasks is reduced by splitting the compact task. Shown in the figure is L2One compact task. When L is finished2Selection of documentsThen, first, according to the compact boundary, at L3.1SSTable (1,2), SSTable (3,6) and SSTable (7,10) are selected as candidate files. Then, for the three candidate files, the files of each sub-hierarchy in the current input file set are queried according to the minimum key and the maximum key of the files respectively, and whether the candidate file can not participate in the present compact is judged. If for each file f in the input file setiIf one of the following two conditions is met, the candidate file is not added to the task: 1, the minimum key and the maximum key of the candidate file are positioned at fiOutside of the range of (1); 2, the minimum key and the maximum key of the candidate file are positioned at fiBut can be determined by mixing fiThe division into two parts is such that both parts do not overlap with the candidate file.
In the figure, for candidate file SSTable (3,6), its determined key range [3,6 ]]Key ranges [7,9 ] determined with files SSTable (7', 9) in the input file set]And the bond range [1, 2] determined by SSTable (1 ', 2')]There is no overlap. The current candidate file is connected with the key range [2,7 ] determined by the file SSTable (2 ', 7')]There is an overlap, but if SSTable (2 ', 7') is split into SSTable (2 ') and SSTable (7'), these two files are compared with [3,6 ]]There is no range overlap. Then, the candidate file SSTable (3,6) does not participate in the present compact. The compact is divided into two subtasks, and one task is responsible for dividing [1, 2]]Data within the scope is sorted and another task is responsible for sorting [7,10 ]]The data within the scope is sorted and the two subtasks can be executed in parallel. Thus, on the one hand, reducing L3The number of files participating in the task reduces the writing amplification, increases the parallelism of the compact and improves the speed of the compact.
Fig. 4 shows the LSM tree reading algorithm of the present invention. Since each hierarchy is further divided, the number of sub-layers to be read is increased, so that the read performance is affected. In order to improve the reading performance, the invention adopts a parallel reading algorithm. The invention maintains a thread pool, and the number of threads in the thread pool is the same as the maximum number of sub-layers of the LSM tree. When a read request arrives, if the corresponding data is not found in the memoryDisk data needs to be queried. First, query L1ThreadjResponsible for querying L1.j. And when the query of each sub-layer is finished, summarizing the result of each thread. If any thread finds the result, the corresponding data is compared through the version number to obtain the latest result and return the latest result. If all threads have not queried the results, then start on L2A query is made. And repeating the steps until a corresponding result is inquired and returned, or each layer is searched but corresponding data is not found, and the returned data does not exist.
Due to the number of sub-levels S of each leveliGrowth factor T of each leveliAnd the setting of the boundary D between the hierarchy that adopts the tipped computation algorithm and the hierarchy of the fine-grained computation algorithm described in fig. 2 has a large influence on the system performance. Therefore, the writing amplification of the system under different parameters can be realized by establishing the model. By minimizing write amplification, parameters are obtained that optimize the write performance of the system.
Assume that the key space K of the workload ranges from 0, N-1]Where N is the total number of independent keys in the workload. Keying follows a certain distribution X, such as a uniform distribution, ziff distribution, and the like. The probability of key k appearing in a write-once request is fX(k) In that respect For example, when the keys are subject to uniform distribution, fX(k) 1/N, when the bond follows a ziff distribution,
Figure BDA0003083402800000101
Figure BDA0003083402800000102
where s represents the degree of data skew and h maps each key to an integer in the key space K. For p requests, the number of independent keys present is unity (p) ═ Σk∈K(1-(1-fX(k))p). The inverse function of Unique (p) is Unique-1(k) In that respect Since Unique (p) is a monotonic function, by extending its domain of definition to the real number domain, Unique can be solved-1(k) In that respect Let the size of a file be u, then k files u1,u2,…,ukGo on compThe total size of the new file generated after action is
Figure BDA0003083402800000103
The write overhead is modeled by write amplification that characterizes each layer. WAL is written before data is written into the memory buffer, so that write amplification is WAbuf=1。
The number of key-value pairs which can be accommodated by the system buffer is set as buf. Consider buffer in memory as L0I.e. Size0Buf. When the buffer reaches the capacity threshold, writing the data in the buffer into L in batch1. Since the buffer does not include repeated key-value pairs, the number of write requests required for the buffer from empty to full is Unique-1(buf), which is also the Interval for writing all of the layer data to the next layer0. The amount of data written to the disk is buf, so the write amplification of the memory write disk is WA0→1=buf/Unique-1(buf)。L1A sub-level L of1.jIs Size of1.j=buf,L1Has a total Size of Size1=buf*S1
The write amplification due to the disk compact is calculated from the amount of data written to the lower layer at a certain interval.
For Li(i is more than or equal to 1 and less than or equal to D), adopting a tipped compact algorithm, and when the number of the sub-layers of the layer reaches SiThe compact is triggered. The time required for adding one sub-layer to the layer is Li-1The interval of two transactions occurs. Therefore, the Interval between the occurrences of the compact in the layer is Intervali=Intervali-1*Si. In the interval, towards Li+1The amount of data written is
Figure BDA0003083402800000111
Thus, LiWrite amplification of WAi→i+1=Writei+1/Intervali. And L isi+1Size of a sub-hierarchy of (2)(i+1).j=Writei+1,Li+1Total Size of (2)i+1=Size(i+1).j*Si+1
For LD+1When the number of sub-layers of the layer reaches SD+1The compact is triggered. The required interval is L for each added sub-levelDThe interval of two transactions occurs. Therefore, the Interval between the occurrences of the compact in the layer is IntervalD+1=IntervalD*SD+1. In the interval, towards LD+2The amount of data written is
Figure BDA0003083402800000112
Figure BDA0003083402800000113
Wherein, SizeD+2=SizeD+1*TD+2Size of jth sub-level(D+2).j=SizeD+2/SD+2. Thus, LD+1Write amplification of WAD+1→D+2=WriteD+2/IntervalD+1
For Li(D+2≤i<n), each compact task is in L each time because the range of data contained in each sub-hierarchy is the sameiEach sub-level of the hierarchy selects substantially the same range of data, so that the data can pass through the first sub-level L of the hierarchyi.1And (6) carrying out analysis. Let DIntervaliIs at Li.1Two times of compact and the same key interval, d is Li.1The one-way distance between one key of the key (2) and the LastKey from the last compact to the next layer is more than or equal to 0 and less than or equal to N-1. For a fixed d, if the layer has a bond k1Distance d from LastKey, since the key is compact, the child hierarchy has done DIntervali*d/(N*Si) And (6) a new request. If there is a key k in these new requests1Then Li.1In the presence of a bond k1. The probability is
Figure BDA0003083402800000114
Assuming that any K belongs to K and P (LastKey) is 1/N, L can be obtained by considering all Ki.1In (2), there is a probability that there is a key at a distance d from LastKey
Figure BDA0003083402800000115
Figure BDA0003083402800000116
Considering all d, one can obtain
Figure BDA0003083402800000117
From this, the DInterval of the layer is obtainedi. Interval of this layeri=Intervali-1+DIntervaliIn the interval, the data amount of the lower layer is written
Figure BDA0003083402800000118
Wherein Sizei+1=Sizei*Ti+1Size of jth sub-level(i+1).j=Sizei+1/Si+1. Thus, the write amplification of this layer is WAi→i+1=Writei+1/Intervali
All WAs are added to get the total WA of the LSM tree. The read performance is affected by the total number of sub-levels of the LSM tree, and generally, the more sub-levels, the more IO operations. The total number of sub-levels of the LSM tree is fixed, i.e.
Figure BDA0003083402800000119
For setting value, obtaining total WA under different parameters through iteration, and recording S when WA is minimumi,TiAnd D.
The parameter optimization algorithm according to the embodiment of the invention is as follows:
Figure BDA0003083402800000121
according to another embodiment of the present invention, there is provided an LSM tree-oriented key-value storage system, including:
a first storage unit which stores the first D levels of an LSM tree including n levels and executes a composition task by adopting a finished composition algorithm for minimizing write amplification, wherein D represents a set level boundary parameter;
a second storage unit for storing the D +1 th layer of the LSM tree and executing the compact task by a compact method including the steps of:
selecting all files of all sub-layers of the layer and adding the files into an input file set;
at LD+2Selecting the sub-hierarchy L with the least current data size from the layersD+2,jAccording to LD+1The data includes a range from LD+2,jSelecting all overlapped files to add into an input file set;
carrying out multi-path merging and sorting on data in the input file set, and putting a newly generated file into a lower selected sub-layer;
a third storage unit for storing L of the LSM treeD+2~LnAnd executing the compact task by adopting a compact method comprising the following steps:
one level L of the LSM treeiThe j sub-level of the i level is marked as Li.jSSTable files in the sub-hierarchy are arranged in an increasing mode from left to right according to the range of keys;
maintaining a compact pointer at each layer for selecting a first input file of a compact task;
when the ith layer LiWhen the total data volume of the layer exceeds the rated size, triggering one compact at the layer, and enabling LiPartial data write L of a layeri+1Layer for realizing reorganization of disk data, wherein L is used for executing compact taskiAll sub-levels of a layer participate in a task, and Li+1A layer has only one sub-level participating in a task.
Steps in which the third storage section is executed for a compact task and for L in the foregoing method embodimentiThe steps involved in one compact task of a layer are the same. And will not be described in detail herein.
The key-value pair storage system maintains a global version number (e.g., a 64-bit integer), encodes the current version number into the key-value pair, and encodes the global version number +1 each time a new key-value pair is written. For example, the current system version number is 1, a key-value pair is inserted, 1 is assigned to this key-value pair, and stored as < key1,1, value1 >. Then the system version number becomes 2. Then, a key-value pair is inserted, and 2 is assigned to this key-value pair, and stored as < key2,2, value2 >. The system version number then becomes 3. Thus, when reading a plurality of key value pairs, the new and old relations of the data can be known by judging the version numbers of the key value pairs.
In the hierarchical manner as described above, the write operation of the key-value storage system includes:
acquiring the global version number of the key value pair, increasing the number of the global version number, and encoding the global version number into the key;
writing data into the WAL in an additional writing mode;
writing the data into a memory buffer, and returning;
lookup operations of a key-value store system include:
inquiring a memory buffer and a cache, if the memory buffer and the cache exist, returning data, and if the memory buffer and the cache do not exist, carrying out the next step;
from L1To LnLayer, for each level L in the diskbSequentially searching, wherein b is more than or equal to 1 and less than or equal to n, maintaining a thread pool, and the number of threads in the thread pool is max (S)1,S2,…,Sn) For LbSubmit S to the thread poolbA read task, threadjTo Lb,jPerform binary search, 1<j<Sb
Summary SbIf any thread reads data, selecting the data with the maximum version number to return, finishing reading, and if no thread reads data, continuously reading Lb+1
If all layers are read, the data are not read yet, and the returned data do not exist.
The scope query operation includes:
searching a key value pair corresponding to the minimum key which is greater than or equal to k by utilizing a Seek (k) interface: submitting a plurality of query tasks to a thread pool, wherein each thread is responsible for querying a sub-layer or a memory buffer, each thread searches a minimum key which is greater than or equal to k through a dichotomy, and if each thread does not read data, the returned data does not exist; otherwise, for the thread reading the data, constructing an iterator from the read data, sequencing the read data according to the version number, and taking out the data with the latest version and returning;
and finding a key-value pair corresponding to the minimum key which is larger than the currently found key in the system by using a Next () interface: if Seek (k) finds the data, when the user submits a Next () request, the last iterator returning the result runs Next (), compares the data currently pointed by each iterator again, and returns the latest data, and the old version of data is ignored in the period.
And the writing amplification of the model description system under different parameters can be established. By minimizing write amplification, parameters are obtained that optimize the write performance of the system. The specific model building steps are the same as those in the foregoing method embodiments, and are not described herein again.
There is also provided, in accordance with another embodiment of the present invention, a key-value pair storage device, including:
one or more processors;
a memory; and
one or more computer programs stored in the memory and configured for execution by the one or more processors, which when executed by the one or more processors, cause the one or more processors to perform steps comprising a LSM tree oriented key-value storage method as described in the preceding method embodiment.
As will be appreciated by one of skill in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.
It is apparent that those skilled in the art can make various modifications and variations to the embodiments of the present invention without departing from the spirit and scope of the embodiments of the present invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the embodiments of the present invention and their equivalents, the embodiments of the present invention are also intended to include such modifications and variations.

Claims (10)

1. A key value storage method facing an LSM tree, the method comprising:
dividing a level of the LSM tree into a plurality of sub-levels, wherein the jth sub-level of the ith level is marked as Li,jSSTable files in the sub-hierarchy are arranged in an increasing mode from left to right according to the range of keys;
maintaining a compact pointer at each layer for selecting a first input file of a compact task;
when the ith layer LiWhen the total data volume of the layer exceeds the rated size, triggering one compact at the layer, and enabling LiPartial data write L of a layeri+1Layer for realizing reorganization of disk data, wherein L is used for executing compact taskiAll sub-levels of a layer participate in a task, and Li+1A layer has only one sub-level participating in a task.
2. The LSM tree oriented key-value storage method of claim 1, wherein for LiA layer one-time compact task, comprising the steps of:
according to LiComposition pointer of layer at LiFirst sub-level L ofi.1Selecting SSTable files which contain the minimum keys more than or equal to the pointer and are closest to the pointer as task initial input files, adding an input file set of a compact task, taking the minimum keys of the files as task left boundaries, and taking the maximum keys of the files as task right boundaries;
for LiOther sub-levels of a layer
Figure FDA0003083402790000011
Sequentially selecting part or allSegmenting files within left and right boundaries and adding to the input file set, wherein SiRepresents LiNumber of sub-levels divided in a layer;
expanding the boundary of the current task according to the minimum key and the maximum key of the files in the input file set so that the task comprises more files completely positioned in the boundary;
at Li+1Selecting the sub-hierarchy L with the least current data size from the layersi+1,jFrom L by task boundariesi+1,jSelecting a file which is positioned in the boundary or overlapped with the boundary to be added into a candidate file set, segmenting the compact task through the file in the candidate file set, and adding only the file which needs to participate in the task after the task is segmented into an input file set;
for an input fileset, it will be at LiData within layers and task boundaries and at Li+1The data of the layer is subjected to multi-path merging and sorting, a new file is generated and written into the Li+1,j
For an input fileset, it will be at LiPerforming multi-path merging and sequencing on data outside the task boundary, and writing the data smaller than the left boundary of the task in the generated new file into LiLayer, new file greater than task right boundary data write LiCaching the composition of the layer, recording the minimum key and the maximum key of a cached file in the log, recording the files which are overlapped with the cached file in the input file set, and deleting the files which are not recorded in the log in the input file set;
mixing L withiThe composition pointer of the layer is replaced by the right boundary of the composition task.
3. The LSM tree-oriented key-value storage method of claim 2, wherein the specific step of partitioning the task comprises:
for each file in the candidate file set, acquiring the minimum key k contained in each file through metadata in a memoryminAnd maximum key kmax
According to kminAnd k ismaxInquiring files in the input file set if the input files are searchedEach file in the set, [ k ]min,kmax]No overlap with the file, or less than k in the fileminIs greater than kmaxIf no other key-value pair exists between the minimum keys, the candidate file is moved out of the candidate file set and the k is used as the basisminAnd kmaxCutting the files in the input file set into two parts, wherein one part comprises keys smaller than kminAnother part containing a bond greater than kmaxOtherwise, the candidate file is moved out of the candidate file set and added into the input file set.
4. The LSM tree oriented key-value storage method of claim 1, wherein L isiIs layered on LD+2~LnAmong the layers, wherein n is the number of the layers of the LSM tree, D is a set layer boundary parameter, and D is more than or equal to 1 and less than or equal to n; the method further comprises the following steps:
for L1~LDAdopting a tipped compatibility algorithm, sequencing all data of the layer at one time, writing a newly generated file into the next layer, forming a new sub-layer on the next layer, and during the period, not allowing lower-layer data to participate in sequencing;
for LD+1And sorting all data of the layer and data of one sub-layer of the lower layer, and writing the newly generated data into the selected sub-layer of the lower layer.
5. The LSM tree oriented key-value storage method of claim 4, wherein in the hierarchical manner, the write operation comprises:
acquiring a global version number maintained for the key-value pair, increasing the version number, and encoding the version number into the key;
writing data into the WAL in an additional writing mode;
writing the data into a memory buffer, and returning;
the search operation comprises the following steps:
inquiring a memory buffer and a cache, if the memory buffer and the cache exist, returning data, and if the memory buffer and the cache do not exist, carrying out the next step;
from L1To LnLayer, for each level L in the diskbSequentially searching, wherein b is more than or equal to 1 and less than or equal to n, maintaining a thread pool, and the number of threads in the thread pool is max (S)1,S2,…,Sn) For LbSubmit S to the thread poolbA read task, threadjTo Lb,jPerforming binary search, j is more than 1 and less than Sb
Summary SbIf any thread reads data, selecting the data with the maximum version number to return, finishing reading, and if no thread reads data, continuously reading Lb+1
If all layers are read, the data are not read yet, and the returned data do not exist.
6. The LSM tree oriented key-value storage method of claim 5, wherein the range query operation comprises:
searching a key value pair corresponding to the minimum key which is greater than or equal to k by utilizing a Seek (k) interface: submitting a plurality of query tasks to a thread pool, wherein each thread is responsible for querying a sub-layer or a memory buffer, each thread searches a minimum key which is greater than or equal to k through a dichotomy, and if each thread does not read data, the returned data does not exist; otherwise, for the thread reading the data, constructing an iterator from the read data, sequencing the read data according to the version number, and taking out the data with the latest version and returning;
and finding a key-value pair corresponding to the minimum key which is larger than the currently found key in the system by using a Next () interface: if Seek (k) finds the data, when the user submits a Next () request, the last iterator returning the result runs Next (), compares the data currently pointed by each iterator again, and returns the latest data, and the old version of data is ignored in the period.
7. The LSM tree-oriented key-value storage method of any of claims 1-6, further comprising: modeling write amplification, and selecting optimal parameters by minimizing write amplification, the steps comprising:
let the number of LSM tree layers be n and the number of sub-layers in each layer be SbGrowth factor per layer is TbB is more than or equal to 1 and less than or equal to n, the boundary of the layers adopting different compact algorithms is D, and the write amplification of each layer is calculated:
for a write WAL, its write amplification is 1;
for memory buffer disk-flushing, the write amplification is buf/Unique-1(buf), where buf is the maximum number of key-value pairs, Unique, that the buffer can hold-1(k) As an inverse function of Unique (p), Unique (p) Σk∈K(1-(1-fX(k))p) N is the total number of independent keys in the workload, K is the key space [0, N-1 ]]Set of integers of (1), fX(k) Represents the probability of the occurrence of key k in a write-once request;
when 1. ltoreq. b. ltoreq.D for LbWhich is written and amplified as
Figure FDA0003083402790000033
Figure FDA0003083402790000034
Wherein
Figure FDA0003083402790000031
Intervalb=Intervalb-1*Sb,Interval0=Unique-1(buf),Size1=buf*S1
Figure FDA0003083402790000035
Figure FDA0003083402790000036
Sizeb+1=Size(b+1).j*Sb+1
For LD+1Which is written and amplified as
Figure FDA0003083402790000037
Figure FDA0003083402790000038
Wherein, IntervalD +1 is IntervalD SD +1, SizeD+2=SizeD+1*TD+2,Size(D+2).j=SizeD+2/SD+2
When D + 2. ltoreq. b < n, for LbWhich is written and amplified as
Figure FDA0003083402790000039
Wherein Intervalb=Intervalb-1+DIntervalb,DIntervalbBy solving equations
Figure FDA0003083402790000032
Figure FDA0003083402790000041
To obtain, Sizeb+1=Sizeb*Tb+1,Size(b+1).j=Sizeb+1/Sb+1
The write amplification of each disk layer, the write amplification of the write WAL and the write amplification of the memory buffer disk refreshing are combined to form the write amplification of the whole LSM tree;
fixing the number of total sub-layers of the LSM tree, iteratively solving the write amplification under different parameters, and obtaining the S which enables the write amplification to be minimumb、TbAnd D.
8. An LSM tree oriented key-value storage system, comprising:
a first storage unit which stores the first D levels of an LSM tree including n levels and executes a composition task by adopting a finished composition algorithm for minimizing write amplification, wherein D represents a set level boundary parameter;
a second storage unit for storing the D +1 th layer of the LSM tree and executing the compact task by a compact method including the steps of:
selecting all files of all sub-layers of the layer and adding the files into an input file set;
at LD+2Selecting the sub-hierarchy L with the least current data size from the layersD+2,jAccording to LD+1The data includes a range from LD+2,jSelecting all overlapped files to add into an input file set;
carrying out multi-path merging and sorting on data in the input file set, and putting a newly generated file into a lower selected sub-layer;
a third storage unit for storing L of the LSM treeD+2~LnAnd executing the compact task by adopting a compact method comprising the following steps:
one level L of the LSM treeiThe j sub-level of the i level is marked as Li,jSSTable files in the sub-hierarchy are arranged in an increasing mode from left to right according to the range of keys;
maintaining a compact pointer at each layer for selecting a first input file of a compact task;
when the ith layer LiWhen the total data volume of the layer exceeds the rated size, triggering one compact at the layer, and enabling LiPartial data write L of a layeri+1Layer for realizing reorganization of disk data, wherein L is used for executing compact taskiAll sub-levels of a layer participate in a task, and Li+1A layer has only one sub-level participating in a task.
9. The LSM tree oriented key-value storage system of claim 8, wherein said third storage section comprises the following steps for a compact task:
according to LiAt L, the compact pointer ofiFirst sub-level L ofi.1Selecting SSTable files which contain the minimum keys more than or equal to the pointer and are closest to the pointer as task initial input files, adding an input file set of a compact task, taking the minimum keys of the files as task left boundaries, and taking the maximum keys of the files as task right boundaries;
for other sub-levels of this level
Figure FDA0003083402790000051
Sequentially selecting part or all of the files within the left and right boundaries and adding the selected files into an input file set, wherein SiA presentation layer LiThe number of sub-levels of the middle division;
expanding the boundary of the current task according to the minimum key and the maximum key of the files in the input file set so that the task comprises more files completely positioned in the boundary;
at Li+1Selecting the sub-hierarchy L with the least current data size from the layersi+1,jFrom L by task boundariesi+1,jSelecting a file which is positioned in the boundary or overlapped with the boundary to be added into a candidate file set, segmenting the compact task through the file in the candidate file set, and adding only the file which needs to participate in the task after the task is segmented into an input file set;
for an input fileset, it will be at LiAnd data within task boundaries and at Li+1The data are subjected to multi-path merging and sorting to generate a new file and write the new file into the Li+1,j
For an input fileset, it will be at LiAnd performing multi-path merging and sequencing on the data outside the task boundary, and writing the data smaller than the left boundary of the task in the generated new file into LiWriting the data larger than the right boundary of the task in the new file into the compact cache of the layer, recording the minimum key and the maximum key of the cache file and the file which is overlapped with the cache file in the input file set in the log, and deleting the file which is not recorded in the log in the input file set;
and replacing the compact pointer of the current layer with the right boundary of the present compact task.
10. A key-value pair storage device, the device comprising:
one or more processors;
a memory; and
one or more computer programs stored in the memory and configured for execution by the one or more processors, which when executed by the one or more processors, cause the one or more processors to perform steps comprising the LSM tree-oriented key-value storage method of any of claims 1-7.
CN202110573140.8A 2021-05-25 2021-05-25 LSM tree-oriented key value storage method and storage system Active CN113297136B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202110573140.8A CN113297136B (en) 2021-05-25 2021-05-25 LSM tree-oriented key value storage method and storage system
PCT/CN2021/103902 WO2022246953A1 (en) 2021-05-25 2021-07-01 Key-value storage method and storage system for lsm tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110573140.8A CN113297136B (en) 2021-05-25 2021-05-25 LSM tree-oriented key value storage method and storage system

Publications (2)

Publication Number Publication Date
CN113297136A true CN113297136A (en) 2021-08-24
CN113297136B CN113297136B (en) 2023-11-03

Family

ID=77325052

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110573140.8A Active CN113297136B (en) 2021-05-25 2021-05-25 LSM tree-oriented key value storage method and storage system

Country Status (2)

Country Link
CN (1) CN113297136B (en)
WO (1) WO2022246953A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113721863A (en) * 2021-11-02 2021-11-30 支付宝(杭州)信息技术有限公司 Method and device for managing data
CN114817263A (en) * 2022-04-28 2022-07-29 北京达佳互联信息技术有限公司 Data processing method and device, electronic equipment and storage medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117891395B (en) * 2023-12-26 2024-07-16 天津中科曙光存储科技有限公司 Data storage method, device, computer equipment and storage medium
CN117785890B (en) * 2024-02-27 2024-06-28 支付宝(杭州)信息技术有限公司 Data traversal query method based on LSM tree and related equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038206A (en) * 2017-01-17 2017-08-11 阿里巴巴集团控股有限公司 The method for building up of LSM trees, the method for reading data and server of LSM trees
CN107247624A (en) * 2017-06-05 2017-10-13 安徽大学 A kind of cooperative optimization method and system towards Key Value systems
CN107291541A (en) * 2017-06-23 2017-10-24 安徽大学 Towards the compaction coarseness process level parallel optimization method and system of Key Value systems
CN110347336A (en) * 2019-06-10 2019-10-18 华中科技大学 A kind of key assignments storage system based on NVM with SSD mixing storage organization
CN111226205A (en) * 2017-08-31 2020-06-02 美光科技公司 KVS tree database
US20200183906A1 (en) * 2018-12-07 2020-06-11 Vmware, Inc. Using an lsm tree file structure for the on-disk format of an object storage platform
US20200201821A1 (en) * 2018-12-21 2020-06-25 Vmware, Inc. Synchronization of index copies in an lsm tree file system
US20200201822A1 (en) * 2018-12-21 2020-06-25 Vmware, Inc. Lockless synchronization of lsm tree metadata in a distributed system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804019B (en) * 2017-04-27 2020-07-07 华为技术有限公司 Data storage method and device
CN111352908B (en) * 2020-02-28 2023-10-10 北京奇艺世纪科技有限公司 LSM-based data storage method and device, storage medium and computer equipment

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038206A (en) * 2017-01-17 2017-08-11 阿里巴巴集团控股有限公司 The method for building up of LSM trees, the method for reading data and server of LSM trees
CN107247624A (en) * 2017-06-05 2017-10-13 安徽大学 A kind of cooperative optimization method and system towards Key Value systems
CN107291541A (en) * 2017-06-23 2017-10-24 安徽大学 Towards the compaction coarseness process level parallel optimization method and system of Key Value systems
CN111226205A (en) * 2017-08-31 2020-06-02 美光科技公司 KVS tree database
US20200183906A1 (en) * 2018-12-07 2020-06-11 Vmware, Inc. Using an lsm tree file structure for the on-disk format of an object storage platform
US20200201821A1 (en) * 2018-12-21 2020-06-25 Vmware, Inc. Synchronization of index copies in an lsm tree file system
US20200201822A1 (en) * 2018-12-21 2020-06-25 Vmware, Inc. Lockless synchronization of lsm tree metadata in a distributed system
CN110347336A (en) * 2019-06-10 2019-10-18 华中科技大学 A kind of key assignments storage system based on NVM with SSD mixing storage organization

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
YUNPENG CHAI等: "LDC: A Lower-Level Driven Compaction Method to Optimize SSD-Oriented Key-Value Stores", 《2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE)》, pages 722 - 733 *
ZHANG, WEITAO等: "Deduplication Triggered Compaction for LSM-tree Based Key-Value Store", 《PROCEEDINGS OF 2018 IEEE 9TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING AND SERVICE SCIENCE (ICSESS)》, pages 719 - 722 *
张伟韬: "基于LSM-tree的KV数据库性能优化", 《中国博士学位论文全文数据库信息科技辑》, pages 138 - 33 *
饶毓琳: "基于LSM-Tree的持久化缓存机制的优化研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》, pages 138 - 227 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113721863A (en) * 2021-11-02 2021-11-30 支付宝(杭州)信息技术有限公司 Method and device for managing data
CN113721863B (en) * 2021-11-02 2021-12-31 支付宝(杭州)信息技术有限公司 Method and device for managing data
CN114237507A (en) * 2021-11-02 2022-03-25 支付宝(杭州)信息技术有限公司 Method and device for managing data
CN114237507B (en) * 2021-11-02 2024-04-12 支付宝(杭州)信息技术有限公司 Method and device for managing data
CN114817263A (en) * 2022-04-28 2022-07-29 北京达佳互联信息技术有限公司 Data processing method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113297136B (en) 2023-11-03
WO2022246953A1 (en) 2022-12-01

Similar Documents

Publication Publication Date Title
CN113297136A (en) LSM tree-oriented key value storage method and storage system
US11693830B2 (en) Metadata management method, system and medium
CN110825748B (en) High-performance and easily-expandable key value storage method by utilizing differentiated indexing mechanism
US9626422B2 (en) Systems and methods for reslicing data in a relational database
US9378232B2 (en) Framework for numa affinitized parallel query on in-memory objects within the RDBMS
US20170212680A1 (en) Adaptive prefix tree based order partitioned data storage system
Levandoski et al. LLAMA: A cache/storage subsystem for modern hardware
Bernstein et al. Optimizing optimistic concurrency control for tree-structured, log-structured databases
US7418544B2 (en) Method and system for log structured relational database objects
US8229968B2 (en) Data caching for distributed execution computing
US20160117354A1 (en) Method and system for dynamically partitioning very large database indices on write-once tables
US20200210399A1 (en) Signature-based cache optimization for data preparation
JPH02230373A (en) Data base processing system
US11714794B2 (en) Method and apparatus for reading data maintained in a tree data structure
US6745198B1 (en) Parallel spatial join index
CN113906406A (en) Database management system
JP6598997B2 (en) Cache optimization for data preparation
US7774304B2 (en) Method, apparatus and program storage device for managing buffers during online reorganization
CN116186085A (en) Key value storage system and method based on cache gradient cold and hot data layering mechanism
US20180011897A1 (en) Data processing method having structure of cache index specified to transaction in mobile environment dbms
WO2015129109A1 (en) Index management device
CN113204520B (en) Remote sensing data rapid concurrent read-write method based on distributed file system
US11625386B2 (en) Fast skip list purge
US20220335030A1 (en) Cache optimization for data preparation
US20230177034A1 (en) Method for grafting a scion onto an understock data structure in a multi-host environment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant