CN113297136A

CN113297136A - LSM tree-oriented key value storage method and storage system

Info

Publication number: CN113297136A
Application number: CN202110573140.8A
Authority: CN
Inventors: 王宏超; 叶保留; 唐斌; 陆桑璐
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-24
Anticipated expiration: 2041-05-25
Also published as: CN113297136B; WO2022246953A1

Abstract

The invention provides a key value storage method and a key value storage system for an LSM tree. The method comprises the following steps: and performing fine-grained division on the disk layers, and setting the compact policy as follows: in the compact task, all the upper sub-layers participate in the task, and the lower layer only has one sub-layer to participate in the task, so as to reduce the ratio of the lower layer participating in the data to the total participating in the data; when executing the compact task, the compact task is divided, so that the number of files participating in the compact task is reduced, and the parallelism of the compact is improved. The invention also provides a method for selecting parameters which minimize write amplification by reducing the influence on the read performance through a parallel read algorithm and modeling the write amplification of the LSM tree.

Description

LSM tree-oriented key value storage method and storage system

Technical Field

The invention relates to a computer storage technology, in particular to a key value storage method and a key value storage system facing an LSM tree.

Background

Key-Value Store (Key-Value Store) stores data as a set of < Key-Value >, where a Key is a unique identifier for a Value. It does not support complex relational schema like relational database, but processes data through simple interfaces Put (k, v), get (k), Update (k, v), delete (k), etc. Due to the advantages of high performance, high expandability and the like, the method plays an important role in the current network application and distributed system, and is widely applied to the fields of graphic databases, task queues, stream processing engines, application program data caching, event tracking systems and the like.

An LSM tree (Log-Structured large tree) is a storage engine that is widely used in key-value storage systems. When a user writes a key value pair, the data is written into the cache and is sorted in the cache, and then the write operation is executed. When the buffer memory exceeds the preset size, the data in the buffer memory is written into the disk once. This effectively converts a large number of random writes to a small number of sequential writes. Because the sequential writing performance of the hard disk is far higher than the random writing performance, and the writing speed of the LSM tree is very high, the method is suitable for the workload with more writing operations. In order to avoid the memory data loss during the system crash, before the data is written into the cache, it needs to write a WAL (write Ahead Log) located in the disk. Since the operation is performed in the mode of additional writing, the writing performance of the system is not obviously influenced.

Data in disk is stored in multiple tiers (L)₁,L₂,…,L_n) Wherein L is_nDenotes the lowest layer, L_iRepresents the ith (1. ltoreq. i. ltoreq.n) layer. The data of each layer is stored in order according to the keys in the key value pair, and is stored in a plurality of SSTable (sorted String Table) in a scattered way, and each SSTable stores data of a certain key range in order. In two adjacent layers, the ratio of the amount of data that can be accommodated in the next layer to the amount of data that can be accommodated in the previous layer is called a growth factor T, and is generally 10. Taking the first layer as an example of being capable of storing 10MB data at most, the second layer can store 100MB data at most, and only 7 layers are needed in the secondary class push, so that more than 10TB data can be contained in total. In order to maintain the stability of the hierarchy and prevent excessive data in one layer, the background has a compact process to reorganize the data in the disk and write part of the data in one layer into the next layer.

Specifically, when the data amount of a certain layer exceeds the maximum value that can be accommodated by the certain layer, the compact process selects one SSTable file of the layer, then selects all SSTable files with key ranges overlapped with the SSTable files in the next layer of the layer, merges and sorts the files, generates a new file and writes the new file into the next layer, and deletes the old selected file.

With L₁For example, assume that in the key value pair contained in the SSTable file selected in the layer, the range of the key is [2,8 ]]Then at L₂If a SSTable file contains a range of keys corresponding to [2,8 ]]With overlap, the file needs to be selected as input for the compact task. This is done to ensure that data is written to L₂After that, L can still be ensured₂The ordering of the data. Since data that can be contained in different layers grows exponentially, in order to write one SSTable file in a certain layer into a next layer, multiple files in the next layer are often required to participate, and one compact in the layer increases data stored in the lower layer, which may cause compact in the next layer. Such multiple layers accumulate to allow the data in the disk to be frequently overwritten. The ratio of the actual amount of disk write data to the amount of user requested write data is referred to as write amplification. Taking the key value storage system LevelDB adopting the LSM tree structure as an example, experimental results show that when a user requests to write 50GB of data, the write amplification is close to 20, that is, the actual disk write amount is close to 1 TB. Too high write amplification severely affects the write performance of the LSM tree. While the LSM tree often runs in computers that employ SSDs, frequent hard disk reads and writes can reduce the lifetime of the SSD. In summary, write amplification is a serious problem for LSM tree structures. On the other hand, when the memory buffer and the disk L are in use₁When the data volume exceeds the threshold value, the memory data can not be flushed, and L must be waited₁After completing a compact, a space is left for this layer before a new write request can be serviced, resulting in a write stall, i.e. a large increase in the periodic write delay.

Disclosure of Invention

Aiming at the problems in the background art, the invention aims to provide an LSM tree-oriented key value storage method, which reduces the write amplification by reducing the ratio of the lower layer to the upper layer participating in the data volume in the compact task, describes the write amplification of a system by adopting a modeling mode, optimizes the system parameters, and reduces the influence on the reading performance by adopting a parallel reading algorithm.

Another object of the present invention is to provide a key value storage system and device using the above key value storage method.

In order to achieve the purpose, the invention adopts the following technical scheme:

according to a first aspect of the present invention, there is provided a LSM tree-oriented key value storage method, including the following steps:

dividing a level of the LSM tree into a plurality of sub-levels, wherein the jth sub-level of the ith level is marked as L_i.jSSTable files in the sub-hierarchy are arranged in an increasing mode from left to right according to the range of keys;

maintaining a compact pointer at each layer for selecting a first input file of a compact task;

when the ith layer L_iWhen the total data volume of the layer exceeds the rated size, triggering one compact at the layer, and enabling L_iPartial data write L of a layer_i+1Layer for realizing reorganization of disk data, wherein L is used for executing compact task_iAll sub-levels of a layer participate in a task, and L_i+1A layer has only one sub-level participating in a task.

Wherein for L_iA layer one-time compact task, comprising the steps of:

according to L_iComposition pointer of layer at L_iFirst sub-level L of_i.1Selecting SSTable files which contain the minimum keys more than or equal to the pointer and are closest to the pointer as task initial input files, adding an input file set of a compact task, taking the minimum keys of the files as task left boundaries, and taking the maximum keys of the files as task right boundaries;

for L_iOther sub-levels L of a layer_i.2～

In turn selectPart or all of the files within the left and right boundaries and adding to the input file set, wherein S_iRepresents L_iNumber of sub-levels divided in a layer;

expanding the boundary of the current task according to the minimum key and the maximum key of the files in the input file set so that the task comprises more files completely positioned in the boundary;

at L_i+1Selecting the sub-hierarchy L with the least current data size from the layers_i+1,jFrom L by task boundaries_i+1,jSelecting a file which is positioned in the boundary or overlapped with the boundary to be added into a candidate file set, segmenting the compact task through the file in the candidate file set, and adding only the file which needs to participate in the task after the task is segmented into an input file set;

for an input fileset, it will be at L_iData within layers and task boundaries and at L_i+1The data of the layer is subjected to multi-path merging and sorting, a new file is generated and written into the L_i+1,j；

For an input fileset, it will be at L_iPerforming multi-path merging and sequencing on data outside the task boundary, and writing the data smaller than the left boundary of the task in the generated new file into L_iLayer, new file greater than task right boundary data write L_iCaching the composition of the layer, recording the minimum key and the maximum key of a cached file in the log, recording the files which are overlapped with the cached file in the input file set, and deleting the files which are not recorded in the log in the input file set;

mixing L with_iThe composition pointer of the layer is replaced by the right boundary of the composition task.

The specific method for segmenting the task comprises the following steps:

for each file in the candidate file set, acquiring the minimum key k contained in each file through metadata in a memory_minAnd maximum key k_max；

According to k_minAnd k is_maxA query is made for files in the input set of files, if for each file in the input set of files, [ k [ ]_min,k_max]Non-overlapping with, or small in, the fileAt k_minIs greater than k_maxIf no other key-value pair exists between the minimum keys, the candidate file is moved out of the candidate file set and the k is used as the basis_minAnd k_maxCutting the files in the input file set into two parts, wherein one part comprises keys smaller than k_minAnother part containing a bond greater than k_maxOtherwise, the candidate file is moved out of the candidate file set and added into the input file set.

In some embodiments of the first aspect of the present invention, the L_iIs layered on L_D+2～L_nAmong the layers, wherein n is the number of the layers of the LSM tree, D is a set layer boundary parameter, and D is more than or equal to 1 and less than or equal to n; the method further comprises the following steps:

for L₁～L_DAdopting a tipped compatibility algorithm, sequencing all data of the layer at one time, writing a newly generated file into the next layer, forming a new sub-layer on the next layer, and during the period, not allowing lower-layer data to participate in sequencing;

for L_D+1And sorting all data of the layer and data of one sub-layer of the lower layer, and writing the newly generated data into the selected sub-layer of the lower layer.

In this hierarchical manner, the write operation includes:

acquiring a global version number maintained for the key-value pair, increasing the version number, and encoding the version number into the key;

writing data into the WAL in an additional writing mode;

writing the data into a memory buffer, and returning;

the search operation comprises the following steps:

inquiring a memory buffer and a cache, if the memory buffer and the cache exist, returning data, and if the memory buffer and the cache do not exist, carrying out the next step;

from L₁To L_nLayer, for each level L in the disk_bSequentially searching, wherein b is more than or equal to 1 and less than or equal to n, maintaining a thread pool, and the number of threads in the thread pool is max (S)₁,S₂,…,S_n) For L_bSubmit S to the thread pool_bA read task, thread_jTo L_b,jTo carry outBinary search, 1<j<S_b；

Summary S_vIf any thread reads data, selecting the data with the maximum version number to return, finishing reading, and if no thread reads data, continuously reading L_b+1；

If all layers are read, the data are not read yet, and the returned data do not exist.

The scope query operation includes:

searching a key value pair corresponding to the minimum key which is greater than or equal to k by utilizing a Seek (k) interface: submitting a plurality of query tasks to a thread pool, wherein each thread is responsible for querying a sub-layer or a memory buffer, each thread searches a minimum key which is greater than or equal to k through a dichotomy, and if each thread does not read data, the returned data does not exist; otherwise, for the thread reading the data, constructing an iterator from the read data, sequencing the read data according to the version number, and taking out the data with the latest version and returning;

and finding a key-value pair corresponding to the minimum key which is larger than the currently found key in the system by using a Next () interface: if Seek (k) finds the data, when the user submits a Next () request, the last iterator returning the result runs Next (), compares the data currently pointed by each iterator again, and returns the latest data, and the old version of data is ignored in the period.

In some embodiments of the first aspect of the present invention, the method further comprises: modeling write amplification, and selecting optimal parameters by minimizing write amplification, the steps comprising:

let the number of LSM tree layers be n and the number of sub-layers in each layer be S_bGrowth factor per layer is T_bB is more than or equal to 1 and less than or equal to n, the boundary of the layers adopting different compact algorithms is D, and the write amplification of each layer is calculated:

for a write WAL, its write amplification is 1;

for memory buffer disk-flushing, the write amplification is buf/Unique^-1(buf), where buf is the maximum number of key-value pairs, Unique, that the buffer can hold^-1(k) Is UInverse function of nique (p), unique (p) Σ_k∈K(1-(1-f_X(k))^p) N is the total number of independent keys in the workload, K is the key space [0, N-1 ]]Set of integers of (1), f_X(k) Represents the probability of the occurrence of key k in a write-once request;

when 1. ltoreq. b. ltoreq.D for L_bWhich is written and amplified as

Wherein

Interval_b＝Interval_b-1*S_b，Interval₀＝Unique^-1(buf)，Size₁＝buf*S₁，

Size_b+1＝Size_(b+1).j*S_b+1；

For L_D+1Which is written and amplified as

Wherein, IntervalD +1 is IntervalD SD +1, Size_D+2＝Size_D+1*T_D+2，Size_(D+2).j＝Size_D+2/S_D+2；

When D +2 is less than or equal to b<n is, for L_bWhich is written and amplified as

Wherein Interval_b＝Interval_b-1+DInterval_b，DInterval_bBy solving equations

Obtaining Sizeb +1 ═ Sizeb Tb +1, Size (b +1). j ═ Sizeb +1/Sb + 1;

the write amplification of each disk layer, the write amplification of the write WAL and the write amplification of the memory buffer disk refreshing are combined to form the write amplification of the whole LSM tree;

fixing the number of total sub-layers of the LSM tree, iteratively solving the write amplification under different parameters, and obtaining the S which enables the write amplification to be minimum_b、T_bAnd D.

According to a second aspect of the present invention, there is provided an LSM tree-oriented key-value storage system, comprising:

a first storage unit which stores the first D levels of an LSM tree including n levels and executes a composition task by adopting a finished composition algorithm for minimizing write amplification, wherein D represents a set level boundary parameter;

a second storage unit for storing the D +1 th layer of the LSM tree and executing the compact task by a compact method including the steps of:

selecting all files of all sub-layers of the layer and adding the files into an input file set;

at L_D+2Selecting the sub-hierarchy L with the least current data size from the layers_D+2,jAccording to L_D+1The data includes a range from L_D+2,jSelecting all overlapped files to add into an input file set;

carrying out multi-path merging and sorting on data in the input file set, and putting a newly generated file into a lower selected sub-layer;

a third storage unit for storing L of the LSM tree_D+2～L_nAnd executing the compact task by adopting a compact method comprising the following steps:

one level L of the LSM tree_iThe j sub-level of the i level is marked as L_i.jSSTable file in child hierarchy as followsThe range of keys is arranged in an increasing manner from left to right;

Wherein the third storage unit executes the steps for one compact task and the method for storing key values facing LSM tree according to the first aspect of the present invention_iThe steps involved in one compact task of a layer are the same.

According to a third aspect of the present invention, there is provided a key-value pair storage device, the device comprising:

one or more processors;

a memory; and

one or more computer programs, stored in the memory and configured for execution by the one or more processors, which when executed by the one or more processors, cause the one or more processors to perform steps comprising a LSM tree oriented key-value storage method according to the first aspect of the invention.

The invention can obtain the following beneficial effects:

1. and (3) finely dividing each level of the LSM tree, wherein during compact, a plurality of sub-levels participate in the upper layer, and the data range contained in each sub-level is the same, so that the data quantity selected by each sub-level is similar. While the lower layer has only one sub-level to participate. When the data is selected at the lower layer, the composition task is segmented, so that the number of files participating in the composition at the lower layer is reduced as much as possible, the ratio of the data amount participating in the composition at the lower layer to the data amount participating in the composition at the upper layer is reduced, that is, in order to introduce a certain amount of data into the next layer, the data amount required to participate in the sorting at the lower layer is reduced, and the write amplification is reduced.

2. By adopting different compact algorithms for different layers and adopting a tipped compact algorithm which can minimize write amplification on an upper layer, the efficiency of importing data into a lower layer is improved, and the occurrence of a write pause phenomenon is reduced.

3. By multi-threaded parallel reading, the impact on the reading performance is reduced. And modeling is carried out on the write amplification, a method for selecting the optimal parameters is provided, and the write performance of the system is maximized under the condition of fixed read performance.

Drawings

FIG. 1 is a schematic diagram of an LSM tree according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a compact algorithm according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of compact task segmentation according to an embodiment of the present invention;

FIG. 4 is a diagram of a parallel read algorithm according to an embodiment of the present invention.

Detailed Description

The technical solution of the present invention is further explained with reference to the drawings and the embodiments.

Fig. 1 is a schematic diagram of an LSM tree according to an embodiment of the present invention. As shown in the figure, there is a buffer in the memory, and the WAL is a disk pre-write log configured to avoid loss of buffer data when the program crashes, and receives a write request from a user. Data in a disk is divided into three levels (L)₁,L₂,L₃). Each layer is divided into three sub-layers. Each sub-hierarchy contains a plurality of SSTable files. The data in the sub-hierarchies is ordered, and there is no relationship between the data in different sub-hierarchies. This is equivalent to relaxing the ordering of the original LSM tree, where the data in each level of the original LSM tree is strictly ordered, and the data in each level of the LSM tree of the present invention is divided into a plurality of smaller ordered groups by means of segmentation.

FIG. 2 is a schematic diagram of a compact algorithm according to an embodiment of the present invention. Each box in the figure represents an SSTable file. For convenience of description, it is assumed that a file can accommodate at most two key-value pairs (in practice, the number of key-value pairs that each file can accommodate is much higher than 2), and the numbers in the squares represent the keys in the key-value pairs contained in the file, and the corresponding values are not shown. The upper right-hand symbol of the number indicates the old-new relationship of the version of the value corresponding to the key, taking key 5 as an example, the value corresponding to 5 "is newer than the value corresponding to 5 ', and the value corresponding to 5 ' is newer than the value corresponding to 5 '. In particular implementations, the order in which key-value pairs are written may be recorded by maintaining a global version number (e.g., a 64-bit integer). And (3) representing the current latest version number by a number, encoding the current version number into a key value pair every time a new key value pair is written, and adding the global version number to be + 1. For example, the current system version number is 1, a key-value pair is inserted, 1 is assigned to this key-value pair, and stored as < key1,1, value1 >. Then the system version number becomes 2. Then, a key-value pair is inserted, and 2 is assigned to this key-value pair, and stored as < key2,2, value2 >. The system version number then becomes 3. Thus, when reading a plurality of key value pairs, the new and old relations of the data can be known by judging the version numbers of the key value pairs.

When L is₂Exceeds the maximum value that can be accommodated, and a compact task is triggered. First from L_2.1And selecting an input file. Since the compact pointer of the layer is 6, the file containing the minimum key greater than or equal to 6 and closest to 6, namely SSTable (6 ', 12'), is selected as the initial file and added to the input file set. Record the minimum key 6 contained in the file as the left boundary of the compact task and the maximum key 12 contained in the file as the right boundary of the compact task. Then, from L_2.2To L_2.3Files within this boundary, or overlapping the boundary, are selected from each sub-hierarchy according to the left and right boundaries to join the input file set, where a total of 4 files SSTable (5 ', 8), SSTable (12, 13'), SSTable (5 ', 7'), SSTable (10 ', 14') are selected. To this end, L₂And finishing the file selection.

For L₃For simplicity of illustration, L is not shown_3.1And L_3.3Contained in the fileA key. Due to L_3.2The data size is the least, and the sub-hierarchy is selected to participate in the compact. Files SSTable (5,6), SSTable (7,9) and sstave (10,11) are selected at the sub-hierarchy to be added to the input file set, also according to left and right boundaries. And selecting the sub-hierarchy with the least data quantity to participate in the compact, so that the data quantity of each sub-hierarchy is closest to the data quantity of each sub-hierarchy after the compact task is finished, but the new-old relation of the data of each sub-hierarchy cannot be guaranteed. In order to maintain the data version relationship between the layers (for the same key, the value of the upper layer is newer than that of the lower layer), L needs to be added₂All data within a certain range is written to the next layer. Therefore, it is necessary to set L according to the boundary of the compact₂And (4) cutting the selected file, wherein the data outside the boundary range needs to be rewritten back to the layer. Otherwise L may result₃Data ratio L in (1)₂And (5) new.

Data outside the boundary is eventually written back to the layer, which increases write amplification, thus expanding the boundary according to the minimum and maximum keys of the selected file in each sub-hierarchy. If the number of files to be cut can be reduced after the expansion, the boundary is updated without introducing a new file. That is, the boundary is only expanded according to the initial file, ensuring that no additional files are added. Otherwise, the situation that the boundary is continuously expanded and finally all files are added into the input file may occur, which causes the compact task to be too large and affects the system stability. As shown, the boundaries are initially [6,12] and the boundaries are expanded to [5,12], thus eliminating the need to cut both SSTable (5 ', 8) and SSTable (5 ', 7 ').

After the files are selected, the files are divided into two parts: 1 from L₂Part of the selected file within the boundary and the slave L₃All selected files; 2 from L₂The parts of the selected file that are outside the boundary. For the data of the first part, 4 new files SSTable (5 ', 6'), SSTable (7 ', 8), SSTable (9, 10'), SSTable (11,12) are generated and put into L by adopting multi-path merging and sorting_3.2. And performing intra-layer compact on the second part of data, i.e. adopting multi-path merging and sorting, generating new file SSTable (13 ', 14') and putting back L_2.3. Finally, the compact pointer is replaced with the right boundary 12 of the task, and the files in the input file set are deleted.

To further reduce file write back L₂The resulting write amplification sets a compact cache for each layer, storing files generated by the compact in the layer. Specifically, the intra-layer compact is in L₂It is possible to generate two parts of the document, the first part of the document being located to the left of the left boundary and the second part of the document being located to the right of the right boundary. The first part of file is written into the disk, and the second part of file is stored into the compact cache without being written into the disk. The input file of the compact task adopts a circular selection strategy, that is, when the layer carries out the next compact, the right boundary of the present compact task is taken as the left boundary of the next compact task. Therefore, the cache file can be directly read from the memory, and one-time reading and writing of the disk file are reduced. Since the boundary expansion operation is performed and the compact task has a clear boundary, L is not considered₁The files located to the left of the left boundary are not generated in the compact of (1), so that the memory space occupied by the cache files is small. The cache will only be used once if L₁The triggered compact task comprises a file in the cache, and the file is moved out of the cache, so that the cache is still reduced by one read-write of the disk file.

When the computer crashes, it may cause the loss of the compact cache. In order to avoid data loss, in the disk log, the sub-hierarchy to which the cache file of the composition belongs, the minimum key and the maximum key of the cache file, the source SSTable file of the cache file, and other metadata of the composition (such as a new file generated by the composition, a composition pointer, composition task statistical information and the like) are recorded together, and in the last step of the task, the input file related to the cache is not deleted. Thus, when the computer crashes, the data in the compact cache can be recovered from the input file by using the data.

FIG. 3 shows how the amount of data for the underlying participating tasks is reduced by splitting the compact task. Shown in the figure is L₂One compact task. When L is finished₂Selection of documentsThen, first, according to the compact boundary, at L_3.1SSTable (1,2), SSTable (3,6) and SSTable (7,10) are selected as candidate files. Then, for the three candidate files, the files of each sub-hierarchy in the current input file set are queried according to the minimum key and the maximum key of the files respectively, and whether the candidate file can not participate in the present compact is judged. If for each file f in the input file set_iIf one of the following two conditions is met, the candidate file is not added to the task: 1, the minimum key and the maximum key of the candidate file are positioned at f_iOutside of the range of (1); 2, the minimum key and the maximum key of the candidate file are positioned at f_iBut can be determined by mixing f_iThe division into two parts is such that both parts do not overlap with the candidate file.

In the figure, for candidate file SSTable (3,6), its determined key range [3,6 ]]Key ranges [7,9 ] determined with files SSTable (7', 9) in the input file set]And the bond range [1, 2] determined by SSTable (1 ', 2')]There is no overlap. The current candidate file is connected with the key range [2,7 ] determined by the file SSTable (2 ', 7')]There is an overlap, but if SSTable (2 ', 7') is split into SSTable (2 ') and SSTable (7'), these two files are compared with [3,6 ]]There is no range overlap. Then, the candidate file SSTable (3,6) does not participate in the present compact. The compact is divided into two subtasks, and one task is responsible for dividing [1, 2]]Data within the scope is sorted and another task is responsible for sorting [7,10 ]]The data within the scope is sorted and the two subtasks can be executed in parallel. Thus, on the one hand, reducing L₃The number of files participating in the task reduces the writing amplification, increases the parallelism of the compact and improves the speed of the compact.

Fig. 4 shows the LSM tree reading algorithm of the present invention. Since each hierarchy is further divided, the number of sub-layers to be read is increased, so that the read performance is affected. In order to improve the reading performance, the invention adopts a parallel reading algorithm. The invention maintains a thread pool, and the number of threads in the thread pool is the same as the maximum number of sub-layers of the LSM tree. When a read request arrives, if the corresponding data is not found in the memoryDisk data needs to be queried. First, query L₁Thread_jResponsible for querying L_1.j. And when the query of each sub-layer is finished, summarizing the result of each thread. If any thread finds the result, the corresponding data is compared through the version number to obtain the latest result and return the latest result. If all threads have not queried the results, then start on L₂A query is made. And repeating the steps until a corresponding result is inquired and returned, or each layer is searched but corresponding data is not found, and the returned data does not exist.

Due to the number of sub-levels S of each level_iGrowth factor T of each level_iAnd the setting of the boundary D between the hierarchy that adopts the tipped computation algorithm and the hierarchy of the fine-grained computation algorithm described in fig. 2 has a large influence on the system performance. Therefore, the writing amplification of the system under different parameters can be realized by establishing the model. By minimizing write amplification, parameters are obtained that optimize the write performance of the system.

Assume that the key space K of the workload ranges from 0, N-1]Where N is the total number of independent keys in the workload. Keying follows a certain distribution X, such as a uniform distribution, ziff distribution, and the like. The probability of key k appearing in a write-once request is f_X(k) In that respect For example, when the keys are subject to uniform distribution, f_X(k) 1/N, when the bond follows a ziff distribution,

where s represents the degree of data skew and h maps each key to an integer in the key space K. For p requests, the number of independent keys present is unity (p) ═ Σ_k∈K(1-(1-f_X(k))^p). The inverse function of Unique (p) is Unique^-1(k) In that respect Since Unique (p) is a monotonic function, by extending its domain of definition to the real number domain, Unique can be solved^-1(k) In that respect Let the size of a file be u, then k files u₁,u₂,…,u_kGo on compThe total size of the new file generated after action is

The write overhead is modeled by write amplification that characterizes each layer. WAL is written before data is written into the memory buffer, so that write amplification is WA_buf＝1。

The number of key-value pairs which can be accommodated by the system buffer is set as buf. Consider buffer in memory as L₀I.e. Size₀Buf. When the buffer reaches the capacity threshold, writing the data in the buffer into L in batch₁. Since the buffer does not include repeated key-value pairs, the number of write requests required for the buffer from empty to full is Unique^-1(buf), which is also the Interval for writing all of the layer data to the next layer₀. The amount of data written to the disk is buf, so the write amplification of the memory write disk is WA_0→1＝buf/Unique^-1(buf)。L₁A sub-level L of_1.jIs Size of_1.j＝buf，L₁Has a total Size of Size₁＝buf*S₁。

The write amplification due to the disk compact is calculated from the amount of data written to the lower layer at a certain interval.

For L_i(i is more than or equal to 1 and less than or equal to D), adopting a tipped compact algorithm, and when the number of the sub-layers of the layer reaches S_iThe compact is triggered. The time required for adding one sub-layer to the layer is L_i-1The interval of two transactions occurs. Therefore, the Interval between the occurrences of the compact in the layer is Interval_i＝Interval_i-1*S_i. In the interval, towards L_i+1The amount of data written is

Thus, L_iWrite amplification of WA_i→i+1＝Write_i+1/Interval_i. And L is_i+1Size of a sub-hierarchy of (2)_(i+1).j＝Write_i+1，L_i+1Total Size of (2)_i+1＝Size_(i+1).j*S_i+1。

For L_D+1When the number of sub-layers of the layer reaches S_D+1The compact is triggered. The required interval is L for each added sub-level_DThe interval of two transactions occurs. Therefore, the Interval between the occurrences of the compact in the layer is Interval_D+1＝Interval_D*S_D+1. In the interval, towards L_D+2The amount of data written is

Wherein, Size_D+2＝Size_D+1*T_D+2Size of jth sub-level_(D+2).j＝Size_D+2/S_D+2. Thus, L_D+1Write amplification of WA_D+1→D+2＝Write_D+2/Interval_D+1。

For L_i(D+2≤i<n), each compact task is in L each time because the range of data contained in each sub-hierarchy is the same_iEach sub-level of the hierarchy selects substantially the same range of data, so that the data can pass through the first sub-level L of the hierarchy_i.1And (6) carrying out analysis. Let DInterval_iIs at L_i.1Two times of compact and the same key interval, d is L_i.1The one-way distance between one key of the key (2) and the LastKey from the last compact to the next layer is more than or equal to 0 and less than or equal to N-1. For a fixed d, if the layer has a bond k₁Distance d from LastKey, since the key is compact, the child hierarchy has done DInterval_i*d/(N*S_i) And (6) a new request. If there is a key k in these new requests₁Then L_i.1In the presence of a bond k₁. The probability is

Assuming that any K belongs to K and P (LastKey) is 1/N, L can be obtained by considering all K_i.1In (2), there is a probability that there is a key at a distance d from LastKey

Considering all d, one can obtain

From this, the DInterval of the layer is obtained_i. Interval of this layer_i＝Interval_i-1+DInterval_iIn the interval, the data amount of the lower layer is written

Wherein Size_i+1＝Size_i*T_i+1Size of jth sub-level_(i+1).j＝Size_i+1/S_i+1. Thus, the write amplification of this layer is WA_i→i+1＝Write_i+1/Interval_i。

All WAs are added to get the total WA of the LSM tree. The read performance is affected by the total number of sub-levels of the LSM tree, and generally, the more sub-levels, the more IO operations. The total number of sub-levels of the LSM tree is fixed, i.e.

For setting value, obtaining total WA under different parameters through iteration, and recording S when WA is minimum_i，T_iAnd D.

The parameter optimization algorithm according to the embodiment of the invention is as follows:

according to another embodiment of the present invention, there is provided an LSM tree-oriented key-value storage system, including:

one level L of the LSM tree_iThe j sub-level of the i level is marked as L_i.jSSTable files in the sub-hierarchy are arranged in an increasing mode from left to right according to the range of keys;

Steps in which the third storage section is executed for a compact task and for L in the foregoing method embodiment_iThe steps involved in one compact task of a layer are the same. And will not be described in detail herein.

The key-value pair storage system maintains a global version number (e.g., a 64-bit integer), encodes the current version number into the key-value pair, and encodes the global version number +1 each time a new key-value pair is written. For example, the current system version number is 1, a key-value pair is inserted, 1 is assigned to this key-value pair, and stored as < key1,1, value1 >. Then the system version number becomes 2. Then, a key-value pair is inserted, and 2 is assigned to this key-value pair, and stored as < key2,2, value2 >. The system version number then becomes 3. Thus, when reading a plurality of key value pairs, the new and old relations of the data can be known by judging the version numbers of the key value pairs.

In the hierarchical manner as described above, the write operation of the key-value storage system includes:

acquiring the global version number of the key value pair, increasing the number of the global version number, and encoding the global version number into the key;

writing data into the WAL in an additional writing mode;

writing the data into a memory buffer, and returning;

lookup operations of a key-value store system include:

from L₁To L_nLayer, for each level L in the disk_bSequentially searching, wherein b is more than or equal to 1 and less than or equal to n, maintaining a thread pool, and the number of threads in the thread pool is max (S)₁,S₂,…,S_n) For L_bSubmit S to the thread pool_bA read task, thread_jTo L_b,jPerform binary search, 1<j<S_b；

Summary S_bIf any thread reads data, selecting the data with the maximum version number to return, finishing reading, and if no thread reads data, continuously reading L_b+1；

The scope query operation includes:

And the writing amplification of the model description system under different parameters can be established. By minimizing write amplification, parameters are obtained that optimize the write performance of the system. The specific model building steps are the same as those in the foregoing method embodiments, and are not described herein again.

There is also provided, in accordance with another embodiment of the present invention, a key-value pair storage device, including:

one or more processors;

a memory; and

one or more computer programs stored in the memory and configured for execution by the one or more processors, which when executed by the one or more processors, cause the one or more processors to perform steps comprising a LSM tree oriented key-value storage method as described in the preceding method embodiment.

As will be appreciated by one of skill in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

It is apparent that those skilled in the art can make various modifications and variations to the embodiments of the present invention without departing from the spirit and scope of the embodiments of the present invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the embodiments of the present invention and their equivalents, the embodiments of the present invention are also intended to include such modifications and variations.

Claims

1. A key value storage method facing an LSM tree, the method comprising:

dividing a level of the LSM tree into a plurality of sub-levels, wherein the jth sub-level of the ith level is marked as L_i，jSSTable files in the sub-hierarchy are arranged in an increasing mode from left to right according to the range of keys;

2. The LSM tree oriented key-value storage method of claim 1, wherein for L_iA layer one-time compact task, comprising the steps of:

for L_iOther sub-levels of a layer

Sequentially selecting part or allSegmenting files within left and right boundaries and adding to the input file set, wherein S_iRepresents L_iNumber of sub-levels divided in a layer;

at L_i+1Selecting the sub-hierarchy L with the least current data size from the layers_i+1，jFrom L by task boundaries_i+1，jSelecting a file which is positioned in the boundary or overlapped with the boundary to be added into a candidate file set, segmenting the compact task through the file in the candidate file set, and adding only the file which needs to participate in the task after the task is segmented into an input file set;

for an input fileset, it will be at L_iData within layers and task boundaries and at L_i+1The data of the layer is subjected to multi-path merging and sorting, a new file is generated and written into the L_i+1，j；

3. The LSM tree-oriented key-value storage method of claim 2, wherein the specific step of partitioning the task comprises:

According to k_minAnd k is_maxInquiring files in the input file set if the input files are searchedEach file in the set, [ k ]_min，k_max]No overlap with the file, or less than k in the file_minIs greater than k_maxIf no other key-value pair exists between the minimum keys, the candidate file is moved out of the candidate file set and the k is used as the basis_minAnd k_maxCutting the files in the input file set into two parts, wherein one part comprises keys smaller than k_minAnother part containing a bond greater than k_maxOtherwise, the candidate file is moved out of the candidate file set and added into the input file set.

4. The LSM tree oriented key-value storage method of claim 1, wherein L is_iIs layered on L_D+2～L_nAmong the layers, wherein n is the number of the layers of the LSM tree, D is a set layer boundary parameter, and D is more than or equal to 1 and less than or equal to n; the method further comprises the following steps:

5. The LSM tree oriented key-value storage method of claim 4, wherein in the hierarchical manner, the write operation comprises:

writing data into the WAL in an additional writing mode;

writing the data into a memory buffer, and returning;

the search operation comprises the following steps:

from L₁To L_nLayer, for each level L in the disk_bSequentially searching, wherein b is more than or equal to 1 and less than or equal to n, maintaining a thread pool, and the number of threads in the thread pool is max (S)₁，S₂，…，S_n) For L_bSubmit S to the thread pool_bA read task, thread_jTo L_b,jPerforming binary search, j is more than 1 and less than S_b；

6. The LSM tree oriented key-value storage method of claim 5, wherein the range query operation comprises:

7. The LSM tree-oriented key-value storage method of any of claims 1-6, further comprising: modeling write amplification, and selecting optimal parameters by minimizing write amplification, the steps comprising:

for a write WAL, its write amplification is 1;

for memory buffer disk-flushing, the write amplification is buf/Unique^-1(buf), where buf is the maximum number of key-value pairs, Unique, that the buffer can hold^-1(k) As an inverse function of Unique (p), Unique (p) Σ_k∈K(1-(1-f_X(k))^p) N is the total number of independent keys in the workload, K is the key space [0, N-1 ]]Set of integers of (1), f_X(k) Represents the probability of the occurrence of key k in a write-once request;

when 1. ltoreq. b. ltoreq.D for L_bWhich is written and amplified as

Wherein

Size_b+1＝Size_(b+1).j*S_b+1；

For L_D+1Which is written and amplified as

When D + 2. ltoreq. b < n, for L_bWhich is written and amplified as

Wherein Interval_b＝Interval_b-1+DInterval_b，DInterval_bBy solving equations

To obtain, Size_b+1＝Size_b*T_b+1，Size_(b+1).j＝Size_b+1/S_b+1；

8. An LSM tree oriented key-value storage system, comprising:

at L_D+2Selecting the sub-hierarchy L with the least current data size from the layers_D+2，jAccording to L_D+1The data includes a range from L_D+2，jSelecting all overlapped files to add into an input file set;

one level L of the LSM tree_iThe j sub-level of the i level is marked as L_i，jSSTable files in the sub-hierarchy are arranged in an increasing mode from left to right according to the range of keys;

9. The LSM tree oriented key-value storage system of claim 8, wherein said third storage section comprises the following steps for a compact task:

according to L_iAt L, the compact pointer of_iFirst sub-level L of_i.1Selecting SSTable files which contain the minimum keys more than or equal to the pointer and are closest to the pointer as task initial input files, adding an input file set of a compact task, taking the minimum keys of the files as task left boundaries, and taking the maximum keys of the files as task right boundaries;

for other sub-levels of this level

Sequentially selecting part or all of the files within the left and right boundaries and adding the selected files into an input file set, wherein S_iA presentation layer L_iThe number of sub-levels of the middle division;

for an input fileset, it will be at L_iAnd data within task boundaries and at L_i+1The data are subjected to multi-path merging and sorting to generate a new file and write the new file into the L_i+1，j；

For an input fileset, it will be at L_iAnd performing multi-path merging and sequencing on the data outside the task boundary, and writing the data smaller than the left boundary of the task in the generated new file into L_iWriting the data larger than the right boundary of the task in the new file into the compact cache of the layer, recording the minimum key and the maximum key of the cache file and the file which is overlapped with the cache file in the input file set in the log, and deleting the file which is not recorded in the log in the input file set;

and replacing the compact pointer of the current layer with the right boundary of the present compact task.

10. A key-value pair storage device, the device comprising:

one or more processors;

a memory; and

one or more computer programs stored in the memory and configured for execution by the one or more processors, which when executed by the one or more processors, cause the one or more processors to perform steps comprising the LSM tree-oriented key-value storage method of any of claims 1-7.