CN112395213B

CN112395213B - ACEH index structure and method based on memory hot spot data

Info

Publication number: CN112395213B
Application number: CN202011296272.2A
Authority: CN
Inventors: 何水兵; 朱彤; 曾令仿; 段雪豪; 银燕龙
Original assignee: Zhejiang University ZJU; Zhejiang Lab
Current assignee: Zhejiang University ZJU; Zhejiang Lab
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2023-05-30
Anticipated expiration: 2040-11-18
Also published as: CN112395213A

Abstract

The invention discloses an ACEH index structure and a method based on memory hot spot oriented data, wherein the structure comprises the following steps: directory entries, segments, and data buckets; squareThe method comprises the following steps: segment indexing is performed on directory entries through global depth G, one segment corresponds to a group of data buckets, segment indexing is performed on data buckets through local depth L, l=g-log ₂ ^k K represents the number of pointers to the data bucket, the data bucket index locates the data bucket inserted by the hash key using an Adjusted-Cuckoo algorithm, the Adjusted-Cuckoo algorithm includes two hash functions, two insertable data buckets are generated, then the empty data bucket is selected for insertion, the Adjusted-Cuckoo algorithm determines one data bucket, and the second data bucket is directly determined as the next data bucket of the current data bucket, the method of operation includes the steps of: step one, inserting operation; step two, refreshing operation; step three, splitting operation; and step four, deleting operation.

Description

ACEH index structure and method based on memory hot spot data

Technical Field

The invention relates to a memory storage structure of a computer, in particular to an ACEH index structure and method based on memory hot spot oriented data.

Background

Key-value storage techniques are widely used by internet companies in actual production environments to improve the performance of data storage. Scholars have studied the hot spot problem in different scenarios and have in some scenarios presented an effective solution. However, hot spot problems in key-value store scenarios are ignored.

In a conventional extensible hash structure, when there is a key-value pair (key-value) inserted, the key is first matched to a directory entry (directory). For example, a key matches the first directory entry "00" of a directory, and is inserted into the Bucket (socket) to which the first directory entry points. Upon entering the data bucket, the conventional scalable hash directly uses a sequential traversal approach until the first empty slot point (slot) is found, inserting the key-value pair. During searching, a directory entry is found according to the corresponding position of the key, a data bucket is positioned through a pointer of the directory entry, and finally a corresponding key value pair is searched in a sequential traversing mode. The deletion process is the same as the lookup. When refreshing, the conventional scalable hash simply performs an insert operation again, which wastes a great deal of space.

When expanded, directory entries multiply. When the directory is modified, the pointer for each directory entry is modified and, in addition, some key pairs in the data bucket are moved accordingly. For example, when the number of directories is 4, the key pair is inserted into a data bucket with a directory entry of 00, and after expansion, the key pair moves into the data bucket corresponding to the directory entry 001 (because the first three bits of the key pair are 001).

Disclosure of Invention

In order to solve the defects in the prior art and achieve the purposes of increasing the utilization rate of the memory and accelerating the searching performance, the invention adopts the following technical scheme:

an ACEH index structure based on memory hotspot-oriented data, comprising: the catalog item and the data barrel adopt an intermediate structure, namely a section, the catalog item and the section are used for solving the problem of insertion search in a three-layer structure by using the global depth G as a section index, one section comprises a group of data barrels, the section and the data barrel are used for using the local depth L as a data barrel index, and L=G-log ₂ ^k K denotes the number of pointers to the data bucket, L is used because there is not necessarily one Directory entry (Directory) corresponding to one segment after the scalable hash expansion, since the number of Directory entries after expansion is greater than the number of original segments, a new segment is created only when a key pair needs to be moved to an absent data bucket, the data bucket index locates the hash key inserted data bucket using the Adjusted-Cuckoo algorithm, the Adjusted-Cuckoo algorithm contains two hash functions, two insertable data buckets are generated, then the empty data bucket is selected for insertion, the Adjusted-Cuckoo algorithm determines one data bucket, the second data bucket is then directly determined as the next data bucket of the current data bucket, the arrangement is friendly to cacheline, compared with the traditional Cuckoo hash algorithm, the space office can be utilizedThe method has the advantages of being partial, accelerating the searching performance and remarkably improving the utilization rate of the memory. This structure has a significant performance improvement for operation on NVM.

Memory-oriented hot spot data-oriented ACEH indexing method, wherein a catalog item is subjected to segment indexing through global depth G, one segment corresponds to a group of data buckets, the segment is subjected to data bucket indexing through local depth L, and L=G-log ₂ ^k K represents the number of pointers to the data bucket, L is adopted because there is not necessarily one Directory entry (Directory) corresponding to one segment after the scalable hash is expanded, since the number of Directory entries after expansion is greater than the number of segments in the original, a new segment is created only when a key value pair needs to be moved to a non-existing data bucket, the data bucket index locates the data bucket inserted by the hash key by adopting the Adjusted-Cuckoo algorithm, the Adjusted-Cuckoo algorithm comprises two hash functions, two pluggable data buckets are generated, then the spare data bucket is selected for insertion, the Adjusted-Cuckoo algorithm determines one data bucket, the second data bucket is directly determined as the next data bucket of the current data bucket, so that the setting is friendly to cacheline, compared with the traditional Cuckoo hash algorithm, space locality can be utilized, search performance is accelerated, significant improvement is also achieved on memory utilization, NVM (non-volatile memory) performance is improved, and the operation method comprises the following steps:

step one, inserting operation;

step two, refreshing operation;

step three, splitting operation;

and step four, deleting operation.

Further, the Adjusted-Cuckoo algorithm selects the multiple-shift function to determine a bucket.

Further, the inserting operation in the first step finds the section pointed by the directory entry by locating the directory entry at the highest position, then finds two insertable data barrels according to the Adjusted-Cuckoo algorithm, sequentially traverses the data in the two data barrels, and updates the value corresponding to the current key if the key is the same as the inserted key; if the key is not the same as the inserted key, the hash key is inserted randomly if both data barrels have free positions, if only one data barrel is free, the free data barrel is inserted, and if both data barrels have no free positions, the insertion fails, and the splitting operation is performed.

Furthermore, the refreshing operation in the second step adopts in-situ refreshing, the key to be updated is directly compared with the key stored in each slot in the detection process, and if the key is the same, the key is updated. The traditional extensible hash is directly inserted into data, so that repeated key value pairs exist, space is wasted, errors of read data are likely to occur, in-situ refreshing is performed, the repeated key value pairs are removed, and the memory utilization rate is improved.

Further, in the splitting operation in the third step, a new segment is created, and an extra directory entry pointer pointing to the original segment is turned to point to the new segment, and the valid key value pair corresponding to the directory entry pointer is also transferred to the new segment. And deleting the key value pairs in the original segment.

Further, the deletion operation in the fourth step adopts inert deletion, after updating the directory entry, the query searching for the migrated record will access the new segment, and the query searching for the non-migrated record will access the old segment, but because the split segment (i.e. the old segment) contains all keys, they will always successfully find the record searching for the key value, which contains some unnecessary repeated items, while inert expansion is performed during expansion, and while inert expansion is performed during expansion, when some key value pairs are migrated, the key value pairs of the migrated original data bucket are not deleted immediately, but when new data is inserted into the original data bucket, the key value pairs that have been migrated are replaced directly, thus reducing the overhead when the hash table is expanded, and improving the utilization rate of the memory.

Further, in the inert deletion, when deleting the data x1 in the Bucket0 data Bucket, the data x2 in the Bucket1 data Bucket is directly covered with the data x1, the data at the x2 position is marked as invalid, and the data inserted into the Bucket0 and the data inserted into the Bucket1 can be strived for in the Bucket0 as much as possible, so that the searching times are reduced, the average length of the searched data is reduced, and the data access performance is improved.

The invention has the advantages that:

the hash index for the hot data set is improved based on the extensible hash structure ACEH, and the method is different from the general extensible hash index, in the secondary indexing process, the ACEH uses a modified cuckoo hash algorithm, so that the insertable position of each data is increased, and the memory utilization rate is increased; ACEH also provides in-situ refresh operations, reducing memory space occupation by duplicate key values. At the same time, the operation can also reduce the splitting operation of the ACEH structure and improve the insertion performance.

Drawings

Fig. 1 is a schematic diagram of an ACEH index structure of the present invention.

FIG. 2 is a schematic diagram of the process of the Adjusted-Cuckoo algorithm of the present invention.

FIG. 3 is a schematic diagram of the creation of a new segment in the present invention.

FIG. 4 is a diagram of directory entry expansion in accordance with the present invention.

Detailed Description

The following describes specific embodiments of the present invention in detail with reference to the drawings. It should be understood that the detailed description and specific examples, while indicating and illustrating the invention, are not intended to limit the invention.

The ACEH (Adjusted-Cuckoo Extendible Hashing) structure for hotspot data is a hash storage structure for data sets containing hotspots.

1. ACEH logic structure and algorithm:

as shown in FIG. 1, ACEH employs an intermediate structure, referred to as Segment, in structure over conventional extensible Ha Xiduo. One Segment consists of N pockets. To solve the insert lookup problem in a three-layer structure, the structure uses the G bit (representing global depth) as the segment index and uses the Adjusted-Cuckoo algorithm to locate which socket the hash key is inserted into.

Adjust-Cuckoo algorithm: as with the traditional Cuckoo hash algorithm, the Adjusted-Cuckoo algorithm also comprises two hash functions, so that two insertable pockets are generated after the hash key passes through the Adjusted-Cuckoo algorithm, then the rest pockets are selected for insertion, but unlike the Cuckoo hash algorithm which randomly selects the hash functions, the Adjusted-Cuckoo algorithm selects the multiple-shift function to determine one pocket, and the second pocket is directly determined as the next pocket of the current pocket, so that the setting is friendly to cacheline, and compared with the traditional Cuckoo hash algorithm, the spatial locality can be utilized, and the search performance is quickened.

As shown in FIG. 2, assuming that x1 is calculated by the Adjusted-Cuckoo algorithm and can be inserted into pocket 0 and pocket 1, sequential traversal is followed by inserting pocket 0, and x2 is also calculated by the Adjusted-Cuckoo algorithm and can be inserted into pocket 0 and pocket 1, sequential traversal is followed by inserting pocket 1, when x1 is deleted, x2 can be directly overlaid with x1, and then the data at the x2 position is marked as INVALID (INVALID). Therefore, data inserted into the socket 0 and the data inserted into the socket 1 can be strived for as much as possible, the searching times are reduced, the average length of the searching data is reduced, and the data access performance is improved.

After the Adjust-Cuckoo algorithm, compared with the traditional extensible hash, ACEH is also remarkably improved in memory utilization rate. In a refresh operation, a conventional scalable hash is to insert data directly, so that repeated key-value pairs waste space and errors in the read data are likely to occur. While ACEH employs in-situ refresh, removing duplicate key-value pairs.

2. ACEH operation:

as shown in fig. 3, assume that a given hash key is 00100110..11110, and that the global depth G (Global Depth) is 2, the first two bits of the most significant bits are used as Segment indexes, and the least significant bytes are used as Bucket indexes. L represents the Local Depth (Local Depth), l=g-log ₂ ^k K represents the number of pointers to the data bucket, and the local depth L is used because there is not necessarily one Directory entry (Directory) corresponding to one segment after the scalable hash expansion, and since the number of Directory entries after expansion is greater than the number of original segments, a new segment is created only when the key pair needs to be moved to the non-existing data bucket, as shown in fig. 4, the Directory entry double expansion (Directory Doubling) operation: expansion is performed according to Most Significant Bits, wherein white indicates a directory entry prior to expansion and gray indicates a newly added directory entry after expansion.

Insert operation: when hash key 00100110..11111110 is inserted, firstly, locating to a 00 directory entry through a most significant (Most Significant Bits) index, finding a Segment pointed by the 00 directory entry, then finding two insertable pockets according to an Adjusted-Cuckoo algorithm, sequentially traversing data in the two pockets, and if a key is the same as an inserted key, updating a value corresponding to the current key; if the key is not the same as the inserted key, if two sockets have idle positions, hash keys are randomly inserted, if only one socket is empty, the idle socket is inserted, if the two sockets have no idle positions, the insertion fails, and Split operation is performed.

Update operation: for the refresh operation, because the ACEH adopts a linear detection method during insertion, the key value to be updated can be directly compared with the key stored in each slot in the detection process, and if the key value is the same, the key value is updated.

Split operation: assuming that keys 00100110..11110 are to be inserted, after calculation by the Adjusted-Cuckoo algorithm, the keys are to be inserted into both the pocket 2 and the pocket 254, but both the pockets have no redundant space for storing data, at this time, ACEH creates a new Segment4, and at the same time, an extra pointer to Segment3, i.e. a pointer to the 11 directory entry, is turned to Segment4, and the valid hash key value pair with the highest 11 in the original Segment3 is also transferred to Segment 4.

Delete operation: with lazy deletion, migrated records are not deleted immediately. After updating the directory entries, queries searching for migrated records will access new segments, while queries searching for non-migrated records will access old segments, but since the split segments (i.e., old segments) contain all keys, they will always succeed in finding a record of the search key value, which contains some unnecessary duplication. In the process of the inert expansion, when part of key value pairs are migrated, the key value pairs in the original data barrel are not deleted, and when new data is inserted into the original data barrel, the migrated data is directly replaced, so that the expenditure of the hash table during expansion can be reduced.

For example, insert one record with hash key 1010..11111110, access segment3, assuming the Adjusted-Cuckoo algorithm calculates that to insert pocket 2 and pocket 254, it is found that pocket 254 is full, the hash key of the first record in pocket 2 is 1001..0010 is valid, but the hash key of the second record is 1111..0010 is invalid, then insert the transaction to replace the second record with the new record. Since the validity of each record is determined by the local depth, the order in which the directory entries are updated must be preserved to maintain consistency and fault scope.

The above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims

1. An index method of an extensible hash storage structure ACEH based on memory oriented hot spot data is characterized in that a catalog item passes through global depthGIndexing segments, one segment corresponding to each group of data buckets, the segments passing through local depthsLThe indexing of the data buckets is performed,

，krepresenting the number of pointers to the data bucket, the data bucket index locating the data bucket inserted by the hash key using an Adjusted-Cuckoo algorithm that includes two hash functions, generating two insertable data buckets, then selecting the remaining data buckets to insert, the Adjusted-Cuckoo algorithm determining one data bucket, the second data bucket directly determining the next data bucket to the current data bucket,

when the data 1 can be inserted into the first data barrel and the second data barrel after being calculated by the algorithm, the first data barrel can be inserted after the sequential traversal, the data 2 can be inserted into the first data barrel and the second data barrel after being calculated by the algorithm, the second data barrel can be inserted after the sequential traversal, when the data 1 is deleted, the data 2 is directly covered with the data 1, and then the data at the position of the data 2 is marked as invalid.

2. The method for indexing an ACEH based on a memory hot spot oriented data scalable hash storage structure of claim 1, wherein said Adjusted-Cuckoo algorithm determines a bucket by a multiple-shift function.

3. The method for indexing the memory-oriented hot spot data-oriented extensible hash storage structure ACEH according to claim 1 or 2, wherein the method comprises the steps of performing key value pair insertion operation, locating a directory entry through the highest order, finding a segment pointed by the directory entry, finding two pluggable data barrels according to an Adjusted-Cuckoo algorithm, sequentially traversing data in the two data barrels, and updating a value corresponding to a current key if the key is the same as an inserted key; if the key is not the same as the inserted key, the hash key is randomly inserted if both data barrels have idle positions, if only one data barrel is empty, the idle data barrel is inserted, if both data barrels have no idle positions, the insertion is failed, and the key value splitting operation is performed.

4. The method for indexing the memory-oriented hotspot data scalable hash storage structure ACEH according to claim 1 or 2, wherein the method comprises a key value pair refreshing operation, in-situ refreshing is adopted, a key to be updated is directly compared with a key stored in each slot in a detection process, and if the key is the same, the key is updated.

5. A method for indexing a memory-oriented hotspot data scalable hash storage structure ACEH according to claim 1 or 2, wherein the method comprises a key-value splitting operation, creating a new segment, and simultaneously turning an extra directory entry pointer to the original segment to the new segment, wherein valid key-value pairs corresponding to the directory entry pointer are also transferred to the new segment.