CN111274456A

CN111274456A - Data indexing method and data processing system based on NVM (non-volatile memory) main memory

Info

Publication number: CN111274456A
Application number: CN202010064770.8A
Authority: CN
Inventors: 陈世敏; 刘霁航
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2020-01-20
Filing date: 2020-01-20
Publication date: 2020-06-12
Anticipated expiration: 2040-01-20
Also published as: CN111274456B

Abstract

The invention discloses a data indexing method based on NVM (non-volatile memory), which comprises the following steps: setting leaf nodes of a tree-shaped index structure in the NVM main memory; when newly added data are written into leaf nodes, judging whether the leaf nodes have idle index items, if so, performing data writing operation, otherwise, performing and completing node splitting operation and then performing data writing operation; wherein the data write operation comprises: if the first index row of the leaf node has an idle index item, writing the newly added data into the idle index item; otherwise, the newly added data and the stored data stored in the first index row are transferred to the idle index items of the middle index row and/or the tail index row; the node splitting operation comprises the following steps: and constructing a new leaf node, and migrating part of stored data of the leaf node to the free index entry of the new leaf node.

Description

Data indexing method and data processing system based on NVM (non-volatile memory) main memory

Technical Field

The invention relates to the field of database systems and big data systems, in particular to a tree-shaped index structure, an index method and an index system for a nonvolatile main memory.

Background

1.1 New Generation of non-volatile memory technology

A new generation of non-volatile memory (NVM) is a computer memory that is an alternative or complement to existing DRAM (dynamic random access memory) hosting technology. Current integrated circuit feature sizes have reached 7nm, and the continued downward expansion of DRAM technology to smaller feature sizes presents significant challenges. The new generation of NVM technology can support smaller feature sizes by changing the resistance of the storage medium to store 0/1, providing a viable solution to the above-mentioned problems. The new generation of NVM technologies include Phase Change Memory (PCM), spin transfer torque magnetic random access memory (STT-MRAM) and Memristor (Memristor), 3DXPoint, etc.

Compared with the DRAM technology, the NVM technology has the following characteristics: (1) NVM has read and write performance similar to DRAM, but slower (e.g., 3 times) than DRAM; (2) the writing performance of the NVM is poor compared with the reading performance, the power consumption is high, and the writing times can be limited, namely the writing times of the same storage unit exceed a certain threshold value, and the storage unit can be damaged; (3) the data written into the NVM does not disappear after power failure, and the data in the DRAM and the CPU Cache disappear after power failure; (4) in order to ensure that the content in the CPU Cache is written back to the NVM, Cache line flush instructions such as clwb/clflush and memory operation sequencing instructions such as sfence/mfence need to be executed, and the performance cost of these special instructions is higher than that of ordinary writing (for example, 10 times); (5) the basic unit of CPU access to NVM is a Cache line (e.g., 64B). (6) The access base unit inside the NVM module may be larger than the Cache line (e.g., 256B for an access unit inside an Intel Optane DC Persistent Memory).

NVM technology has at least 2 orders of magnitude higher performance compared to flash, and NVM allows in-situ writing without requiring operations like erasing of flash. Therefore, the use of NVM technology is more closely characterized as DRAM, and is considered an alternative or supplement to DRAM hosting technology.

1.2 computer System including New Generation NVM host

Two configurations of a computer system incorporating a 3DXPoint based Intel Optane DC Persistent Memory are shown in FIG. 1A, B. As shown in FIG. 1A, in a first configuration, the DRAM is treated as 3DXPoint buffer, the memory controller automatically completes the buffer operation, and the main memory size visible to the system is 3 DXPoint. DRAM is completely controlled by hardware and is not visible to software; in a second configuration, shown in FIG. 1B, where both DRAM and 3DXPoint are visible to software, the software can determine which data is placed in volatile DRAM and which data is placed in non-volatile 3 DXPoint. The first configuration is mainly suitable for running original applications, and can utilize a large-capacity 3DXPoint host without modifying the applications, but the first configuration cannot achieve the purpose of persistently storing data in the NVM, so that new applications oriented to the NVM host use the second configuration.

1.3B + -Tree indexing

B + -Tree is an index structure of a Tree structure. The leaf nodes store the index items, are connected by a brother linked list from left to right in the same layer, and provide ordered storage for the index items from small to large. Each internal node may have multiple pointers, each pointing to a child node. The keys in the internal nodes are ordered from small to large, separating the different child subtrees. Thus, the B + -Tree is an ordered index.

B + -Tree is widely applied to databases and big data systems to assist in quickly querying and updating data. The performance of the index, since it is called with high frequency, affects the overall performance of the system. Therefore, the design of optimizing the B + -Tree index structure for the nonvolatile main memory has important theoretical and practical significance for the database and the big data system based on the nonvolatile main memory.

1.4 prior art: NVM (non volatile memory) -optimization-oriented B + -Tree index structure

The existing NVM-oriented optimized B + -Tree index structure mainly comprises WB + -Tree, FP-Tree, BzTree and the like. The main optimization idea comprises the following aspects:

paying attention to the performance of a CPU Cache, enabling the size of a node to be integral multiple of the size of a Cache Line (namely 64B), and the size of the node to be obviously smaller than the size of a B + -Tree node based on a hard disk (for example, 4 KB);

adopt out-of-order leaf nodes to reduce NVM writes;

placing internal nodes in DRAM and leaf nodes in NVM, thereby improving access performance of the internal nodes, while internal nodes can be rebuilt from the leaf nodes at recovery time, thereby maintaining persistence and downtime consistency of the data structure;

the number of NVM writes and forced write backs can be reduced by using NVM atomic writes to avoid using pre-write logs or shadow copies.

First, the prior art is designed based on theoretical NVM characteristics, and when there is no real NVM hardware, an analog simulation method is used for research. After the actual hardware of 3DXPoint appears, through research, the inventor finds that the hardware has new important characteristics: (1) the granularity of internal data access is 256B, which is larger than 64B of the Line of the CPU Cache; (2) the cost of writing each 64B Line to NVM does not vary from written content to written content, unlike assumptions in earlier studies.

Secondly, the B + -Tree node splitting operation is very costly in the prior art, and usually logs are adopted to ensure the downtime consistency of the node splitting operation, and the logs cause the cost of additional NVM write and forced write-back. Forcing a write-back on 3DXPoint may incur a performance penalty of more than 10 times that of normal writes, and therefore how to reduce extra operations such as logging is a very important issue.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a data indexing method based on an NVM (non-volatile memory) main memory, which constructs a tree-shaped index structure by utilizing the characteristics of atomic write operation of the nonvolatile main memory and 3DXPoint real hardware so as to reduce the overhead of data write operation in the NVM main memory.

Specifically, the data indexing method of the present invention includes: constructing a tree-shaped index structure; the root node and the internal node of the tree-shaped index structure are arranged in a DRAM main memory of the data processing system, and the leaf nodes of the tree-shaped index structure are arranged in an NVM main memory of the data processing system; the leaf nodes are multiple and in the same layer, and are connected in a brother chain list from left to right; the leaf node comprises an index unit A, the index unit A sequentially comprises a first index row, a middle index row and a tail index row, and the first index row, the middle index row and the tail index row respectively comprise a plurality of index items; when newly-added data is written into the current leaf node, judging whether an index unit A of the current leaf node has an idle index item, if so, performing data writing operation, otherwise, performing and completing node splitting operation and then performing the data writing operation; wherein the data write operation comprises: if the first index row has an idle index item, writing the newly added data into the idle index item of the first index row; otherwise, the newly added data and the stored data stored in the first index row are transferred to the idle index items of the middle index row and/or the tail index row; the node splitting operation comprises: and constructing a new leaf node, adding the brother linked list, and migrating part of stored data of the current leaf child node to the idle index entry of the index unit of the new leaf child node.

In the data indexing method, the A indexing unit NVMLine comprises M indexing lines, and each Line is provided with N data item storage positions; line1 as the first index Line L_h，L_hThe 1 st data item storage position is an index head H, L_hThe rest data item storage positions are index items, and the 2 nd to the M-1 th lines are middle index lines L_i，L_iAll the data item storage positions of (1) are index items, and the Mth Line is a tail index Line L_t，L_tThe Nth data item of (2) is a brother linked list pointer item S, L_tThe other data items of (2) are index items; m, N is a positive integer; the index head H comprises a write lock location bit, an alternation control bit altbit, an index occupation bitmap and an index fingerprint bit F, and S comprises a pointer S₀And a pointer S₁(ii) a The lock bit is used for setting the write-in state of the current leaf node; alt bit for setting S by NVM atomic write₀And S₁One of which is a valid pointer and the other is an invalid pointer; the bitmap is used for respectively recording the occupation state of each index item; f is a fingerprint array used for recording the fingerprint of each index item respectively; the valid pointer of S is used to connect the sibling list.

The data indexing method of the invention, wherein the node splitting operation step specifically includes: when leaf Node_nAll index items are occupied, and newly-added leaf nodes are allocated_n'，Node_n' having and Node_nThe same structure; node is to be_nIs copied to Node_n' and modify Node_n' first index line L_hThe index of the ' index header H ' occupies the bitmap '; node is to be_n'the effective pointer of the brother linked list pointer item S' points to the Node_nRight brother leaf Node_n+1Node is to be_nInvalidation of the sibling linked list pointer entry SPointer to Node_n'; persisting Node in the NVM host_n'; setting alt bit to point S to Node with NVM atomic write_n' the pointer is set as the effective pointer, and S is pointed to Node_n+1Set the pointer of (a) to an invalid pointer; the part of the stored data exists in the Node_nThe index entries of (1) are cleared to be free index entries, and the bitmap is modified by NVM atomic write.

The data indexing method provided by the invention comprises the following steps that the leaf node further comprises at least one B indexing unit NVMLine ', each NVMLine' comprises M lines, and each Line is provided with N data item storage positions; the storage position of the 1 st data item of the 1 st Line of NVMLine 'is a head index head H0, and the storage position of the Nth data item of the Mth Line of NVMLine' is a tail index head H1; h0 includes an index occupancy bitmap 'and an index fingerprint bit F'; h1 has the same structure as H0; setting alt bit of H by NVM atomic write to set H₀And H₁One of the index headers is an effective index header, and the other index header is an invalid index header, wherein the bitmap 'of the effective index header is used for respectively recording the occupation state of each index item of the current NVMLine'; f 'of the effective index header is a fingerprint array for recording fingerprints of each index entry of the current NVMLine'.

The data indexing method of the present invention further includes: and only after the write lock bit is set to be in a lock state and the concurrent control hardware transaction is exited, performing data write operation of the exclusive leaf node.

In the data indexing method of the invention, when fault recovery operation is carried out, the root node and the internal node are rebuilt in the DRAM main memory according to all the leaf nodes so as to recover the tree-shaped indexing structure; and if the write lock bit of the index unit NVMLine is in a locked state, setting the write lock bit to be in an unlocked state, and recovering the stored data of the index unit NVMLine.

According to the data indexing method, partial or all index items of the NVMLine are set as idle index items by modifying bitmap, and partial or all index items of the NVMLine 'are set as idle index items by modifying bitmap'.

The present invention also provides a computer-readable multi-storage medium storing executable instructions for performing the NVM-based data indexing method as described above.

The invention also proposes a data processing system comprising: a processor; a main memory connected with the processor and including a DRAM main memory and an NVM main memory connected in parallel; a computer-readable storage medium, the processor retrieving and executing executable instructions in the computer-readable storage medium for NVM-based indexing of data.

The LB + -Tree provided by the patent optimizes the performance of index write operation aiming at the characteristics of real hardware and realizes node splitting of zero log.

Drawings

FIG. 1A, B is a diagram of a prior art computer system including an NVM.

FIG. 2 is a schematic diagram of the LB + -Tree structure of a 256B leaf node of the present invention.

FIG. 3 is a schematic diagram of the LB + -Tree point query algorithm of the present invention.

FIG. 4 is a schematic diagram of the migration of top row indexing items of the present invention.

FIG. 5 is a schematic diagram of a 256B leaf node LB + -Tree insertion algorithm of the present invention.

FIG. 6 is a zero log leaf node splitting diagram of the present invention.

FIG. 7 is a schematic diagram of a 256B leaf node zero log splitting algorithm of the present invention.

Fig. 8 is a schematic diagram of a multi-256B leaf node structure of the present invention.

FIG. 9 is a graphical illustration of the performance of the index insertion of the present invention compared to the prior art.

FIG. 10 is a schematic diagram of a data processing system of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more clearly apparent, the following describes in detail the NVM main memory based data indexing method and data processing system proposed by the present invention with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The inventor researches the 3DXPoint hardware characteristics, and considers the existing B + -Tree index structure oriented to NVM optimization again, finds that the prior art is not suitable for the new characteristics of real hardware, and the existing node splitting method is high in cost, so the inventor provides a new B + -Tree design, draws the advantages of the prior art, solves the problems of the prior art, and obtains remarkable performance improvement.

First, the following definitions are made:

CacheLineSize: the size of a CPU Cache line is the granularity of reading and writing main memory data by a CPU, and is usually 64B.

NVMLineSize: the granularity of data access inside the NVM, for example 256B in 3DXPoint, is typically an integer multiple of CacheLineSize.

The invention provides a Tree-shaped index structure LB + -Tree oriented to a nonvolatile main memory, which is a B + -Tree index structure oriented to real 3DXPoint hardware optimization. The LB + -Tree supports the persistent storage and downtime consistency in the nonvolatile main memory, supports the multithreading concurrent operation, optimizes the node structure, reduces the overhead of the write operation by utilizing the characteristics of the atomic write operation and the 3DXPoint real hardware of the nonvolatile main memory, and realizes the persistent support of the Tree index structure of the zero log.

Leaf nodes of the LB + -Tree are stored in the NVM, all leaf node starting addresses are aligned according to NVMLineSIze, the size of the leaf nodes is integral multiple of the NVMLineSIze, namely 1 or more NVMLineSIzes, and data access bandwidth inside real NVM hardware can be fully utilized.

The leaf node of the LB + -Tree consists of 1 or more NVMLines, the NVMLines are index units NVMLines with the size of the NVMLinesize, the NVMLines comprise index head headers, index items and brother linked list pointers, the headers control which brother linked list pointers are effective currently, and the leaf node structure supports a first-row index migration technology and a zero-log node splitting technology.

For the insertion algorithm of the leaf node, the invention provides first-row index migration. First, as far as possible, only the free index entry in the CacheLineSize Line (called the first index Line0) where the header is located in the leaf node is modified to complete the insertion. Second, when the index entry of the first index row is full, free index entries of other index rows need to be inserted. At this time, as many index entries as possible in the first row are migrated to the index row where the inserted idle index row is located, so that as many idle bits as possible are vacated in the first row, and subsequent insertion can find the idle index entries in the first row, thereby reducing insertion cost, reducing the number of written NVM rows, and improving insertion performance.

For the leaf node splitting inserted when the leaf nodes are full, the invention also provides a zero log splitting algorithm, and the switching of the leaf node states (brother linked list pointer and header) is completed by adopting NVM atomic writing, thereby avoiding the cost of writing the log and improving the node splitting performance.

Compared with the prior art, the LB + -Tree optimizes the node structure, reduces the overhead of write operation by utilizing the characteristics of nonvolatile main memory atomic write operation and 3DXPoint real hardware, and realizes the Tree-shaped index structure persistence support of the zero log.

FIG. 2 is a schematic diagram of the LB + -Tree structure of a 256B leaf node of the present invention. As shown in FIG. 2, the present invention proposes a data index structure LB + -Tree, which includes a root node and an internal node (both referred to as a node-leaf) disposed in DRAM main memory, and a leaf node disposed in NVM main memory (3 DXPoint). The LB + -Tree overall structure adopts the existing NVM-oriented B + -Tree design idea, and mainly comprises the following steps: leaf nodes are placed in the NVM to guarantee persistent storage, internal nodes are placed in the DRAM and disappear after power failure, and the leaf nodes are rebuilt from the leaf nodes during recovery; the internal nodes adopt an ordered index item array to support binary search, and the index items in the leaf nodes are unordered, so that NVM (non-volatile memory) writing caused by index item movement is reduced.

Specifically, the present invention provides a Tree-like index structure LB + -Tree oriented to a non-volatile main memory NVM, and in the embodiment of the present invention, according to the implementation environment, CacheLineSize ═ 64B and NVMLineSize ═ 256B can be taken, but the present invention is not limited thereto.

In this embodiment, the leaf node has an integer multiple of 256B, so as to adapt to the characteristic that the internal data transmission size of 3DXPoint is 256B, and fully utilize the internal bandwidth of 3 DXPoint. There are two designs for leaf nodes, corresponding to the case where the leaf node size is 256B and the size is at least 512B, referred to below as the 256B leaf node and the multi-256B leaf node.

1. 256B leaf node structure

The 256B leaf node is composed of an index unit NVMLine (256B unit), and assuming that a single index entry of the index unit is (8B key,8B val), the index unit is composed of 4 index lines Line of 64B, i.e., a first index Line (Line0), a middle index Line (Line1, Line2) and a last index Line (Line3), and has 14 index entries with a length of 16B in total, and the index entries store valid keys and val of data.

The first index Line0 has 4 data items, the 1 st data item (with the size of 16B) is an index head H (header), the header sequentially comprises a write lock bit of 1bit, an alternate control bit alt bit of 1bit, an index occupation bitmap of 14bit and an index fingerprint bit of 14B, the write lock bit is used for concurrency control, and the alternate control bit altbit determines a pointer S in the tail index Line through 0/1 dereferencing₀And a pointer S₁One of the index occupation bitmaps is used for recording whether an index item of an index unit NVMLine is occupied, 1 represents that a corresponding index item is occupied, effective key and val are stored, 0 represents that a corresponding index item is free, and an index fingerprint bit F comprises a fingerprint array of 14B, wherein 14 fingerprint bits of 1B can be stored, the 14 fingerprint bits respectively correspond to the 14 index items of the index unit NVMLine, and the fingerprint bit of 1B can be obtained by a key of 8B through a hash function; 3 data items after the header are index items; . In the aligned 64B index Line where H is located, except for H, only index items are placed in all the remaining spaces, so that the number of the index items which can be placed in the 64B index Line where H is located is as large as possible;

the middle index lines Line1 and Line2 respectively have 4 data items, the Line1 and Line2 have 8 data items, and the data items of the middle index lines are index items;

the tail index Line3 has 4 data items, of which the first 3 are index items and the 4 th data item is a sibling linked list pointer item containing two pointers S of 8B₀And S₁One of the two pointers is an effective pointer and points to the next leaf node on the right side in the LB + -Tree leaf layer, if the leaf node is the rightmost leaf, the effective pointer is NULL, and the other pointer is an idle pointer; pointer S₀And S₁Is set by the alternate control bit alt bit of the header of the first index Line 0.

2. Concurrent control

The LB + -Tree of the invention can adopt a plurality of concurrent control strategies, combines HTM (hardware affair) provided by CPU and operation to the leaf node lock bit, and the basic thought comprises:

the access of internal nodes in a DRAM and the read operation of leaf nodes in an NVM are protected by an HTM, and the rollback of hardware transactions caused by cache line flush instructions such as clwb/clflush and the like is noticed, so that the write operation of the leaf nodes in the NVM cannot be protected by the HTM; HTM is supported in a variety of mainstream CPU architectures, including the Intel x86 architecture, the ARM architecture, the IBM Power8 architecture, and the like;

the write operation of the leaf node sets the lock bit and completes the hardware transaction, and the read operation of the leaf node checks whether the lock bit is set, so that the modification of the lock bit by the write operation and the read of the lock bit by the read operation conflict, the write operation is ensured to monopolize the leaf node, the leaf node is mutually exclusive with the read operation and other write operations, and the correctness of the concurrent control of the leaf node is ensured;

and the specific step of the write operation is carried out only when the leaf node is monopolized after the lock bit is set and the hardware transaction is exited in the write operation of the leaf node.

3. LB + -Tree point query algorithm of 256B leaf nodes

FIG. 3 is a schematic diagram of the LB + -Tree point query algorithm of the present invention. As shown in FIG. 3, the LB + -Tree point query algorithm mainly comprises the steps of searching internal nodes first and then searching leaf nodes. The basic approach is consistent with the general approach of B + -Tree. The method mainly has the following two characteristics. The first feature is the use of HTMs for concurrency control, where _ xbegin, _ xend, _ xabort is an implementation based on Intel x 86. XBEGIN starts a hardware transaction and returns XBEGIN STARTED if successful. When a hardware transaction fails due to a data conflict or a call _ xabort, requiring a rollback, the CPU will discard all relevant data modifications in the CPU Cache and transfer control back to _ xbegin. From the software perspective, it will be seen that _ xbegin returns some error code. In the event of a transaction failure, the point query algorithm will retry (see algorithm line 3).

The second feature is a search operation for leaf nodes. First, the algorithm computes the 1B finger print value for the query key and compares it to the 14 1B values in the finger print array in the header using SIMD vector operations. Only if the fingerprint matches, the index entry may match. Second, for each index entry that a finger print matches, the bitmap is checked to determine if the index entry is valid, and further index key comparisons are made. The method is based on the existing FP-Tree, and the cost of comparing each index item in turn can be reduced.

4. Insertion algorithm of 256B leaf node

The data insertion operation of the leaf node of the invention comprises the following steps:

(1) migration of top-line index items: in real NVM hardware, the write performance of every 64B row of data does not change with the amount of data modified, i.e. when i byte changes in 64B and 64-i byte does not change, the NVM write performance does not change under various conditions of i being greater than or equal to 1 and less than or equal to 64, noting that simulation studies before the occurrence of real NVM hardware sometimes set fewer modifications to produce better performance. Based on this discovery, the present invention wants to minimize the modification of the 64B Line in the leaf node, since the header always modifies at the time of insertion, so preferably, the insertion modifies only the first index Line 0.

FIG. 4 is a schematic diagram of the migration of top row indexing items of the present invention. As shown in FIG. 4, at insert 6, Line0 has a free index entry, so writing 6 to this free index entry and modifying the header all occurs in Line0, and the insertion of data only causes a write of 1 index Line, which is best case.

However, when the index entries of the first row are occupied, the insertion of the leaf node necessarily requires writing other free bits besides the first row, and the header in the first row also requires modification, so that 2 index rows will be written.

In this case, the first-row index item migration migrates the index item in Line0, which stores stored data, to the index row where the index item is being modified, so as to vacate Line0 as much as possible, so that subsequent insertion can fully utilize the vacated index item in the first row to implement writing of 1 index row. As shown in FIG. 4, at insert 3, since Line0 is full, a free index entry in Line1 needs to be inserted. At this point, 9, 6, 4 in Line0 are all migrated to Line1, freeing Line0 for 3 free bits. Then at the time of insertion 7, the free index entry can be found again in Line0 to complete the insertion, which achieves the best case.

Note that when migrating the index entries of Line0, the migrated row will have more writes to accept the migrated index entries, and Line0 will have more writes to modify the finger print array, but the number of rows written to the NVM is not changed, so the write performance is not changed. And subsequent insert operations benefit after the index entry migration of Line 0.

It can be verified by analysis that: in a stable LB + -Tree, the index migration of the first index row will reduce the number of NVM row write operations by at least 1.35 times (i.e. the conventional scheme write NVM row number/first row index migration write NVM row number is greater than or equal to 1.35).

FIG. 5 is a schematic diagram of a 256B leaf node LB + -Tree insertion algorithm of the present invention. As shown in fig. 5, where lbtreeLeafInsert is one implementation of the above-described index entry migration of the first index row. Lines 15-20 are the case where the free bit is found in the top line, corresponding to the case of inserting 6 and 7 in fig. 4, lines 21-32 are the case where the top line is fully occupied and other lines need to be inserted, and the top line index entry migration is also performed while inserting.

In addition, the algorithm of FIG. 5 has several notable details. The first detail is for the modification of the header, and the algorithm does not directly modify the header, but copies the header into a temporary variable dword first, modifies the dword, and then writes the dword back to the header. Therefore, single bit operation and single byte modification can be carried out in temporary variables, the unit of write-back is 8B, and the processing of the downtime consistency of the algorithm is simplified.

The second detail is that the algorithm always modifies the free location first, including the free index entry location, the free finger print location, and finally writes the first 8B in the head line. This is because the modification of the idle position does not affect the downtime consistency, and when the modifications are completed, the modification of the first 8B in the first row header includes the modification of the bitmap, and the state of the leaf node is finally changed, so that the newly written index entry is valid.

A final detail is the implementation of concurrency control. The algorithm sets the lockbit of the leaf node in LBTreeInsert and commits the hardware transaction _ xend. Thus, mutual exclusion of other read and write operations can be completed, and the current leaf node is exclusive. When the lock bit is set, faults such as power failure and the like may occur. Upon crash recovery, the leaves need to be scanned to reconstruct the internal nodes. At this time, the set lock bit can be cleared uniformly, thereby ensuring the correctness of further processing.

(2) Zero log leaf node splitting technique

Node splitting in the existing NVM-oriented B + -Tree needs to be protected by a pre-write log, which causes a great deal of cost of NVM writing, cache line flush, sfence/mfence and the like.

FIG. 6 is a zero log leaf node splitting diagram of the present invention. As shown in fig. 6, the present invention proposes zero log leaf node splitting, which is to replace two pointers by a NAW (NVM Write of nvmagnetic Write, 8B + clwb/clflush instruction + sfence/mfence instruction,/yes or means), so as to achieve the purpose of zero log. Specifically, as shown in fig. 6 (a), before splitting, the alt bit indicates a valid pointer S0; as shown in fig. 6 (b), the first step of node splitting is to allocate and write a new node and set another sibling pointer, note that here the new node and the free sibling pointer do not change the state of the original leaf node and therefore do not affect the downtime consistency. As shown in fig. 6 (c), the second step swaps sibling pointers by writing alt bits through a NAW while writing bitmaps, setting the index entry location moved to the new node as free. Thus, complete switching of leaf node states is completed by one NAW, and the cost of writing logs is avoided.

FIG. 7 is a schematic diagram of a 256B leaf node zero log splitting algorithm of the present invention. As shown in fig. 7, an implementation of zero log splitting of 256B leaf nodes is shown, and the overall flow of the algorithm is completely in accordance with the example of fig. 7. The main complexity is to consider whether the newly inserted indexing item is put into a new node or an old node, and to adopt optimization of migration of the first row indexing item. In the new node, the previous index entry is freed up as much as possible. In the old node, lbtreeLeafInsert is called to complete the insertion of the new index entry.

5. Structure of multi-256B leaf node

Fig. 8 is a schematic diagram of a multi-256B leaf node structure of the present invention. As shown in FIG. 8, the leaf node of LB + -Tree also has a multi-256B node, and the multi-256B node is composed of a plurality of NVMLinesize index units (256B units). The first index unit multi-NVMLine of the multi-256 node B has the same structure and the same function as the index unit NVMLine of the 256 node B, and the difference is that the alternating control bit alt bit in the index head H (header) of the first index unit multi-NVMLine of the multi-256 node B simultaneously controls which of the 2 sibling linked list pointers S0 and S1 in the first index unit multi-NVMLine is currently valid and which of the H0 and H1 in the other index unit multi-NVMLine' is currently valid. In the aligned 64B index Line where H is located, except for H, only index items are placed in all the remaining spaces, so that the number of the index items which can be placed in the 64B index Line where the header is located is as large as possible;

the other index units of the multi-256 node B, namely the multi-NVMLine', sequentially comprise a first index head H0(header0), a plurality of index items and a tail index head H1(header1), and no sibling linked list pointer; h0 and H1 have the same structure, and also comprise a write lock location lock bit ', an alternate control bit alt bit ', an index occupation bitmap ' and an index fingerprint bit F ', but the write lock location lock bit ' and the alternate control bit alt bit ' of the index unit multi-NVMLine ' only reserve corresponding data bits but do not use;

an index occupation bitmap ', wherein each bit is used for recording the occupation state of each index item of a multi-NVMLine' of a current index unit of a multi-256B leaf node; the index fingerprint bit F 'is a fingerprint array and is used for respectively recording the fingerprint of each index item of the multi-256B leaf node current index unit multi-NVMLine', the fingerprint is obtained by calculating a hash function, and the same index key has the same fingerprint.

The characteristics and significance of the multi-256 node B structure include: (1) distributed index header: the existing B + -Tree node adopts a centralized index head, and the meta-information of all index items is stored at the starting position of the node. When the size of the node is increased, the meta information of the index items is increased, the space available for storing the index items in the first index row is reduced, and the effect of the first-row index item migration technology is not facilitated to be exerted. The multi-256B node provided by the invention adopts a distributed index header, namely each 256B has an own index header to store the meta information of the index items in the 256B. The distributed design furthest reserves the space for storing the index items in the index row containing the index head, thereby fully playing the effect of the first row index item migration technology. (2) H0 and H1: when a multi-256B leaf node is split, each 256B may have a moved index entry, and therefore the meta-information stored in the index header of each 256B may need to be modified, which obviously cannot be done by a single 8B NVM atomic write. And zero log index splitting requires that a single NVM atomic write can complete the update of the entire node state, including the removal of the shift-out index entry bitmap position. Designs H0 and H1 achieve this goal. One of H0 and H1 is a valid index header and the other is an invalid index header, controlled by the alt bit in H. Thus, the invalid index header can be used to store the index entry metadata after the splitting is completed, and the valid index header stores the index entry metadata before the splitting. Therefore, an NVM atomic write can modify the alt bit to make the invalid index header valid and the valid index header invalid, switching the node states before and after the split.

6. Multi-256B leaf node LB + -Tree point query algorithm

The main difference with the 256B leaf node LB + -Tree query algorithm is the search of the leaf nodes: each 256B cell of the Multi-256B leaf node is searched in turn. For each 256B cell, the search algorithm in the 256B leaf node described above is applied. The header valid in the other 256B units, except the first 256B unit, is determined by the altbit in the first 256B unit.

7. LB + -Tree insertion algorithm of Multi-256B leaf node

The LB + -Tree insertion algorithm of the Multi-256B leaf node is an extension on the basis of the LB + -Tree insertion algorithm of the 256B leaf node. The internal node searching, the concurrency control and the like are completely the same as the LB + -Tree insertion algorithm of the 256B leaf nodes. The main difference is the insertion of leaf nodes.

The insertion of the leaf node visits each 256B cell in turn from front to back. The algorithm always inserts the first free bit. In the 256B unit where the first free bit is located, the leading line index entry migration technique is applied for insertion.

If the leaf node is full, then node splitting is performed, using a similar algorithm to 256B leaf node zero log splitting. The main difference is that the 256B cells, except the first 256B cell, all have two headers (H0 and H1), one being used and one idle, determined by the alt bit of the first 256B cell. In the first step of splitting, in addition to the zero log splitting operation, a free header is written to reflect the split bitmap and fingerprint array contents. Then one NAW in the second step can change the state of the first 256B cell and the other 256B cells simultaneously. For the first 256B element, the NAW modifies both the alt bit, replacing the sibling pointer, and the bitmap, deleting the moved index entry. While for other 256B units in the Multi-256B leaf node, the header is replaced by the NAW modification to the alt bit. Therefore, the switching of the states of the whole leaf nodes is completed by using one NAW, and the downtime consistency is ensured.

8. Deletion algorithm of LB + -Tree

The deletion algorithm of the LB + -Tree does not carry out node combination, and only needs to modify bitmap (for a Multi-256B leaf node, the modification range also comprises the bitmaps' of all 256B units except for a first index unit, namely, the Multi-NVMLine), and the deleted position is set as a free index bit, so that the deletion algorithm can be completed by using one NAW, which is similar to the existing NVM-oriented B + -Tree.

9. LB + -Tree range query algorithm

The LB + -Tree keeps the characteristic that B + -Tree leaf nodes are orderly according to the sequence of brother linked lists, so that the range query operation can be easily supported. Given a range of keys, a starting leaf node and a terminating leaf node may be determined by searching, and then sequentially accessing the leaf nodes along a sibling list from the starting leaf node until the terminating leaf node. In the starting leaf node and the ending leaf node, each index key needs to be compared in sequence to find the index item meeting the range condition. In the middle leaf node, all the index entries satisfy the range condition, so that no comparison is required.

10. Fault recovery of LB + -Tree

When faults such as power failure, downtime, system crash and the like occur, the leaf nodes of the LB + -Tree are consistent in the NVM. Therefore, the recovery of LB + -Tree can be accomplished by scanning the leaf nodes in NVM to reconstruct the internal nodes. At scan time, if the lock bit is found to be 1, it is cleared and written back.

FIG. 9 is a graphical illustration of the performance of the index insertion of the present invention compared to the prior art. As shown in fig. 9, in the experiment, the index is initialized to make each node 70% -100% full, then random insertion is performed or dense insertion is performed on the rightmost leaf node, 20 hundred million (8B key,8B val) index entries are put to make the nodes 70% -100% full, and then random insertion or dense insertion is performed. In fig. 9, the horizontal axis represents the number of inserted index items, and the vertical axis represents the time for completing all the insertion operations, so that a lower curve indicates better performance. This experiment compares the LB + -Tree of the present invention with the two existing NVM-oriented indices WB + -Tree (WB-Tree) and FP-Tree (FP-Tree). As shown in FIG. 9(a), LB + -Tree has certain advantages under random insertion. In FIG. 9(b), when the nodes are full, the insertion will cause a large number of node splitting operations, and the zero log splitting technique of LB + -Tree yields a great advantage. In fig. 9(c), when the nodes on the rightmost side are densely inserted, the characteristics of the LB + -Tree top-line index entry migration technique are fully exerted, and great advantages are also shown. In addition, other experiments have shown that LB + -Tree has similar performance to existing NVM-oriented B + -Tree indexes under query and delete operations. Therefore, experiments prove that the LB + -Tree structure has the advantages over the existing NVM-oriented B + -Tree structure, and particularly the insertion performance is remarkably improved.

FIG. 10 is a schematic diagram of a data processing system of the present invention. As shown in fig. 10, the embodiment of the present invention also provides a computer-readable storage medium and a data processing system. The computer readable storage medium of the present invention stores executable instructions, and when the executable instructions are executed by a processor of a data processing system, the data indexing method is implemented for a main memory of the data processing system, wherein the main memory of the data processing system comprises a DRAM main memory and an NVM main memory connected in parallel with the DRAM main memory, the DRAM main memory and the NVM main memory are arranged in parallel and are all visible by software, and the software can determine which data are put into a volatile DRAM and which data are put into a nonvolatile NVM. It will be understood by those skilled in the art that all or part of the steps of the above method may be implemented by instructing relevant hardware (e.g., processor, FPGA, ASIC, etc.) through a program, and the program may be stored in a readable storage medium, such as a read-only memory, a magnetic or optical disk, etc. All or some of the steps of the above embodiments may also be implemented using one or more integrated circuits. Accordingly, the modules in the above embodiments may be implemented in hardware, for example, by an integrated circuit, or in software, for example, by a processor executing programs/instructions stored in a memory. Embodiments of the invention are not limited to any specific form of hardware or software combination.

The invention solves the problems that the prior art is not suitable for real 3DXPoint hardware and the node division is high in cost, provides a novel Tree-shaped index structure (LB + -Tree) facing a nonvolatile main memory, compared with the prior art, the structure optimizes the node structure while keeping high-efficiency reading operation, reduces the write operation cost by utilizing the characteristics of the atomic write operation of the nonvolatile main memory and the real 3DXPoint hardware, and realizes the persistent support of the Tree-shaped index structure of the zero log.

The above embodiments are only for illustrating the invention and are not to be construed as limiting the invention, and those skilled in the art can make various changes and modifications without departing from the spirit and scope of the invention, therefore, all equivalent technical solutions also fall into the scope of the invention, and the scope of the invention is defined by the claims.

Claims

1. An NVM-based data indexing method, comprising:

constructing a tree-shaped index structure; the root node and the internal node of the tree-shaped index structure are arranged in a DRAM main memory of the data processing system, and the leaf nodes of the tree-shaped index structure are arranged in an NVM main memory of the data processing system; the leaf nodes are multiple and in the same layer, and are connected in a brother chain list from left to right; the leaf node comprises an index unit A, the index unit A sequentially comprises a first index row, a middle index row and a tail index row, and the first index row, the middle index row and the tail index row respectively comprise a plurality of index items;

when newly-added data is written into the current leaf node, judging whether an index unit A of the current leaf node has an idle index item, if so, performing data writing operation, otherwise, performing and completing node splitting operation and then performing the data writing operation;

wherein the data write operation comprises: if the first index row has an idle index item, writing the newly added data into the idle index item of the first index row; otherwise, the newly added data and the stored data stored in the first index row are transferred to the idle index items of the middle index row and/or the tail index row;

the node splitting operation comprises: and constructing a new leaf node, adding the brother linked list, and migrating part of stored data of the current leaf child node to the idle index entry of the index unit of the new leaf child node.

2. The data indexing method of claim 1, wherein the a-index unit NVMLine includes M index lines, each Line having N data item storage locations; line1 as the first index Line L_h，L_hThe 1 st data item storage position is an index head H, L_hThe rest data item storage positions are index items, and the 2 nd to the M-1 th lines are middle index lines L_i，L_iAll the data item storage positions of (1) are index items, and the Mth Line is a tail index Line L_t，L_tThe Nth data item of (2) is a brother linked list pointer item S, L_tThe other data items of (2) are index items; m, N is a positive integer;

the index header H includes a write lock bit lockbit, an alternate control bit altbit, an index occupancy bitmap, and index fingerprint bits F, S including a pointer S₀And a pointer S₁(ii) a The lockbit is used for setting the write-in state of the current leaf node; altbit for setting S by NVM atomic write₀And S₁One of which is a valid pointer and the other is an invalid pointer; the bitmap is used for respectively recording the occupation state of each index item; f is a fingerprint array used for recording the fingerprint of each index item respectively; the valid pointer of S is used to connect the sibling list.

3. The data indexing method of claim 2, wherein the node splitting operation step specifically comprises:

when leaf Node_nAll index items are occupied, and newly-added leaf nodes are allocated_n'，Node_n' having and Node_nThe same structure;

node is to be_nIs copied to Node_n' and modify Node_n' first index line L_hThe index of the ' index header H ' occupies the bitmap ';

node is to be_n'the effective pointer of the brother linked list pointer item S' points to the Node_nRight brother leaf Node_n+1Node is to be_nThe invalid pointer of the brother linked list pointer item S points to the Node_n'；

Persisting Node in the NVM host_n'；

Setting altbit with NVM atomic write to point S to Node_n' the pointer is set as the effective pointer, and S is pointed to Node_n+1Set the pointer of (a) to an invalid pointer; the part of the stored data exists in the Node_nThe index entries of (1) are cleared to be free index entries, and the bitmap is modified by NVM atomic write.

4. The data indexing method of claim 2, wherein the leaf node further includes at least one B-index unit NVMLine ', where each NVMLine' includes M lines, and each Line has N data item storage locations;

the storage position of the 1 st data item of the 1 st Line of NVMLine 'is a head index head H0, and the storage position of the Nth data item of the Mth Line of NVMLine' is a tail index head H1;

h0 includes an index occupancy bitmap 'and an index fingerprint bit F'; h1 has the same structure as H0; setting altbit of H by NVM atomic write to set H₀And H₁One of the index headers is an effective index header, and the other index header is an invalid index header, wherein the bitmap 'of the effective index header is used for respectively recording the occupation state of each index item of the current NVMLine'; f 'of the effective index header is a fingerprint array for recording fingerprints of each index entry of the current NVMLine'.

5. The data indexing method of claim 2, further comprising:

and only after the write lock bit is set to be in a lock state and the concurrent control hardware transaction is exited, performing data write operation of the exclusive leaf node.

6. The data indexing method of claim 2, wherein when performing a fail-over operation, the root node and the internal node are rebuilt in the DRAM main memory according to all the leaf nodes to restore the tree index structure;

and if the write lock bit of the NVMLine is in a locked state, setting the write lock bit to be in an unlocked state, and recovering the stored data of the NVMLine.

7. The data indexing method of claim 2, wherein a part or all of the indexing items of the NVMLine are set as idle indexing items by modifying bitmap.

8. The data indexing method as claimed in claim 4, wherein a part or all of the indexing items of the NVMLine are set as idle indexing items by modifying bitmap, and a part or all of the indexing items of the NVMLine 'are set as idle indexing items by modifying bitmap'.

9. A computer readable multi-storage medium storing executable instructions for performing the NVM-based data indexing method of any one of claims 1-8.

10. A data processing system comprising:

a processor;

a main memory connected with the processor and including a DRAM main memory and an NVM main memory connected in parallel;

the computer-readable storage medium of claim 9, the processor retrieves and executes executable instructions in the computer-readable storage medium to perform NVM-based data indexing.