CN114201551A - Data storage method and data storage device - Google Patents

Data storage method and data storage device Download PDF

Info

Publication number
CN114201551A
CN114201551A CN202111488099.0A CN202111488099A CN114201551A CN 114201551 A CN114201551 A CN 114201551A CN 202111488099 A CN202111488099 A CN 202111488099A CN 114201551 A CN114201551 A CN 114201551A
Authority
CN
China
Prior art keywords
log
data
node
processed
tag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111488099.0A
Other languages
Chinese (zh)
Inventor
徐宁
付钰
谢娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CCB Finetech Co Ltd
Original Assignee
CCB Finetech Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CCB Finetech Co Ltd filed Critical CCB Finetech Co Ltd
Priority to CN202111488099.0A priority Critical patent/CN114201551A/en
Publication of CN114201551A publication Critical patent/CN114201551A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data storage method and a data storage device, relates to the field of storage, and is beneficial to reducing the write-in time delay of a distributed key value storage system. The method comprises the following steps: the method comprises the steps that a log node receives a first transaction processing request of a client node, wherein the first transaction processing request carries metadata and data to be processed of a target transaction, the log of the log node comprises at least one label log and a global log, the label log comprises the metadata and the processed data of a storage node corresponding to the label log, and the global log comprises an identifier of the label log and a log serial number; the log node determines a target storage node from the at least one storage node based on the first transaction request; the log node writes the metadata of the target transaction and the data to be processed into a first tag log corresponding to the target storage node, and updates the first tag log; and the log node writes the identifier of the first label log into the global log and updates the log sequence number of the global log.

Description

Data storage method and data storage device
Technical Field
The present invention relates to the field of storage, and more particularly, to a data storage method and a data storage apparatus.
Background
As core components of distributed storage systems and distributed databases, distributed metadata systems and distributed data systems have been key to the performance and stability of large-scale systems.
With the development of systems such as new sql and Hybrid Transaction and Analysis Processing (HTAP), new system devices and generally tend to use transaction-supporting distributed key value (key-value) storage supporting flat ordering as an underlying storage, but distributed designs often suffer from performance degradation in terms of both write latency and storage capacity in order to support transactions.
Disclosure of Invention
The application provides a data storage method and a data storage device, which are beneficial to reducing the write-in time delay of a distributed key value storage system.
In a first aspect, a data storage method is provided, which is applied to a data storage system including a client node, a log node, and at least one storage node, where the at least one storage node includes a persistent memory (PMEM) in which application data is stored. The method comprises the following steps: the method comprises the steps that a log node receives a first transaction processing request from a client node, the first transaction processing request carries metadata and data to be processed of a target transaction, the log of the log node comprises at least one label log and a global log, the at least one label log is in one-to-one correspondence with at least one storage node, the label log comprises the metadata and processed data of the storage node corresponding to the label log, and the global log comprises identification of the label log and a Log Sequence Number (LSN). The journal node modifies the application data of the persistent memory through Remote Direct Memory Access (RDMA) based on the metadata of the target transaction and the data to be processed. The log node determines a target storage node from the at least one storage node based on the first transaction request. And the log node writes the metadata and the data to be processed of the target transaction into a first tag log corresponding to the target storage node, and updates the first tag log. And the log node writes the identifier of the first label log into the global log and updates the log sequence number of the global log.
The RDMA and PMEM are mixed to provide a high-performance and high-availability distributed key value storage system, and particularly, the mixture of the RDMA and the PMEM is applied to key log nodes, so that cost and delay performance are considered.
With reference to the first aspect, in certain implementations of the first aspect, the storage node further includes a volatile memory. The log node modifies the application data of the persistence memory through RDMA based on the metadata and the pending data of the target transaction, including: in the case that the pending data includes a large amount of burst data, the log node writes the metadata of the target transaction and the pending data to the volatile memory.
With reference to the first aspect, in certain implementations of the first aspect, the log node moves data stored in the volatile memory for a first time period to the persisteable memory if a remaining storage capacity of the volatile memory is less than a preset threshold.
With reference to the first aspect, in certain implementations of the first aspect, the at least one tag log and the global log are stored in the persisteable memory.
In a second aspect, a data storage method is provided, which is applied to a data storage system including a client node, a log node, and at least one storage node, where the at least one storage node includes a persistent memory, the persistent memory stores application data, and the log of the log node includes at least one tag log and a global log. The method comprises the following steps: the target storage node in at least one storage node determines a first label log corresponding to the target storage node from at least one label log, wherein the at least one label log is in one-to-one correspondence with the at least one storage node, the label log comprises metadata and processed data of the storage node corresponding to the label log, and the global log comprises an identifier of the label log and a log sequence number. And the target storage node reads the metadata and the to-be-processed data of the target transaction in the first tag log through RDMA (remote direct memory access), and modifies the application data of the persistent memory based on the metadata and the to-be-processed data of the target transaction. The target storage node updates the first tag log and the global log based on the modification to the application data.
With reference to the second aspect, in certain implementations of the second aspect, the at least one tag log and the global log are stored in the persisted memory.
With reference to the second aspect, in some implementations of the second aspect, the target storage node receives a second transaction request, where the second transaction request is used to request to modify or query the pending data stored in the target storage node. And the target storage node determines to read the data to be processed from the first label log through the mapping relation between the pre-configured storage node and the label log and the global log based on the second transaction request, and modifies or queries the data to be processed.
In a third aspect, there is provided a data storage device arranged to perform the method of any one of the possible implementations of any one of the above aspects. In particular, the apparatus comprises means for performing the method of any one of the possible implementations of any one of the above aspects.
In a fourth aspect, there is provided a data processing apparatus comprising a processor coupled to a memory and operable to execute instructions in the memory to implement the method of any one of the possible implementations of any one of the aspects. Optionally, the apparatus further comprises a memory. Optionally, the apparatus further comprises a communication interface, the processor being coupled to the communication interface.
In a fifth aspect, a processor is provided, comprising: input circuit, output circuit and processing circuit. The processing circuit is configured to receive a signal via the input circuit and transmit a signal via the output circuit, such that the processor performs the method of any one of the possible implementations of any one of the above aspects.
In a specific implementation process, the processor may be a chip, the input circuit may be an input pin, the output circuit may be an output pin, and the processing circuit may be a transistor, a gate circuit, a flip-flop, various logic circuits, and the like. The input signal received by the input circuit may be received and input by, for example and without limitation, a receiver, the signal output by the output circuit may be output to and transmitted by a transmitter, for example and without limitation, and the input circuit and the output circuit may be the same circuit that functions as the input circuit and the output circuit, respectively, at different times. The specific implementation of the processor and various circuits are not limited in this application.
In a sixth aspect, a processing apparatus is provided that includes a processor and a memory. The processor is configured to read instructions stored in the memory and to receive signals via the receiver and transmit signals via the transmitter to perform the method of any one of the possible implementations of any one of the aspects.
Optionally, there are one or more processors and one or more memories.
Alternatively, the memory may be integrated with the processor, or provided separately from the processor.
In a specific implementation process, the memory may be a non-transitory (non-transitory) memory, such as a Read Only Memory (ROM), which may be integrated on the same chip as the processor, or may be separately disposed on different chips.
It will be appreciated that the associated data interaction process, for example, sending the indication information, may be a process of outputting the indication information from the processor, and receiving the capability information may be a process of receiving the input capability information from the processor. In particular, the data output by the processor may be output to a transmitter and the input data received by the processor may be from a receiver. The transmitter and receiver may be collectively referred to as a transceiver, among others.
The processing device in the above sixth aspect may be a chip, the processor may be implemented by hardware or may be implemented by software, and when implemented by hardware, the processor may be a logic circuit, an integrated circuit, or the like; when implemented in software, the processor may be a general-purpose processor implemented by reading software code stored in a memory, which may be integrated with the processor, located external to the processor, or stand-alone.
In a seventh aspect, a computer program product is provided, the computer program product comprising: computer program (also called code, or instructions), which when executed, causes a computer to perform the method of any of the possible implementations of any of the above aspects.
In an eighth aspect, a computer-readable storage medium is provided, which stores a computer program (which may also be referred to as code, or instructions) that, when executed on a computer, causes the computer to perform the method of any of the possible implementations of any of the above aspects.
Drawings
FIG. 1 is a schematic diagram of a hierarchical storage structure of a computer system;
FIG. 2 is a schematic diagram of a data storage system provided by an embodiment of the present application;
FIG. 3 is a schematic flow chart diagram of a data storage method provided by an embodiment of the present application;
FIG. 4 is a schematic diagram of an FDB database architecture provided by an embodiment of the present application;
FIG. 5 is a schematic block diagram of a data storage device provided by an embodiment of the present application;
FIG. 6 is a schematic block diagram of another data storage device provided by an embodiment of the present application;
fig. 7 is a schematic block diagram of another data storage device provided in an embodiment of the present application.
Detailed Description
The technical solution in the present application will be described below with reference to the accompanying drawings.
For ease of understanding, first, terms referred to in the embodiments of the present application will be briefly described.
1. Non-volatile memory (NVM): refers to a memory in which stored data does not disappear when the current is turned off.
2. Persistent memory (PMEM): the NVM can also be called as NVM, and refers to storage hardware that supports byte-addressing (byte-addressing), can be directly operated by a Central Processing Unit (CPU) instruction, and does not lose data after power is off.
3. Solid State Drive (SSD): the hard disk is made of a solid electronic storage chip array and comprises a control unit and a storage unit. The storage media of the solid state disk are divided into two types, one is to adopt a flash memory as the storage medium, and the other is to adopt a dynamic random access memory as the storage medium.
4. Hard Disk Drive (HDD): the storage medium is a main computer storage medium, is a device for controlling hard disk addressing and data access in a personal computer, and can store data only through an HHD hard disk.
5. Remote Direct Memory Access (RDMA): RDMA techniques are created to account for server-side data processing delays in network transmissions. RDMA can transfer data directly over a network into a computer's storage area, quickly moving data from one system to a remote system memory without any impact on the operating system and without using the computer's excessive processing functions.
6. NoSQL: compared with SQL, NoSQL has the capability of supporting mass data, but relaxes the transaction and SQL query capabilities.
7. NewSQL: compared with SQL and NoSQL, the system is a new generation SQL system which has the ultra-strong expansibility of NoSQL and also has the usability, the query capability and the transaction capability of the traditional SQL.
8. And (2) TiDB: NewSQL database from PingCap.
9. TiKV: the product of PingCap company supports the realization of TiDB by using the key value storage of the transaction as the bottom storage
10. Base database (FDB): the key value storage supporting affairs, which is open source by apple, is used as bottom storage, and supports recording layer (record layer) and CloudKit of apple. Wherein, the record layer (record layer) is a structured layer supporting SQL based on FDB implementation of apple Inc.
11. Write Ahead Log (WAL): WAL is a mechanism commonly used in databases to ensure atomicity and durability of data operations, where atomicity refers to recoverability of data, and durability refers to a successful submission of data persisted to a persistent storage medium such as a disk. In a system using the WAL mechanism, all modifications are written to a log (log) file prior to commit.
FIG. 1 is a schematic diagram of a hierarchical storage structure of a computer system. In the hierarchical storage structure shown in fig. 1, a CPU register (registers), a CPU buffer (caches), a Dynamic Random Access Memory (DRAM), a persistent memory (PMEM), a Solid State Disk (SSD), a Hard Disk Drive (HDD), and a tape (tape) are sequentially arranged from top to bottom, where the CPU register, the CPU buffer, and the DRAM are volatile (volatile) storage, and the PMEM, the SSD, the HDD, and the tape are non-volatile (non-volatile) storage. As can be seen from fig. 1, in the layered storage structure, the time delay of the top-down storage medium is gradually increased, the cost is gradually reduced, and the capacity is gradually increased.
Enterprise grade SSDs can provide response times on the order of 10 microseconds, and DRAM response times on the order of approximately 100 nanoseconds. It is clear that there is a huge performance gap between DRAMs and SSDs. PMEM can provide delay on the order of sub-microsecond (less than 1 microsecond), and is located in a layer of storage between DRAM and SSD in terms of cost, latency, and capacity, and the appearance of PMEM can fill the performance gap between DRAM and SSD.
The existing distributed key value storage is usually based on a single machine version key value storage engine, such as RockDB used by TiKi or modified SQLite used by FDB, in order to ensure local or distributed data consistency, the single machine version key value storage engine needs to add a WAL mechanism on a single machine level or a distributed module level, which may reduce local I/O performance, or increase module interaction time on an early distributed system level, bring extra burden to the whole system, and reduce the performance of the whole system.
In view of this, embodiments of the present application provide a data storage method and a data storage device, where the method may introduce PMEM and RDMA into a distributed key value storage system, design two logs, namely a tag log and a global log, and improve the performance of a WAL log based on advantages of byte addressing and sequential writing of PMEM, so that a newly designed system has both low cost and low latency performance.
Before describing the data storage method and the data storage device provided by the embodiments of the present application, the following description is made.
First, in the embodiments shown below, terms and english abbreviations such as application data, metadata, tag logs, global logs, etc. are exemplary examples given for convenience of description, and should not limit the present application in any way. This application is not intended to exclude the possibility that other terms may be defined in existing or future protocols to carry out the same or similar functions.
Second, the first, second and various numerical numbers in the embodiments shown below are merely for convenience of description and are not intended to limit the scope of the embodiments of the present application.
Third, "at least one" means one or more, "a plurality" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, and c, may represent: a, or b, or c, or a and b, or a and c, or b and c, or a, b and c, wherein a, b and c can be single or multiple.
Fig. 2 is a schematic diagram of a data storage system 200 according to an embodiment of the present application. The system 200 includes a client node 201, a logging node 202, and at least one storage node 203.
It should be understood that fig. 2 illustrates one log node and two storage nodes as an example, and the number of the log nodes and the storage nodes is not limited in the embodiment of the present application.
The log node 202 and the two storage nodes 203 may be deployed in different servers, and may also be deployed in different servers, which is not limited in this embodiment of the application.
Illustratively, the client node 210 may be a terminal device used by a user, such as a mobile phone, a notebook computer, or a tablet computer, which is not limited in this embodiment.
Fig. 3 is a schematic flow chart of a data storage method 300 according to an embodiment of the present application. The method 300 may be applied to the data storage system 200 shown in fig. 2, but the embodiment of the present application is not limited thereto. Wherein at least one storage node comprises a persistent memory having application data stored therein, the method 300 comprising:
s301, a client node sends a first transaction processing request to a log node, wherein the transaction processing request carries metadata and data to be processed of a target transaction, the log of the log node comprises at least one label log and a global log, the at least one label log is in one-to-one correspondence with at least one storage node, the label log comprises the metadata and processed data of the storage node corresponding to the label log, and the global log comprises an identifier of the label log and a log serial number. Accordingly, the log node receives the first transaction request.
Wherein the log sequence number is a number of each record in the tag log and is a unique identification of each record, and in the ordering of LSNs, if LSN2 is greater than LSN1, the change described by the log record identified by LSN2 occurs after the change described by log record LSN 1. The LSN numbers in the global log and the LSN numbers in each label log are synchronized, and the sequence of occurrence of global changes is recorded.
In the embodiment of the application, each modification of data is recorded in the tag log, and the data is not recorded in the global log again, and the global log only records the metadata and the LSN of the transaction by referring to the tag log, so that the write operation can be reduced.
Many types of operations can be recorded in the tag log, such as the beginning and end of each thing, each modification of data (insert, update, or delete), each allocated or released area and page
S302, the journal node modifies the application data of the persistent memory through RDMA based on the metadata of the target transaction and the data to be processed.
For example, the application data in the embodiment of the present application may include data to be processed and processed data.
S303, the log node determines a target storage node from the at least one storage node based on the first transaction processing request.
S304, the log node writes the metadata and the data to be processed of the target transaction into a first tag log corresponding to the target storage node, and updates the first tag log.
S305, the log node writes the identifier of the first label log into the global log and updates the log sequence number of the global log.
In this step, the identifier written in the first tag log is to facilitate the storage node to find the tag log corresponding to the storage node through the identifier.
S306, the target storage node determines a first label log corresponding to the target storage node from at least one label log.
S307, the target storage node reads the metadata and the data to be processed of the target transaction in the first tag log through RDMA.
S308, modifying the application data of the persistent memory based on the metadata and the data to be processed of the target transaction.
S309, the target storage node updates the first label log and the global log based on the modification of the application data.
The RDMA and PMEM are mixed to provide a high-performance and high-availability distributed key value storage system, and particularly, the RDMA and PMEM are mixed in a key log node, so that cost and delay performance are considered.
The logs of the log nodes may include at least one tag log and a global log, each storage node may correspond to one tag log, and then the storage node may read data from the corresponding tag log and perform corresponding operations on the data to be processed, such as deleting the data or adding the data. After the processing is finished, the target storage node can update the stored first label log and the global log.
The method of the embodiment of the application does not need all the storage nodes in the system to share one longer log, and is beneficial to avoiding the performance bottleneck caused by the fact that at least one storage node simultaneously accesses the same log. In addition, since the global log includes a record of an entry of at least one tag log, the logical order of operations performed by the entire system can be queried in the global log, which is advantageous for achieving global system consistency and traceability.
As an alternative embodiment, the target storage node receives a second transaction request, where the second transaction request is used to request to modify or query the pending data stored in the target storage node. And the target storage node determines to read the data to be processed from the first label log according to the mapping relation between the pre-configured storage node and the label log and the global log based on the second transaction request, and modifies or queries the data to be processed.
In this embodiment of the application, if a user desires to query or modify previous data, the target storage node may find a tag log corresponding to the target storage node based on an identifier of the tag log recorded in the global log, so as to implement backtracking of the system global transaction by using metadata of a transaction recorded in the corresponding tag log and data to be processed.
The data storage method provided by the embodiment of the present application is described below with reference to fig. 4 by taking a transaction commit process of the FDB database as an example. It should be understood that the data storage method according to the embodiment of the present application may be applied to a distributed system with a WAL mechanism, and the embodiment of the present application is not limited thereto.
Fig. 4 is a schematic diagram of an architecture of an FDB database according to an embodiment of the present disclosure. As shown in fig. 4, the FDB database includes a control plane (control plane) which is responsible for managing metadata of the cluster and a data plane (data plane) which hosts data and interacts with clients to implement a transaction mechanism. In particular, the data plane may include a Transaction System (TS), a Log System (LS), and a storage system (storage system). The TS is responsible for realizing distributed transactions at a serializable snapshot isolation (serialized snapshot isolation) level, and the LS is responsible for copying logs, so that high availability of the system is guaranteed. The SS stores the actual data, or state machine, and can obtain logs from the SS and apply them.
Still further, the TS may include a proxy (proxy), a sequencer (sequencer), and a resolver (resolver). The LS may include at least one logging node and the SS may include at least one storage node. Taking storage node 1, storage node 2, and storage node 3 as examples in FIG. 4, each storage node includes PMEM. The number of storage nodes is not limited in the embodiment of the present application.
It should be understood that the number of the agents, the log nodes, and the parsers may also be multiple, and the embodiment of the present application does not limit this.
The basic flow of a transaction generally includes the following steps:
step 1: the client acquires the read version of the transaction from the sequencer through proxy and stores the read version in the SS.
Step 2: the SS reads the data snapshot according to the read version.
And step 3: write requests are cached locally prior to submission.
And 4, step 4: the client sends the read set and the write set to the agent.
And 5: the agent obtains the commit version from the sequencer.
Step 6: the agent sends the read set and the write set to the resolver for transaction conflict detection.
And 7: if the transaction conflicts, then the read version based modification is declared invalid. Otherwise, the agent sends the write set to the LS for persistence, completes the submission of the transaction, and synchronously replies that the transaction is submitted by the client.
Currently in the process of performing persistence, the storage node may periodically read data from the logging node and write committed data to disk. In this way, data needs to be written on the log node once, and then the storage node reads data from the log node and writes the data into the disk.
In the LS of the FDB, the WAL is implemented in a manner of using an on-disk ring buffer (on-disk ring buffer), and meanwhile, in order to enable a storage node to effectively access commit data, a transaction log performs data replication in a volatile memory at the same time, but there is a problem that the size of the log itself is limited, and the size of the log in the volatile memory affects the commit speed of a transaction or affects the amount of advance that can process the transaction in advance. Under the condition that a storage node of the system has a problem, the transaction log needs to cache a larger amount of data until the storage node of the system is recovered to be normal, and the processing of the related log is started.
In a specific processing process, at the moment of sprill, if the data of a transaction is copied into the SQLite (spril-by-value), the system performance is greatly reduced, and the spril-by-reference mode can reserve a larger disk queue (disk queue) to reserve the spiled data, meanwhile, the part of the value inserted into the SQLite is an index pointing to the disk queue, and the part of the key is the identification of the storage node at the moment of each sprilling and the maximum version in the corresponding value batch (value batch), so that the number of times of writing the B-tree is changed from O (tags) to O (tags).
Meanwhile, in order to ensure that the log keeps high availability so as to ensure that the whole system has no single point of failure, the FDB log system realizes a multi-copy writing mode, but the multi-copy writing is disturbed by tail latency (tail latency), that is, the total latency is determined according to the writing latency of the slowest log node in a plurality of log nodes during writing, for example, under the condition that a certain log node is in I/O intensive use, the access latency and throughput of the log node cannot be ensured, and further all other servers using the log node are influenced. The problem of multiple copies can also result in the processing speed of the system not being able to keep up with the speed of transaction log generation when the system is in a high load state.
Therefore, the RDMA and the PMEM are used in a mixed mode, the PMEM can be deployed in the storage node, and the metadata and the data to be processed can be directly stored in the PMEM based on the high network speed provided by the RDMA and the large-capacity characteristic of the PMEM when the log is written, so that the performance reduction and the write amplification of the system caused by the WAL mechanism are relieved, and the response speed and the performance of the existing distributed key value storage system are improved.
The improved distributed key value storage system provided by the embodiment of the application designs at least one label log and a global log, the global log is provided with an identifier of each label log, the global log can find the storage position of each data to be processed in a reference mode, so that the data to be processed does not need to be written again in the global log, each storage node can read the data from the corresponding label log and operate the data, and the time delay of reading and writing is favorably reduced.
The throughput capability data of PMEM200 series is shown in the table I, and the high throughput sequential read-write capability and the large capacity of PMEM are beneficial to solving the system performance bottleneck caused by log scale limitation and multiple copies.
Watch 1
Figure BDA0003397362250000111
Figure BDA0003397362250000121
As an optional embodiment, the storage node further includes a volatile memory. The log node modifies the application data of the persistent memory through Remote Direct Memory Access (RDMA) based on the metadata and the data to be processed of the target transaction, and the method comprises the following steps: in the case that the pending data includes a large amount of burst data, the log node writes the metadata of the target transaction and the pending data to the volatile memory.
In the embodiment of the application, in consideration of an extremely large data writing scene, since the writing speed of PMEM is not as fast as that of volatile memory, the metadata of the target transaction and the data to be processed can be written into the volatile memory to improve the writing speed of the data. Likewise, the storage node may retrieve data from the tag log written in volatile memory and process the data.
Optionally, since data may be lost due to a system failure or after a power-off restart in a volatile memory, in an extremely large data writing scenario, the log node may write the metadata of the target transaction and the to-be-processed data into the PMEM, so as to ensure data consistency and security.
As an optional embodiment, in a case that the remaining storage capacity of the volatile memory is smaller than a preset threshold, the log node moves the data stored in the volatile memory for the first time period to the persistent memory.
In the embodiment of the present application, in consideration of the situations that the extreme amount of log data is retained too much and the performance requirement is extremely high, for example, the total capacity of the volatile memory is 64G, the occupied capacity is 60G, and the remaining available capacity is 4G, and the preset threshold is 10G, the log node may write the data stored in the volatile memory in the first time period into the PMEM, and store the newly written data into the volatile memory. For example, the data stored in the first time period is data written into the volatile memory earlier, so that a part of the storage space of the volatile memory can be released, more capacity can be available for writing more data, and due to the larger capacity of the PMEM, the logging node can continuously move the data into the PMEM in an extreme scene, so that the writing delay of the system can be reduced, and the capacity requirement of the system can be met.
Due to the fact that the log system is relatively small in scale and high in required performance, completion of high-availability multiple copies can be accelerated in a mode that RDMA (remote direct memory access) accesses PMEM (physical random access memory) and is combined with a distributed consistency protocol, and influence of tail delay is avoided. This also helps to reduce the footprint of the logging system on the server's CPU, making the logging system logically separate from the storage system, but physically deployable with the cluster to reduce overall cost.
It should be understood that the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
The data storage method according to the embodiment of the present application is described in detail above with reference to fig. 1 to 4, and the data storage device according to the embodiment of the present application is described in detail below with reference to fig. 5 to 7.
Fig. 5 shows a schematic block diagram of a data storage device 500 provided in an embodiment of the present application, where the device 500 includes a receiving module 510 and a processing module 520.
Wherein, the receiving module 510 is configured to: receiving a first transaction processing request from a client node, wherein the first transaction processing request carries metadata and data to be processed of a target transaction, a log of the device comprises at least one tag log and a global log, the at least one tag log is in one-to-one correspondence with at least one storage node, the tag log comprises the metadata and processed data of the storage node corresponding to the tag log, and the global log comprises an identifier of the tag log and a log sequence number. The processing module 520 is configured to: modifying application data of the persistent memory through RDMA based on the metadata and the data to be processed of the target transaction; writing the metadata and the data to be processed of the target transaction into a first tag log corresponding to the target storage node, and updating the first tag log; and writing the identifier of the first label log into a global log, and updating a log sequence number of the global log.
Optionally, the at least one storage node further comprises a volatile memory. The processing module 520 is configured to: and in the case that the data to be processed comprises a large amount of burst data, writing the metadata of the target transaction and the data to be processed into the volatile memory.
Optionally, the processing module 520 is configured to: and under the condition that the residual storage capacity of the volatile memory is smaller than a preset threshold value, moving the data stored in the volatile memory in the first time period to the persistent memory.
Optionally, the at least one tag log and the global log are stored in the persisteable memory.
In an alternative example, it can be understood by those skilled in the art that the apparatus 500 may be embodied as a log node in the above-described embodiment, or the functions of the log node in the above-described embodiment may be integrated in the apparatus 500. The above functions may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above. The apparatus 500 may be configured to perform various processes and/or steps corresponding to the log node in the above method embodiments.
Fig. 6 shows a schematic block diagram of another data storage device 600 provided in the embodiment of the present application, where the device 600 includes a determination module 610 and a processing module 620.
Wherein the determining module 610 is configured to: determining a corresponding first label log from at least one label log, wherein the log of the log node comprises the at least one label log and a global log, the first label log corresponds to the device, the label log comprises metadata and processed data of the device corresponding to the label log, and the global log comprises an identification of the label log and a log sequence number. The processing module 620 is configured to: reading metadata and to-be-processed data of a target transaction in the first tag log through RDMA (remote direct memory access), and modifying application data of a persistent memory based on the metadata and to-be-processed data of the target transaction; and updating the first tag log and the global log based on the modification to the application data.
Optionally, the at least one tag log and the global log are stored in the persisteable memory.
Optionally, the apparatus 600 further includes a receiving module 630, configured to receive a second transaction request, where the second transaction request is used to request modification or query of the stored pending data. The processing module 620 is configured to: and based on the second transaction request, determining to read the data to be processed from the first tag log through a mapping relation between a pre-configured storage node and the tag log and the global log, and modifying or inquiring the data to be processed.
In an alternative example, those skilled in the art can understand that the apparatus 600 may be embodied as the target storage node in the above embodiment, or the functions of the target storage node in the above embodiment may be integrated in the apparatus 600. The above functions may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above. The apparatus 600 may be configured to perform various processes and/or steps corresponding to the target storage node in the above method embodiments.
It should be appreciated that the apparatus 500 and the apparatus 600 herein are embodied in the form of functional modules. The term module herein may refer to an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (e.g., a shared, dedicated, or group processor) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality. In embodiments of the present application, apparatus 500 and/or apparatus 600 may also be a chip or a system of chips, such as: system on chip (SoC).
Fig. 7 is a schematic block diagram of another data storage device 700 provided in an embodiment of the present application. The apparatus 700 includes a processor 710, a transceiver 720, and a memory 730. The processor 710, the transceiver 720 and the memory 730 are in communication with each other through an internal connection path, the memory 730 is used for storing instructions, and the processor 710 is used for executing the instructions stored in the memory 730 to control the transceiver 720 to transmit and/or receive signals.
It should be understood that the apparatus 700 may be embodied as a log node or a target storage node in the foregoing embodiments, or functions of the log node or the target storage node in the foregoing embodiments may be integrated in the apparatus 700, and the apparatus 700 may be configured to perform each step and/or flow corresponding to the log node or the target storage node in the foregoing method embodiments. Alternatively, the memory 730 may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information. The processor 710 may be configured to execute the instructions stored in the memory, and when the processor executes the instructions, the processor may perform the steps and/or processes corresponding to the log node or the target storage node in the above method embodiments.
It should be understood that, in the embodiment of the present application, the processor 710 may be a Central Processing Unit (CPU), and the processor may also be other general processors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), Field Programmable Gate Arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor executes instructions in the memory, in combination with hardware thereof, to perform the steps of the above-described method. To avoid repetition, it is not described in detail here.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (17)

1. A data storage method applied to a data storage system including a client node, a log node, and at least one storage node, wherein the at least one storage node includes a persistent memory in which application data is stored, the method comprising:
the method comprises the steps that a log node receives a first transaction processing request from a client node, the first transaction processing request carries metadata and data to be processed of a target transaction, the log of the log node comprises at least one label log and a global log, the at least one label log is in one-to-one correspondence with at least one storage node, the label log comprises the metadata and processed data of the storage node corresponding to the label log, and the global log comprises an identifier of the label log and a log serial number;
the journal node modifies the application data of the persistent memory through Remote Direct Memory Access (RDMA) based on the metadata of the target transaction and the data to be processed;
the log node determining a target storage node from the at least one storage node based on the first transaction request;
the log node writes the metadata of the target transaction and the data to be processed into a first tag log corresponding to the target storage node, and updates the first tag log;
and the log node writes the identifier of the first label log into the global log and updates the log serial number of the global log.
2. The method of claim 1, wherein the at least one storage node further comprises volatile memory;
the journal node modifies the application data of the persistent memory through Remote Direct Memory Access (RDMA) based on the metadata of the target transaction and the data to be processed, and the method comprises the following steps:
and under the condition that the data to be processed comprises a large amount of burst data, the log node writes the metadata of the target transaction and the data to be processed into the volatile memory.
3. The method of claim 2, further comprising:
and under the condition that the residual storage capacity of the volatile memory is smaller than a preset threshold value, the log node moves the data stored in the volatile memory in a first time period to the persistent memory.
4. The method of any of claims 1-3, wherein the at least one tag log and the global log are stored in the persisteable memory.
5. A data storage method applied to a data storage system including a client node, a log node, and at least one storage node, wherein the at least one storage node includes a persistent memory in which application data is stored, and a log of the log node includes at least one tag log and a global log, the method comprising:
a target storage node in the at least one storage node determines a first label log corresponding to the target storage node from the at least one label log, wherein the at least one label log is in one-to-one correspondence with the at least one storage node, the label log comprises metadata and processed data of the storage node corresponding to the label log, and the global log comprises an identifier of the label log and a log sequence number;
the target storage node reads metadata and data to be processed of a target transaction in the first tag log through Remote Direct Memory Access (RDMA), and modifies application data of the persistent memory based on the metadata and the data to be processed of the target transaction;
the target storage node updates the first tag log and the global log based on the modification to the application data.
6. The method of claim 5, wherein the at least one tag log and the global log are stored in the persisteable memory.
7. The method of claim 5 or 6, further comprising:
the target storage node receives a second transaction processing request, wherein the second transaction processing request is used for requesting to modify or inquire the to-be-processed data stored in the target storage node;
and the target storage node determines to read the data to be processed from the first label log according to the mapping relation between the pre-configured storage node and the label log and the global log based on the second transaction request, and modifies or queries the data to be processed.
8. A data storage device, comprising:
a receiving module, configured to receive a first transaction processing request from a client node, where the first transaction processing request carries metadata of a target transaction and data to be processed, a log of the apparatus includes at least one tag log and a global log, the at least one tag log and the at least one storage node are in one-to-one correspondence, the tag log includes metadata and processed data of the storage node corresponding to the tag log, and the global log includes an identifier of the tag log and a log sequence number;
the processing module is used for modifying the application data of the persistent memory through Remote Direct Memory Access (RDMA) based on the metadata of the target transaction and the data to be processed;
the processing module is further configured to: determining a target storage node from the at least one storage node based on the first transaction request;
the processing module is further configured to: writing the metadata and the data to be processed of the target transaction into a first tag log corresponding to the target storage node, and updating the first tag log;
the processing module is further configured to: writing the identification of the first label log into a global log, and updating a log sequence number of the global log.
9. The apparatus of claim 8, wherein the at least one storage node further comprises a volatile memory;
the processing module is used for: and writing the metadata of the target transaction and the data to be processed into the volatile memory under the condition that the data to be processed comprises a large amount of burst data.
10. The apparatus of claim 9, wherein the processing module is configured to:
and under the condition that the residual storage capacity of the volatile memory is smaller than a preset threshold value, moving the data stored in the volatile memory in the first time period to the persistent memory.
11. The apparatus of any of claims 8-10, wherein the at least one tag log and the global log are stored in the persisteable memory.
12. A data storage device, comprising:
a determining module, configured to determine a corresponding first tag log from at least one tag log, where the log of the log node includes the at least one tag log and a global log, the first tag log corresponds to the device, the tag log includes metadata and processed data of the device corresponding to the tag log, and the global log includes an identifier of the tag log and a log sequence number;
the processing module is used for reading metadata and to-be-processed data of a target transaction in the first tag log through Remote Direct Memory Access (RDMA), and modifying application data of a persistent memory based on the metadata and to-be-processed data of the target transaction;
the processing module is used for: updating the first tag log and the global log based on the modification to the application data.
13. The apparatus of claim 12, wherein the at least one tag log and the global log are stored in the persisteable memory.
14. The apparatus according to claim 12 or 13, wherein the apparatus further comprises a receiving module configured to:
receiving a second transaction processing request, wherein the second transaction processing request is used for requesting to modify or inquire stored to-be-processed data;
the processing module is further configured to: and determining to read the data to be processed from the first tag log according to the mapping relation between the pre-configured storage node and the tag log and the global log based on the second transaction request, and modifying or inquiring the data to be processed.
15. A data processing apparatus, comprising: a processor coupled with a memory for storing a computer program that, when invoked by the processor, causes the apparatus to perform the method of any of claims 1-4 or causes the apparatus to perform the method of any of claims 5-7.
16. A computer-readable storage medium for storing a computer program comprising instructions for implementing the method of any one of claims 1-4, or instructions for implementing the method of any one of claims 5-7.
17. A computer program product, characterized in that it comprises computer program code which, when executed, implements the method of any of claims 1-4, or implements the method of any of claims 5-7.
CN202111488099.0A 2021-12-07 2021-12-07 Data storage method and data storage device Pending CN114201551A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111488099.0A CN114201551A (en) 2021-12-07 2021-12-07 Data storage method and data storage device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111488099.0A CN114201551A (en) 2021-12-07 2021-12-07 Data storage method and data storage device

Publications (1)

Publication Number Publication Date
CN114201551A true CN114201551A (en) 2022-03-18

Family

ID=80651218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111488099.0A Pending CN114201551A (en) 2021-12-07 2021-12-07 Data storage method and data storage device

Country Status (1)

Country Link
CN (1) CN114201551A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116744168A (en) * 2022-09-01 2023-09-12 荣耀终端有限公司 Log storage method and related device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111801661A (en) * 2018-02-28 2020-10-20 国际商业机器公司 Transaction operations in a multi-host distributed data management system
CN112035410A (en) * 2020-08-18 2020-12-04 腾讯科技(深圳)有限公司 Log storage method and device, node equipment and storage medium
US20210182246A1 (en) * 2019-12-11 2021-06-17 Western Digital Technologies, Inc. Efficient transaction log and database processing
CN113220729A (en) * 2021-05-28 2021-08-06 网易(杭州)网络有限公司 Data storage method and device, electronic equipment and computer readable storage medium
CN113468338A (en) * 2021-06-16 2021-10-01 杨绍顺 Big data analysis method for digital cloud service and big data server

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111801661A (en) * 2018-02-28 2020-10-20 国际商业机器公司 Transaction operations in a multi-host distributed data management system
US20210182246A1 (en) * 2019-12-11 2021-06-17 Western Digital Technologies, Inc. Efficient transaction log and database processing
CN112035410A (en) * 2020-08-18 2020-12-04 腾讯科技(深圳)有限公司 Log storage method and device, node equipment and storage medium
CN113220729A (en) * 2021-05-28 2021-08-06 网易(杭州)网络有限公司 Data storage method and device, electronic equipment and computer readable storage medium
CN113468338A (en) * 2021-06-16 2021-10-01 杨绍顺 Big data analysis method for digital cloud service and big data server

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116744168A (en) * 2022-09-01 2023-09-12 荣耀终端有限公司 Log storage method and related device
CN116744168B (en) * 2022-09-01 2024-05-14 荣耀终端有限公司 Log storage method and related device

Similar Documents

Publication Publication Date Title
US11360863B2 (en) Key-value store on persistent memory
WO2018001135A1 (en) Method for processing database transaction, client and server
US10628325B2 (en) Storage of data structures in non-volatile memory
KR20090028518A (en) High speed nonvolatile memory device
US11210263B1 (en) Using persistent memory technology as a host-side storage tier for clustered/distributed file systems, managed by cluster file system
US9678871B2 (en) Data flush of group table
US20160350003A1 (en) Memory system
CN111400268A (en) Log management method of distributed persistent memory transaction system
CN109871386A (en) Multi version concurrency control (MVCC) in nonvolatile memory
CN115167786B (en) Data storage method, device, system, equipment and medium
US20200019511A1 (en) Distributed shared memory paging
US20040158764A1 (en) Storage system
WO2022105585A1 (en) Data storage method and apparatus, and device and storage medium
CN117120998A (en) Method and device for reading data stored in tree data structure
CN110968530B (en) Key value storage system based on nonvolatile memory and memory access method
US20220027327A1 (en) Distributed vfs with shared page cache
CN114201551A (en) Data storage method and data storage device
US10452543B1 (en) Using persistent memory technology as a host-side storage tier for clustered/distributed file systems, managed by storage appliance
US10853314B1 (en) Overlay snaps
US11321233B2 (en) Multi-chip system and cache processing method
US10073874B1 (en) Updating inverted indices
EP4124963A1 (en) System, apparatus and methods for handling consistent memory transactions according to a cxl protocol
US11687392B2 (en) Method and system for constructing persistent memory index in non-uniform memory access architecture
CN115374133A (en) Data processing method and device, electronic equipment and computer readable storage medium
US11237925B2 (en) Systems and methods for implementing persistent data structures on an asymmetric non-volatile memory architecture

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination