CN113905054B - RDMA (remote direct memory access) -based Kudu cluster data synchronization method, device and system - Google Patents

RDMA (remote direct memory access) -based Kudu cluster data synchronization method, device and system Download PDF

Info

Publication number
CN113905054B
CN113905054B CN202111007589.4A CN202111007589A CN113905054B CN 113905054 B CN113905054 B CN 113905054B CN 202111007589 A CN202111007589 A CN 202111007589A CN 113905054 B CN113905054 B CN 113905054B
Authority
CN
China
Prior art keywords
node
data
log file
rdma
auxiliary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111007589.4A
Other languages
Chinese (zh)
Other versions
CN113905054A (en
Inventor
胡德鹏
刘兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Inspur Intelligent Technology Co Ltd
Original Assignee
Suzhou Inspur Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Inspur Intelligent Technology Co Ltd filed Critical Suzhou Inspur Intelligent Technology Co Ltd
Priority to CN202111007589.4A priority Critical patent/CN113905054B/en
Publication of CN113905054A publication Critical patent/CN113905054A/en
Application granted granted Critical
Publication of CN113905054B publication Critical patent/CN113905054B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a Kudu cluster data synchronization method, device and system based on RDMA, wherein the method comprises the following steps: setting an RDMA interface; reading pre-written log file data of a captain node; when configuring a network card supporting an RDMA protocol, calling an RDMA interface of a pre-written log file management module, and synchronizing pre-written log file data into a pre-written log file of a following node; when the preset number of the auxiliary nodes finishes the synchronization of the pre-written log files, the preset number of the auxiliary nodes is set to finish the synchronization of the pre-written log files of the rest auxiliary nodes by a capacity-recovering algorithm; reading line set data of the captain node; calling an RDMA interface of a row header management module to synchronize data in a row header to a row header of a following node; when the preset threshold number of secondary nodes complete the line set data synchronization, the line set data synchronization of the rest secondary nodes is set to be completed by the capacity-recovering algorithm. And the data transmission synchronization rate between the auxiliary nodes is improved.

Description

RDMA (remote direct memory access) -based Kudu cluster data synchronization method, device and system
Technical Field
The invention relates to the technical field of cluster data synchronization, in particular to a Kudu cluster data synchronization method, device and system based on RDMA.
Background
Kudu is a distributed storage system based on Raft, which has a low latency write and high performance analysis characteristics for columnar storage engines, and has the main functional characteristics required by current big data technology. Kudu is a typical master-slave architecture. One Kudu cluster consists of a Master node, master, and several slave nodes, tableservers. Master is responsible for managing metadata of clusters, table Server is responsible for data storage, and the concept of Table in Kudu is the same as other relational databases, and Table is the location of data storage in Kudu. The table has a schema (table structure) and globally ordered primary keys. Table is divided horizontally into a number of segments, each of which is called a Table. Tablet, a tab is a Table of contiguous segments, resembling the region or relational database partition of HBase. Each Tablet stores data (keys) for a certain succession of ranges, and the ranges between the tablets do not overlap. All tablelets of a table contain all key spaces of the table. The tablelet will be stored redundantly. Placed onto multiple table servers, and at any given point in time, one of the secondary nodes, the table, is considered the leader node (leader table), and the rest is considered the follower node (follower table). Each secondary node tablelet can make a read request for data, but only the leader node is responsible for writing the data request.
In the conventional TCP/IP technology, a large amount of server resources and memory bus bandwidth are required to be occupied by the data packet passing through an operating system and other software layers in the processing process of the data packet, and the data is copied and moved back and forth among a system memory, a processor cache and a network controller cache, so that a heavy burden is caused to a CPU and a memory of the server. Especially, the serious "mismatch" of network bandwidth, processor speed and memory bandwidth exacerbates the network delay effect. RDMA is a new direct memory access technology that allows computers to directly access the memory of other computers without requiring processing by a processor.
At present, a TCP/IP network is used between data auxiliary nodes in a Kudu cluster, and data needs to be copied from a user state to a core state in the inter-node transmission of the data, so that the transmission synchronization efficiency of the inter-node data in the cluster is limited.
Disclosure of Invention
Aiming at the problems that a TCP/IP network is used between data auxiliary nodes in a current Kudu cluster, and data needs to be copied from a user state to a core state in the inter-node transmission of data, and the transmission synchronization efficiency of the inter-node data in the cluster is limited, the invention provides a Kudu cluster data synchronization method, device and system based on RDMA.
The technical scheme of the invention is as follows:
on the one hand, the technical scheme of the invention provides a Kudu cluster data synchronization method based on RDMA, wherein the Kudu cluster comprises a main node and a plurality of auxiliary nodes, one of the auxiliary nodes is selected as a leading node, and the other auxiliary nodes are following nodes; the method comprises the following steps:
RDMA interfaces are respectively arranged in a pre-write log file management module and a row header management module of the auxiliary node;
reading pre-written log file data of a captain node;
judging whether the cluster auxiliary node system is configured with a network card supporting an RDMA protocol or not;
if yes, calling an RDMA interface of the pre-written log file management module, and synchronizing the pre-written log file data into the pre-written log file of the following node;
when the preset number of the auxiliary nodes finishes the synchronization of the pre-written log files, the preset number of the auxiliary nodes is set to finish the synchronization of the pre-written log files of the rest auxiliary nodes by a capacity-recovering algorithm;
reading line set data of the captain node;
calling an RDMA interface of a row header management module to synchronize data in a row header to a row header of a following node;
when the preset threshold number of auxiliary nodes complete line set data synchronization, the line set data synchronization of the rest auxiliary nodes is set by a capacity-recovering algorithm;
if not, the control uses the TCP/IP network to complete the synchronization of the pre-written log file and the data between the auxiliary nodes.
Further, the step of calling the RDMA interface of the pre-write log file management module to synchronize the pre-write log file data into the pre-write log file of the following node includes:
calling an RDMA interface of a leader node prewritten log file management module to acquire prewritten log file data, and transmitting the data to a following node through communication between a network card of the leader node and a network card of the following node;
the RDMA interface of the pre-write log file management module of the following node is called to write the received data into the pre-write log file of the following node.
Further, the step of invoking the RDMA interface of the rowset management module to synchronize the rowset data to the rowset of the following nodes includes:
the RDMA interface of the head node row header management module is called to acquire data in a row set, and the data is transmitted to the following node through communication between the network card of the node and the network card of the following node;
the RDMA interface of the row header management module of the call following node writes the received data into the row set of the following node.
Further, the step of calling the RDMA interface of the head node row header management module to acquire data in a row set, communicating with the network card of the following node through the network card of the node, and transmitting the data to the following node specifically comprises the following steps of:
and calling an RDMA interface of the head node row header management module to acquire newly added data in the row header, and transmitting the data to the following node through communication between the network card of the node and the network card of the following node.
Further, the method further comprises:
each auxiliary node sends heartbeat to the main node at regular time;
when the main node sets a time interval and does not receive the heartbeat of the auxiliary node, judging whether the auxiliary node is a captain node or not;
if yes, reselecting a secondary node to be set as a captain node to replace the original captain node function;
if not, the node fault information is sent to the captain node. And skipping the fault node when the captain node realizes data and file synchronization.
Further, the method further comprises:
and after the fault of the faulty secondary node is repaired, the repaired node is used as a following node to be added into the cluster.
Further, when the preset number of secondary nodes complete the synchronization of the pre-written log files, the step of setting the preset number of secondary nodes to complete the synchronization of the pre-written log files of the remaining secondary nodes by the capacity recovery algorithm includes:
when the slave nodes which are more than half of the total number of the slave nodes complete the synchronization of the pre-written log files, the head node sends writing success information to the master node;
after receiving the information of successful writing, the master node informs the Kudu cluster to synchronously and mutually transfer the pre-written log files of the rest of the slave nodes by a capacity-recovering algorithm.
Further, the step of reading the pre-written log file data of the leader node includes, before:
the leader node receives a write operation;
calling the updated and changed data to an RDMA interface of a row set management module of the head node to write the data into the row set;
and calling an RDMA interface of a pre-write log management module of the head node to write the pre-write log record after the data change into a pre-write log file.
On the other hand, the technical scheme of the invention provides a Kudu cluster data synchronization device based on RDMA, which comprises an RDMA interface setting module, a preprocessing module, a system configuration information judging module, a control module and a residual synchronization processing module;
the RDMA interface setting module is used for setting an RDMA interface in the pre-write log file management module and the rowset management module of the auxiliary node respectively;
the preprocessing module is used for reading the pre-written log file data of the captain node; the line set data reading device is used for reading line set data of the captain node;
the system configuration information judging module is used for judging whether the cluster auxiliary node system is configured with a network card supporting an RDMA protocol or not;
the control module is used for calling the RDMA interface of the pre-written log file management module and synchronizing the pre-written log file data into the pre-written log file of the following node; the RDMA interface is also used for calling a row set management module and synchronizing the data in the row set to the row set of the following node; the system is also used for controlling the use of a TCP/IP network to complete the synchronization of the pre-written log files and data between the auxiliary nodes when the system is not provided with a network card supporting the RDMA protocol;
the residual synchronization processing module is used for setting the pre-written log file synchronization of the residual auxiliary nodes by the capacity recovery algorithm when the auxiliary nodes with the set threshold number complete the pre-written log file synchronization; and the method is also used for setting that the line set data synchronization of the rest auxiliary nodes is finished by the capacity-recovering algorithm when the threshold number of auxiliary nodes are set to finish the line set data synchronization.
Further, the control module is specifically configured to call the RDMA interface of the leader node pre-write log file management module to obtain pre-write log file data, and communicate with the network card of the following node through the network card of the present node, so as to transmit the data to the following node; calling an RDMA interface of a prewrite log file management module of the following node to write the received data into a prewrite log file of the following node; calling an RDMA interface of a head node row header management module to acquire newly added data in a row header, and transmitting the data to a following node through communication between a network card of the head node and a network card of the following node; the RDMA interface of the row header management module of the call following node writes the received data into the row set of the following node.
Further, the device also comprises a monitoring module, a node judging module and a node processing module;
the monitoring module is used for monitoring the states of the main node and the auxiliary node; specifically monitoring a heartbeat process of the secondary node to the main node and a heartbeat receiving process of the main node;
the node judging module is used for judging whether the auxiliary node is a captain node or not when the monitoring module monitors that the heartbeat of the auxiliary node is not received at the set time interval of the main node;
the node processing module is used for reselecting a function that one auxiliary node is set as the leader node to replace the original leader node when the node judging module judges that the leader node is faulty; when the node judging module judges that the following node fails, node failure information is sent to the captain node; and skipping the fault node when the captain node realizes data and file synchronization. And the monitoring module is also used for adding the repaired node serving as the following node into the cluster after monitoring that the faulty secondary node is repaired after being faulty.
Further, the device also comprises a request receiving module,
the request receiving module is used for receiving the write operation sending information to the control module;
the control module is also used for calling the updated and changed data to the RDMA interface of the line set management module of the head node to write the data into the line set after receiving the information sent by the request receiving module; and calling an RDMA interface of a pre-write log management module of the head node to write the pre-write log record after the data change into a pre-write log file.
The device also comprises a writing state sending module and a writing state information receiving module;
the writing state sending module is used for sending writing success information when the slave nodes which are more than half of the total number of the slave nodes complete the synchronization of the pre-written log files;
the writing state information receiving module is used for sending the complex capacity information to the residual synchronous processing module after receiving the information of successful writing;
and the residual synchronization processing module is used for setting the pre-written log file synchronization and the line set data synchronization of the residual auxiliary nodes to be completed by the capacity recovery algorithm after receiving the lotus information.
In a third aspect, the present invention provides a Kudu cluster data synchronization system based on RDMA, including a master node and a plurality of slave nodes in communication with the master node; a master node for managing cluster metadata; the auxiliary node is used for being responsible for data storage; the auxiliary node comprises a captain node and a following node;
the auxiliary node is provided with a preprocessing module, a system configuration information judging module and a control module; the system also comprises an RDMA interface setting module and a residual synchronous processing module;
the RDMA interface setting module is used for setting an RDMA interface in the pre-write log file management module and the rowset management module of the auxiliary node respectively;
the preprocessing module is used for reading the pre-written log file data of the captain node; the line set data reading device is used for reading line set data of the captain node;
the system configuration information judging module is used for judging whether the cluster auxiliary node system is configured with a network card supporting an RDMA protocol or not;
the control module is used for calling the RDMA interface of the pre-written log file management module and synchronizing the pre-written log file data into the pre-written log file of the following node; the RDMA interface is also used for calling a row set management module and synchronizing the data in the row set to the row set of the following node; the system is also used for controlling the use of a TCP/IP network to complete the synchronization of the pre-written log files and data between the auxiliary nodes when the system is not provided with a network card supporting the RDMA protocol;
the residual synchronization processing module is used for setting the pre-written log file synchronization of the residual auxiliary nodes by the capacity recovery algorithm when the auxiliary nodes with the set threshold number complete the pre-written log file synchronization; and the method is also used for setting that the line set data synchronization of the rest auxiliary nodes is finished by the capacity-recovering algorithm when the threshold number of auxiliary nodes are set to finish the line set data synchronization.
Further, the control module is specifically configured to call the RDMA interface of the leader node pre-write log file management module to obtain pre-write log file data, and communicate with the network card of the following node through the network card of the present node, so as to transmit the data to the following node; calling an RDMA interface of a prewrite log file management module of the following node to write the received data into a prewrite log file of the following node; calling an RDMA interface of a head node row header management module to acquire newly added data in a row header, and transmitting the data to a following node through communication between a network card of the head node and a network card of the following node; the RDMA interface of the row header management module of the call following node writes the received data into the row set of the following node.
Further, the system also comprises a monitoring module, a node judging module and a node processing module;
the monitoring module is used for monitoring the states of the main node and the auxiliary node; specifically monitoring a heartbeat process of the secondary node to the main node and a heartbeat receiving process of the main node;
the node judging module is used for judging whether the auxiliary node is a captain node or not when the monitoring module monitors that the heartbeat of the auxiliary node is not received at the set time interval of the main node;
the node processing module is used for reselecting a function that one auxiliary node is set as the leader node to replace the original leader node when the node judging module judges that the leader node is faulty; when the node judging module judges that the following node fails, node failure information is sent to the captain node; and skipping the fault node when the captain node realizes data and file synchronization. And the monitoring module is also used for adding the repaired node serving as the following node into the cluster after monitoring that the faulty secondary node is repaired after being faulty.
Furthermore, the captain node is also provided with a request receiving module;
the request receiving module is used for receiving the write operation sending information to the control module;
the control module is also used for calling the updated and changed data to the RDMA interface of the line set management module of the head node to write the data into the line set after receiving the information sent by the request receiving module; and calling an RDMA interface of a pre-write log management module of the head node to write the pre-write log record after the data change into a pre-write log file.
The head node is also provided with a writing state sending module, and the main node is provided with a writing state information receiving module;
the writing state sending module is used for sending writing success information to the master node when the slave nodes which are more than half of the total number of the slave nodes complete the synchronization of the pre-written log files;
the writing state information receiving module is used for sending the complex capacity information to the residual synchronous processing module after receiving the information of successful writing;
and the residual synchronization processing module is used for setting the pre-written log file synchronization and the line set data synchronization of the residual auxiliary nodes to be completed by the capacity recovery algorithm after receiving the lotus information.
From the above technical scheme, the invention has the following advantages: adding and developing an RDMA interface in a pre-written log file management module of each auxiliary node, adding an adaptation option for a network card in a pre-written log file synchronization process, and calling the RDMA interface for data transmission if the network card supporting an RDMA protocol is configured; and secondly, in the line set persistence function of the Kudu, an RDMA interface is additionally developed in a line set management module, and is called to synchronize the data of the head node to the memory line set of the following node. The data transmission synchronization rate between the auxiliary nodes is improved, and the performance of Kudu data analysis is improved.
In addition, the invention has reliable design principle, simple structure and very wide application prospect.
It can be seen that the present invention has outstanding substantial features and significant advances over the prior art, as well as its practical advantages.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.
FIG. 1 is a schematic flow chart of a method of one embodiment of the invention.
Fig. 2 is a schematic block diagram of an apparatus of one embodiment of the invention.
Fig. 3 is a schematic block diagram of a system provided by an embodiment of the present invention.
Detailed Description
In order to make the technical solution of the present invention better understood by those skilled in the art, the technical solution of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
Lift is an abbreviation for Replicated And Fault-Tolerant, which is incorporated herein: and (5) a capacity-recovering algorithm.
As shown in fig. 1, an embodiment of the present invention provides a Kudu cluster data synchronization method based on RDMA, where a Kudu cluster includes a main node and a plurality of auxiliary nodes, one of the auxiliary nodes is selected as a leading node, and the other auxiliary nodes are following nodes; the method comprises the following steps:
step 1: RDMA interfaces are respectively arranged in a pre-write log file management module and a row header management module of the auxiliary node;
step 2: reading pre-written log file data of a captain node;
step 3: judging whether the cluster auxiliary node system is configured with a network card supporting an RDMA protocol or not; if yes, executing the step 4, otherwise executing the step 9;
step 4: calling an RDMA interface of the pre-write log file management module, and synchronizing the pre-write log file data into the pre-write log file of the following node;
step 5: when the preset number of the auxiliary nodes finishes the synchronization of the pre-written log files, the preset number of the auxiliary nodes is set to finish the synchronization of the pre-written log files of the rest auxiliary nodes by a capacity-recovering algorithm;
step 6: reading line set data of the captain node;
step 7: calling an RDMA interface of a row header management module to synchronize data in a row header to a row header of a following node;
step 8: when the preset threshold number of auxiliary nodes complete line set data synchronization, the line set data synchronization of the rest auxiliary nodes is set by a capacity-recovering algorithm;
step 9: control uses a TCP/IP network to accomplish synchronization of the pre-written log files and data between the secondary nodes.
Firstly, an RDMA-based interface is additionally developed in a function of synchronizing a pre-written log file in Kudu, an adaptation option for a network card is additionally provided for synchronizing the pre-written log file, and if the network card supporting an RDMA protocol is configured in a server, the RDMA interface is used; secondly, the RDMA data synchronization interface is additionally developed in the line set persistence function of the Kudu, and after the memory line set data is written into the disk line set, the RDMA data synchronization interface is called to synchronize the leader data into the memory line set of the following node.
In step 4, the step of calling the RDMA interface of the pre-write log file management module to synchronize the pre-write log file data into the pre-write log file of the following node includes:
calling an RDMA interface of a leader node prewritten log file management module to acquire prewritten log file data, and transmitting the data to a following node through communication between a network card of the leader node and a network card of the following node;
the RDMA interface of the pre-write log file management module of the following node is called to write the received data into the pre-write log file of the following node.
In step 7, the step of calling the RDMA interface of the rowset management module to synchronize the rowset data to the rowset of the following nodes includes:
the RDMA interface of the head node row header management module is called to acquire data in a row set, and the data is transmitted to the following node through communication between the network card of the node and the network card of the following node;
the RDMA interface of the row header management module of the call following node writes the received data into the row set of the following node. Here, the RDMA interface of the head node row header management module is called to acquire newly added data in the row header, and the newly added data is communicated with the network card of the following node through the network card of the head node, so that the data is transmitted to the following node.
The capacity recovery algorithm (Raft) periodically collects performance parameters such as network delay, partition, data packet loss and the like of cluster nodes, one node with balanced performance parameters is selected from (N+1)/2 nodes of which the replication of the copies is completed in the cluster to serve as a head node of the replication of the remaining (N-1)/2 copies, other (N-1)/2 nodes serve as following nodes, and a row header management module calls an RDMA synchronous interface to synchronize newly added data in a memory row set in the head node to the memory row set of the following nodes.
In some embodiments, the method further comprises:
step a: each auxiliary node sends heartbeat to the main node at regular time;
step b: when the main node sets a time interval and does not receive the heartbeat of the auxiliary node, judging whether the auxiliary node is a captain node or not; if yes, executing the step c, and if not, executing the step d;
step c: a secondary node is reselected to be set as a captain node to replace the original captain node;
step d: and sending node fault information to the captain node. And skipping the fault node when the captain node realizes data and file synchronization.
And e, after the fault of the faulty auxiliary node is repaired, the repaired node is used as a following node to be added into the cluster.
It should be further noted that, the manner of step 5 and step 8 is the same, except that the transmitted data is different, and only the procedure in step 5 is described below:
when the slave nodes which are more than half of the total number of the slave nodes complete the synchronization of the pre-written log files, the head node sends writing success information to the master node;
after receiving the information of successful writing, the master node informs the Kudu cluster to synchronously and mutually transfer the pre-written log files of the rest of the slave nodes by a capacity-recovering algorithm.
It should be noted that, before the step of reading the pre-written log file data of the leader node in step 2, the method includes:
step 021: the leader node receives a write operation;
step 022: calling the updated and changed data to an RDMA interface of a row set management module of the head node to write the data into the row set;
step 023: and calling an RDMA interface of a pre-write log management module of the head node to write the pre-write log record after the data change into a pre-write log file.
The step of inserting a new piece of data into Kudu: 1. the client is connected with the main node to acquire the related information of the table, including partition information; 2. the client finds the secondary node responsible for maintenance by the tablelet responsible for handling the read-write request. Kudu accepts the request of the client and checks whether the request meets the requirement (table structure); 3. the Kudu cluster searches in all line sets rowset (memory line set MemRowset, disk line set DiskRowset) in the tablelet to see whether data with the same main key as the data to be inserted exist or not, if yes, an error is returned, otherwise, the operation is continued; 4. write operations are first committed to a pre-written log (WAL) of the tablelet leader node and consent is obtained for the following node according to a Raft consistency algorithm. 5. After the leader node finishes writing the pre-written log data, synchronously writing the data into the table of the following node by calling the RDMA interface, and after finishing writing (N+1)/2 table nodes (N is the total number of the auxiliary nodes and N is an odd number), the leader node sends information to the Master node to confirm that the cluster data is successfully written. 6. Then it is added to the memory of one of the tableets and the insertion is added to the MemRowset of the tableet. To support multi-version concurrency control (MVCC) in MemRowset, update and delete operations on the most recently inserted row (i.e., a new row that has not been flushed to disk) will be appended to the original row in MemRowset to generate a list of REDO records. 7. Kudu writes a row of new data in MemRowset, and when MemRowset (1G or 120 s) data reaches a certain size, memRowset drops the data and generates a DiskRowset for persisting the data, and also generates a request for MemRowset to continue receiving new data. Through the above, the main implementation method for realizing data synchronization by calling the RDMA interface comprises the following steps:
(1) The pre-write log file management module in the KUDU is added with a pre-write log file synchronization function based on an RDMA protocol;
(2) Adding a function of reading and judging the configuration of a local system network card in a pre-written log file synchronization function;
(3) Reading the table metadata of the leader node;
(4) If the system is configured with an RDMA network card, the pre-write log function calls an RDMA synchronous interface, and the WAL data of the pre-write log file is synchronized into the WAL file of the following node table;
(5) The total number N of the auxiliary nodes configured by the cluster is calculated according to a Raft consistency algorithm, when the WAL file writing of the (N+1)/2 auxiliary nodes is completed, the head node considers that the cluster consistency requirement is met, and a RowSet writing operation is started; the rest WAL auxiliary node synchronous operation is passed through a Raft algorithm to complete file synchronization.
(6) The Rowset data synchronization function based on RDMA is additionally developed in a Rowset row header management module in KUDU;
(7) Adding configuration reading and judgment of a local system network card in a Rowset data synchronization function;
(8) Reading the leader tablelet metadata;
(9) If the RDMA network card is configured in the cluster, the RowSet calls an RDMA synchronous interface, and newly added data in the MemRowRet is synchronized to MemRowset following the node Tablet;
(10) The data of multiple RowSet pairs are synchronized, and according to a Raft algorithm, when (N+1)/2 pairs of nodes complete MemRowSet data synchronous writing, the head node considers that the cluster consistency requirement is met, and a new RowSet writing operation is started; and the rest (N-1)/2 auxiliary nodes are synchronized, and the MemRowSet synchronization is completed by a Raft algorithm.
(11) After RDMA synchronization is completed, the leader node Tablet is returned to be successful, and Kudu data synchronization is completed.
As shown in fig. 2, an embodiment of the present invention provides an RDMA-based Kudu cluster data synchronization device, which includes an RDMA interface setting module 11, a preprocessing module 22, a system configuration information judging module 33, a control module 44, and a residual synchronization processing module 55;
an RDMA interface setting module 11, configured to set an RDMA interface in a pre-write log file management module and a rowset management module of the secondary node respectively;
a preprocessing module 22, configured to read the pre-written log file data of the captain node; the line set data reading device is used for reading line set data of the captain node;
the system configuration information judging module 33 is configured to judge whether the cluster auxiliary node system configures a network card supporting the RDMA protocol;
a control module 44 for invoking the RDMA interface of the pre-write log file management module to synchronize the pre-write log file data into the pre-write log file of the following node; the RDMA interface is also used for calling a row set management module and synchronizing the data in the row set to the row set of the following node; the system is also used for controlling the use of a TCP/IP network to complete the synchronization of the pre-written log files and data between the auxiliary nodes when the system is not provided with a network card supporting the RDMA protocol;
the remaining synchronization processing module 55 is configured to set that, when the preset number of secondary nodes complete the synchronization of the pre-written log files, the capacity recovery algorithm completes the synchronization of the pre-written log files of the remaining secondary nodes; and the method is also used for setting that the line set data synchronization of the rest auxiliary nodes is finished by the capacity-recovering algorithm when the threshold number of auxiliary nodes are set to finish the line set data synchronization.
It should be noted that, the control module 44 is specifically configured to call the RDMA interface of the leader node pre-write log file management module to obtain pre-write log file data, and communicate with the network card of the following node through the network card of the present node, so as to transmit the data to the following node; calling an RDMA interface of a prewrite log file management module of the following node to write the received data into a prewrite log file of the following node; calling an RDMA interface of a head node row header management module to acquire newly added data in a row header, and transmitting the data to a following node through communication between a network card of the head node and a network card of the following node; the RDMA interface of the row header management module of the call following node writes the received data into the row set of the following node.
In some embodiments, the apparatus further includes a monitoring module 66, a node determination module 77, a node processing module 88;
the monitoring module 66 is configured to monitor states of the primary node and the secondary node; specifically monitoring a heartbeat process of the secondary node to the main node and a heartbeat receiving process of the main node;
the node judging module 77 is configured to judge whether the secondary node is a captain node when the monitoring module monitors that the heartbeat of the secondary node is not received at the set time interval of the primary node;
the node processing module 88 is configured to reselect a function that the secondary node is set as the captain node to replace the original captain node when the node judging module judges that the captain node is faulty; when the node judging module judges that the following node fails, node failure information is sent to the captain node; and skipping the fault node when the captain node realizes data and file synchronization. And the monitoring module is also used for adding the repaired node serving as the following node into the cluster after monitoring that the faulty secondary node is repaired after being faulty.
In some embodiments, the apparatus further comprises a request receiving module 99;
a request receiving module 99 for receiving write operation transmission information to the control module;
the control module 44 is further configured to, after receiving the information sent by the request receiving module, invoke the RDMA interface of the line set management module of the head node to write the updated and changed data into the line set; and calling an RDMA interface of a pre-write log management module of the head node to write the pre-write log record after the data change into a pre-write log file.
In some embodiments, the apparatus further includes a write status transmitting module 100, a write status information receiving module 101;
the writing state sending module 100 is configured to send writing success information when the slave nodes that are more than half of the total number of the slave nodes complete synchronization of the pre-written log file;
the writing state information receiving module 101 is configured to send the complex capacity information to the remaining synchronization processing module after receiving the information that the writing is successful;
and the remaining synchronization processing module 55 is used for setting the pre-written log file synchronization and the line set data synchronization of the remaining auxiliary nodes to be completed by the capacity recovery algorithm after receiving the lotus information.
As shown in fig. 3, the embodiment of the present invention further provides an RDMA-based Kudu cluster data synchronization system, which includes a master node 1 and a plurality of slave nodes in communication with the master node 1; a master node 1 for managing cluster metadata; the auxiliary node is used for being responsible for data storage; the secondary nodes comprise a captain node 10 and a follower node 20;
the auxiliary node is provided with a preprocessing module 22, a system configuration information judging module 33 and a control module 44; the system also includes an RDMA interface setup module 11 and a residual synchronization processing module 55;
an RDMA interface setting module 11, configured to set an RDMA interface in a pre-write log file management module and a rowset management module of the secondary node respectively;
a preprocessing module 22, configured to read the pre-written log file data of the captain node; the line set data reading device is used for reading line set data of the captain node;
the system configuration information judging module 33 is configured to judge whether the cluster auxiliary node system configures a network card supporting the RDMA protocol;
a control module 44 for invoking the RDMA interface of the pre-write log file management module to synchronize the pre-write log file data into the pre-write log file of the following node; the RDMA interface is also used for calling a row set management module and synchronizing the data in the row set to the row set of the following node; the system is also used for controlling the use of a TCP/IP network to complete the synchronization of the pre-written log files and data between the auxiliary nodes when the system is not provided with a network card supporting the RDMA protocol;
the remaining synchronization processing module 55 is configured to set that, when the preset number of secondary nodes complete the synchronization of the pre-written log files, the capacity recovery algorithm completes the synchronization of the pre-written log files of the remaining secondary nodes; and the method is also used for setting that the line set data synchronization of the rest auxiliary nodes is finished by the capacity-recovering algorithm when the threshold number of auxiliary nodes are set to finish the line set data synchronization.
It should be noted that, the control module 44 is specifically configured to call the RDMA interface of the leader node pre-write log file management module to obtain pre-write log file data, and communicate with the network card of the following node through the network card of the present node, so as to transmit the data to the following node; calling an RDMA interface of a prewrite log file management module of the following node to write the received data into a prewrite log file of the following node; the method comprises the steps that a pre-written log file is stored in a disk, an RDMA interface of a head node row header management module is called to obtain newly added data in a row header, the row header data is in a cache, and the data is transmitted to a following node through communication between a network card of the node and a network card of the following node; the RDMA interface of the row header management module of the call following node writes the received data into the row set of the following node.
In some embodiments, the system further includes a monitoring module 66, a node determination module 77, a node processing module 88;
the monitoring module 66 is configured to monitor states of the primary node and the secondary node; specifically monitoring a heartbeat process of the secondary node to the main node and a heartbeat receiving process of the main node;
the node judging module 77 is configured to judge whether the secondary node is a captain node when the monitoring module monitors that the heartbeat of the secondary node is not received at the set time interval of the primary node;
the node processing module 88 is configured to reselect a function that the secondary node is set as the captain node to replace the original captain node when the node judging module judges that the captain node is faulty; when the node judging module judges that the following node fails, node failure information is sent to the captain node; and skipping the fault node when the captain node realizes data and file synchronization. And the monitoring module is also used for adding the repaired node serving as the following node into the cluster after monitoring that the faulty secondary node is repaired after being faulty.
In some embodiments, the captain node 10 is further provided with a request receiving module 99;
a request receiving module 99 for receiving write operation transmission information to the control module;
the control module 44 is further configured to, after receiving the information sent by the request receiving module, invoke the RDMA interface of the line set management module of the head node to write the updated and changed data into the line set; and calling an RDMA interface of a pre-write log management module of the head node to write the pre-write log record after the data change into a pre-write log file.
The head node 10 is further provided with a writing state transmitting module 100, and the main node 1 is provided with a writing state information receiving module 101;
a writing state sending module 100, configured to send writing success information to the master node 1 when the slave node that is greater than half of the total number of the slave nodes completes synchronization of the pre-written log file;
the writing state information receiving module 101 is configured to send the complex capacity information to the remaining synchronization processing module 55 after receiving the information that the writing is successful;
and the remaining synchronization processing module 55 is used for setting the pre-written log file synchronization and the line set data synchronization of the remaining auxiliary nodes to be completed by the capacity recovery algorithm after receiving the lotus information.
It should be noted that, the network card information determined by the system configuration information determining module 33 may be an HCA network card or other network cards supporting RDMA protocols.
Although the present invention has been described in detail by way of preferred embodiments with reference to the accompanying drawings, the present invention is not limited thereto. Various equivalent modifications and substitutions may be made in the embodiments of the present invention by those skilled in the art without departing from the spirit and scope of the present invention, and it is intended that all such modifications and substitutions be within the scope of the present invention/be within the scope of the present invention as defined by the appended claims. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (7)

1. The Kudu cluster data synchronization method based on RDMA is characterized in that the Kudu cluster comprises a main node and a plurality of auxiliary nodes, one of the auxiliary nodes is selected as a leading node, and the other auxiliary nodes are following nodes; the method comprises the following steps:
RDMA interfaces are respectively arranged in a pre-write log file management module and a row header management module of the auxiliary node;
reading pre-written log file data of a captain node;
judging whether the cluster auxiliary node is configured with a network card supporting an RDMA protocol or not;
if yes, calling an RDMA interface of the pre-written log file management module, and synchronizing the pre-written log file data into the pre-written log file of the following node;
when the (n+1)/2 auxiliary nodes complete the synchronization of the pre-written log files, the synchronization of the pre-written log files of the rest auxiliary nodes is set to be completed by a capacity-recovering algorithm;
reading line set data of the captain node;
calling an RDMA interface of a row set management module to synchronize row set data into a row set of the following nodes;
when the (n+1)/2 auxiliary nodes complete the line set data synchronization, setting that the line set data synchronization of the rest auxiliary nodes is completed by a capacity-multiplexing algorithm; n is an odd number;
if not, the control uses TCP/IP network to complete the synchronization of the pre-written log file and the data collection between the auxiliary nodes;
the step of reading the pre-written log file data of the captain node is preceded by:
the client is connected with the master node to acquire the related information of the table and find the slave node responsible for maintenance of the table responsible for processing the read-write request;
after the Kudu cluster receives the request of the client and checks that the request meets the requirements, searching is carried out in all row sets rowset in the tablelet, whether data with the same main key as the data to be inserted exist or not, if yes, returning errors exist, and if not, continuing to execute the next step;
submitting the write operation to a pre-written log of a tablelet leader node, and obtaining the agreement of a following node according to a Raft consistency algorithm;
the method comprises the steps that writing of pre-written log file data is completed at a head node;
after the leader node finishes writing the pre-written log file data, finishing writing the (N+1)/2 tableet auxiliary nodes with the pre-written log file data; simultaneously writing line set data into a memory line set of a head node Tablet, and after finishing writing (N+1)/2 Tablet auxiliary node line set data, the head node sends information to a main node, and feeds back and confirms that Kudu cluster data is successfully written;
the Kudu cluster writes a row of new data in the memory line set, and when the data in the memory line set reaches a set size, the memory line set drops the data, generates a disk line set for persistence data, and also generates a request that the memory line set continuously receives the new data.
2. The RDMA-based Kudu cluster data synchronization method of claim 1 wherein the step of invoking the RDMA interface of the pre-write log file management module to synchronize the pre-write log file data into the pre-write log file of the follower node comprises:
calling an RDMA interface of a leader node pre-write log file management module to acquire pre-write log file data, and transmitting the pre-write log file data to a following node through communication between a network card of the leader node and a network card of the following node;
and calling the RDMA interface of the pre-written log file management module of the following node to write the received pre-written log file data into the pre-written log file of the following node.
3. The RDMA-based Kudu cluster data synchronization method of claim 2 wherein the step of invoking the RDMA interface of the rowset management module to synchronize rowset data into the rowset of the follower node comprises:
calling an RDMA interface of a leader node row header management module to acquire row set data, and transmitting the row set data to the following node through communication between a network card of the leader node and a network card of the following node;
the RDMA interface of the row set management module of the call follow-up node writes the received row set data into the row set of the follow-up node.
4. The RDMA-based Kudu cluster data synchronization method of claim 3, further comprising:
each auxiliary node sends heartbeat to the main node at regular time;
when the main node sets a time interval and does not receive the heartbeat of the auxiliary node, judging whether the auxiliary node is a captain node or not;
if yes, reselecting a secondary node to be set as a captain node to replace the original captain node function;
if not, the node fault information is sent to the captain node.
5. The RDMA-based Kudu cluster data synchronization method of claim 4 further comprising:
and after the fault of the faulty auxiliary node is repaired, the repaired auxiliary node is used as a following node to be added into the cluster.
6. The Kudu cluster data synchronization device based on RDMA is characterized in that the Kudu cluster comprises a main node and a plurality of auxiliary nodes, one of the auxiliary nodes is selected as a leading node, and the other auxiliary nodes are following nodes; the device comprises an RDMA interface setting module, a preprocessing module, a system configuration information judging module, a control module and a residual synchronous processing module;
the RDMA interface setting module is used for setting an RDMA interface in the pre-write log file management module and the rowset management module of the auxiliary node respectively;
the preprocessing module is used for reading the pre-written log file data of the captain node; the line set data reading device is used for reading line set data of the captain node;
the system configuration information judging module is used for judging whether the cluster auxiliary node is configured with a network card supporting an RDMA protocol or not;
the control module is used for calling the RDMA interface of the pre-written log file management module and synchronizing the pre-written log file data into the pre-written log file of the following node; the RDMA interface is also used for calling a rowset management module and synchronizing rowset data into a rowset of the following node; the method is also used for controlling the use of a TCP/IP network to complete the synchronization of the pre-written log file and the line set data between the auxiliary nodes when the auxiliary nodes are not provided with network cards supporting the RDMA protocol;
the residual synchronization processing module is used for setting that the writing log file synchronization of the residual auxiliary nodes is completed by the capacity recovery algorithm when the (N+1)/2 auxiliary nodes complete the writing log file synchronization; the method is also used for setting that the complex capacity algorithm completes the line set data synchronization of the rest auxiliary nodes when the (N+1)/2 auxiliary nodes complete the line set data synchronization;
the preprocessing module comprises the following steps before reading the pre-written log file data of the leader node: the client is connected with the master node to acquire the related information of the table and find the slave node responsible for maintenance of the table responsible for processing the read-write request; after the Kudu cluster receives the request of the client and checks that the request meets the requirements, searching is carried out in all row sets rowset in the tablelet, whether data with the same main key as the data to be inserted exist or not, if yes, returning errors exist, and if not, continuing to execute the next step; submitting the write operation to a pre-written log of a tablelet leader node, and obtaining the agreement of a following node according to a Raft consistency algorithm;
after the leader node finishes writing the pre-written log file data, finishing writing the (N+1)/2 tableet auxiliary nodes with the pre-written log file data; and simultaneously writing line set data into a memory line set of the head node Tablet, and after finishing writing (N+1)/2 Tablet auxiliary node line set data, the head node sends information to the main node, and feeds back and confirms that the Kudu cluster data is successfully written.
7. The Kudu cluster data synchronization system based on RDMA is characterized by comprising a main node and a plurality of auxiliary nodes communicated with the main node; a master node for managing cluster metadata; the auxiliary node is used for being responsible for data storage; the auxiliary node comprises a captain node and a following node;
the system further comprising the apparatus of claim 6.
CN202111007589.4A 2021-08-30 2021-08-30 RDMA (remote direct memory access) -based Kudu cluster data synchronization method, device and system Active CN113905054B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111007589.4A CN113905054B (en) 2021-08-30 2021-08-30 RDMA (remote direct memory access) -based Kudu cluster data synchronization method, device and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111007589.4A CN113905054B (en) 2021-08-30 2021-08-30 RDMA (remote direct memory access) -based Kudu cluster data synchronization method, device and system

Publications (2)

Publication Number Publication Date
CN113905054A CN113905054A (en) 2022-01-07
CN113905054B true CN113905054B (en) 2023-08-08

Family

ID=79188369

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111007589.4A Active CN113905054B (en) 2021-08-30 2021-08-30 RDMA (remote direct memory access) -based Kudu cluster data synchronization method, device and system

Country Status (1)

Country Link
CN (1) CN113905054B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115550384B (en) * 2022-11-25 2023-03-10 苏州浪潮智能科技有限公司 Cluster data synchronization method, device and equipment and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107528710A (en) * 2016-06-22 2017-12-29 华为技术有限公司 Switching method, equipment and the system of raft distributed system leader nodes
CN111274317A (en) * 2020-01-07 2020-06-12 书生星际(北京)科技有限公司 Method and device for synchronizing multi-node data and computer equipment
CN112597251A (en) * 2020-12-29 2021-04-02 天津南大通用数据技术股份有限公司 Database cluster log synchronization method and device, server and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110082832A1 (en) * 2009-10-05 2011-04-07 Ramkumar Vadali Parallelized backup and restore process and system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107528710A (en) * 2016-06-22 2017-12-29 华为技术有限公司 Switching method, equipment and the system of raft distributed system leader nodes
CN111274317A (en) * 2020-01-07 2020-06-12 书生星际(北京)科技有限公司 Method and device for synchronizing multi-node data and computer equipment
CN112597251A (en) * 2020-12-29 2021-04-02 天津南大通用数据技术股份有限公司 Database cluster log synchronization method and device, server and storage medium

Also Published As

Publication number Publication date
CN113905054A (en) 2022-01-07

Similar Documents

Publication Publication Date Title
CN100478902C (en) Geographically distributed clusters
CN102265277B (en) Operation method and device for data memory system
US8521691B1 (en) Seamless migration between replication technologies
CN101334797B (en) Distributed file systems and its data block consistency managing method
CN107832138B (en) Method for realizing flattened high-availability namenode model
CN100388225C (en) Cluster database with remote data mirroring
US7440977B2 (en) Recovery method using extendible hashing-based cluster logs in shared-nothing spatial database cluster
US20130110781A1 (en) Server replication and transaction commitment
CN107919977B (en) Online capacity expansion and online capacity reduction method and device based on Paxos protocol
CN105975579B (en) A kind of active and standby clone method and memory database system of memory database
CN1860450A (en) Method, system, and program for forming a consistency group
CN103345502B (en) Transaction processing method and system of distributed type database
US20040107381A1 (en) High performance transaction storage and retrieval system for commodity computing environments
EP4213038A1 (en) Data processing method and apparatus based on distributed storage, device, and medium
JP2023541298A (en) Transaction processing methods, systems, devices, equipment, and programs
US11003550B2 (en) Methods and systems of operating a database management system DBMS in a strong consistency mode
CN115794499B (en) Method and system for dual-activity replication data among distributed block storage clusters
CN113905054B (en) RDMA (remote direct memory access) -based Kudu cluster data synchronization method, device and system
CN110196788B (en) Data reading method, device and system and storage medium
US10970177B2 (en) Methods and systems of managing consistency and availability tradeoffs in a real-time operational DBMS
WO2019109257A1 (en) Log management method, server and database system
CN111352766A (en) Database double-activity implementation method and device
WO2021189283A1 (en) Method and device for data processing, electronic device, and storage medium
CN116303789A (en) Parallel synchronization method and device for multi-fragment multi-copy database and readable medium
CN112667440A (en) Long-distance disaster recovery method for high-availability MySQL

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant