CN113992681A

CN113992681A - Method for ensuring strong consistency of data in distributed system

Info

Publication number: CN113992681A
Application number: CN202111110681.3A
Authority: CN
Inventors: 张超林; 夏之春; 朱洪斌; 金健; 胡旭东
Original assignee: Shanghai Kingstar Fintech Co Ltd
Current assignee: Shanghai Kingstar Fintech Co Ltd
Priority date: 2021-09-18
Filing date: 2021-09-18
Publication date: 2022-01-28
Anticipated expiration: 2041-09-18
Also published as: CN113992681B

Abstract

The invention adopts a reliable multicast mechanism to distribute data from a source service cluster to a target service cluster, only one node (called a main node) of the target cluster is responsible for distributing the processed service data outwards again at any time, and other nodes are called standby nodes. All nodes in the target service cluster join the same multicast group and process the received service data at the same time, but the standby nodes do not participate in external data distribution, and the processing speeds of all nodes are consistent, so that the switching sequence of internal state machines can also be kept consistent, thereby ensuring the strong consistency of data among all nodes. If the master node fails, the method can select reasonable backup nodes to promote to the master node and continue to process data, so that distributed data consistency is achieved, namely all data can be accessed from any node in the cluster. The invention adopts the reliable multicast without connection to replace the original TCP point-to-point transmission mode, thereby obviously increasing the number of the nodes which can be accommodated by the service cluster.

Description

Method for ensuring strong consistency of data in distributed system

Technical Field

The invention belongs to the technical field of security trading, and particularly relates to a method for ensuring strong consistency of data in a distributed system.

Background

The method is successfully applied to various product lines of securities, futures, mechanism transactions and the like. However, with the rapid increase of the variety and volume of financial market trading, higher requirements are put on the performance and capacity of the trading system. The original trading system has many limitations, so that the improvement and the improvement on the existing version are very difficult, and the requirement on a new generation trading system is further provided.

For example, the original transaction system is not a distributed architecture in the complete sense, the service cluster only supports a simple active/standby model, and the active/standby switching mechanism itself is controlled by a single arbitration process.

This in itself means the possibility of a single point of failure. The active-standby model itself works effectively, but cannot support horizontal extension modes such as remote backup except local active-standby. To enhance the level of expansion of the capabilities and robustness of the trading system. The invention is innovative in a service data distribution mode and a service cluster synchronization mode so as to meet core requirements of future customers on transaction speed, robustness, expansion capability and the like, and has two problems at present:

(1) a single service cluster consists of only 3 nodes: master node/backup node/arbitration node. The main node receives the service data which is responsible for service processing and distributes the processing to the outside from the original message middleware, and the standby node also receives the service data which is responsible for service processing but not distributes the service data to the outside from the original message middleware. The arbitration node and the main and standby nodes are all connected through heartbeats, if the main node failure is detected, the standby node receives the notification from the arbitration node, and the identity is switched from the standby node to the main node. And after the switching is finished, the new main node is responsible for distributing the processed data outwards. The disadvantage of the above scheme is that the arbitration node itself runs only one process, which as a single point in fact does not bring reliability to the distributed system. When the arbitration node fails, it means that the switching function of the main and standby nodes is unavailable, and when the main node fails to provide transaction service, the cluster is actually failed, and the service stability and reliability are not good.

(2) The service cluster only has a main node and a spare node, the horizontal expansion capability is limited, and if a client requires to increase transaction nodes and keep data consistency and perform disaster recovery switching at any time, obviously, the architecture cannot meet the further requirement of the client on the fault-tolerant capability.

Disclosure of Invention

In the transaction system described in the present invention, a service cluster has a logical role called a master node at any time, and the role is responsible for distributing sequencing messages to standby nodes and sending responses to the outside, and at this time, the master node is regarded as a potential single point of failure. In order to ensure the reliability and robustness of the transaction system, a method is needed to ensure that a new main node can be selected quickly to continue to be responsible for message processing under the condition of main node failure.

The technical scheme of the invention is as follows:

the invention provides a method for ensuring strong consistency of data in a distributed system, which comprises the following steps:

1) introducing UDP multicast, so that a source service main node sends data to a preset multicast address when sending the data to a target service cluster, and adding an application layer frame number to the head of each UDP message;

2) the target service cluster main node information is a lease written in the etc, the expiration time of the lease is very short and less than 2 seconds, the main node is required to be updated regularly to ensure that the lease always exists in the etcd, and all standby nodes are in a state of waiting for lease;

3) when the main node fails and the lease is deleted by the etcd after the lease is expired, the standby nodes can all obtain the notification of the main node failure; at this time, the service cluster does not provide service to the outside;

4) after the fact that the target service cluster does not have a main node at present is known, all the plurality of standby nodes start an election process;

5) one node is selected from a plurality of standby nodes as a main node, the sequencing and processing work of the pickup service message is carried out, and the service cluster is switched into an available state.

Further, the number of nodes in the service cluster is X, and X is 2N + 1.

Further, the lease expiration time and the time value from the time when the standby node receives the master node failure notification to the time when the master node reselection is completed are less than 5 seconds.

Further, in step 5, all the nodes that accept the former node are formally promoted to the master node, and all the nodes that accept the former node are changed to the slave node.

Further, all other nodes are formally promoted to be the master node after approval, and all the nodes which approve the former nodes are converted to be the standby nodes.

Further, after the node is started, the node learns that the existing service cluster exists from the etcd cluster, and then the node is switched to a joining process, and a request needs to be initiated to the master node to require joining the cluster. Receiving the message when the service cluster is ready; if a node requires to join the cluster after the cluster starts processing the service message, the node is rejected.

Furthermore, each service message contains a transmission layer sequence number in a message header, and the sequence number is a monotonically increasing integer; after the message is generated, the main node in the source service cluster sends the message to a predetermined multicast address; each service message is defined as a unidirectional stream and the receiving multicast address of each stream must be different.

The invention also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the above method.

Compared with the prior art, the method has the beneficial effects that

1. Considering that the financial transaction system has extremely high requirements on the robustness of a processing host, a scheme for ensuring strong data consistency at any time is designed, the scheme sacrifices the availability in a CAP theory, but meets the requirements on consistency and partition fault tolerance, and the problem of unavailable service occurs during the main-standby switching, but is compensated by a reliable multicast mechanism;

2. in the aspect of horizontal expansion requirements, etcd is introduced as a main node election tool, so that a competitive main node election mechanism is realized, and the robustness of a service cluster is greatly improved;

3. in the aspect of service data distribution speed, connectionless reliable multicast is adopted to replace the original TCP point-to-point transmission mode, intermediate forwarding nodes are abandoned, the data transmission speed greatly jumps, and meanwhile, the number of nodes which can be accommodated by a service cluster is obviously increased.

Drawings

Fig. 1 is a flowchart of the operation provided in this embodiment.

Examples

The following description of specific embodiments of the present invention is provided in connection with examples to facilitate a better understanding of the present invention.

As shown in fig. 1, the present embodiment provides a method for ensuring strong consistency of data in a distributed system, which specifically includes the following steps:

reliable multicast introduction UDP message

The reliable multicast implementation is based on the transport layer sequence number carried in the UDP header, and the sequence number of each service stream starts from 1. When receiving multicast group broadcast, analyzing the serial number of each service message header, and at this time, there are 3 situations:

the new sequence number is the message sequence number +1 received last time, the receiving can be analyzed normally, and the service flow message sequence number held by the self is updated.

The difference between the new sequence number and the last received message sequence number is greater than 1, which indicates that the UDP message loss occurs. After the UDP message is lost, the receiver needs to send a control plane request message to the multicast address of the known message sender, where the message contains all missing sequence numbers, and requests the latter to reissue the lost message. And during the period of waiting for the service message to be reissued, the receiver caches the received subsequent service message, and after receiving the service message reissued by the source service node, the service message and the cached service message are analyzed and processed together.

The new sequence number is less than the last received message sequence number, indicating that the service message has been received and is improperly duplicated by the network and creates a transmission delay that should be discarded.

In practical applications, two kinds of streams should be defined for both parties of communication using reliable multicast: traffic flow, which may be unidirectional, and control flow, which should be configured in pairs, both using their respective multicast group addresses.

The service flow is used for bearing certain type of service messages and the maximum sequence number notification message of the service:

1) and the service flow of the sending party: the multicast address is 224.0.1.0:23022, the receiver should join the multicast address, and other sending service flows can also define the multicast address of the receiver;

2) receiver response flow: the multicast address is 224.0.2.0:23022, and the sender should join the multicast address for receiving the former response.

The sender control flow is used to carry the particular traffic flow of the shim message. The service flow also has its own multicast group address, and all service flows are transmitted through the service flow when the missing messages are filled up, regardless of the service types. The design is to avoid the problem of artificial message disorder by not interfering all the normally sent service flows when the messages are supplemented.

1) Sender control flow: the multicast address is 224.0.1.0:23023, the receiver should join the multicast address to receive the service message required to be completed by itself;

2) the control flow of the receiving party: the multicast address is 224.0.2.0:23023, and the sender should join the multicast address to receive the message completion request of the former and confirm the ordered message.

In the transaction system of this embodiment, because the receiving party has two roles (primary and backup nodes), the processing of service message loss by the nodes in the reliable multicast target cluster is divided into two situations:

after the nodes in the target service cluster are started, waiting for a period of time, wherein the purpose is to wait for other nodes to report self information to the etcd cluster, and if the time is up to the time that the cluster still has less than half (X-1)/2 pieces of information which are not reported (namely, the cluster does not have online service), the cluster cannot be started; if more than half of nodes are started in the time and no main node exists currently, entering an election process, declaring the nodes as main nodes and reporting the main nodes to the etcd cluster;

when all other nodes approve, the nodes formally promote to be the main nodes, and all nodes approving the former nodes are converted to be standby nodes. If the node acquires the existence of the current existing service cluster from the etcd cluster after starting, the node is switched to a joining process, and a request needs to be initiated to the main node to require joining the cluster. The message may be received when the service cluster is ready. If a node requires to join a cluster after the cluster starts processing the service message, the node should be rejected, which is done to avoid the inter-node message synchronization in the production state, which results in the cluster stopping service.

Each service message contains a transmission layer sequence number in a message header, and the sequence number is a monotonically increasing integer; and after the message is generated, the main node in the source service cluster sends the message to a preset multicast address. Each service message is called a unidirectional flow and the receiving multicast address of each flow must be different.

All nodes (receivers) in the target cluster should join all interested multicast groups to receive the message stream of the source service cluster. They should each configure the source cluster information in advance to obtain its predefined service message stream ID and its multicast address. On the contrary, if the source service cluster needs to receive the response of the target service cluster, it should join the related multicast group of the latter.

And after the source service cluster starts to send the service messages, storing each sent service message into an internal sending data cache.

All nodes in the target cluster receive the service message from the designated multicast address, but only the master node receives and stores the service message into an internal received data cache, and then queuing and sequencing are carried out; the standby node receives and stores the message to an internal message cache without processing, and discards the message after a certain time. The time is equal to the master node switching time, and is used for saving the messages from the source service cluster during the master selection period to prevent loss. The master node then sends all sequenced traffic messages to the multicast address within the cluster, and then waits for acknowledgement responses from other standby nodes. The purpose of collecting the acknowledgement response is to ensure that all nodes within the target cluster have received the message and await processing.

For a backup node, after receiving the sequenced service message, it needs to send an acknowledgement response to the primary node. At the same time it also needs to receive responses from other standby nodes than itself. For the main node, the number of the response messages is the same as that of the on-line standby nodes; for the standby node, the number of the response messages is-1.

When all the nodes receive the confirmation response, the current message is received by other nodes in the target cluster, and the main node and the standby node simultaneously submit all confirmed messages to the application program for processing at one time. But only the primary node may receive and send out responses to the message from the application, and if the standby node does not need to send out responses to the message. That is, the external recipients (including the source service cluster) can only receive message responses from the master node.

The global consistency of message receiving and the global uniqueness of processing results are ensured by the mode of distributing the service messages by the main node. On the premise that the internal processing sequences of all online nodes are consistent, the states of the application programs can be strictly kept consistent. The same processing result can be obtained by accessing the application program on any node at any time, thereby realizing the strong consistency of the distributed system.

The main node detects that the service message is lost:

it sends message completion request to source service cluster, after the source service cluster main node receives the request, it searches out several needed ones from the sent service message according to the sequence number of the opposite side request, and sends them to the multicast address special for completion message flow appointed by both sides. For the sake of maintaining the processing simplicity, the host node does not modify the message format. The target service cluster main node can be merged into the internal message cache after receiving the message, and the message is sequenced again and then submitted to the application program for processing. In the actual production process, because the flow of financial transaction messages is extremely large, it is not feasible for a single node to store all sent business messages in the memory. And 1, under most conditions, UDP multicast in the local area network is more stable and reliable, and the disorder and loss of messages rarely occur; and 2, the message receiver can also quickly initiate a completion request after detecting that the message is lost, so that the sent message is deleted after being stored for one minute (the time value can be set), and the memory space is saved.

The standby node detects that the service message is lost:

the standby node can only send a message completion request to the primary node and it stops sending acknowledgements of ordered messages to the primary node until the required message is received. At this point, the master node will also stop sending sequenced traffic messages to all the standby nodes due to message acknowledgements that have not yet arrived at all the standby nodes, but will still continue to receive and sequence traffic messages from the source cluster. And after receiving the missing service message, the standby node sends an acknowledgement to the main node, and then the main node continuously broadcasts the sequenced service message to the cluster.

The reliable multicast is used in the transaction system related by the invention to meet the data receiving consistency of each node of the target cluster and the requirement of simultaneously processing and converting each node application state machine, and provides the guarantee of strong data consistency of a distributed system.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A method for ensuring strong consistency of data in a distributed system is characterized by comprising the following steps:

5) one node is selected from a plurality of standby nodes as a main node, the sequencing and processing work of the pickup service message is carried out, and the target service cluster is switched into an available state.

2. The method according to claim 1, wherein the method for ensuring strong consistency of data in the distributed system comprises: the number of the nodes in the service cluster is X, and X is 2N + 1.

3. The method according to claim 1, wherein the method for ensuring strong consistency of data in the distributed system comprises: the expiration time is only lease expiration time and the time value from the standby node receiving the master node failure notice to the master node reselection completion is less than 5 seconds.

4. The method of claim 1, wherein in step 5, all nodes that accept the former node are formally promoted to be the master node, and all nodes that accept the former node are changed to be the slave nodes.

5. The method according to claim 4, wherein the method for ensuring strong consistency of data in the distributed system comprises: when all other nodes approve, the nodes formally promote to be the main nodes, and all nodes approving the former nodes are converted to be standby nodes.

6. The method according to claim 5, wherein the method for ensuring strong consistency of data in the distributed system comprises: after the node is started, the existing service cluster is known to exist from the etcd cluster, the node is switched to a joining process, and a request needs to be initiated to the main node to require joining the cluster. Receiving the message when the service cluster is ready; if a node requires to join the cluster after the cluster starts processing the service message, the node is rejected.

7. The method according to claim 1, wherein the method for ensuring strong consistency of data in the distributed system comprises: each service message contains a transmission layer sequence number in a message header, and the sequence number is a monotonically increasing integer; after the message is generated, the main node in the source service cluster sends the message to a predetermined multicast address; each service message is defined as a unidirectional stream and the receiving multicast address of each stream must be different.

8. A computer storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the method of claims 1-6.