WO2016150066A1 - 一种主节点选举方法、装置及存储*** - Google Patents

一种主节点选举方法、装置及存储*** Download PDF

Info

Publication number
WO2016150066A1
WO2016150066A1 PCT/CN2015/086169 CN2015086169W WO2016150066A1 WO 2016150066 A1 WO2016150066 A1 WO 2016150066A1 CN 2015086169 W CN2015086169 W CN 2015086169W WO 2016150066 A1 WO2016150066 A1 WO 2016150066A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
election
nodes
request message
replication group
Prior art date
Application number
PCT/CN2015/086169
Other languages
English (en)
French (fr)
Inventor
陈正华
郭斌
陈典强
韩银俊
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2016150066A1 publication Critical patent/WO2016150066A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F15/00Digital computers in general; Data processing equipment in general
    • G06F15/16Combinations of two or more digital computers each having at least an arithmetic unit, a program unit and a register, e.g. for a simultaneous processing of several programs

Definitions

  • the present invention relates to the field of distributed storage, and in particular, to a primary node election method, apparatus, and storage system.
  • Cloud Computing is Grid Computing, Distributed Computing, Parallel Computing, Utility Computing Network Storage Technologies, Virtualization, The integration of traditional computer technology and network technology such as load balancing (Load Balance). It aims to integrate multiple relatively low-cost computing entities into a system with powerful computing power through the network.
  • Distributed storage is an area in the field of cloud computing. Its role is to provide distributed data storage services for massive data and high-speed read and write access.
  • Nodes in a distributed storage system are stateful, that is, the data stored on each node may be different, and nodes cannot be easily replaced with each other. Therefore, its disaster recovery processing is also more complicated.
  • the master node usually provides read and write services, and synchronizes data to the slave nodes in real time. When the master node fails, it switches to the slave nodes for reading and writing, thereby achieving disaster recovery purposes.
  • the handover process requires that two or more master nodes cannot be generated under any circumstances to avoid data consistency problems. At the same time, the handover should be completed as soon as possible to reduce the system failure time.
  • the above problem solving is relatively easy.
  • the system is inside the same switch, that is, the topology is a star network, and each node has only one network exit. If the primary node experiences a network failure, neither the secondary node nor the application server can connect to the node. Therefore, only the slave node needs to monitor the status of the master node. When the master node is unavailable, it automatically switches states and takes over the read and write requests.
  • the application server can also detect the master node failure and transfer to the slave node for reading and writing.
  • the system size is usually much larger. There are multiple writers accessing the data storage system through different networks. The reliability of the network connection is also greatly reduced, and network partitioning may occur.
  • the master node may still work normally. At this time, if the slave node automatically switches to the active state, two or more master nodes will be generated. At this time, when the application server detects the multi-master node, the write will stop, and the related service will be interrupted or the node. There is a problem with the consistency of the data.
  • the present invention provides a method, an apparatus, and a storage system for selecting a master node to solve at least the above problems.
  • a primary node election method including: a first node in a replication group establishes a connection with other nodes in a replication group; and the first node determines whether a primary node exists in the other node. Node; if it does not exist, And the first node sends an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the first node determines whether to switch to the primary node according to the election result. .
  • the other node replies to the election result according to the election policy, including at least one of the following: if the election request message received by the other node is the first request message in the preset time period, the other The node replies to the consent message, where the first node switches to the master node if the number of the consent message reaches the preset threshold; the other node according to the data information carried in the election request message and the election policy And returning a weight value, where the sum of the weight values of the first node is the largest among all the nodes, the first node is switched to be a master node; if the master node exists in the other node, the other The node replies with a rejection message.
  • the method before the first node sends the election request message to the other node, the method includes: sending an election request message to the other node if the priority of the first node meets an election condition, where The priority is preset.
  • the first node in the replication group establishes a connection with other nodes in the replication group, and the number of nodes in the replication group that establish a connection with the first node reaches a preset threshold.
  • the replication groups are at least two and are all disposed on the same physical server.
  • the number of nodes in the replication group is at least three.
  • a primary node election apparatus which is disposed on the first node, and includes: a connection module configured to establish a connection with other nodes in the replication group; and a query module configured to determine Whether the primary node exists in the other node; the election requesting module is configured to send an election request message to the other node if the primary node does not exist in the other node, and the election request message is used for the other The node replies to the election result according to the election policy; and the switching module is configured to determine whether to switch the first node to the primary node according to the election result.
  • the switching module includes at least one of the following: a first triggering unit, configured to switch the first node to a primary node if the number of the consent messages reaches a preset threshold, where The consent message is that the other node replies if the received election request message is the first request message within the preset time period; and the second trigger unit sets the sum of the weight values of the first node to be The largest of all the nodes, the first node is switched to the master node, wherein the weight value is replied by the other node according to the data information carried in the election request message and the election policy; If there is a master node in the other nodes, the other node replies to the reject message.
  • a first triggering unit configured to switch the first node to a primary node if the number of the consent messages reaches a preset threshold, where The consent message is that the other node replies if the received election request message is the first request message within the preset time period
  • the second trigger unit sets the sum of the weight values of the first node to be
  • the election request module includes: a priority unit, configured to trigger the election request module to send an election request message to the other node if the priority of the first node meets an election condition, Wherein, the priority is preset.
  • the first node in the replication group establishes a connection with other nodes in the replication group, and the number of nodes in the replication group that establish a connection with the first node reaches a preset threshold.
  • the replication groups are at least two and are all disposed on the same physical server.
  • the number of nodes in the replication group is at least three.
  • a storage system where the storage system includes at least one replication group, and the replication group includes:
  • the first node including:
  • connection module configured to establish a connection with other nodes in the replication group;
  • query module configured to determine whether the other node has a primary node;
  • election request module configured to not have a primary node in the other node, Sending an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the switching module is configured to determine whether to switch to the primary node according to the election result;
  • the other node is configured to reply to the election result according to the election request message and the election policy.
  • the first node in the replication group is used to establish a connection with other nodes in the replication group; the first node determines whether the other node has a primary node; if not, the first node The other node sends an election request message, where the election request message is used by the other node to reply to the election result according to the election policy; and the first node determines, according to the election result, whether to switch to the primary node, so that the distributed storage is performed.
  • the primary node in the replication group in the system is always kept at one time, which avoids the problem of data consistency between nodes.
  • FIG. 1 is a schematic diagram of a network partition in a distributed system according to an embodiment of the present invention
  • FIG. 2 is a flowchart 1 of a method for electing a master node according to an embodiment of the present invention
  • FIG. 3 is a second flowchart of a method for electing a master node according to an embodiment of the present invention.
  • FIG. 4 is a structural block diagram 1 of a primary node election apparatus according to an embodiment of the present invention.
  • FIG. 5 is a second structural block diagram of a primary node election apparatus according to an embodiment of the present invention.
  • FIG. 6 is a schematic diagram of a storage system application according to an embodiment of the present invention.
  • FIG. 2 is a flowchart 1 of a primary node election method according to an embodiment of the present invention. As shown in FIG. 2, the method includes:
  • the first node in the replication group establishes a connection with other nodes in the replication group, where each node in the replication group is connected to each other and stores the same data, and the replication group can pass node redundancy and related scheduling algorithms. Guarantee the availability and consistency of the stored data.
  • a distributed storage system can be composed of one or more such replication groups and served to the application.
  • the server provides a data storage service.
  • the first node determines whether a primary node exists in other nodes in the replication group, and may adopt a method of polling all node identities to find and determine whether a primary node exists.
  • the first node sends an election request message to other nodes in the replication group, and the other nodes reply to the election result according to the corresponding election policy after receiving the election request message.
  • the first node determines, according to the election result replied by the other node, whether it switches to the master node.
  • step S204 is re-executed when the master node is found to be invalid.
  • the first node in the replication group establishes a connection with other nodes in the replication group, and the first node determines whether the primary node exists in the other node. If not, the first node sends an election request message to other nodes. After receiving the election request message, the other node replies to the election result according to the election policy, and the first node determines whether to switch to the primary node according to the election result.
  • the primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.
  • the other node may reply to the election result according to the election policy after receiving the election request message, and may include one of the following methods:
  • the election request message received by the other node is the first request message in the preset time period
  • the other node replies to the consent message
  • the number of the agreed message returned reaches a preset threshold
  • the other node returns the weight value according to the data information carried in the election request message and the election policy, wherein the data information includes the basis for the other node to reply the weight value, and the basis may be a preset operation logic.
  • the first node switches to the master node.
  • FIG. 3 is a second flowchart of a method for electing a master node according to an embodiment of the present invention. As shown in FIG. 3, the method includes:
  • the slave node A can be regarded as the first node in the present invention, and the slave node B and C are equivalent to the other nodes mentioned above.
  • the slave node A cannot connect to an existing master node in the copy group. Broadcasting the election request internally, requesting that the node B and C agree to become the new master node;
  • the slave node C first receives the election request from the node A, and cannot connect to the known master node by itself, and agrees to the election request from the node A;
  • the node B receives the election request of the node A, and rejects the election request from the node A because it has sent an election request (ie, agrees to become the new master node);
  • the node A receives an election request from the node B, and rejects the election request from the node B because it has sent an election request (ie, agrees to become the new master node);
  • the node A collects more than half of the election consent feedbacks, becomes a new master node, and completes switching of its own service state.
  • the node B fails to collect more than half of the election consent feedback, so the election fails, and the main node search process is re-executed;
  • timeout period t For the election request described in steps S302 and S304, there is a timeout period t. If the sender fails to receive a reply from a node within the time t, it is equivalent to the node rejecting the ticket. Setting the timeout period t can prevent the request from being unreachable or lost due to network reasons.
  • steps S306, S308, S310, and S312 when a slave node receives an election request sent by another node, if it is already connected to a master node, the election request should be rejected; otherwise, it should be guaranteed to time out. Within time t, only one election request from the slave node is agreed.
  • the specific timing may be different from the above steps, but the processing manner is the same.
  • one master node can always be selected to complete the initialization process of the replication group.
  • a default election initiation priority may be configured for the slave node, for example, when the first node is pre-configured as the second priority. If the first node wants to initiate the election process, the election conditions that must be met are: the slave node in the first priority has initiated the election process or is unable to connect.
  • the nodes in the replication group may be designed to be virtual, and two or more replication groups may be deployed on one physical server, and each physical server is simultaneously As the master node or slave node in these replication groups, the purpose of balancing the load can be achieved.
  • the electoral process between different replication groups is independent of each other.
  • the number of nodes in the replication group is generally three or more.
  • FIG. 4 is a structural block diagram of a primary node election device according to an embodiment of the present invention.
  • the device includes: a connection module. 402, configured to establish a connection with other nodes in the replication group; the query module 404 is configured to determine whether the other node has a primary node; and the election request module 406 is configured to not exist in the other node. Sending an election request message to the other node, where the election request message is used by the other node to reply to the election result according to the election policy; and the switching module 408 is configured to determine, according to the election result, whether to switch the first node to Primary node.
  • the query module 404 finds that a primary node already exists in the determining process, the first node remains as a slave. Node identity. And the data real-time synchronization process can be started, and the status of the master node is monitored at the same time, and the query module 404 re-executes the query function when the master node is found to be invalid.
  • the connection module 402 in the first node establishes a connection with other nodes in the replication group, and the query module 404 determines whether there is a primary node in the other node. If not, the election request module 406 sends an election request to other nodes. The message, after receiving the election request message, the other node replies to the election result according to the election policy, and the switching module 408 determines whether to switch to the master node according to the election result.
  • the primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.
  • FIG. 5 is a block diagram 2 of a main node election apparatus according to an embodiment of the present invention.
  • the switching module 508 includes:
  • the first triggering unit 5004 is configured to switch the first node to the primary node if the number of the consent messages reaches the preset threshold, where the consent message is that the election request message received by the other node is a preset time period. Reply in case of the first request message;
  • the second triggering unit 5006 is configured to: when the sum of the weight values of the first node is the largest among all the nodes, the first node switches to the master node, where the weight value is data information carried by other nodes according to the election request message and Reply to the election strategy;
  • the other nodes reply to the reject message.
  • the election request module 506 includes:
  • the priority unit 5002 is configured to trigger the election request module 506 to send an election request message to other nodes when the priority of the first node meets the election condition, wherein the priority is preset.
  • FIG. 6 is a schematic diagram of a storage system application according to an embodiment of the present invention.
  • the storage system includes at least one replication group: one primary node and two or more It consists of nodes, the same data is stored on the nodes in the group, and the availability and consistency of the stored data are guaranteed by the redundancy of the nodes and related scheduling algorithms.
  • the entire storage system consists of one or more replication groups that provide data storage services to the application server.
  • the first node in the embodiment of the present invention becomes a master node after being elected, and is a node that provides data read/write service in the replication group, and is responsible for processing the read and write request sent by the application server, and synchronizing the stored data to the slave node.
  • the other nodes in the embodiment of the present invention are equivalent to the slave node in FIG. 6, which is the backup node of the master node, provides the same data access capability as the master node, and synchronizes data from the master node to keep the data state consistent.
  • the slave nodes Like the first node, when these slave nodes detect that the master node is unavailable, elections can also be initiated to participate in the new master node selection process. And according to the election result, switch to the master node, or start data synchronization with the new master node.
  • the application server in Figure 6 is a node that deploys a user-specific application that uses the data read and write services provided by the storage system.
  • the program reads and writes data through an interface library issued by the storage system, which is not itself part of the data storage system.
  • the interface library has the ability to distinguish between the master and slave nodes and automatically select the master node for data reading and writing. In fact, this feature can be implemented outside the system, ie by the user program.
  • the application server in the embodiment of the present invention is connected to all nodes in the replication group, and all nodes in the replication group are also connected to each other.
  • the connection module of the first node in the storage system establishes a connection with other nodes in the replication group, and the query module determines whether there is a primary node in the other node. If not, the election request module sends an election request to other nodes. The message, after receiving the election request message, the other node replies to the election result according to the election policy, and the switching module determines whether to switch to the master node according to the election result.
  • the primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.
  • modules or steps of the present invention described above can be implemented by a general-purpose computing device that can be centralized on a single computing device or distributed across a network of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device such that they may be stored in the storage device by the computing device and, in some cases, may be different from the order herein.
  • the steps shown or described are performed, or they are separately fabricated into individual integrated circuit modules, or a plurality of modules or steps thereof are fabricated as a single integrated circuit module.
  • the invention is not limited to any specific combination of hardware and software.
  • the first node in the replication group is used to establish a connection with other nodes in the replication group; the first node determines whether the other node has a primary node; if not, the first node The other node sends an election request message, where the election request message is used by the other node to reply to the election result according to the election policy; and the first node determines whether to switch to the primary node according to the election result.
  • the primary node in the replication group in the distributed storage system is always kept at one time, which avoids the problem of data consistency between the nodes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明提供了一种主节点选举方法、装置及存储***,其中,该主节点选举方法包括:复制组中的第一节点与复制组中其他节点建立连接,第一节点判断其他节点中是否存在主节点,若不存在,则第一节点向其他节点发送选举请求消息,其他节点收到选举请求消息后根据选举策略回复选举结果,第一节点根据选举结果确定是否切换为主节点。使得分布式存储***内复制组中的主节点始终保持为一个,避免了节点之间的数据一致性出现问题。

Description

一种主节点选举方法、装置及存储*** 技术领域
本发明涉及分布式存储领域,尤其涉及一种主节点选举方法、装置及存储***。
背景技术
云计算(Cloud Computing)是网格计算(Grid Computing)、分布式计算(Distributed Computing)、并行计算(Parallel Computing)、效用计算(Utility Computing)网络存储(Network Storage Technologies)、虚拟化(Virtualization)、负载均衡(Load Balance)等传统计算机技术和网络技术发展融合的产物。它旨在通过网络把多个成本相对较低的计算实体整合成一个具有强大计算能力的***。分布式存储是云计算范畴中的一个领域,其作用是提供海量数据的分布式数据存储服务以及高速读写访问的能力。
分布式存储***中的节点都是有状态的,即每个节点上存储的数据可能不同,节点之间无法简单的互相替换。因此,其容灾处理也更为复杂。在主从架构的分布式存储***中,通常由主节点提供读写服务,并实时向从节点同步数据,在主节点故障时,切换到从节点进行读写,从而达到容灾目的。切换过程要求任何情况下不能产生两个及两个以上的主节点,以避免数据一致性出现问题,同时,要求切换应尽快完成,降低***故障时间。
在传统***规模和网络环境下,上述问题解决相对容易。通常,***处于同一交换机内部,即拓扑结构为星型网络,每个节点只有一个网络出口。如果主节点出现网络故障,则从节点和应用服务器均无法连接到该节点。因此,只需要由从节点监测主节点状态,当主节点不可用时,自动切换状态,接管读写请求。相应的,应用服务器也能检测到主节点故障,并转移到从节点进行读写。
在分布式存储***中,***规模通常要大很多,会存在多个写入者通过不同网络访问数据存储***的情形,网络连接的可靠性也大大降低,可能出现网络分区的情形。如附图1所示,当从节点检测到主节点不可达时,主节点可能仍然工作正常。此时,如果从节点自动切换到主用状态,将产生两个及两个以上的主节点,此时应用服务器检测到了多主节点的情况就会停止写入,相关的服务就会中断或者节点之间的数据一致性出现问题。
发明内容
为了解决相关技术中节点之间的数据一致性难以保证的问题,本发明提供了一种主节点选举方法、装置及存储***,以至少解决上述问题。
根据本发明实施例的一个方面,提供了一种主节点选举方法,包括:复制组中的第一节点与复制组中其他节点建立连接;所述第一节点判断所述其他节点中是否存在主节点;若不存在, 则所述第一节点向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;所述第一节点根据所述选举结果确定是否切换为主节点。
可选地,所述其他节点根据选举策略回复选举结果,包括以下至少之一:所述其他节点接收的所述选举请求消息是预设时间段内第一个请求消息的情况下,所述其他节点回复同意消息,在所述同意消息的数量达到预设阈值的情况下,所述第一节点切换为主节点;所述其他节点根据所述选举请求消息中携带的数据信息以及所述选举策略,回复权重值,所述第一节点的所述权重值总和在所有节点中最大的情况下,所述第一节点切换为主节点;所述其他节点中存在主节点的情况下,所述其他节点回复拒绝消息。
可选地,所述第一节点向所述其他节点发送选举请求消息之前,包括:在所述第一节点的优先级符合选举条件的情况下,向所述其他节点发送选举请求消息,其中,所述优先级是预先设定的。
可选地,所述复制组中的第一节点与复制组中其他节点建立连接,包括:所述复制组中与所述第一节点建立连接的节点数量达到预设阈值。
可选地,所述复制组为至少两个,且均设置在同一个物理服务器上。
可选地,所述复制组中的节点数量为至少三个。
根据本发明实施实例的另一个方面,提供了一种主节点选举装置,设置于所述第一节点上,包括:连接模块,设置为与复制组中其他节点建立连接;查询模块,设置为判断所述其他节点中是否存在主节点;选举请求模块,设置为在所述其他节点中不存在主节点的情况下,向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;切换模块,设置为根据所述选举结果确定是否将所述第一节点切换为主节点。
可选地,所述切换模块包括以下至少之一:第一触发单元,设置为在所述同意消息的数量达到预设阈值的情况下,将所述第一节点切换为主节点,其中,所述同意消息是所述其他节点在接收的所述选举请求消息是预设时间段内第一个请求消息的情况下回复的;第二触发单元,设置为所述第一节点的权重值总和在所有节点中最大的情况下,所述第一节点切换为主节点,其中,所述权重值是由所述其他节点根据所述选举请求消息中携带的数据信息以及所述选举策略回复的;其中,所述其他节点中存在主节点的情况下,所述其他节点回复拒绝消息。
可选地,所述选举请求模块,包括:优先级单元,设置为在所述第一节点的优先级符合选举条件的情况下,触发所述选举请求模块向所述其他节点发送选举请求消息,其中,所述优先级是预先设定的。
可选地,所述复制组中的第一节点与复制组中其他节点建立连接,包括:所述复制组中与所述第一节点建立连接的节点数量达到预设阈值。
可选地,所述复制组为至少两个,且均设置在同一个物理服务器上。
可选地,所述复制组中的节点数量为至少三个。
根据本发明实施例的再一个方面,提供了一种存储***,所述存储***包括至少一个复制组,所述复制组包括:
第一节点,包括:
连接模块,设置为与复制组中其他节点建立连接;查询模块,设置为判断所述其他节点中是否存在主节点;选举请求模块,设置为在所述其他节点中不存在主节点的情况下,向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;切换模块,设置为根据所述选举结果确定是否切换为主节点;
其他节点,设置为根据所述选举请求消息及选举策略回复选举结果。
通过本发明实施例,采用复制组中的第一节点与复制组中其他节点建立连接;所述第一节点判断所述其他节点中是否存在主节点;若不存在,则所述第一节点向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;所述第一节点根据所述选举结果确定是否切换为主节点的方式,使得分布式存储***内复制组中的主节点始终保持为一个,避免了节点之间的数据一致性出现问题。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:
图1是根据本发明实施例的分布式***中网络分区的示意图;
图2是根据本发明实施例的主节点选举方法流程图一;
图3是根据本发明实施例的主节点选举方法流程图二;
图4是根据本发明实施例的主节点选举装置结构框图一;
图5是根据本发明实施例的主节点选举装置结构框图二;
图6是根据本发明实施例的存储***应用示意图。
具体实施方式
下文中将参考附图并结合实施例来详细说明本发明。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
本发明实施例提供了一种主节点选举方法,图2是根据本发明实施例的主节点选举方法流程图一,如图2所示,该方法包括:
S202,复制组中的第一节点与复制组中其他节点建立连接,其中,复制组中的各个节点互相连接在一起并存储相同的数据,该复制组可以通过节点的冗余以及相关调度算法,保证所存储数据的可用性及一致性。分布式存储***可以由一个或多个这样的复制组构成,并向应用服 务器提供数据存储服务。
S204,该第一节点判断该复制组内的其他节点中是否存在主节点,可以采用的方式为轮询所有节点身份,查找并判断是否存在主节点。
S206,若不存在主节点,则第一节点向复制组内的其他节点发送选举请求消息,其他节点收到上述选举请求消息后根据相应的选举策略回复其选举的结果。
S208,第一节点根据其他节点回复的选举结果确定其自身是否切换为主节点。
其中,如果在判断过程中发现已经存在一个主节点,则该第一节点保持为从节点身份。并且可以启动数据实时同步流程,同时监控主节点状态,在发现主节点失效时重新执行步骤S204。
通过本发明实施例,复制组中的第一节点与复制组中其他节点建立连接,第一节点判断其他节点中是否存在主节点,若不存在,则第一节点向其他节点发送选举请求消息,其他节点收到选举请求消息后根据选举策略回复选举结果,第一节点根据选举结果确定是否切换为主节点。使得分布式存储***内复制组中的主节点始终保持为一个,避免了节点之间的数据一致性出现问题。
在本发明的一个实施方式中,其他节点收到选举请求消息后根据选举策略回复选举结果可以包括以下方式之一:
其他节点接收的选举请求消息是预设时间段内第一个请求消息的情况下,该其他节点就回复同意消息,当所回复的同意消息数量达到了预先设定阈值,上述第一节点就切换为主节点;
其他节点根据选举请求消息中携带的数据信息以及选举策略,回复权重值,其中,上述数据信息中包含该其他节点回复权重值的依据,这种依据可以是预先设置好的运算逻辑。当第一节点的上述所有权重值总和在所有节点中最大的情况下,该第一节点就切换为主节点。
以上述第一种方式举例,初始启动或主节点故障时,复制组内的从节点无法连接到主节点,则触发一次选举过程,直到自身成为新的主节点,或找到主节点为止。当选举产生主节点以后,由主节点负责处理应用服务器的数据读写请求。图3是根据本发明实施例的主节点选举方法流程图二,如图3所示,该方法包括:
S302,从节点A可以看作本发明中的第一节点,从节点B、C相当于上述的其他节点,本实施例中,从节点A由于无法连接到一个已存在的主节点,在复制组内广播选举请求,要求从节点B、C同意其成为新的主节点;
S304,从节点B执行类似操作,发送选举请求到从节点A、C,要求成为主节点;
S306,在预设时间段T内,从节点C首先收到从节点A的选举请求,且自身无法连接到已知主节点,同意从节点A的选举请求;
S308,从节点C收到从节点B的选举请求,由于已经同意从节点A的选举请求,从节点C回复拒绝消息,拒绝从节点B的选举请求;
S310,从节点B收到节点A的选举请求,由于其自身已发送选举请求(即同意自己成为新的主节点),拒绝从节点A的选举请求;
S312,从节点A收到节点B的选举请求,由于其自身已发送选举请求(即同意自己成为新的主节点),拒绝从节点B的选举请求;
S314,假设预设的阈值为超过半数,则从节点A收集到超过半数的选举同意反馈,成为新的主节点,并完成自身服务状态的切换;
S316,从节点B未能收集到超过半数的选举同意反馈,因此选举失败,重新执行主节点查找流程;
S318,从节点B、C发现并连接到主节点A,完成复制组的初始化,选举终止。
对于步骤S302、S304中所述的选举请求,存在一个超时时间t,如果发送方在该时间t内未能收到某节点的回复,则等同于该节点投拒绝票。设置超时时间t可以避免网络原因导致的请求不可达或回复丢失。
如步骤S306、S308、S310、S312所述,当某个从节点收到其他节点发送的选举请求时,如果其本身已经连接到一个主节点,则应当拒绝该选举请求;否则,应当保证在超时时间t内,仅同意一个从节点的选举请求。
实际情况下,由于各节点的请求发送是并行的,具体时序可能与上述步骤不同,但处理方式一致。在本发明实施例中,如果复制组中超过半数的节点能够互相通讯,经过1次或多次的选举,总能选出一个主节点,完成复制组的初始化过程。
在本发明的一个实施方式中,为了避免多个节点同时发起选举的冲突问题,可以为从节点配置默认的选举发起优先级,比如,当预先配置上述第一节点为第二优先级时,此时如果该第一节点要发起选举流程,其所必须符合的选举条件即为:处于第一优先级的从节点已经发起了选举流程或者无法连接到。
在本发明的一个实施方式中,由于复制组内主从节点负载不同,可以将复制组内的节点设计为虚拟的,在一个物理服务器上部署两个或者多个复制组,每个物理服务器同时作为这些复制组中的主节点或从节点,如此可以达到平衡负载的目的。当然,不同复制组之间的选举过程都是相互独立的。其中,该复制组中的节点数量一般为3个或者多个。
本发明实施例还提供了一种主节点选举装置,设置于第一节点上,图4是根据本发明实施例的主节点选举装置结构框图一,如图4所示,该装置包括:连接模块402,设置为与复制组中其他节点建立连接;查询模块404,设置为判断所述其他节点中是否存在主节点;选举请求模块406,设置为在所述其他节点中不存在主节点的情况下,向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;切换模块408,设置为根据所述选举结果确定是否将所述第一节点切换为主节点。
其中,如果查询模块404在判断过程中发现已经存在一个主节点,则该第一节点保持为从 节点身份。并且可以启动数据实时同步流程,同时监控主节点状态,在发现主节点失效时查询模块404重新执行查询功能。
通过本发明实施例,第一节点中的连接模块402与复制组中其他节点建立连接,查询模块404判断其他节点中是否存在主节点,若不存在,则选举请求模块406向其他节点发送选举请求消息,其他节点收到选举请求消息后根据选举策略回复选举结果,切换模块408根据选举结果确定是否切换为主节点。使得分布式存储***内复制组中的主节点始终保持为一个,避免了节点之间的数据一致性出现问题。
图5是根据本发明实施例的主节点选举装置结构框图二,在本发明的一个实施方式中,如图5所示,切换模块508包括:
第一触发单元5004,设置为在上述同意消息的数量达到预设阈值的情况下,将第一节点切换为主节点,其中,上述同意消息是其他节点在接收的选举请求消息是预设时间段内第一个请求消息的情况下回复的;
第二触发单元5006,设置为第一节点的权重值总和在所有节点中最大的情况下,第一节点切换为主节点,其中,权重值是由其他节点根据选举请求消息中携带的数据信息以及选举策略回复的;
其中,在其他节点中存在主节点的情况下,其他节点回复拒绝消息。
在本发明的一个实施方式中,如图5所示,选举请求模块506包括:
优先级单元5002,设置为在第一节点的优先级符合选举条件的情况下,触发选举请求模块506向其他节点发送选举请求消息,其中,该优先级是预先设定的。
本发明实施例还提供了一种存储***,图6是根据本发明实施例的存储***应用示意图,如图6所示,该存储***包括至少一个复制组:由一个主节点和两个以上的从节点组成,组内的节点上存储相同的数据,通过节点的冗余以及相关调度算法,保证所存储数据的可用性及一致性。整个存储***由1个或多个复制组构成,向应用服务器提供数据存储服务。
本发明实施例中的第一节点经过选举后成为主节点,其在复制组中是提供数据读写服务的节点,负责处理应用服务器发送的读写请求,并向从节点同步其存储的数据。
本发明实施例中的其他节点就相当于图6中的从节点,是主节点的备份节点,提供与主节点相同的数据存取能力,并从主节点同步数据,保持数据状态一致。与该第一节点一样,当这些从节点检测到主节点不可用时,也可发起选举,参与新的主节点选择过程。并根据选举结果,切换为主节点,或与新的主节点开始数据同步。
图6中的应用服务器,是部署用户特定应用程序的节点,该程序使用该存储***提供的数据读写服务。通常,该程序通过一个由存储***发行的接口库进行数据读写,其本身并不是数据存储***的一部分。为简化表述,本发明假定接口库具有区分主从节点,并自动选择主节点进行数据读写的能力。实际上,该特性可以实现在***外部,即由用户程序完成。
其中,本发明实施例中的应用服务器与复制组中所有节点连接,复制组中所有节点也互相连接。
通过本发明实施例,该存储***中第一节点的连接模块与复制组中其他节点建立连接,查询模块判断其他节点中是否存在主节点,若不存在,则选举请求模块向其他节点发送选举请求消息,其他节点收到选举请求消息后根据选举策略回复选举结果,切换模块根据选举结果确定是否切换为主节点。使得分布式存储***内复制组中的主节点始终保持为一个,避免了节点之间的数据一致性出现问题。
显然,本领域的技术人员应该明白,上述的本发明的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本发明不限制于任何特定的硬件和软件结合。
以上所述仅为本发明的优选实施例而已,并不用于限制本发明,对于本领域的技术人员来说,本发明可以有各种更改和变化。凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。
工业实用性
通过本发明实施例,采用复制组中的第一节点与复制组中其他节点建立连接;所述第一节点判断所述其他节点中是否存在主节点;若不存在,则所述第一节点向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;所述第一节点根据所述选举结果确定是否切换为主节点。使得分布式存储***内复制组中的主节点始终保持为一个,避免了节点之间的数据一致性出现问题。

Claims (13)

  1. 一种主节点选举方法,包括:复制组中的第一节点与复制组中其他节点建立连接;
    所述第一节点判断所述其他节点中是否存在主节点;
    若不存在,则所述第一节点向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;
    所述第一节点根据所述选举结果确定是否切换为主节点。
  2. 根据权利要求1所述的方法,其中,所述其他节点根据选举策略回复选举结果,包括以下至少之一:
    所述其他节点接收的所述选举请求消息是预设时间段内第一个请求消息的情况下,所述其他节点回复同意消息,在所述同意消息的数量达到预设阈值的情况下,所述第一节点切换为主节点;
    所述其他节点根据所述选举请求消息中携带的数据信息以及所述选举策略,回复权重值,所述第一节点的所述权重值总和在所有节点中最大的情况下,所述第一节点切换为主节点;
    所述其他节点中存在主节点的情况下,所述其他节点回复拒绝消息。
  3. 根据权利要求1所述的方法,其中,所述第一节点向所述其他节点发送选举请求消息之前,包括:
    在所述第一节点的优先级符合选举条件的情况下,向所述其他节点发送选举请求消息,其中,所述优先级是预先设定的。
  4. 根据权利要求1至3任一项所述的方法,其中,所述复制组中的第一节点与复制组中其他节点建立连接,包括:
    所述复制组中与所述第一节点建立连接的节点数量达到预设阈值。
  5. 根据权利要求4任一项所述的方法,其中,所述复制组为至少两个,且均设置在同一个物理服务器上。
  6. 根据权利要求5所述的方法,其中,所述复制组中的节点数量为至少三个。
  7. 一种主节点选举装置,设置于所述第一节点上,包括:
    连接模块,设置为与复制组中其他节点建立连接;
    查询模块,设置为判断所述其他节点中是否存在主节点;
    选举请求模块,设置为在所述其他节点中不存在主节点的情况下,向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;
    切换模块,设置为根据所述选举结果确定是否将所述第一节点切换为主节点。
  8. 根据权利要求7所述的装置,其中,所述切换模块包括以下至少之一:
    第一触发单元,设置为在所述同意消息的数量达到预设阈值的情况下,将所述第一节点切换为主节点,其中,所述同意消息是所述其他节点在接收的所述选举请求消息是预设时间段内第一个请求消息的情况下回复的;
    第二触发单元,设置为所述第一节点的权重值总和在所有节点中最大的情况下,所述第一节点切换为主节点,其中,所述权重值是由所述其他节点根据所述选举请求消息中携带的数据信息以及所述选举策略回复的;
    其中,所述其他节点中存在主节点的情况下,所述其他节点回复拒绝消息。
  9. 根据权利要求7所述的装置,其中,所述选举请求模块,包括:
    优先级单元,设置为在所述第一节点的优先级符合选举条件的情况下,触发所述选举请求模块向所述其他节点发送选举请求消息,其中,所述优先级是预先设定的。
  10. 根据权利要求7至9任一项所述的装置,其中,所述复制组中的第一节点与复制组中其他节点建立连接,包括:
    所述复制组中与所述第一节点建立连接的节点数量达到预设阈值。
  11. 根据权利要求10任一项所述的装置,其中,所述复制组为至少两个,且均设置在同一个物理服务器上。
  12. 根据权利要求11所述的装置,其中,所述复制组中的节点数量为至少三个。
  13. 一种存储***,所述存储***包括至少一个复制组,所述复制组包括:
    第一节点,包括:
    连接模块,设置为与复制组中其他节点建立连接;查询模块,设置为判断所述其他节点中是否存在主节点;选举请求模块,设置为在所述其他节点中不存在主节点的情况下,向所述其他节点发送选举请求消息,所述选举请求消息用于所述其他节点根据选举策略回复选举结果;切换模块,设置为根据所述选举结果确定是否切换为主节点;
    其他节点,设置为根据所述选举请求消息及选举策略回复选举结果。
PCT/CN2015/086169 2015-03-25 2015-08-05 一种主节点选举方法、装置及存储*** WO2016150066A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510133374.5 2015-03-25
CN201510133374.5A CN106161495A (zh) 2015-03-25 2015-03-25 一种主节点选举方法、装置及存储***

Publications (1)

Publication Number Publication Date
WO2016150066A1 true WO2016150066A1 (zh) 2016-09-29

Family

ID=56976948

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/086169 WO2016150066A1 (zh) 2015-03-25 2015-08-05 一种主节点选举方法、装置及存储***

Country Status (2)

Country Link
CN (1) CN106161495A (zh)
WO (1) WO2016150066A1 (zh)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106657117A (zh) * 2016-12-31 2017-05-10 广州佳都信息技术研发有限公司 一种地铁综合监控权限管理的方法及装置
CN112104727A (zh) * 2020-09-10 2020-12-18 华云数据控股集团有限公司 一种精简高可用Zookeeper集群部署方法及***
CN112533304A (zh) * 2020-11-24 2021-03-19 锐捷网络股份有限公司 自组网络管理方法、装置、***、电子设备以及存储介质
CN112835748A (zh) * 2019-11-22 2021-05-25 上海宝信软件股份有限公司 基于scada***的多中心冗余仲裁方法及***
CN113297236A (zh) * 2020-11-10 2021-08-24 阿里巴巴集团控股有限公司 分布式一致性***中主节点的选举方法、装置及***
CN113489601A (zh) * 2021-06-11 2021-10-08 海南视联通信技术有限公司 基于视联网自治云网络架构的抗毁方法和装置
CN114448769A (zh) * 2022-04-02 2022-05-06 支付宝(杭州)信息技术有限公司 一种基于共识***的节点竞选投票方法及装置
CN118200123A (zh) * 2024-05-14 2024-06-14 北京智芯微电子科技有限公司 一种多节点网络主节点选取方法、装置、设备及存储介质

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391320B (zh) 2017-03-10 2020-07-10 创新先进技术有限公司 一种共识方法及装置
CN106911524B (zh) * 2017-04-27 2020-07-07 新华三信息技术有限公司 一种ha实现方法及装置
CN107832138B (zh) * 2017-09-21 2021-09-14 南京邮电大学 一种扁平化的高可用namenode模型的实现方法
CN107995029B (zh) * 2017-11-28 2019-12-13 新华三信息技术有限公司 选举控制方法及装置、选举方法及装置
CN110417842B (zh) * 2018-04-28 2022-04-12 北京京东尚科信息技术有限公司 用于网关服务器的故障处理方法和装置
CN108810100B (zh) * 2018-05-22 2021-06-29 郑州云海信息技术有限公司 一种主节点的选举方法、装置及设备
CN109040184B (zh) * 2018-06-28 2021-09-07 武汉船舶通信研究所(中国船舶重工集团公司第七二二研究所) 一种主节点的选举方法及服务器
CN110764690B (zh) * 2018-07-28 2023-04-14 阿里云计算有限公司 分布式存储***及其领导节点选举方法和装置
CN110784331B (zh) 2018-07-30 2022-05-13 华为技术有限公司 一种共识流程恢复方法及相关节点
CN109379238B (zh) * 2018-12-14 2022-06-17 郑州云海信息技术有限公司 一种分布式集群的ctdb主节点选举方法、装置及***
CN112398664B (zh) * 2019-08-13 2023-08-08 中兴通讯股份有限公司 主设备选择方法、设备管理方法、电子设备以及存储介质
CN111093249B (zh) * 2019-12-05 2022-06-21 合肥中感微电子有限公司 无线局域网通信方法、***及无线收发设备
CN112988882B (zh) * 2019-12-12 2024-01-23 阿里巴巴集团控股有限公司 数据的异地灾备***、方法及装置、计算设备
CN113742417B (zh) * 2020-05-29 2024-06-07 同方威视技术股份有限公司 多级分布式共识方法及***、电子设备及计算机可读介质
CN112000285A (zh) * 2020-08-12 2020-11-27 广州市百果园信息技术有限公司 强一致存储***、数据强一致存储方法、服务器及介质
CN113596093A (zh) * 2021-06-28 2021-11-02 青岛海尔科技有限公司 设备集合的控制方法和装置、存储介质及电子设备
CN113489149B (zh) * 2021-07-01 2023-07-28 广东电网有限责任公司 基于实时状态感知的电网监控***业务主节点选取方法
CN116107828A (zh) * 2021-11-11 2023-05-12 中兴通讯股份有限公司 主节点选择方法、分布式数据库及存储介质
CN116910158A (zh) * 2023-08-17 2023-10-20 深圳计算科学研究院 基于复制组的数据处理、查询方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101043398A (zh) * 2006-06-28 2007-09-26 华为技术有限公司 一种动态共享连接的方法和***
US20080281938A1 (en) * 2007-05-09 2008-11-13 Oracle International Corporation Selecting a master node in a multi-node computer system
CN101661408A (zh) * 2009-09-14 2010-03-03 四川川大智胜软件股份有限公司 一种分布式实时数据复制同步方法
CN101702721A (zh) * 2009-10-26 2010-05-05 北京航空航天大学 一种多集群***的可重组方法
CN103118084A (zh) * 2013-01-21 2013-05-22 浪潮(北京)电子信息产业有限公司 一种主节点的选举方法及节点

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101217402B (zh) * 2008-01-15 2012-01-04 杭州华三通信技术有限公司 一种提高集群可靠性的方法和一种高可靠性通信节点
CN102843259A (zh) * 2012-08-21 2012-12-26 武汉达梦数据库有限公司 集群内中间件自管理热备方法及***
CN102904752B (zh) * 2012-09-25 2016-06-29 新浪网技术(中国)有限公司 一种节点选举方法、节点设备及***
CN103491168A (zh) * 2013-09-24 2014-01-01 浪潮电子信息产业股份有限公司 一种集群选举设计方法
CN103634375B (zh) * 2013-11-07 2017-01-11 华为技术有限公司 扩容集群节点的方法、装置及设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101043398A (zh) * 2006-06-28 2007-09-26 华为技术有限公司 一种动态共享连接的方法和***
US20080281938A1 (en) * 2007-05-09 2008-11-13 Oracle International Corporation Selecting a master node in a multi-node computer system
CN101661408A (zh) * 2009-09-14 2010-03-03 四川川大智胜软件股份有限公司 一种分布式实时数据复制同步方法
CN101702721A (zh) * 2009-10-26 2010-05-05 北京航空航天大学 一种多集群***的可重组方法
CN103118084A (zh) * 2013-01-21 2013-05-22 浪潮(北京)电子信息产业有限公司 一种主节点的选举方法及节点

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106657117A (zh) * 2016-12-31 2017-05-10 广州佳都信息技术研发有限公司 一种地铁综合监控权限管理的方法及装置
CN112835748A (zh) * 2019-11-22 2021-05-25 上海宝信软件股份有限公司 基于scada***的多中心冗余仲裁方法及***
CN112104727A (zh) * 2020-09-10 2020-12-18 华云数据控股集团有限公司 一种精简高可用Zookeeper集群部署方法及***
CN112104727B (zh) * 2020-09-10 2021-11-30 华云数据控股集团有限公司 一种精简高可用Zookeeper集群部署方法及***
CN113297236A (zh) * 2020-11-10 2021-08-24 阿里巴巴集团控股有限公司 分布式一致性***中主节点的选举方法、装置及***
CN112533304A (zh) * 2020-11-24 2021-03-19 锐捷网络股份有限公司 自组网络管理方法、装置、***、电子设备以及存储介质
CN112533304B (zh) * 2020-11-24 2023-10-20 锐捷网络股份有限公司 自组网络管理方法、装置、***、电子设备以及存储介质
CN113489601A (zh) * 2021-06-11 2021-10-08 海南视联通信技术有限公司 基于视联网自治云网络架构的抗毁方法和装置
CN113489601B (zh) * 2021-06-11 2024-05-14 海南视联通信技术有限公司 基于视联网自治云网络架构的抗毁方法和装置
CN114448769A (zh) * 2022-04-02 2022-05-06 支付宝(杭州)信息技术有限公司 一种基于共识***的节点竞选投票方法及装置
CN118200123A (zh) * 2024-05-14 2024-06-14 北京智芯微电子科技有限公司 一种多节点网络主节点选取方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN106161495A (zh) 2016-11-23

Similar Documents

Publication Publication Date Title
WO2016150066A1 (zh) 一种主节点选举方法、装置及存储***
CN102404390B (zh) 高速实时数据库的智能化动态负载均衡方法
JP6382454B2 (ja) 分散ストレージ及びレプリケーションシステム、並びに方法
US7225356B2 (en) System for managing operational failure occurrences in processing devices
US11106556B2 (en) Data service failover in shared storage clusters
CN108551765A (zh) 输入/输出隔离优化
US10652100B2 (en) Computer system and method for dynamically adapting a software-defined network
EP3461065B1 (en) Cluster arbitration method and multi-cluster cooperation system
CN109802986B (zh) 设备管理方法、***、装置及服务器
EP3813335B1 (en) Service processing methods and systems based on a consortium blockchain network
CN109040184A (zh) 一种主节点的选举方法及服务器
CN114265753A (zh) 消息队列的管理方法、管理***和电子设备
CN114124650A (zh) 一种sptn网络控制器主从部署方法
CN113810216B (zh) 一种集群的故障切换方法、装置及电子设备
CN113794765A (zh) 基于文件传输的网闸负载均衡方法及装置
CN105323271B (zh) 一种云计算***以及云计算***的处理方法和装置
JP2010044553A (ja) データ処理方法、クラスタシステム、及びデータ処理プログラム
CN116346582A (zh) 一种实现主备双网冗余方法、装置、设备及存储介质
CN111510336B (zh) 一种网络设备状态管理方法及装置
US20210406141A1 (en) Computer cluster with adaptive quorum rules
US9798633B2 (en) Access point controller failover system
CN113923222A (zh) 数据处理方法及装置
CN110266795A (zh) 一种基于Openstack平台控制方法
CN108959170B (zh) 虚拟设备管理方法、装置、堆叠***及可读存储介质
CN114490158A (zh) 分布式的容灾***、服务器节点处理方法、装置及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15885998

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15885998

Country of ref document: EP

Kind code of ref document: A1