WO2024131366A1 - 一种集群修复方法及装置 - Google Patents

一种集群修复方法及装置 Download PDF

Info

Publication number
WO2024131366A1
WO2024131366A1 PCT/CN2023/130370 CN2023130370W WO2024131366A1 WO 2024131366 A1 WO2024131366 A1 WO 2024131366A1 CN 2023130370 W CN2023130370 W CN 2023130370W WO 2024131366 A1 WO2024131366 A1 WO 2024131366A1
Authority
WO
WIPO (PCT)
Prior art keywords
node
cluster
normal
reorganization
nodes
Prior art date
Application number
PCT/CN2023/130370
Other languages
English (en)
French (fr)
Inventor
孟凡辉
张基峰
Original Assignee
中科信息安全共性技术国家工程研究中心有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中科信息安全共性技术国家工程研究中心有限公司 filed Critical 中科信息安全共性技术国家工程研究中心有限公司
Publication of WO2024131366A1 publication Critical patent/WO2024131366A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/18Error detection or correction of the data by redundancy in hardware using passive fault-masking of the redundant circuits
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring

Definitions

  • the embodiments of the present application relate to the field of cluster technology, and in particular to a cluster repair method and device.
  • a normal node in the cluster is designated to add the repaired node to the reorganized cluster.
  • the present application also provides a cluster repair device, the device comprising: a monitoring test module, delete module, reorganize module and add module; among them,
  • the monitoring module is used to monitor the operating status of each node in the cluster; and monitor whether the cluster can operate normally according to the operating status of each node; the cluster includes one or more nodes;
  • the deletion module is used to delete the faulty node from the cluster if the cluster cannot operate normally;
  • the reorganization module is used to designate a normal node in the cluster to reorganize the cluster and repair the faulty node;
  • the adding module is used to designate a normal node in the cluster to add the repaired node to the reorganized cluster.
  • the embodiment of the present application proposes a cluster repair method and device, which monitors the operating status of each node in the cluster; and monitors whether the cluster can operate normally according to the operating status of each node; if the cluster cannot operate normally, the faulty node is deleted from the cluster; then a normal node in the cluster is designated to reorganize the cluster and repair the faulty node; and then a normal node in the cluster is designated to add the repaired node to the reorganized cluster.
  • the operating status of each node in the cluster can be monitored in real time; when the cluster cannot operate normally, the node is first designated to complete the cluster reorganization to ensure that the cluster does not stop working, and then the faulty node is analyzed and added to the reorganized cluster after repair.
  • the cluster when the cluster cannot operate normally due to a fault during operation, especially when multiple nodes in the cluster fail or lose power, the cluster will stop working.
  • the cluster repair method and device proposed in the embodiment of the present application can automatically repair the cluster to avoid the impact on the business to the greatest extent; and the technical solution of the embodiment of the present application is simple and convenient to implement, easy to popularize, and has a wider range of application.
  • FIG1 is a schematic diagram of a first process flow of a cluster repair method provided in an embodiment of the present application
  • FIG2 is a schematic diagram of a second process of the cluster repair method provided in an embodiment of the present application.
  • FIG3 is a schematic diagram of a third process flow of the cluster repair method provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of a cluster system architecture provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of the structure of a cluster repair device provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • FIG1 is a first flow chart of a cluster repair method provided in an embodiment of the present application.
  • the method can be performed by a cluster repair device or electronic device.
  • the device or electronic device can be implemented by software and/or hardware.
  • the device or electronic device can be integrated into any smart device with network communication function.
  • the cluster repair method may include the following steps:
  • each node in the cluster can monitor its own operating status and the operating status of other nodes in the cluster; and monitor whether the cluster can operate normally according to the operating status of each node.
  • the operating status of each node in the cluster is monitored by a monitoring function module; wherein the monitoring function module is located on each node of the cluster, and is used to monitor the operating status of the node and other nodes in the cluster, analyze and repair faults, and back up the database.
  • the monitoring function module is not limited to the above functions in actual applications, and the module can also realize functions such as cluster reorganization and node addition.
  • Each node has a daemon process, which can detect the status of the node where it is located, and can also detect the status of other nodes in the cluster.
  • a cluster includes: Node A, Node B and Node C; suppose that when Node C fails, the daemon process of Node A and the daemon process of Node B can detect that Node C fails. At this time, the daemon process of Node A and the daemon process of Node B can delete Node C from the cluster. At the same time, you can also use custom rules to specify one of the nodes to complete the deletion of Node C.
  • S103 Designate a normal node in the cluster to reorganize the cluster and repair the faulty node.
  • the cluster has only one normal node, modify the cluster configuration through the node and convert the cluster operation mode to stand-alone mode to avoid the impact of cluster failure.
  • the nodes in the cluster can specify a node as the reorganization execution node through custom rules; then set a node in the cluster as the master node through the reorganization execution node, and reorganize the cluster through the master node.
  • the selected reorganization execution node can trigger the monitoring function module at the reorganization execution node to perform cluster reorganization operations, obtain information about other normal nodes in the current cluster, and determine the node with the latest database information through the information of other normal nodes; set the node with the latest database information as the master node, and complete the reorganization cluster information configuration and data synchronization through the master node, specifically including the master node modifying the cluster configuration information and the master node synchronizing the database information of this node to other nodes in the reorganized cluster.
  • the designated reorganization execution node needs to complete the analysis and repair of the faulty node problems and handle different fault situations.
  • the repaired node is added to the reorganized cluster using the normal nodes in the specified cluster.
  • the node can be designated as the add execution node to add the repaired node to the cluster; if the cluster has more than one normal node, one of the normal nodes can be designated as the add execution node through a custom rule to add the repaired node to the cluster.
  • Click Join Cluster Usually, the execution node added in this step is the reorganization execution node determined in S103.
  • Adding an execution node adds the repaired node to the reorganized cluster.
  • the specific operation process is as follows: elect a node as the added execution node according to the IP election mechanism, or directly use the reorganized execution node of step S103 as the added execution node, and modify the configuration files of each node in the cluster by adding the execution node; based on the modified configuration files of each node in the cluster, the repaired node is started, and the database information regularly backed up by the monitoring function module on the added execution node is sent to the repaired node to complete the database synchronization.
  • This operation does not affect the normal operation of the original cluster node, and there is no need to perform a locking operation.
  • adding an execution node can also complete synchronization verification and integrity verification to ensure the integrity and consistency of the data.
  • the repaired node can be added to the reorganized cluster in at least the following two scenarios: 1) One or more nodes in the cluster fail, and after the failed node is repaired, the repaired node is added to the cluster; 2) The original cluster needs to be expanded, and the new node needs to be added to the original cluster.
  • a normal node can be selected for full data synchronization according to the Paxos algorithm combined with custom rules, such as IP operation rules.
  • the problem solved by the Paxos algorithm is how a distributed system can reach a consensus on a certain value or a certain resolution.
  • a typical scenario is that in a distributed database system, if the initial state of each node is consistent and each node executes the same sequence of operations, then they can finally get a consistent state.
  • a "consistency algorithm” needs to be executed on each instruction to ensure that the instructions seen by each node are consistent.
  • a general consistency algorithm can be applied in many scenarios and is an important issue in distributed computing.
  • the consistency algorithm can be implemented through shared memory or message passing, and the Paxos algorithm uses the latter.
  • the Paxos algorithm uses the latter.
  • the Paxos algorithm is applicable include: multiple processes/threads in a machine reach data consistency; multiple clients concurrently read and write data in a distributed file system or distributed database; and the consistency of multiple replicas in distributed storage responding to read and write requests.
  • FIG2 is a second flow diagram of the cluster repair method provided in the embodiment of the present application. Based on the above technical solution, it is further optimized and expanded, and can be combined with the above optional implementation methods. As shown in FIG2, the cluster repair method may include the following steps:
  • S201 Monitor the operating status of each node in the cluster through a monitoring function module; wherein the monitoring function module is located on each node of the cluster and is used to monitor the operating status of the node and other nodes in the cluster, analyze and repair faults, and back up the database.
  • the monitoring function module is located on each node of the cluster and is used to monitor the operating status of the node and other nodes in the cluster, analyze and repair faults, and back up the database.
  • the operating status of each node in the cluster can be monitored by the monitoring function module; wherein the monitoring function module is located on each node of the cluster, and is used to monitor the operating status of the node and other nodes in the cluster, perform fault analysis and repair, and back up the database.
  • the monitoring function module in the embodiment of the present application can have the following functions: 1) monitor the operating status of the node database and the node status; 2) monitor the operating status of other node databases in the cluster and the node status; 3) use custom rules to specify a node to be responsible for the repair of the faulty node; 4) regularly back up the data information of this node, and be responsible for the full backup of the newly added nodes and the formation of a new cluster.
  • the present application realizes automatic cluster reorganization through the monitoring function module.
  • the monitoring function module on each node monitors its own operating status and the operating status of other node servers in the cluster.
  • the custom mechanism is started, which mainly specifies one of the fault-free nodes to complete the automatic cluster reorganization.
  • the cluster In this step, if the cluster cannot operate normally, the faulty node is deleted from the cluster. If the cluster has only one normal node (fault-free node), the detection function module on the normal node can be used to modify the configuration of the cluster and convert the cluster's operating mode to stand-alone mode. Specifically, if there is only one normal node left in the cluster, the cluster stops working and the cluster database stops updating. When the monitoring function module of the remaining normal node detects this situation, it modifies the relevant configuration and automatically converts the cluster mode to stand-alone mode, so that the normal reading and writing of the new cluster database can be achieved through this normal node.
  • the detection function module on the normal node can be used to modify the configuration of the cluster and convert the cluster's operating mode to stand-alone mode. Specifically, if there is only one normal node left in the cluster, the cluster stops working and the cluster database stops updating.
  • the monitoring function module of the remaining normal node modifies the relevant configuration and automatically converts the cluster mode to stand-alone
  • each node has a daemon process that can detect the status of the node where it is located, and can also detect the status of other nodes in the cluster.
  • a cluster includes: Node A, Node B, and Node C. Point C; suppose that when node C fails, the daemon of node A and the daemon of node B can detect the failure of node C respectively. At this time, the daemon of node A and the daemon of node B can use custom rules to select a normal node to delete node C from the cluster.
  • a node is designated as a reorganization execution node through a custom rule.
  • a node is designated as the reorganization execution node through a custom rule.
  • the custom rule in the embodiment of the present application may be: take the normal node with the largest or smallest IP address in the current cluster as the reorganization execution node.
  • the embodiment of the present application adopts a custom IP election mechanism, which defines the selection of the node with the largest IP address among all the current normal nodes in the cluster as the cluster reorganization executor.
  • the monitoring process on node C is designated to trigger the cluster reorganization task, and node C is the cluster reorganization execution node.
  • C completes the cluster reorganization, including designating the master node of the reorganized cluster, and the master node completes the startup of each node of the cluster to achieve the normal operation of the reorganized cluster.
  • the startup in the embodiment of the present application refers to the startup of the cluster service, not the startup of the hardware device, which can be understood here as the startup of the software.
  • a normal node in the cluster is set as the master node through the reorganization execution node, the cluster is reorganized through the master node, and the faulty node is repaired.
  • the designated reorganization execution node can trigger the monitoring function module at the reorganization execution node to perform cluster reorganization operations, obtain information of other normal nodes, and determine the node with the latest database information through the information of other normal nodes; set the node with the latest database information as the master node, and complete the reorganization cluster information configuration and data synchronization through the master node.
  • the current cluster contains three normal nodes A, B, and C, among which node C has the largest IP address.
  • the monitoring process on node C triggers the cluster reorganization task.
  • Node C is the cluster reorganization execution node.
  • C completes the cluster reorganization, including specifying the master node of the reorganized cluster.
  • the master node starts each node in the cluster to achieve normal operation of the reorganized cluster.
  • the monitoring function module on node C sends relevant instructions to node A, sets node A as the master node, modifies the relevant configuration, and starts node A first, and then sends relevant instructions to node B and node C to complete the startup, or the monitoring function module adds other normal nodes to the cluster one after another according to this rule to complete the cluster reorganization.
  • the startup in the embodiment of the present application refers to the startup of the cluster service, not the startup of the hardware device, which can be understood here as the startup of the software.
  • the monitoring service module located on the normal node can delete the faulty node in time and repair the faulty node, and the faulty node is repaired by analyzing the faulty node and taking corresponding repair measures.
  • the faults of the faulty node usually include Mysql crash, process deadlock, network abnormality or power-on abnormality after recovery but the node cannot start automatically.
  • the monitoring service module on the node is set by default to complete the removal and repair of the faulty node; if there is more than one normal node in the cluster at this time, the node specified by the custom rule can be used as the fault repair execution node to realize the repair of the faulty node.
  • the specified fault repair execution node is the reorganization execution node of S203.
  • the reasons for repair in the embodiment of the present application may include but are not limited to: service damage, file loss, service crash, error code, etc.
  • the specific repair method can be to restore the failed node through the original file.
  • the cluster After the faulty node is repaired, or the cluster needs to add a new node to improve the processing capacity, in the existing technology, when adding a node, the cluster adds the execution node and is locked. Its database can only be read but not written. The execution node can only be unlocked after data synchronization with the repaired node or other new nodes is completed, which affects the use of the added execution node and also has certain problems with data synchronization. After the repair is completed, use a normal node in the cluster specified in the above steps to add the repaired node to the reorganized cluster.
  • the node can be designated as an add execution node; the repaired node is added to the cluster; if the cluster has more than one normal node, a node can be designated as an add execution node through a custom rule, and the repaired node is added to the cluster. If the node specified in this operation is the same as the custom rule adopted in the aforementioned step S203, then the node specified in this operation is the same node as the reorganization execution node determined by S203, otherwise they are different.
  • a node is selected as an add execution node according to the custom rule, and the configuration files of each node in the cluster are modified by adding the execution node; the repaired node is started based on the modified configuration files of each node in the cluster, and the database information regularly backed up by the monitoring function module on the add execution node is sent to the repaired node to complete database synchronization.
  • the custom rule in the embodiment of the present application is: take the node with the largest or smallest IP address in the cluster as the reorganization execution node.
  • FIG3 is a third flow chart of the cluster repair method provided in the embodiment of the present application. Based on the above technical solution, it is further optimized and expanded, and can be combined with the above optional implementation methods. As shown in FIG3, the cluster repair method may include the following steps:
  • S301 Monitor the operating status of each node in the cluster through a monitoring function module; wherein the monitoring function module is located on each node of the cluster and is used to monitor the operating status of the node and other nodes in the cluster, analyze and repair faults, and back up the database.
  • the monitoring function module is located on each node of the cluster and is used to monitor the operating status of the node and other nodes in the cluster, analyze and repair faults, and back up the database.
  • the operating status of each node in the cluster is monitored through the monitoring function module; the monitoring function module is located on each node of the cluster and is used to monitor the operating status of the node and other nodes in the cluster, analyze and repair faults, and back up the database. For example, if there are three nodes in the cluster, namely node A, node B, and node C, then a monitoring function module can be set on node A.
  • monitoring function module A is used for monitoring the operating status of node A, node B and node C, fault analysis and repair, and database backup
  • monitoring function module B is used for monitoring the operating status of node B, node A and node C, fault analysis and repair, and database backup
  • monitoring function module C is used for monitoring the operating status of node C, node A and node B, fault analysis and repair, and database backup.
  • each node has a daemon process, which can detect the status of the node where it is located, and can also detect the status of other nodes in the cluster.
  • a cluster includes: node A, node B and node C; assuming that when node C fails, the daemon process of node A and the daemon process of node B can detect that node C fails.
  • the daemon process of node A and the daemon process of node B can delete node C from the cluster respectively, or use custom rules to specify one of the nodes to complete the deletion of the faulty node C.
  • a node is designated as a reorganization execution node through a custom rule.
  • a node as the reorganization execution node through a custom rule. For example, suppose there are three nodes in the cluster, namely node A, node B, and node C; suppose node C fails, then node C can be deleted from the cluster; then suppose node A is the node with the largest IP address, then node A can be used as the reorganization execution node.
  • S304 trigger the monitoring function module at the reorganization execution node to perform cluster reorganization operation, obtain information of other normal nodes, determine the node with the latest database information through the information of other normal nodes; set the node with the latest database information as the master node, and complete the reorganization cluster information configuration and data synchronization through the master node.
  • the faulty node is repaired through the reorganization execution node selected in S303.
  • the reorganization execution node can trigger the monitoring function module at the reorganization execution node to perform cluster reorganization operations, obtain information of other normal nodes, and determine the data through the information of other normal nodes.
  • the node with the latest database information is set as the master node, and the cluster information configuration and data synchronization are completed through the master node.
  • each node has a daemon process, which can detect the status of the node where it is located, and can also detect the status of other nodes in the cluster.
  • a cluster includes: node A, node B and node C; assuming that when node C fails, the protection process of node A and the protection process of node B can respectively detect that node C has failed. At this time, the protection process of node A and the protection process of node B can respectively delete node C from the cluster, and specify one of the nodes to perform the deletion operation of node C through custom rules such as the maximum or minimum IP rule.
  • node A is the node with the largest IP address
  • the node can be used as the reorganization execution node; at this time, the monitoring function module of node A can be triggered to perform the cluster reorganization operation, obtain the information of node B, and determine the node with the latest database information through the information of node B; set the node with the latest database information as the master node, and complete the reorganization cluster information configuration and data synchronization through the master node. If there are multiple nodes with the latest database information, it is necessary to use the custom rules again to determine one of the nodes as the master node to complete the reorganization cluster information configuration and data synchronization.
  • the cluster has only one normal node, you can designate that node as the add execution node; add the repaired node to the cluster; if the cluster has more than one normal node, you can designate a node as the add execution node through custom rules, and add the repaired node to the cluster.
  • FIG 4 is a schematic diagram of the cluster system architecture provided in an embodiment of the present application.
  • the system may include: a management unit, a storage unit, a scheduling unit and a computing unit; wherein the administrator sends an http request to the management unit through a browser to implement management operations on the cluster.
  • the management unit may include: system management services, business management services, system monitoring services and upgrade services.
  • the storage unit may include N units, namely storage unit 1, storage unit 2, ..., storage unit N; wherein N is a natural number greater than 1; the dispatcher may send an http request to the scheduling unit to implement scheduling operations on the cluster.
  • the computing unit may include: signature and encryption services.
  • FIG5 is a schematic diagram of the structure of a cluster repair device provided in an embodiment of the present application.
  • the cluster repair device includes: a monitoring module 501, a deletion module 502, a reorganization module 503 and an addition module 504;
  • the monitoring module 501 is used to monitor the operating status of each node in the cluster; and monitor whether the cluster can operate normally according to the operating status of each node; the cluster includes one or more nodes;
  • the deletion module 502 is used to delete the faulty node from the cluster if the cluster cannot operate normally;
  • the reorganization module 503 is used to designate a normal node in the cluster to reorganize the cluster and repair the faulty node;
  • the adding module 504 is used to designate a normal node in the cluster to add the repaired node to the reorganized cluster.
  • the above cluster repair device can execute the method provided by any embodiment of the present application, and has the corresponding functional modules and beneficial effects of the execution method.
  • the cluster repair method provided by any embodiment of the present application.
  • FIG6 is a schematic diagram of the structure of an electronic device provided in an embodiment of the present application.
  • FIG6 shows a block diagram of an exemplary electronic device suitable for implementing an embodiment of the present application.
  • the electronic device can be any node in a cluster.
  • the electronic device 12 shown in FIG6 is only an example and should not bring any limitation to the functions and scope of use of the embodiments of the present application.
  • the electronic device 12 is in the form of a general purpose computing device.
  • the components of the electronic device 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 connecting different system components (including the system memory 28 and the processing unit 16).
  • Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor or a local bus using any of a variety of bus architectures.
  • these architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MAC) bus, an Enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.
  • ISA Industry Standard Architecture
  • MAC Micro Channel Architecture
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • the electronic device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the electronic device 12, including volatile and non-volatile media, removable and non-removable media.
  • the system memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32.
  • the electronic device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • the storage system 34 may be used to read and write non-removable, non-volatile magnetic media (not shown in FIG. 6 , commonly referred to as a “hard drive”).
  • a disk drive for reading and writing a removable non-volatile disk such as a “floppy disk”
  • an optical disk drive for reading and writing a removable non-volatile optical disk (such as a CD-ROM, DVD-ROM or other optical media) may be provided.
  • each drive may be connected to the bus 18 via one or more data medium interfaces.
  • the memory 28 may include at least one program product having a set (e.g., at least one) of program modules that are configured to perform the functions of the various embodiments of the present application.
  • a program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in the memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which or some combination may include an implementation of a network environment.
  • the program modules 42 generally perform the functions and/or methods of the embodiments described herein.
  • the electronic device 12 may also communicate with one or more external devices 14 (e.g., keyboards, pointing devices, displays 24, etc.), may also communicate with one or more devices that enable a user to interact with the electronic device 12, and/or communicate with any device that enables the electronic device 12 to communicate with one or more other computing devices (e.g., network cards, modems, etc.). Such communication may be performed via an input/output (I/O) interface 22.
  • the electronic device 12 may also communicate with one or more networks (e.g., local area networks (LANs), wide area networks (WANs), and/or public networks, such as the Internet) via a network adapter 20. As shown, the network adapter 20 communicates with other modules of the electronic device 12 via a bus 18.
  • LANs local area networks
  • WANs wide area networks
  • public networks such as the Internet
  • the processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, such as implementing the cluster repair method provided in the embodiment of the present application.
  • An embodiment of the present application provides a computer storage medium.
  • the computer-readable storage medium of the embodiments of the present application may adopt any combination of one or more computer-readable media.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof.
  • a computer readable storage medium may be a computer programmable memory device, a computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the above.
  • a computer readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, which carry computer-readable program code. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above. Computer-readable signal media may also be any computer-readable medium other than a computer-readable storage medium, which may send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for performing the operation of the present application can be written in one or more programming languages or a combination thereof, including object-oriented programming languages, such as Java, Smalltalk, C++, and conventional procedural programming languages, such as "C" language or similar programming languages.
  • the program code can be executed entirely on the user's computer, partially on the user's computer, as an independent software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server.
  • the remote computer can be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or can be connected to an external computer (e.g., using an Internet service provider to connect through the Internet).
  • LAN local area network
  • WAN wide area network

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)

Abstract

一种集群修复方法及装置,方法包括:监测集群中各个节点的运行状态;并根据各个节点的运行状态监测集群是否能正常运行(101);若集群不能正常运行,则从集群中删除故障节点(102);指定集群中的一个正常节点将集群进行重组,对故障节点进行修复(103);并指定集群中的一个正常节点将修复后的节点添加到重组后的集群(104)。可以自动地对集群进行修复,最大程度地避免业务受到影响。

Description

一种集群修复方法及装置 技术领域
本申请实施例涉及集群技术领域,尤其涉及一种集群修复方法及装置。
背景技术
在传统的集群模式下,至少需要三台服务器才能正常运行,灵活性和可靠性低。现有技术中,集群在运行中因故障不能正常运行时,尤其是集群中的多个节点出现故障或者断电的情况下,集群会停止工作,这样会对业务运行造成一定影响,因此需要手动修复;另一方面,当因业务需求需要在集群中增加新的节点时,需要先完成新增节点的数据库的全量备份及相关配置,并对与之进行全量备份的节点的数据库进行锁定,这样不但不能自动地将此节点添加到集群,而且还会妨碍现有集群的正常运转。
发明内容
本申请提供一种集群修复方法及装置,可以自动地对集群进行修复,最大程度地避免业务受到影响。
第一方面,本申请实施例提供了一种集群修复方法,所述方法包括:
监测集群中各个节点的运行状态;并根据各个节点的运行状态监测所述集群是否能正常运行;
若所述集群不能正常运行,则从所述集群中删除故障节点;
指定所述集群中的一个正常节点将所述集群进行重组,并对所述故障节点进行修复;
指定所述集群中的一个正常节点将修复后的节点添加到重组后的集群。
第二方面,本申请实施例还提供了一种集群修复装置,所述装置包括:监 测模块、删除模块、重组模块和添加模块;其中,
所述监测模块,用于监测集群中各个节点的运行状态;并根据各个节点的运行状态监测所述集群是否能正常运行;所述集群包括一个或者多个节点;
所述删除模块,用于若所述集群不能正常运行,则从所述集群中删除故障节点;
所述重组模块,用于指定所述集群中的一个正常节点将所述集群进行重组,并对所述故障节点进行修复;
所述添加模块,用于指定所述集群中的一个正常节点将修复后的节点添加到重组后的集群。
本申请实施例提出了一种集群修复方法及装置,监测集群中各个节点的运行状态;并根据各个节点的运行状态监测集群是否能正常运行;若集群不能正常运行,则从集群中删除故障节点;然后指定集群中的一个正常节点将集群进行重组,并对故障节点进行修复;再指定集群中的一个正常节点将修复后的节点添加到重组后的集群。也就是说,在本申请的技术方案中,可以实时地监测集群中各个节点的运行状态;当集群不能正常运行时,先指定节点完成集群重组保证集群不停止工作,之后分析故障节点,修复后将其添加到重组后的集群。而在现有技术中,集群在运行中因故障不能正常运行时,尤其是集群中的多个节点出现故障或者断电的情况下,集群会停止工作。因此,和现有技术相比,本申请实施例提出的集群修复方法及装置,可以自动地对集群进行修复,最大程度地避免业务受到影响;并且,本申请实施例的技术方案实现简单方便、便于普及,适用范围更广。
附图说明
图1为本申请实施例提供的集群修复方法的第一流程示意图;
图2为本申请实施例提供的集群修复方法的第二流程示意图;
图3为本申请实施例提供的集群修复方法的第三流程示意图;
图4为本申请实施例提供的集群***架构示意图;
图5为本申请实施例提供的集群修复装置的结构示意图;
图6为本申请实施例提供的电子设备的结构示意图。
具体实施方式
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。
实施例一
图1为本申请实施例提供的集群修复方法的第一流程示意图,该方法可以由集群修复装置或者电子设备来执行,该装置或者电子设备可以由软件和/或硬件的方式实现,该装置或者电子设备可以集成在任何具有网络通信功能的智能设备中。如图1所示,集群修复方法可以包括以下步骤:
S101、监测集群中各个节点的运行状态;并根据各个节点的运行状态监测集群是否能正常运行。
在本步骤中,集群中的各个节点可以监测自身以及集群中其他节点的运行状态;并根据各个节点的运行状态监测集群是否能正常运行。具体地,通过监测功能模块监测集群中各个节点的运行状态;其中,监测功能模块位于集群的各个节点上,用于对该节点及集群内其他节点运行状态的监测、故障分析修复以及数据库的备份。需要说明的是,监测功能模块在实际应用中并不限于以上这些功能,该模块还可以实现集群重组、节点添加等功能。
S102、若集群不能正常运行,则从集群中删除故障节点。
在本步骤中,若集群不能正常运行,则从集群中删除故障节点。具体地, 每一节点都有一个守护进程,该守护进程可以探测自身所在节点的状态,同时还可以探测集群中其他节点的状态。例如,某一个集群包括:节点A、节点B和节点C;假设当节点C发生故障时,节点A的守护进程和节点B的守护进程分别可以探测出节点C发生故障,此时节点A的守护进程和节点B的守护进行可以将节点C从集群中删除,同行,此时也可利用自定义规则指定其中一个节点完成节点C的删除。
S103、指定集群中的一个正常节点将集群进行重组,并对故障节点进行修复。
若集群只有一个正常节点,则通过该节点修改集群的配置,将集群的运行模式先转换为单机模式运行,以避免集群因故障停止运行带来的影响。
若集群的正常节点大于一个,则集群中的节点可以通过自定义规则指定一个节点作为重组执行节点;然后通过重组执行节点设置集群中的一个节点为主节点,并通过主节点对集群进行重组。具体地,选定的重组执行节点可以触发重组执行节点处的监测功能模块执行集群重组操作,获取当前集群中其他正常节点的信息,通过其他正常节点的信息确定数据库信息最新的节点;将数据库信息最新的节点设置为主节点,通过主节点完成重组集群信息配置及数据同步,具体包括主节点修改集群配置信息以及主节点将本节点数据库信息同步到重组后集群中其他各节点。
同时,指定的重组执行节点需要完成对故障节点问题的分析及修复,针对故障的不同情况作出处理。
S104、指定集群中的一个正常节点将修复后的节点添加到重组后的集群。
在本步骤中,利用指定的集群中的正常节点将修复后的节点添加到重组后的集群。具体地,与S103步骤相同,若集群只有一个正常节点,则可以指定该节点为添加执行节点,将已修复的节点加入集群;若集群的正常节点大于一个,则通过自定义规则指定其中的一个正常节点作为添加执行节点,将已修复的节 点加入集群。通常情况下,此步骤中添加执行节点即为S103中确定的重组执行节点。
添加执行节点将修复后的节点添加到重组后的集群,具体操作流程如下:依据IP选举机制选举一个节点作为添加执行节点,或者直接将S103步骤的重组执行节点作为添加执行节点,通过添加执行节点修改集群中各个节点的配置文件;基于集群中各个节点的修改后的配置文件将已修复节点进行启动,并将添加执行节点上的监测功能模块定期备份的数据库信息发送至已修复的节点完成数据库同步,此操作不影响原有集群节点的正常工作,无需执行锁定操作。此外,添加执行节点还可以完成同步校验和完整性校验,保证数据的完整性和一致性。至少可以在以下两种场景下将修复后的节点添加到重组后的集群:1)集群中的某一个或者多个节点发生故障,发生故障的节点完成修复后,将修复后的节点加入到集群中;2)原集群需要扩容,需要将新节点加入到原集群中。本申请实施例中,可以根据Paxos算法结合自定义规则,如IP运算规则,选举一台正常的节点进行全量数据同步。Paxos算法解决的问题是一个分布式***如何就某个值或者某一个决议达成一致的问题。一个典型的场景是,在一个分布式数据库***中,如果各节点的初始状态一致,每个节点执行相同的操作序列,那么他们最后能得到一个一致的状态。为保证每个节点执行相同的命令序列,需要在每一条指令上执行一个“一致性算法”以保证每个节点看到的指令一致。一个通用的一致性算法可以应用在许多场景中,是分布式计算中的重要问题。一致性算法可以通过共享内存或者消息传递实现,Paxos算法采用的是后者。Paxos算法适用的几种情况包括:一台机器中多个进程/线程达成数据一致;分布式文件***或者分布式数据库中多客户端并发读写数据;分布式存储中多个副本响应读写请求的一致性。
实施例二
图2为本申请实施例提供的集群修复方法的第二流程示意图。基于上述技术方案进一步优化与扩展,并可以与上述各个可选实施方式进行结合。如图2所示,集群修复方法可以包括以下步骤:
S201、通过监测功能模块监测集群中各个节点的运行状态;其中,监测功能模块位于集群的各个节点上,用于对该节点及集群内其他节点运行状态的监测、故障分析修复以及数据库的备份。
在本步骤中,可以通过监测功能模块监测集群中各个节点的运行状态;其中,监测功能模块位于集群的各个节点上,用于对该节点及集群内其他节点运行状态的监测、故障分析修复以及数据库的备份。具体地,本申请实施例中的监测功能模块可以具备如下功能:1)监控节点数据库运行状态、节点情况;2)监控本集群内其他节点数据库运行状态、节点情况;3)利用自定义规则指定一个节点负责故障节点的修复;4)定期备份本节点数据信息,负责新加节点全量备份及新集群组建。本申请通过监测功能模块实现集群自动重组。每个节点上的监测功能模块监控自身运行状态以及集群内其他节点服务器运行状态,当发现集群不能正常运行时,启动自定义机制,该机制主要是指定其中一个无故障节点完成集群自动重组。
S202、若集群不能正常运行,则从集群中删除故障节点。
在本步骤中,若集群不能正常运行,则从集群中删除故障节点。若集群只有一个正常节点(无故障节点),则可以利用该正常节点上的检测功能模块修改集群的配置,将集群的运行模式转换为单机模式运行。具体地,如果集群只剩一个正常节点,此时集群停止工作,集群数据库停止更新。当剩余正常节点的监测功能模块监测到此情况时,修改相关配置,将集群模式自动转换为单机模式,通过这台状态正常的节点实现新集群数据库的正常读写。具体地,每一节点都有一个守护进程,该守护进程可以探测自身所在节点的状态,同时还可以探测集群中其他节点的状态。例如,某一个集群包括:节点A、节点B和节 点C;假设当节点C发生故障时,节点A的守护进程和节点B的守护进程分别可以探测出节点C发生故障,此时节点A的守护进程和节点B的守护进程可以利用自定义规则选定一个正常节点将将节点C从集群中删除。
S203、若集群的正常节点大于一个,则通过自定义规则指定一个节点作为重组执行节点。
在本步骤中,若集群的正常节点大于一个,则通过自定义规则指定一个节点作为重组执行节点。本申请实施例中的自定义规则可以为:取当前集群中IP地址最大或最小的正常节点作为重组执行节点。具体地,本申请实施例采用自定义的IP选举机制,定义选择集群中当前所有正常节点中IP地址最大的节点作为集群重组执行者。
假设当前集群中包含A、B、C三个正常节点,其中节点C的IP地址最大,则指定节点C上的监控进程触发执行集群重组任务,节点C即为集群重组执行节点,由C完成集群重组,包括指定重组集群主节点,由主节点完成集群各节点的启动,实现重组后集群的正常运转。值得说明的是,本申请实施例中的启动是指集群服务的启动,而不是硬件设备的启动,这里可以理解为是软件的启动。
S204、通过重组执行节点设置集群中的一个节点为主节点,通过主节点对集群进行重组,并对故障节点进行修复。
在本步骤中,通过重组执行节点设置集群中的一个正常节点为主节点,通过主节点对集群进行重组,并对故障节点进行修复。具体地,指定的重组执行节点可以触发重组执行节点处的监测功能模块执行集群重组操作,获取其他正常节点的信息,通过其他正常节点的信息确定数据库信息最新的节点;将数据库信息最新的节点设置为主节点,通过主节点完成重组集群信息配置及数据同步。
假设当前集群中包含A、B、C三个正常节点,其中节点C的IP地址最大, 则节点C上的监控进程触发执行集群重组任务,节点C即为集群重组执行节点,由C完成集群重组,包括指定重组集群主节点,由主节点完成集群各节点的启动,实现重组后集群的正常运转。
通过重组执行节点C连通节点A和节点B发送相关命令获取编号值,以及自己的编号值,通过该编号值确定当前哪台机器节点数据库数据最新,假设节点A的数据最新,节点C上的监测功能模块向节点A发送相关指令,设置节点A为主节点,修改相关配置,节点A优先启动,随后再给节点B和节点C发送相关指令完成启动,或者监测功能模块再按此规则陆续将其他正常节点加入集群,完成集群重组。本申请实施例中的启动是指集群服务的启动,而不是硬件设备的启动,这里可以理解为是软件的启动。
特殊情况下,若数据库信息最新的节点数量多于一个,则再次利用自定义规则选举出其中一个节点作为主节点,通过该主节点完成重组集群信息配置及数据同步。当节点发生故障后,位于正常节点上监控服务模块能够及时删除故障节点并对故障节点进行修复,通过对故障节点进行分析并采取相应修复措施实现故障节点的修复。所述故障节点存在的故障通常有Mysql崩溃、进程僵死、网络异常或通电异常后恢复但节点无法自动启动等。此种情况下,若集群只有一个正常节点,则默认设置该节点上的监控服务模块完成故障节点的下架及修复;若此时集群中正常节点大于一个,则可通过自定义规则指定的节点作为故障修复执行节点实现对故障节点的修复,通常情况下指定的故障修复执行节点即为S203的重组执行节点。本申请实施例中修复的原因可以包括但不限于:服务损坏、文件丢失、服务崩溃、错误码等。具体的修复方式可以是通过原始文件将发生故障的节点进行还原。
S205、指定集群中的一个正常节点将修复后的节点添加到重组后的集群。
故障节点修复后,或集群需要添加新的节点以提升处理能力,现有技术中,节点添加时,集群中添加执行节点被锁定,其数据库只能读,不能写,待添加 执行节点与修复后节点或其他新节点完成数据同步完成后才可解锁,影响到添加执行节点的使用,且数据同步也存在一定问题。修复完成后,利用上述步骤中指定的集群中的一个正常节点将修复后的节点添加到重组后的集群。
若集群只有一个正常节点,则可以指定该节点为添加执行节点;将已修复的节点加入集群;若集群的正常节点大于一个,则可以通过自定义规则指定一个节点作为添加执行节点,将已修复的节点加入集群。该操作中指定的节点,如与前述步骤S203采用的自定义规则相同,则该操作中指定的节点与S203确定的重组执行节点为同一节点,反之则不同。进一步地,在将已修复的节点加入集群时,依据自定义规则选举一个节点作为添加执行节点,通过添加执行节点修改集群中各个节点的配置文件;基于集群中各个节点的修改后的配置文件将已修复节点进行启动,并将添加执行节点上的监测功能模块定期备份的数据库信息发送至已修复的节点完成数据库同步。本申请实施例中的自定义规则为:取集群中IP地址最大或最小的节点作为重组执行节点。
实施例三
图3为本申请实施例提供的集群修复方法的第三流程示意图。基于上述技术方案进一步优化与扩展,并可以与上述各个可选实施方式进行结合。如图3所示,集群修复方法可以包括以下步骤:
S301、通过监测功能模块监测集群中各个节点的运行状态;其中,监测功能模块位于集群的各个节点上,用于对该节点及集群内其他节点运行状态的监测、故障分析修复以及数据库的备份。
在本步骤中,通过监测功能模块监测集群中各个节点的运行状态;其中,监测功能模块位于集群的各个节点上,用于对该节点及集群内其他节点运行状态的监测、故障分析修复以及数据库的备份。例如,假设集群中有三个节点,分别为节点A、节点B和节点C,那么在节点A上可以设置一个监测功能模块 A;在节点B上设置一个监测功能模块B;在节点C上设置一个监测功能模块C;其中,监测功能模块A用于对节点A及节点B和节点C运行状态的监测、故障分析修复以及数据库的备份;监测功能模块B用于对节点B及节点A和节点C运行状态的监测、故障分析修复以及数据库的备份;监测功能模块C用于对节点C及节点A和节点B运行状态的监测、故障分析修复以及数据库的备份。
S302、若集群不能正常运行,则从集群中删除故障节点。
在本步骤中,若集群不能正常运行,则从集群中删除故障节点。具体地,每一节点都有一个守护进程,该守护进程可以探测自身所在节点的状态,同时还可以探测集群中其他节点的状态。例如,某一个集群包括:节点A、节点B和节点C;假设当节点C发生故障时,节点A的守护进程和节点B的守护进程分别可以探测出节点C发生故障,此时节点A的守护进程和节点B的守护进程可以分别将节点C从集群中删除,或利用自定义规则指定其中一个节点完成故障节点C的删除。
S303、若集群的正常节点大于一个,则通过自定义规则指定一个节点作为重组执行节点。
在本步骤中,若集群的正常节点大于一个,则可以通过自定义规则指定一个节点作为重组执行节点。例如,假设集群中有三个节点,分别为节点A、节点B和节点C;假设节点C发生故障,则可以将节点C从集群中删除;此时再假设节点A为IP地址最大的节点,则可以将节点A作为重组执行节点。
S304、触发重组执行节点处的监测功能模块执行集群重组操作,获取其他正常节点的信息,通过其他正常节点的信息确定数据库信息最新的节点;将数据库信息最新的节点设置为主节点,通过主节点完成重组集群信息配置及数据同步。通过S303选定的重组执行节点完成对故障节点的修复。
在本步骤中,重组执行节点可以触发重组执行节点处的监测功能模块执行集群重组操作,获取其他正常节点的信息,通过其他正常节点的信息确定数据 库信息最新的节点;将数据库信息最新的节点设置为主节点,通过主节点完成重组集群信息配置及数据同步。具体地,例如,假设集群中有三个节点,分别为节点A、节点B和节点C;假设节点C发生故障,则节点A和节点B可以将节点C从集群中删除;具体地,每一节点都有一个守护进程,该守护进程可以探测自身所在节点的状态,同时还可以探测集群中其他节点的状态。例如,某一个集群包括:节点A、节点B和节点C;假设当节点C发生故障时,节点A的保护进程和节点B的保护进程分别可以探测出节点C发生故障,此时节点A的保护进程和节点B的保护进程可以分别将节点C从集群中删除,通过自定义规则如IP最大或最小规则指定其中一个节点执行节点C的删除操作此时再假设节点A为IP地址最大的节点,则可以将节点为作为重组执行节点;此时可以触发节点A的监测功能模块执行集群重组操作,获取节点B的信息,通过节点B的信息确定数据库信息最新的节点;将数据库信息最新的节点设置为主节点,通过主节点完成重组集群信息配置及数据同步,若数据库信息最新的节点有多个,此时需再次利用自定义规则确定其中一个节点作为主节点完成重组集群信息的配置及数据同步。
S305、若集群只有一个正常节点,则指定该节点为添加执行节点;将已修复的节点加入集群;若集群的正常节点大于一个,则通过自定义规则指定一个节点作为添加执行节点,将已修复的节点加入集群。
在本步骤中,若集群只有一个正常节点,则可以指定该节点为添加执行节点;将已修复的节点加入集群;若集群的正常节点大于一个,则可以通过自定义规则指定一个节点作为添加执行节点,将已修复的节点加入集群。具体地,在将已修复的节点加入集群时,可以先依据自定义规则选举一个节点作为添加执行节点,通过添加执行节点修改所述集群中各个节点的配置文件;基于集群中各个节点的修改后的配置文件将已修复节点进行启动,并将添加执行节点上的监测功能模块定期备份的数据库信息发送至已修复的节点完成数据库同步。
图4为本申请实施例提供的集群***架构示意图。如图4所示,该***可以包括:管理单元、存储单元、调度单元和计算单元;其中,管理员通过浏览器向管理单元发送http请求,实现对集群的管理操作。该管理单元可以包括:***管理服务、业务管理服务、***监控服务和升级服务。存储单元可以包括N个,分别为存储单元1、存储单元2、…、存储单元N;其中,N为大于1的自然数;调度员可以向调度单元发送http请求,实现对集群的调度操作。计算单元可以包括:签名、加密服务。
实施例四
图5为本申请实施例提供的集群修复装置的结构示意图。如图5所示,所述集群修复装置包括:监测模块501、删除模块502、重组模块503和添加模块504;其中,
所述监测模块501,用于监测集群中各个节点的运行状态;并根据各个节点的运行状态监测所述集群是否能正常运行;所述集群包括一个或者多个节点;
所述删除模块502,用于若所述集群不能正常运行,则从所述集群中删除故障节点;
所述重组模块503,用于指定所述集群中的一个正常节点将所述集群进行重组,并对所述故障节点进行修复;
所述添加模块504,用于指定所述集群中的一个正常节点将修复后的节点添加到重组后的集群。
上述集群修复装置可执行本申请任意实施例所提供的方法,具备执行方法相应的功能模块和有益效果。未在本实施例中详尽描述的技术细节,可参见本申请任意实施例提供的集群修复方法。
实施例五
图6为本申请实施例提供的电子设备的结构示意图。图6示出了适于用来实现本申请实施方式的示例性电子设备的框图。该电子设备可以是集群中的任意一个节点。图6显示的电子设备12仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图6所示,电子设备12以通用计算设备的形式表现。电子设备12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,***存储器28,连接不同***组件(包括***存储器28和处理单元16)的总线18。
总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,***总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(ISA)总线,微通道体系结构(MAC)总线,增强型ISA总线、视频电子标准协会(VESA)局域总线以及***组件互连(PCI)总线。
电子设备12典型地包括多种计算机***可读介质。这些介质可以是任何能够被电子设备12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
***存储器28可以包括易失性存储器形式的计算机***可读介质,例如随机存取存储器(RAM)30和/或高速缓存存储器32。电子设备12可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机***存储介质。仅作为举例,存储***34可以用于读写不可移动的、非易失性磁介质(图6未显示,通常称为“硬盘驱动器”)。尽管图6中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如CD-ROM,DVD-ROM或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。存储器28可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请各实施例的功能。
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如存储器28中,这样的程序模块42包括但不限于操作***、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本申请所描述的实施例中的功能和/或方法。
电子设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得用户能与该电子设备12交互的设备通信,和/或与使得该电子设备12能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口22进行。并且,电子设备12还可以通过网络适配器20与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器20通过总线18与电子设备12的其它模块通信。应当明白,尽管图6中未示出,可以结合电子设备12使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID***、磁带驱动器以及数据备份存储***等。
处理单元16通过运行存储在***存储器28中的程序,从而执行各种功能应用以及数据处理,例如实现本申请实施例所提供的集群修复方法。
实施例六
本申请实施例提供了一种计算机存储介质。
本申请实施例的计算机可读存储介质,可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电 连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括——但不限于无线、电线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言或其组合来编写用于执行本申请操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
注意,上述仅为本申请的较佳实施例及所运用技术原理。本领域技术人员会理解,本申请不限于这里所述的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本申请的保护范围。因此,虽 然通过以上实施例对本申请进行了较为详细的说明,但是本申请不仅仅限于以上实施例,在不脱离本申请构思的情况下,还可以包括更多其他等效实施例,而本申请的范围由所附的权利要求范围决定。

Claims (10)

  1. 一种集群修复方法,其特征在于,所述方法包括:
    监测集群中各个节点的运行状态;并根据各个节点的运行状态监测所述集群是否能正常运行;
    若所述集群不能正常运行,则从所述集群中删除故障节点;
    指定所述集群中的一个正常节点将所述集群进行重组,并对所述故障节点进行修复;
    指定所述集群中的一个正常节点将修复后的节点添加到重组后的集群。
  2. 根据权利要求1所述的方法,其特征在于,通过监测功能模块监测所述集群中各个节点的运行状态;其中,所述监测功能模块位于所述集群的各个节点上,用于对该节点及集群内其他节点运行状态的监测、故障分析修复以及数据库的备份。
  3. 根据权利要求1所述的方法,其特征在于,若所述集群只有一个正常节点,则修改所述集群的配置,将所述集群的运行模式转换为单机模式运行。
  4. 根据权利要求1所述的方法,其特征在于,指定所述集群中的一个正常节点将所述集群进行重组,包括:
    若所述集群的正常节点大于一个,则通过自定义规则指定一个节点作为重组执行节点;
    通过所述重组执行节点设置所述集群中的一个节点为主节点,并通过所述主节点对所述集群进行重组。
  5. 根据权利要求4所述的方法,其特征在于,通过所述重组执行节点设置所述集群中的一个节点为主节点,并通过所述主节点对所述集群进行重组,包括:
    触发所述重组执行节点处的监测功能模块执行集群重组操作,获取其他正常节点的信息,通过所述其他正常节点的信息确定数据库信息最新的节点;将所述数据库信息最新的节点设置为所述主节点,通过所述主节点完成重组集群 信息配置及数据同步。
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:
    若数据库信息最新的节点数量多于一个,则再次利用自定义规则选举出其中一个节点作为主节点,通过所述主节点完成重组集群信息配置及数据同步。
  7. 根据权利要求1所述的方法,其特征在于,指定所述集群中的一个正常节点将修复后的节点添加到重组后的集群,包括:
    若所述集群只有一个正常节点,则指定该节点为添加执行节点;将已修复的节点加入所述集群;
    若集群的正常节点大于一个,则通过自定义规则指定一个节点作为添加执行节点,将已修复的节点加入所述集群。
  8. 根据权利要求7所述的方法,其特征在于,将已修复的节点加入所述集群,包括:
    依据自定义规则确定一个节点作为添加执行节点,通过所述添加执行节点修改所述集群中各个节点的配置文件;基于所述集群中各个节点的修改后的配置文件将所述已修复节点进行启动,并将定期备份的数据库的信息发送至所述已修复的节点。
  9. 根据权利要求4、6或7所述的方法,其特征在于所述自定义规则为:取所述集群中IP地址最大或最小的节点作为所述重组执行节点。
  10. 一种集群修复装置,其特征在于,所述装置包括:监测模块、删除模块、重组模块和添加模块;其中,
    所述监测模块,用于监测集群中各个节点的运行状态;并根据各个节点的运行状态监测所述集群是否能正常运行;所述集群包括一个或者多个节点;
    所述删除模块,用于若所述集群不能正常运行,则从所述集群中删除故障节点;
    所述重组模块,用于指定所述集群中的一个正常节点将所述集群进行重组, 并对所述故障节点进行修复;
    所述添加模块,用于指定所述集群中的一个正常节点将修复后的节点添加到重组后的集群。
PCT/CN2023/130370 2022-12-21 2023-11-08 一种集群修复方法及装置 WO2024131366A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211651715.4A CN115904822A (zh) 2022-12-21 2022-12-21 一种集群修复方法及装置
CN202211651715.4 2022-12-21

Publications (1)

Publication Number Publication Date
WO2024131366A1 true WO2024131366A1 (zh) 2024-06-27

Family

ID=86493371

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/130370 WO2024131366A1 (zh) 2022-12-21 2023-11-08 一种集群修复方法及装置

Country Status (2)

Country Link
CN (1) CN115904822A (zh)
WO (1) WO2024131366A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115904822A (zh) * 2022-12-21 2023-04-04 长春吉大正元信息技术股份有限公司 一种集群修复方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040059805A1 (en) * 2002-09-23 2004-03-25 Darpan Dinker System and method for reforming a distributed data system cluster after temporary node failures or restarts
CN106933693A (zh) * 2017-03-15 2017-07-07 郑州云海信息技术有限公司 一种数据库集群节点故障自动修复方法及***
CN111124755A (zh) * 2019-12-06 2020-05-08 中国联合网络通信集团有限公司 集群节点的故障恢复方法、装置、电子设备及存储介质
CN112650624A (zh) * 2020-12-25 2021-04-13 浪潮(北京)电子信息产业有限公司 一种集群升级方法、装置、设备及计算机可读存储介质
CN112866408A (zh) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 一种集群中业务切换方法、装置、设备及存储介质
CN115904822A (zh) * 2022-12-21 2023-04-04 长春吉大正元信息技术股份有限公司 一种集群修复方法及装置

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100264896B1 (ko) * 1998-07-27 2000-09-01 윤종용 다중 클러스터 시스템의 클러스터 노드 고장 감지 장치 및 방법
US7953860B2 (en) * 2003-08-14 2011-05-31 Oracle International Corporation Fast reorganization of connections in response to an event in a clustered computing system
CN105915405A (zh) * 2016-03-29 2016-08-31 深圳市中博科创信息技术有限公司 一种大型集群节点性能监控***
CN111901422B (zh) * 2020-07-28 2022-11-11 浪潮电子信息产业股份有限公司 一种集群中节点的管理方法、***及装置
CN113326100B (zh) * 2021-06-29 2024-04-09 深信服科技股份有限公司 一种集群管理方法、装置、设备及计算机存储介质
CN114301802A (zh) * 2021-12-27 2022-04-08 北京吉大正元信息技术有限公司 密评检测方法、装置和电子设备
CN114816820A (zh) * 2022-04-26 2022-07-29 平安普惠企业管理有限公司 chproxy集群故障修复方法、装置、设备及存储介质
CN115499447A (zh) * 2022-09-15 2022-12-20 北京天融信网络安全技术有限公司 一种集群主节点确认方法、装置、电子设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040059805A1 (en) * 2002-09-23 2004-03-25 Darpan Dinker System and method for reforming a distributed data system cluster after temporary node failures or restarts
CN106933693A (zh) * 2017-03-15 2017-07-07 郑州云海信息技术有限公司 一种数据库集群节点故障自动修复方法及***
CN111124755A (zh) * 2019-12-06 2020-05-08 中国联合网络通信集团有限公司 集群节点的故障恢复方法、装置、电子设备及存储介质
CN112650624A (zh) * 2020-12-25 2021-04-13 浪潮(北京)电子信息产业有限公司 一种集群升级方法、装置、设备及计算机可读存储介质
CN112866408A (zh) * 2021-02-09 2021-05-28 山东英信计算机技术有限公司 一种集群中业务切换方法、装置、设备及存储介质
CN115904822A (zh) * 2022-12-21 2023-04-04 长春吉大正元信息技术股份有限公司 一种集群修复方法及装置

Also Published As

Publication number Publication date
CN115904822A (zh) 2023-04-04

Similar Documents

Publication Publication Date Title
KR102268355B1 (ko) 클라우드 배치 기반구조 검증 엔진
US20190073258A1 (en) Predicting, diagnosing, and recovering from application failures based on resource access patterns
US8255653B2 (en) System and method for adding a storage device to a cluster as a shared resource
US8966318B1 (en) Method to validate availability of applications within a backup image
US9678682B2 (en) Backup storage of vital debug information
CN109325016B (zh) 数据迁移方法、装置、介质及电子设备
US9189338B2 (en) Disaster recovery failback
US7624309B2 (en) Automated client recovery and service ticketing
US11144405B2 (en) Optimizing database migration in high availability and disaster recovery computing environments
WO2024131366A1 (zh) 一种集群修复方法及装置
US9436539B2 (en) Synchronized debug information generation
WO2012053085A1 (ja) ストレージ制御装置およびストレージ制御方法
JP4239989B2 (ja) 障害復旧システム、障害復旧装置、ルール作成方法、および障害復旧プログラム
CN111522703A (zh) 监控访问请求的方法、设备和计算机程序产品
CN108833164B (zh) 服务器控制方法、装置、电子设备及存储介质
CN111581021B (zh) 应用程序启动异常的修复方法、装置、设备及存储介质
CN112261114A (zh) 一种数据备份***及方法
US20140201566A1 (en) Automatic computer storage medium diagnostics
WO2015043155A1 (zh) 一种基于命令集的网元备份与恢复方法及装置
WO2020233001A1 (zh) 双控构架分布式存储***、数据读取方法、装置和存储介质
JP5352027B2 (ja) 計算機システムの管理方法及び管理装置
CN111381770B (zh) 一种数据存储切换方法、装置、设备及存储介质
CN110543385A (zh) 一种虚拟化备份方法和虚拟化备份还原方法
CN105760456A (zh) 一种保持数据一致性的方法和装置
US20220129446A1 (en) Distributed Ledger Management Method, Distributed Ledger System, And Node