CN103607297A - Fault processing method of computer cluster system - Google Patents

Fault processing method of computer cluster system Download PDF

Info

Publication number
CN103607297A
CN103607297A CN201310548737.2A CN201310548737A CN103607297A CN 103607297 A CN103607297 A CN 103607297A CN 201310548737 A CN201310548737 A CN 201310548737A CN 103607297 A CN103607297 A CN 103607297A
Authority
CN
China
Prior art keywords
node
fault
computer cluster
service module
message
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310548737.2A
Other languages
Chinese (zh)
Other versions
CN103607297B (en
Inventor
陈浩
赵亚萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Eisoo Software Co Ltd
Original Assignee
Shanghai Eisoo Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Eisoo Software Co Ltd filed Critical Shanghai Eisoo Software Co Ltd
Priority to CN201310548737.2A priority Critical patent/CN103607297B/en
Publication of CN103607297A publication Critical patent/CN103607297A/en
Application granted granted Critical
Publication of CN103607297B publication Critical patent/CN103607297B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a fault processing method of a computer cluster system. The method comprises the following steps: (A) at least two nodes in the computer cluster system are selected and are set as management nodes which bear the fault processing and the management of the computer cluster system, one node in the management nodes is taken as a main node, and other nodes are taken as standby nodes, (B) a bottom monitoring service module of each node in the computer cluster system monitors the operation state of the node and software and hardware loads and judges whether a fault appears or not, and if so, the bottom monitoring service module notifies a message middleware service module to send a fault massage to a management center service module of the main node; and (C) the management center service module of the main node carries out fault processing according to the fault message. According to the technical scheme of the invention, in the condition that human intervention is not needed, the automatic processing function of the cluster computer system fault can be realized.

Description

A kind of fault handling method of computer cluster
Technical field
The application relates to computer technology, and particularly computer cluster relates in particular to a kind of fault handling method of computer cluster.
Background technology
Along with the propelling of informationization technology, be that enterprise or other organizations all more and more depend on computer system.Be accompanied by the sharply expansion of data volume, single computer cannot meet its needs, if use supercomputer to increase greatly again the cost of computer, in this case, computer cluster technology arises at the historic moment.
Computer cluster is coupled together by software or the hardware of one group of loose integrated computer, and evaluation work has highly closely cooperated.Form many computer equipments of computer cluster from being counted as in logic a computer.Single computer in computer cluster is commonly referred to node, and computer cluster can connect by local area network (LAN), also supports other connected mode.Computer cluster is commonly used to improve the computational speed of single computer and the load balancing of data flow.The computational speed that computer cluster is exceedingly fast with it and cheap price, favored widely, and popularized rapidly.
Even thousands of not etc. from several to hundreds of platform for the number of nodes of computer cluster; when therefore the one or more nodes in computer cluster break down; the computational speed of computer cluster can be affected conventionally, even causes all nodes in computer cluster all cannot normally use.Therefore for user of service, while how to guarantee that any one node in computer cluster breaks down, computer cluster is still available on the whole, and does not affect computational speed and become the key that promotes operating efficiency and the creation of value.
For the fault in process computer group system, usual method is that attendant enters looking up the fault machine in many nodes of machine room in computer cluster, then determine the failure cause of machine, carry out again maintenance work, quantity and the workload that when the quantity of node increases, may need to increase attendant, not only cost is higher, and operating efficiency is very low.
Summary of the invention
The application provides a kind of fault handling method of computer cluster, can under the condition that does not need manual intervention, realize the automatic processing capacity of computer cluster fault.
The fault handling method of a kind of computer cluster that the embodiment of the present application provides, comprising:
A, to choose at least two Node configurations in computer cluster be the management node of bearing troubleshooting and supervisory computer group system, and one in described management node as host node, and all the other are as slave node;
In B, computer cluster, the bottom monitoring service module of each node is monitored running status and the software and hardware load condition of this node, and judge whether to break down, if so, bottom monitoring service module notification message middleware services module sends failure message to administrative center's service module of host node;
Administrative center's service module of C, host node carries out troubleshooting according to described failure message.
Preferably, the internal memory that described fault is node, CPU or system disk utilization rate surpass the threshold value of predetermining;
Step C is: administrative center's service module of host node reports attendant by fault content.
Preferably, described fault is hardware fault;
Step C is: administrative center's service module of host node is notified keeper by the hardware identifier breaking down, and faulty equipment is rejected from computer cluster.
Preferably, the node breaking down is ordinary node, and fault is software fault;
Step C is: administrative center's service module of host node identifies the state of this node with defined state value, and notifies attendant by concrete fault message.
Preferably, the node breaking down is host node, and fault is software fault;
Step C is: from slave node, elect the work that a new host node is taken over former host node.
Preferably, the method further comprises:
Computer cluster has detected node in off-line state by heartbeat mechanism, if this node is host node, elects a new host node and take over after the work of former host node from slave node, and former host node is entered aging; If it is aging that this node is that ordinary node directly enters;
After aging period, from computer cluster, delete all information of this node.
Preferably, each node unification of computer cluster sends to heartbeat message the message-oriented middleware module at host node place, by host node and slave node, collect and manage heartbeat message, if the current time of the timestamp in the last item heartbeat message of receiving distance exceeds predefined threshold value and also do not receive new heartbeat message, think and send the node off-line of this heartbeat message.
As can be seen from the above technical solutions, utilize message-oriented middleware and single node monitoring program to form a monitor network that covers whole computer cluster node, monitor in real time service state and the network state of each node, if finding node failure is processed fault information reporting by the monitoring program on this node to administrative center is unified, thereby under the condition that does not need manual intervention, realize the automatic processing capacity of computer cluster fault, guarantee can normally use after computer cluster node breaks down, alleviate attendant's workload, improve the fault-tolerant ability of computer cluster.
Accompanying drawing explanation
The fault handling method schematic flow sheet of a kind of computer cluster that Fig. 1 provides for the embodiment of the present application;
The deployment schematic diagram of the fault handling method of the computer cluster that Fig. 2 provides for the embodiment of the present application.
Embodiment
For problems of the prior art, the application provides a kind of fault handling method of computer cluster, utilize message mechanism to realize reporting of computer cluster fault, by specific node handling failure, thereby under the condition that does not need manual intervention, realize the automatic processing capacity of computer cluster fault, guarantee can normally use after computer cluster node breaks down, alleviate attendant's workload, improve the fault-tolerant ability of computer cluster.
The main design idea of present techniques scheme is: utilize message-oriented middleware and single node monitoring program to form a monitor network that covers whole computer cluster node, monitor in real time service state and the network state of each node, if finding node failure is processed fault information reporting by the monitoring program on this node to administrative center is unified, wherein the monitoring program of node and failure message have normalized definition, processing for all kinds of faults also has unified standard, the height of striving realizing computer cluster in the situation that saving cost and manpower and materials is available, guarantee that computer cluster continues under the prerequisite that major accident does not occur available.
For making know-why, feature and the technique effect of present techniques scheme clearer, below in conjunction with specific embodiment, present techniques scheme is described in detail.
The fault handling method flow process of a kind of computer cluster that the embodiment of the present application provides as shown in Figure 1, comprising:
Step 101: choosing at least two Node configurations in computer cluster is the management node of bearing troubleshooting and supervisory computer group system, one in described management node as host node, and all the other are as slave node;
Step 102: in computer cluster, the bottom monitoring service module of each node is monitored running status and the software and hardware load condition of this node, and judge whether to break down, if so, bottom monitoring service module notification message middleware services module sends failure message to administrative center's service module of host node;
Step 103: administrative center's service module of host node carries out troubleshooting according to described failure message.
In the embodiment of the present application scheme, mainly utilize message-oriented middleware, by bottom monitoring program, monitored the situation of each node, once find that fault reports in time, by the specific node unification of computer cluster, collect failure message and process.In the present invention, need installation message middleware, and our computer cluster single node monitor service of formulating, computer cluster administrative center service etc., the operating system of using is linux system.The fault processing system of the embodiment of the present application relates generally to four more crucial parts: message-oriented middleware service module, bottom monitoring service module, administrative center's service module and failover processing module.
The deployment of the fault handling method of the computer cluster that the embodiment of the present application provides as shown in Figure 2, comprising:
Step 201: install and start linux system.
For each node in computer cluster, correctly install respectively needed linux system, and to starting after linux system configuration.
Step 202: install and initiation message middleware services.
Correct installation message middleware startup on each node of computer cluster, and guarantee that it is working properly, can accurate messaging.
Step 203: start other services of computer cluster.
Correct administrative center's service module and the bottom monitoring service module starting in computer cluster on all nodes in computer cluster.Bottom monitoring service module is responsible for monitoring the running status of each node, and software and hardware load condition, and administrative center's service module is responsible for processing messages, and the type of analysis of failure, and processes respectively according to fault type.
Step 204: configuration main-standby nodes.
By the web interface of application programming interfaces (API) or O&M software, choosing in computer cluster 2 or 3 Node configurations is the management node of bearing troubleshooting and supervisory computer group system, guarantee that computer cluster work and has fail-over feature, in the management node of choosing one be host node all the other be slave node.Corresponding, the node in computer cluster except management node is called ordinary node.
After above-mentioned flow processing, computer cluster is in normal operating conditions, if break down, computer cluster can respond fast fault and process, and taking over fault node is as required guaranteed the high availability of computer cluster.
Below provide common several fault types and corresponding processing method:
The system failure
The system failure includes but not limited to internal memory, CPU, system disk utilization rate too high (be defaulted as 70%, can configure according to actual conditions).When bottom monitoring service module detects above-mentioned fault, can be by fault message notification message middleware services module, message-oriented middleware service module sends failure message to administrative center's service module of host node, and this message comprises malfunctioning node information, fault time etc.
Because above-mentioned fault does not affect the normal operation of host node, administrative center's service module of host node is informed its fault content of attendant or is checked corresponding system index by the web page of O&M software by mail or other modes, without keeper, enter machine room inspection machine, facilitate greatly keeper's work disposal.
Device hardware fault
Device hardware fault includes but not limited to disk failure, raid fault, net card failure etc., when bottom monitoring service module detects this type of fault, can be by fault message notification message middleware services module, message-oriented middleware service module sends failure message to administrative center's service module of host node, administrative center's service module is responsible for handling failure, concrete grammar is the hardware identifier of notifying keeper to break down, rejects faulty equipment.
Ordinary node software fault
Software fault comprises that fault has occurred the various softwares that computer cluster is used, message-oriented middleware fault for example, ASC administrative service center fault, bottom monitor service fault etc.This type of fault mainly refer in computer cluster, on each node, all have for providing the service of single node that fault has occurred, now the processing for this node is identify the state of this node and inform the concrete fault message of attendant by mail or other modes with defined state value, this type of fault needs human intervention malfunctioning node, manually repairs fault.
Administrative center's software fault
Software fault comprises that fault has occurred the various softwares that computer cluster is used, message-oriented middleware fault for example, ASC administrative service center fault, bottom monitor service fault etc.When fault has occurred for administrative center's service module of host node, now host node cannot work, need to elect a new host node according to certain principle (such as node load situation or little IP principle etc.) from slave node, take over the work of former host node.Bear the work that provides external service that management is internally provided, or slave node breaks down or off-line is taken over by other slave nodes, this process is called management node and automatically switches.
Below provide the implementation procedure example that a kind of management node automatically switches: slave node gets host node by message mechanism fault or off-line have occurred,, slave node starts election mechanism, learns that oneself is for little IP node from database, take over the work of serving as before host node.Become new host node.
When above-mentioned fault occurs, need carry out the switching of fault, guarantee that the height of computer cluster is available, handoff procedure is without manual intervention, whole-process automatic monitoring, and keeper can use web O&M page monitoring handoff procedure.Fault discovery is rapid, the of short duration normal use that does not affect computer cluster of handoff procedure.
Node off-line
There is power-off, the situations such as suspension in the main dactylus point of such fault.The heartbeat mechanism that computer cluster is realized by message-oriented middleware detects this node in off-line state, if host node, carry out entering after host node automatic switchover aging, if it is aging that ordinary node directly enters, aging period will be deleted all information of this node later from whole computer cluster.Be this node and be not re-used as the node in computer cluster, no longer bear any computer cluster work.Heartbeat mechanism in the embodiment of the present application is: each node unification of computer cluster sends to heartbeat message the message-oriented middleware module at host node place, by host node and slave node, collect and manage heartbeat message, if the current time of the timestamp in the last item heartbeat message of receiving distance exceeds predefined threshold value and also do not receive new heartbeat message, think and send the node off-line of this heartbeat message.
By the present invention, can reach following effect:
1。Owing to having used message mechanism, realize the troubleshooting of computer cluster, guaranteed that the node failure in computer cluster can promptly and accurately report, can process according to different fault types, no matter hardware fault or software fault can respond rapidly, has greatly reduced keeper's maintenance difficulties.
2。By a plurality of node unified managements in computer cluster, by host node unification, carry out load balancing, the operations such as data distribution have improved the efficiency of computer cluster greatly.Node in computer cluster is more, and this advantage is more obvious.
3。In the fault treating procedure of computer cluster, in most cases by program, automatically performed, without manual intervention, do not affect computer cluster and run well, do not need complicated configuration and extra instrument, so this programme has feature easy to operate, easy care.
4。The present invention is not only applicable to the server platform of different brands, for various virtual machines, is suitable for too and therefore has good hardware platform adaptability.Have benefited from message-oriented middleware, the reliability of message is high, has guaranteed the accuracy that computer cluster switches; The of short duration normal use that does not affect computer cluster switching time; Linux system stability is high, the impact on customer service while having reduced maintenance calculations machine group system.
The foregoing is only the application's preferred embodiment; not in order to limit the application's protection range; all within the spirit and principle of present techniques scheme, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of the application's protection.

Claims (7)

1. a fault handling method for computer cluster, is characterized in that, comprising:
A, to choose at least two Node configurations in computer cluster be the management node of bearing troubleshooting and supervisory computer group system, and one in described management node as host node, and all the other are as slave node;
In B, computer cluster, the bottom monitoring service module of each node is monitored running status and the software and hardware load condition of this node, and judge whether to break down, if so, bottom monitoring service module notification message middleware services module sends failure message to administrative center's service module of host node;
Administrative center's service module of C, host node carries out troubleshooting according to described failure message.
2. method according to claim 1, is characterized in that, the internal memory that described fault is node, CPU or system disk utilization rate surpass the threshold value of predetermining;
Step C is: administrative center's service module of host node reports attendant by fault content.
3. method according to claim 1, is characterized in that, described fault is hardware fault;
Step C is: administrative center's service module of host node is notified keeper by the hardware identifier breaking down, and faulty equipment is rejected from computer cluster.
4. method according to claim 1, is characterized in that, the node breaking down is ordinary node, and fault is software fault;
Step C is: administrative center's service module of host node identifies the state of this node with defined state value, and notifies attendant by concrete fault message.
5. method according to claim 1, is characterized in that, the node breaking down is host node, and fault is software fault;
Step C is: from slave node, elect the work that a new host node is taken over former host node.
6. according to the method described in claim 1 to 5 any one, it is characterized in that, the method further comprises:
Computer cluster has detected node in off-line state by heartbeat mechanism, if this node is host node, elects a new host node and take over after the work of former host node from slave node, and former host node is entered aging; If it is aging that this node is that ordinary node directly enters;
After aging period, from computer cluster, delete all information of this node.
7. method according to claim 6, it is characterized in that, described heartbeat mechanism is: each node unification of computer cluster sends to heartbeat message the message-oriented middleware module at host node place, by host node and slave node, collect and manage heartbeat message, if the current time of the timestamp in the last item heartbeat message of receiving distance exceeds predefined threshold value and also do not receive new heartbeat message, think and send the node off-line of this heartbeat message.
CN201310548737.2A 2013-11-07 2013-11-07 Fault processing method of computer cluster system Expired - Fee Related CN103607297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310548737.2A CN103607297B (en) 2013-11-07 2013-11-07 Fault processing method of computer cluster system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310548737.2A CN103607297B (en) 2013-11-07 2013-11-07 Fault processing method of computer cluster system

Publications (2)

Publication Number Publication Date
CN103607297A true CN103607297A (en) 2014-02-26
CN103607297B CN103607297B (en) 2017-02-08

Family

ID=50125498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310548737.2A Expired - Fee Related CN103607297B (en) 2013-11-07 2013-11-07 Fault processing method of computer cluster system

Country Status (1)

Country Link
CN (1) CN103607297B (en)

Cited By (46)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104267689A (en) * 2014-09-22 2015-01-07 中国科学院寒区旱区环境与工程研究所 Super computer room outage early warning and automatic power-on management method based on video image differentiation
CN104270268A (en) * 2014-09-28 2015-01-07 曙光信息产业股份有限公司 Network performance analysis and fault diagnosis method of distributed system
CN104579791A (en) * 2015-01-26 2015-04-29 浪潮电子信息产业股份有限公司 Method for achieving automatic K-DB main and standby disaster recovery cluster switching
CN104735069A (en) * 2015-03-26 2015-06-24 浪潮集团有限公司 High-availability computer cluster based on safety and reliability
CN104767794A (en) * 2015-03-13 2015-07-08 青岛海信传媒网络技术有限公司 Node election method in distributed system and nodes in distributed system
CN105007193A (en) * 2015-08-19 2015-10-28 浪潮(北京)电子信息产业有限公司 Multi-layer information processing method, system thereof and cluster management node
CN105162632A (en) * 2015-09-15 2015-12-16 浪潮集团有限公司 Automatic processing system for server cluster failures
CN105681156A (en) * 2014-11-19 2016-06-15 阿里巴巴集团控股有限公司 Message release method, device and system
CN106161090A (en) * 2016-07-12 2016-11-23 许继集团有限公司 The monitoring method of a kind of subregion group system and device
CN106452952A (en) * 2016-09-29 2017-02-22 华为技术有限公司 Method for detecting communication state of cluster system and gateway cluster
CN106878077A (en) * 2017-02-21 2017-06-20 深圳实现创新科技有限公司 The method of controlling security and device of safety monitoring
CN106933693A (en) * 2017-03-15 2017-07-07 郑州云海信息技术有限公司 A kind of data-base cluster node failure self-repairing method and system
CN107203420A (en) * 2016-03-18 2017-09-26 北京京东尚科信息技术有限公司 The master-slave switching method and device of task scheduling example
CN107247729A (en) * 2017-05-03 2017-10-13 ***股份有限公司 A kind of document handling method and device
CN107257298A (en) * 2017-07-27 2017-10-17 郑州云海信息技术有限公司 A kind of fault handling method and device
CN107342905A (en) * 2017-08-28 2017-11-10 郑州云海信息技术有限公司 A kind of node scheduling method and system of cluster storage system failure transfer
CN107343032A (en) * 2017-06-21 2017-11-10 武汉慧联无限科技有限公司 The offline detection method and device of terminal node in remote low power consumption network
CN107704387A (en) * 2017-09-26 2018-02-16 恒生电子股份有限公司 For the method, apparatus of system early warning, electronic equipment and computer-readable medium
CN107831452A (en) * 2017-10-31 2018-03-23 国网上海市电力公司 DC control and protection system hostdown diagnoses and life appraisal equipment
CN107948260A (en) * 2017-11-15 2018-04-20 郑州云海信息技术有限公司 Main monitoring node selecting method and device in a kind of distributed type assemblies
CN108134706A (en) * 2018-01-02 2018-06-08 中国工商银行股份有限公司 Block chain high-availability system mostly living, computer equipment and method
CN108809729A (en) * 2018-06-25 2018-11-13 郑州云海信息技术有限公司 The fault handling method and device that CTDB is serviced in a kind of distributed system
CN108847982A (en) * 2018-06-26 2018-11-20 郑州云海信息技术有限公司 A kind of distributed storage cluster and its node failure switching method and apparatus
CN109117317A (en) * 2018-11-01 2019-01-01 郑州云海信息技术有限公司 A kind of clustering fault restoration methods and relevant apparatus
CN109218126A (en) * 2017-06-30 2019-01-15 中兴通讯股份有限公司 The method, apparatus and system of monitoring node existing state
CN109634787A (en) * 2018-12-17 2019-04-16 浪潮电子信息产业股份有限公司 Distributed file system monitor switching method, device, equipment and storage medium
CN109714202A (en) * 2018-12-21 2019-05-03 郑州云海信息技术有限公司 A kind of client off-line reason method of discrimination and concentrating type safety management system
CN109873719A (en) * 2019-02-03 2019-06-11 华为技术有限公司 A kind of fault detection method and device
CN111130920A (en) * 2019-11-26 2020-05-08 网宿科技股份有限公司 Hardware information acquisition method, device, server and storage medium
CN111143027A (en) * 2019-12-06 2020-05-12 北京浪潮数据技术有限公司 Cloud platform management method, system, equipment and computer readable storage medium
CN111158962A (en) * 2018-11-07 2020-05-15 中移信息技术有限公司 Remote disaster recovery method, device, system, electronic equipment and storage medium
CN111258840A (en) * 2018-11-30 2020-06-09 杭州海康威视数字技术股份有限公司 Cluster node management method and device and cluster
CN111338647A (en) * 2018-12-18 2020-06-26 杭州海康威视数字技术股份有限公司 Big data cluster management method and device
CN111355600A (en) * 2018-12-21 2020-06-30 杭州海康威视数字技术股份有限公司 Method and device for determining main node
CN111865714A (en) * 2020-06-24 2020-10-30 上海上实龙创智能科技股份有限公司 Cluster management method based on multi-cloud environment
CN112131077A (en) * 2020-09-21 2020-12-25 中国建设银行股份有限公司 Fault node positioning method and device and database cluster system
CN112306747A (en) * 2020-09-29 2021-02-02 新华三技术有限公司合肥分公司 RAID card fault processing method and device
CN112491633A (en) * 2020-12-17 2021-03-12 北京浪潮数据技术有限公司 Fault recovery method, system and related components of multi-node cluster
CN112631718A (en) * 2020-12-21 2021-04-09 常州微亿智造科技有限公司 Method and system for realizing Controller and Worker service combination under industrial Internet of things
CN113282334A (en) * 2021-06-07 2021-08-20 深圳华锐金融技术股份有限公司 Method and device for recovering software defects, computer equipment and storage medium
CN114363156A (en) * 2022-01-25 2022-04-15 南瑞集团有限公司 Hydropower station computer monitoring system deployment method based on cluster technology
CN114826905A (en) * 2022-03-31 2022-07-29 西安超越申泰信息科技有限公司 Method, system, equipment and medium for switching management service of lower node
CN114978875A (en) * 2021-02-23 2022-08-30 广州汽车集团股份有限公司 Vehicle-mounted node management method and device and storage medium
CN115242701A (en) * 2022-07-25 2022-10-25 中国民用航空总局第二研究所 Processing method, device and storage medium for airport data platform cluster consumption
WO2022247675A1 (en) * 2021-05-24 2022-12-01 中兴通讯股份有限公司 Device operation and maintenance method, network device, and storage medium
CN116614348A (en) * 2023-07-19 2023-08-18 联想凌拓科技有限公司 System for remote copy service and method of operating the same

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020091814A1 (en) * 1998-07-10 2002-07-11 International Business Machines Corp. Highly scalable and highly available cluster system management scheme
CN101183996A (en) * 2007-12-13 2008-05-21 浪潮电子信息产业股份有限公司 Cluster information monitoring method
CN101373447A (en) * 2008-08-20 2009-02-25 上海超级计算中心 System and method for detecting health degree of computer cluster
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020091814A1 (en) * 1998-07-10 2002-07-11 International Business Machines Corp. Highly scalable and highly available cluster system management scheme
CN101183996A (en) * 2007-12-13 2008-05-21 浪潮电子信息产业股份有限公司 Cluster information monitoring method
CN101373447A (en) * 2008-08-20 2009-02-25 上海超级计算中心 System and method for detecting health degree of computer cluster
CN102231681A (en) * 2011-06-27 2011-11-02 中国建设银行股份有限公司 High availability cluster computer system and fault treatment method thereof

Cited By (62)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104267689B (en) * 2014-09-22 2017-01-18 中国科学院寒区旱区环境与工程研究所 Super computer room outage early warning and automatic power-on management method based on video image differentiation
CN104267689A (en) * 2014-09-22 2015-01-07 中国科学院寒区旱区环境与工程研究所 Super computer room outage early warning and automatic power-on management method based on video image differentiation
CN104270268A (en) * 2014-09-28 2015-01-07 曙光信息产业股份有限公司 Network performance analysis and fault diagnosis method of distributed system
CN104270268B (en) * 2014-09-28 2017-12-05 曙光信息产业股份有限公司 A kind of distributed system network performance evaluation and method for diagnosing faults
CN105681156A (en) * 2014-11-19 2016-06-15 阿里巴巴集团控股有限公司 Message release method, device and system
CN105681156B (en) * 2014-11-19 2019-06-11 阿里巴巴集团控股有限公司 Message issuance method, apparatus and system
CN104579791A (en) * 2015-01-26 2015-04-29 浪潮电子信息产业股份有限公司 Method for achieving automatic K-DB main and standby disaster recovery cluster switching
CN104767794A (en) * 2015-03-13 2015-07-08 青岛海信传媒网络技术有限公司 Node election method in distributed system and nodes in distributed system
CN104767794B (en) * 2015-03-13 2018-05-01 聚好看科技股份有限公司 Node electoral machinery and node in a kind of distributed system
CN104735069A (en) * 2015-03-26 2015-06-24 浪潮集团有限公司 High-availability computer cluster based on safety and reliability
CN105007193A (en) * 2015-08-19 2015-10-28 浪潮(北京)电子信息产业有限公司 Multi-layer information processing method, system thereof and cluster management node
CN105162632A (en) * 2015-09-15 2015-12-16 浪潮集团有限公司 Automatic processing system for server cluster failures
CN107203420A (en) * 2016-03-18 2017-09-26 北京京东尚科信息技术有限公司 The master-slave switching method and device of task scheduling example
CN106161090A (en) * 2016-07-12 2016-11-23 许继集团有限公司 The monitoring method of a kind of subregion group system and device
CN106452952A (en) * 2016-09-29 2017-02-22 华为技术有限公司 Method for detecting communication state of cluster system and gateway cluster
CN106452952B (en) * 2016-09-29 2019-11-22 华为技术有限公司 A kind of method and gateway cluster detecting group system communications status
CN106878077A (en) * 2017-02-21 2017-06-20 深圳实现创新科技有限公司 The method of controlling security and device of safety monitoring
CN106933693A (en) * 2017-03-15 2017-07-07 郑州云海信息技术有限公司 A kind of data-base cluster node failure self-repairing method and system
CN107247729A (en) * 2017-05-03 2017-10-13 ***股份有限公司 A kind of document handling method and device
CN107343032A (en) * 2017-06-21 2017-11-10 武汉慧联无限科技有限公司 The offline detection method and device of terminal node in remote low power consumption network
CN109218126A (en) * 2017-06-30 2019-01-15 中兴通讯股份有限公司 The method, apparatus and system of monitoring node existing state
CN109218126B (en) * 2017-06-30 2023-10-17 中兴通讯股份有限公司 Method, device and system for monitoring node survival state
CN107257298A (en) * 2017-07-27 2017-10-17 郑州云海信息技术有限公司 A kind of fault handling method and device
CN107342905A (en) * 2017-08-28 2017-11-10 郑州云海信息技术有限公司 A kind of node scheduling method and system of cluster storage system failure transfer
CN107704387A (en) * 2017-09-26 2018-02-16 恒生电子股份有限公司 For the method, apparatus of system early warning, electronic equipment and computer-readable medium
CN107704387B (en) * 2017-09-26 2021-03-16 恒生电子股份有限公司 Method, device, electronic equipment and computer readable medium for system early warning
CN107831452A (en) * 2017-10-31 2018-03-23 国网上海市电力公司 DC control and protection system hostdown diagnoses and life appraisal equipment
CN107948260A (en) * 2017-11-15 2018-04-20 郑州云海信息技术有限公司 Main monitoring node selecting method and device in a kind of distributed type assemblies
CN108134706A (en) * 2018-01-02 2018-06-08 中国工商银行股份有限公司 Block chain high-availability system mostly living, computer equipment and method
CN108134706B (en) * 2018-01-02 2020-08-18 中国工商银行股份有限公司 Block chain multi-activity high-availability system, computer equipment and method
CN108809729A (en) * 2018-06-25 2018-11-13 郑州云海信息技术有限公司 The fault handling method and device that CTDB is serviced in a kind of distributed system
CN108847982A (en) * 2018-06-26 2018-11-20 郑州云海信息技术有限公司 A kind of distributed storage cluster and its node failure switching method and apparatus
CN109117317A (en) * 2018-11-01 2019-01-01 郑州云海信息技术有限公司 A kind of clustering fault restoration methods and relevant apparatus
CN111158962B (en) * 2018-11-07 2023-10-13 中移信息技术有限公司 Remote disaster recovery method, device and system, electronic equipment and storage medium
CN111158962A (en) * 2018-11-07 2020-05-15 中移信息技术有限公司 Remote disaster recovery method, device, system, electronic equipment and storage medium
CN111258840B (en) * 2018-11-30 2023-10-10 杭州海康威视数字技术股份有限公司 Cluster node management method and device and cluster
CN111258840A (en) * 2018-11-30 2020-06-09 杭州海康威视数字技术股份有限公司 Cluster node management method and device and cluster
CN109634787A (en) * 2018-12-17 2019-04-16 浪潮电子信息产业股份有限公司 Distributed file system monitor switching method, device, equipment and storage medium
CN111338647A (en) * 2018-12-18 2020-06-26 杭州海康威视数字技术股份有限公司 Big data cluster management method and device
CN111338647B (en) * 2018-12-18 2023-09-12 杭州海康威视数字技术股份有限公司 Big data cluster management method and device
CN109714202A (en) * 2018-12-21 2019-05-03 郑州云海信息技术有限公司 A kind of client off-line reason method of discrimination and concentrating type safety management system
CN111355600A (en) * 2018-12-21 2020-06-30 杭州海康威视数字技术股份有限公司 Method and device for determining main node
CN111355600B (en) * 2018-12-21 2023-05-02 杭州海康威视数字技术股份有限公司 Main node determining method and device
CN109873719A (en) * 2019-02-03 2019-06-11 华为技术有限公司 A kind of fault detection method and device
CN111130920A (en) * 2019-11-26 2020-05-08 网宿科技股份有限公司 Hardware information acquisition method, device, server and storage medium
CN111143027A (en) * 2019-12-06 2020-05-12 北京浪潮数据技术有限公司 Cloud platform management method, system, equipment and computer readable storage medium
CN111865714A (en) * 2020-06-24 2020-10-30 上海上实龙创智能科技股份有限公司 Cluster management method based on multi-cloud environment
CN111865714B (en) * 2020-06-24 2022-08-02 上海上实龙创智能科技股份有限公司 Cluster management method based on multi-cloud environment
CN112131077A (en) * 2020-09-21 2020-12-25 中国建设银行股份有限公司 Fault node positioning method and device and database cluster system
CN112306747A (en) * 2020-09-29 2021-02-02 新华三技术有限公司合肥分公司 RAID card fault processing method and device
CN112306747B (en) * 2020-09-29 2023-04-11 新华三技术有限公司合肥分公司 RAID card fault processing method and device
CN112491633B (en) * 2020-12-17 2023-01-24 北京浪潮数据技术有限公司 Fault recovery method, system and related components of multi-node cluster
CN112491633A (en) * 2020-12-17 2021-03-12 北京浪潮数据技术有限公司 Fault recovery method, system and related components of multi-node cluster
CN112631718A (en) * 2020-12-21 2021-04-09 常州微亿智造科技有限公司 Method and system for realizing Controller and Worker service combination under industrial Internet of things
CN114978875A (en) * 2021-02-23 2022-08-30 广州汽车集团股份有限公司 Vehicle-mounted node management method and device and storage medium
WO2022247675A1 (en) * 2021-05-24 2022-12-01 中兴通讯股份有限公司 Device operation and maintenance method, network device, and storage medium
CN113282334A (en) * 2021-06-07 2021-08-20 深圳华锐金融技术股份有限公司 Method and device for recovering software defects, computer equipment and storage medium
CN114363156A (en) * 2022-01-25 2022-04-15 南瑞集团有限公司 Hydropower station computer monitoring system deployment method based on cluster technology
CN114826905A (en) * 2022-03-31 2022-07-29 西安超越申泰信息科技有限公司 Method, system, equipment and medium for switching management service of lower node
CN115242701A (en) * 2022-07-25 2022-10-25 中国民用航空总局第二研究所 Processing method, device and storage medium for airport data platform cluster consumption
CN115242701B (en) * 2022-07-25 2024-04-02 中国民用航空总局第二研究所 Airport data platform cluster consumption processing method, device and storage medium
CN116614348A (en) * 2023-07-19 2023-08-18 联想凌拓科技有限公司 System for remote copy service and method of operating the same

Also Published As

Publication number Publication date
CN103607297B (en) 2017-02-08

Similar Documents

Publication Publication Date Title
CN103607297A (en) Fault processing method of computer cluster system
TWI746512B (en) Physical machine fault classification processing method and device, and virtual machine recovery method and system
US10592330B2 (en) Systems and methods for automatic replacement and repair of communications network devices
CN107147540A (en) Fault handling method and troubleshooting cluster in highly available system
CN1947096B (en) Dynamic migration of virtual machine computer programs
CN105187249B (en) A kind of fault recovery method and device
CN105808394B (en) Server self-healing method and device
CN106789306B (en) Method and system for detecting, collecting and recovering software fault of communication equipment
CN105323113A (en) A visualization technology-based system fault emergency handling system and a system fault emergency handling method
US20140122140A1 (en) Advanced managed service customer edge router
CN109194514B (en) Dual-computer monitoring method and device, server and storage medium
CN111953566B (en) Distributed fault monitoring-based method and virtual machine high-availability system
CN106330523A (en) Cluster server disaster recovery system and method, and server node
CN100388218C (en) Method for realizing backup between servers
CN101996106A (en) Method for monitoring software running state
CN111897671A (en) Failure recovery method, computer device, and storage medium
WO2016188100A1 (en) Information system fault scenario information collection method and system
CN108347339B (en) Service recovery method and device
CN105095001A (en) Virtual machine exception recovery method under distributed environment
EP2518627A2 (en) Partial fault processing method in computer system
JP2013130901A (en) Monitoring server and network device recovery system using the same
CN103618634A (en) Method for automatically finding nodes in cluster
CN107453888B (en) High-availability virtual machine cluster management method and device
CN114020509A (en) Method, device and equipment for repairing work load cluster and readable storage medium
CN105025179A (en) Method and system for monitoring service agents of call center

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 201112 Shanghai, Minhang District, United Airlines route 1188, building second layer A-1 unit 8

Applicant after: SHANGHAI EISOO INFORMATION TECHNOLOGY CO., LTD.

Address before: 200072 room 3, building 840, No. 101 Middle Luochuan Road, Shanghai, Zhabei District

Applicant before: Shanghai Eisoo Software Co.,Ltd.

COR Change of bibliographic data
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170208

Termination date: 20191107

CF01 Termination of patent right due to non-payment of annual fee