CN103607297A

CN103607297A - Fault processing method of computer cluster system

Info

Publication number: CN103607297A
Application number: CN201310548737.2A
Authority: CN
Inventors: 陈浩; 赵亚萍
Original assignee: Shanghai Eisoo Software Co Ltd
Current assignee: Shanghai Eisoo Software Co Ltd
Priority date: 2013-11-07
Filing date: 2013-11-07
Publication date: 2014-02-26
Anticipated expiration: 2033-11-07
Also published as: CN103607297B

Abstract

The invention discloses a fault processing method of a computer cluster system. The method comprises the following steps: (A) at least two nodes in the computer cluster system are selected and are set as management nodes which bear the fault processing and the management of the computer cluster system, one node in the management nodes is taken as a main node, and other nodes are taken as standby nodes, (B) a bottom monitoring service module of each node in the computer cluster system monitors the operation state of the node and software and hardware loads and judges whether a fault appears or not, and if so, the bottom monitoring service module notifies a message middleware service module to send a fault massage to a management center service module of the main node; and (C) the management center service module of the main node carries out fault processing according to the fault message. According to the technical scheme of the invention, in the condition that human intervention is not needed, the automatic processing function of the cluster computer system fault can be realized.

Description

A kind of fault handling method of computer cluster

Technical field

The application relates to computer technology, and particularly computer cluster relates in particular to a kind of fault handling method of computer cluster.

Background technology

Along with the propelling of informationization technology, be that enterprise or other organizations all more and more depend on computer system.Be accompanied by the sharply expansion of data volume, single computer cannot meet its needs, if use supercomputer to increase greatly again the cost of computer, in this case, computer cluster technology arises at the historic moment.

Computer cluster is coupled together by software or the hardware of one group of loose integrated computer, and evaluation work has highly closely cooperated.Form many computer equipments of computer cluster from being counted as in logic a computer.Single computer in computer cluster is commonly referred to node, and computer cluster can connect by local area network (LAN), also supports other connected mode.Computer cluster is commonly used to improve the computational speed of single computer and the load balancing of data flow.The computational speed that computer cluster is exceedingly fast with it and cheap price, favored widely, and popularized rapidly.

Even thousands of not etc. from several to hundreds of platform for the number of nodes of computer cluster; when therefore the one or more nodes in computer cluster break down; the computational speed of computer cluster can be affected conventionally, even causes all nodes in computer cluster all cannot normally use.Therefore for user of service, while how to guarantee that any one node in computer cluster breaks down, computer cluster is still available on the whole, and does not affect computational speed and become the key that promotes operating efficiency and the creation of value.

For the fault in process computer group system, usual method is that attendant enters looking up the fault machine in many nodes of machine room in computer cluster, then determine the failure cause of machine, carry out again maintenance work, quantity and the workload that when the quantity of node increases, may need to increase attendant, not only cost is higher, and operating efficiency is very low.

Summary of the invention

The application provides a kind of fault handling method of computer cluster, can under the condition that does not need manual intervention, realize the automatic processing capacity of computer cluster fault.

The fault handling method of a kind of computer cluster that the embodiment of the present application provides, comprising:

A, to choose at least two Node configurations in computer cluster be the management node of bearing troubleshooting and supervisory computer group system, and one in described management node as host node, and all the other are as slave node;

In B, computer cluster, the bottom monitoring service module of each node is monitored running status and the software and hardware load condition of this node, and judge whether to break down, if so, bottom monitoring service module notification message middleware services module sends failure message to administrative center's service module of host node;

Administrative center's service module of C, host node carries out troubleshooting according to described failure message.

Preferably, the internal memory that described fault is node, CPU or system disk utilization rate surpass the threshold value of predetermining;

Step C is: administrative center's service module of host node reports attendant by fault content.

Preferably, described fault is hardware fault;

Step C is: administrative center's service module of host node is notified keeper by the hardware identifier breaking down, and faulty equipment is rejected from computer cluster.

Preferably, the node breaking down is ordinary node, and fault is software fault;

Step C is: administrative center's service module of host node identifies the state of this node with defined state value, and notifies attendant by concrete fault message.

Preferably, the node breaking down is host node, and fault is software fault;

Step C is: from slave node, elect the work that a new host node is taken over former host node.

Preferably, the method further comprises:

Computer cluster has detected node in off-line state by heartbeat mechanism, if this node is host node, elects a new host node and take over after the work of former host node from slave node, and former host node is entered aging; If it is aging that this node is that ordinary node directly enters;

After aging period, from computer cluster, delete all information of this node.

Preferably, each node unification of computer cluster sends to heartbeat message the message-oriented middleware module at host node place, by host node and slave node, collect and manage heartbeat message, if the current time of the timestamp in the last item heartbeat message of receiving distance exceeds predefined threshold value and also do not receive new heartbeat message, think and send the node off-line of this heartbeat message.

As can be seen from the above technical solutions, utilize message-oriented middleware and single node monitoring program to form a monitor network that covers whole computer cluster node, monitor in real time service state and the network state of each node, if finding node failure is processed fault information reporting by the monitoring program on this node to administrative center is unified, thereby under the condition that does not need manual intervention, realize the automatic processing capacity of computer cluster fault, guarantee can normally use after computer cluster node breaks down, alleviate attendant's workload, improve the fault-tolerant ability of computer cluster.

Accompanying drawing explanation

The fault handling method schematic flow sheet of a kind of computer cluster that Fig. 1 provides for the embodiment of the present application;

The deployment schematic diagram of the fault handling method of the computer cluster that Fig. 2 provides for the embodiment of the present application.

Embodiment

For problems of the prior art, the application provides a kind of fault handling method of computer cluster, utilize message mechanism to realize reporting of computer cluster fault, by specific node handling failure, thereby under the condition that does not need manual intervention, realize the automatic processing capacity of computer cluster fault, guarantee can normally use after computer cluster node breaks down, alleviate attendant's workload, improve the fault-tolerant ability of computer cluster.

The main design idea of present techniques scheme is: utilize message-oriented middleware and single node monitoring program to form a monitor network that covers whole computer cluster node, monitor in real time service state and the network state of each node, if finding node failure is processed fault information reporting by the monitoring program on this node to administrative center is unified, wherein the monitoring program of node and failure message have normalized definition, processing for all kinds of faults also has unified standard, the height of striving realizing computer cluster in the situation that saving cost and manpower and materials is available, guarantee that computer cluster continues under the prerequisite that major accident does not occur available.

For making know-why, feature and the technique effect of present techniques scheme clearer, below in conjunction with specific embodiment, present techniques scheme is described in detail.

The fault handling method flow process of a kind of computer cluster that the embodiment of the present application provides as shown in Figure 1, comprising:

Step 101: choosing at least two Node configurations in computer cluster is the management node of bearing troubleshooting and supervisory computer group system, one in described management node as host node, and all the other are as slave node;

Step 102: in computer cluster, the bottom monitoring service module of each node is monitored running status and the software and hardware load condition of this node, and judge whether to break down, if so, bottom monitoring service module notification message middleware services module sends failure message to administrative center's service module of host node;

Step 103: administrative center's service module of host node carries out troubleshooting according to described failure message.

In the embodiment of the present application scheme, mainly utilize message-oriented middleware, by bottom monitoring program, monitored the situation of each node, once find that fault reports in time, by the specific node unification of computer cluster, collect failure message and process.In the present invention, need installation message middleware, and our computer cluster single node monitor service of formulating, computer cluster administrative center service etc., the operating system of using is linux system.The fault processing system of the embodiment of the present application relates generally to four more crucial parts: message-oriented middleware service module, bottom monitoring service module, administrative center's service module and failover processing module.

The deployment of the fault handling method of the computer cluster that the embodiment of the present application provides as shown in Figure 2, comprising:

Step 201: install and start linux system.

For each node in computer cluster, correctly install respectively needed linux system, and to starting after linux system configuration.

Step 202: install and initiation message middleware services.

Correct installation message middleware startup on each node of computer cluster, and guarantee that it is working properly, can accurate messaging.

Step 203: start other services of computer cluster.

Correct administrative center's service module and the bottom monitoring service module starting in computer cluster on all nodes in computer cluster.Bottom monitoring service module is responsible for monitoring the running status of each node, and software and hardware load condition, and administrative center's service module is responsible for processing messages, and the type of analysis of failure, and processes respectively according to fault type.

Step 204: configuration main-standby nodes.

By the web interface of application programming interfaces (API) or O&M software, choosing in computer cluster 2 or 3 Node configurations is the management node of bearing troubleshooting and supervisory computer group system, guarantee that computer cluster work and has fail-over feature, in the management node of choosing one be host node all the other be slave node.Corresponding, the node in computer cluster except management node is called ordinary node.

After above-mentioned flow processing, computer cluster is in normal operating conditions, if break down, computer cluster can respond fast fault and process, and taking over fault node is as required guaranteed the high availability of computer cluster.

Below provide common several fault types and corresponding processing method:

The system failure

The system failure includes but not limited to internal memory, CPU, system disk utilization rate too high (be defaulted as 70%, can configure according to actual conditions).When bottom monitoring service module detects above-mentioned fault, can be by fault message notification message middleware services module, message-oriented middleware service module sends failure message to administrative center's service module of host node, and this message comprises malfunctioning node information, fault time etc.

Because above-mentioned fault does not affect the normal operation of host node, administrative center's service module of host node is informed its fault content of attendant or is checked corresponding system index by the web page of O&M software by mail or other modes, without keeper, enter machine room inspection machine, facilitate greatly keeper's work disposal.

Device hardware fault

Device hardware fault includes but not limited to disk failure, raid fault, net card failure etc., when bottom monitoring service module detects this type of fault, can be by fault message notification message middleware services module, message-oriented middleware service module sends failure message to administrative center's service module of host node, administrative center's service module is responsible for handling failure, concrete grammar is the hardware identifier of notifying keeper to break down, rejects faulty equipment.

Ordinary node software fault

Software fault comprises that fault has occurred the various softwares that computer cluster is used, message-oriented middleware fault for example, ASC administrative service center fault, bottom monitor service fault etc.This type of fault mainly refer in computer cluster, on each node, all have for providing the service of single node that fault has occurred, now the processing for this node is identify the state of this node and inform the concrete fault message of attendant by mail or other modes with defined state value, this type of fault needs human intervention malfunctioning node, manually repairs fault.

Administrative center's software fault

Software fault comprises that fault has occurred the various softwares that computer cluster is used, message-oriented middleware fault for example, ASC administrative service center fault, bottom monitor service fault etc.When fault has occurred for administrative center's service module of host node, now host node cannot work, need to elect a new host node according to certain principle (such as node load situation or little IP principle etc.) from slave node, take over the work of former host node.Bear the work that provides external service that management is internally provided, or slave node breaks down or off-line is taken over by other slave nodes, this process is called management node and automatically switches.

Below provide the implementation procedure example that a kind of management node automatically switches: slave node gets host node by message mechanism fault or off-line have occurred,, slave node starts election mechanism, learns that oneself is for little IP node from database, take over the work of serving as before host node.Become new host node.

When above-mentioned fault occurs, need carry out the switching of fault, guarantee that the height of computer cluster is available, handoff procedure is without manual intervention, whole-process automatic monitoring, and keeper can use web O&M page monitoring handoff procedure.Fault discovery is rapid, the of short duration normal use that does not affect computer cluster of handoff procedure.

Node off-line

There is power-off, the situations such as suspension in the main dactylus point of such fault.The heartbeat mechanism that computer cluster is realized by message-oriented middleware detects this node in off-line state, if host node, carry out entering after host node automatic switchover aging, if it is aging that ordinary node directly enters, aging period will be deleted all information of this node later from whole computer cluster.Be this node and be not re-used as the node in computer cluster, no longer bear any computer cluster work.Heartbeat mechanism in the embodiment of the present application is: each node unification of computer cluster sends to heartbeat message the message-oriented middleware module at host node place, by host node and slave node, collect and manage heartbeat message, if the current time of the timestamp in the last item heartbeat message of receiving distance exceeds predefined threshold value and also do not receive new heartbeat message, think and send the node off-line of this heartbeat message.

By the present invention, can reach following effect:

1。Owing to having used message mechanism, realize the troubleshooting of computer cluster, guaranteed that the node failure in computer cluster can promptly and accurately report, can process according to different fault types, no matter hardware fault or software fault can respond rapidly, has greatly reduced keeper's maintenance difficulties.

2。By a plurality of node unified managements in computer cluster, by host node unification, carry out load balancing, the operations such as data distribution have improved the efficiency of computer cluster greatly.Node in computer cluster is more, and this advantage is more obvious.

3。In the fault treating procedure of computer cluster, in most cases by program, automatically performed, without manual intervention, do not affect computer cluster and run well, do not need complicated configuration and extra instrument, so this programme has feature easy to operate, easy care.

4。The present invention is not only applicable to the server platform of different brands, for various virtual machines, is suitable for too and therefore has good hardware platform adaptability.Have benefited from message-oriented middleware, the reliability of message is high, has guaranteed the accuracy that computer cluster switches; The of short duration normal use that does not affect computer cluster switching time; Linux system stability is high, the impact on customer service while having reduced maintenance calculations machine group system.

The foregoing is only the application's preferred embodiment; not in order to limit the application's protection range; all within the spirit and principle of present techniques scheme, any modification of making, be equal to replacement, improvement etc., within all should being included in the scope of the application's protection.

Claims

1. a fault handling method for computer cluster, is characterized in that, comprising:

2. method according to claim 1, is characterized in that, the internal memory that described fault is node, CPU or system disk utilization rate surpass the threshold value of predetermining;

3. method according to claim 1, is characterized in that, described fault is hardware fault;

4. method according to claim 1, is characterized in that, the node breaking down is ordinary node, and fault is software fault;

5. method according to claim 1, is characterized in that, the node breaking down is host node, and fault is software fault;

6. according to the method described in claim 1 to 5 any one, it is characterized in that, the method further comprises:

After aging period, from computer cluster, delete all information of this node.

7. method according to claim 6, it is characterized in that, described heartbeat mechanism is: each node unification of computer cluster sends to heartbeat message the message-oriented middleware module at host node place, by host node and slave node, collect and manage heartbeat message, if the current time of the timestamp in the last item heartbeat message of receiving distance exceeds predefined threshold value and also do not receive new heartbeat message, think and send the node off-line of this heartbeat message.