CN110290012A

CN110290012A - The detection recovery system and method for RabbitMQ clustering fault

Info

Publication number: CN110290012A
Application number: CN201910593885.3A
Authority: CN
Inventors: 宋伟; 蔡卫卫; 谢涛涛; 赖振
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2019-07-03
Filing date: 2019-07-03
Publication date: 2019-09-27

Abstract

The invention discloses a kind of detection recovery system of RabbitMQ clustering fault and method, belong to field of cloud calculation, technical problems to be solved be how judgement and fast quick-recovery normal to RabbitMQ cluster state but the inconsistent fault condition of cluster queue metadata；Its structural information acquisition module, monitoring service end, fault analysis and handling module, restores detection module and data memory module at abnormality detection module.Method includes: acquisition health data；Health data is tested and analyzed, and the consistency of RabbitMQ cluster state and queue metadata is tested and analyzed, obtains testing and analyzing result；It is deposited in detection and analysis result and generates warning information when abnormal；The processing result to RabbitMQ node is generated according to detection and analysis result；After troubleshooting, the availability of RabbitMQ cluster is verified.

Description

The detection recovery system and method for RabbitMQ clustering fault

Technical field

The present invention relates to field of cloud calculation, the detection recovery system of specifically a kind of RabbitMQ clustering fault and side Method.

Background technique

AMQP (advanced message queue protocol) be mainly characterized by message-oriented, queue, routing (including it is point-to-point and hair Cloth/subscription), reliability, safety.RabbitMQ is that a kind of open source of the message-oriented middleware of AMQP is realized, is mainly used for being distributed Storage forwarding message in formula system.RabbitMQ server end is write with Erlang language, supports a variety of clients, such as Java, Python, C etc..

The cluster mode that RabbitMQ is provided is divided into: common cluster mode, mirror image cluster mode.

Common cluster mode is the cluster mode of default, below with three nodes (rabbit01, rabbit02, Rabbit03 be illustrated for): for Queue, message entity exist only in one of node rabbit01 (or Person rabbit02 or rabbit03), rabbit01 and rabbit02,03 node only have identical metadata, the i.e. knot of queue Structure；After message enters the Queue of rabbit01 node, when consumer is consumed from rabbit02 node, RabbitMQ can face When message transmission is carried out between rabbit01, rabbit02, the message entity in A taken out and passes through B be sent to consumer. So consumer should connect each node as far as possible, message is therefrom taken.It, be in multiple nodes i.e. for the same logic query Establish physics Queue；Otherwise no matter consumer connects rabbit01 or rabbit02, exports always in rabbit01, can generate bottle Neck；After rabbit01 node failure, rabbit02 node can not get the message entity that do not consume also in rabbit01 node； If having done message duration, rabbit01 node must be waited to restore, then can just be consumed；If not persistence The phenomenon that talking about, information drop-out will be generated.

Mirror image cluster mode: each node of message under mirror image cluster mode in queue can have portion copy, this It is a in the case where individual node failure, entire cluster can still provide service and (but find in the actual environment, immediately collection Group's state is normal but cluster metadata is inconsistent also can not normally provide service).But since data need to answer in multiple nodes System, while increasing availability, the handling capacity of system can be declined.It is real inside mirror queue on realization mechanism Show a set of election algorithm, has the message in a master and multiple slave, queue based on master, for Publish can choose any one node and be attached, if the node is not master inside RabbitMQ, be transmitted to Master, master send the message to other slave nodes, rear to carry out message localization process, and multicast replication message arrives The storage of other nodes, for consumer, can choose any one node and is attached, the request of consumption can be transmitted to Master, for the reliability for guaranteeing message, consumer needs to carry out ack confirmation, after master receives ack, just will be deleted and disappear Breath, ack message can synchronize (default is asynchronous) to other each nodes, carry out slave knot removal message.If master node loses Effect, then mirror queue can automatic election go out a node (message queue the longest in slave) as master, as disappearing Cease the reference of consumption；Not the case where not being synchronized to all nodes there may be ack message in this case (default is asynchronous), If slave node failure, the state of other nodes is without changing in mirror queue cluster.

Mnesia is a distributed data library module, can be automatically in multiple erlang inter-node synchronous databases. RabbitMQ service by Mnesia database purchase queue (including queue attribute information), message, vhost, user, The information such as exchange (including exchange attribute information).

RabbitMQ cluster network partition (being commonly called as fissure), if a certain node is at one section in RabbitMQ cluster Interior (depending on the setting of net_ticktime, defaulting 60s) cannot get in touch with another node, then Mnesia thinks to fail The node to get in touch therewith is broken down.If two node restore contacted each other, but all Zeng Yiwei other side breaks down, then Manesia, which concludes, occurred Network partition.RabbitMQ provides three kinds of modes for automatically processing network partition: Pause-minority mode, pause-if-all-down mode and autoheal mode (be defaulted as ignore mode, namely Need manual processing)

Under pause-minority mode, discover other nodes down fall rear RabbitMQ by automatic pause think from Oneself is the nodes (half of e.g., less than or equal to total nodes number) of minority, and network partition once occurs, " minority " Nodes will suspend at once, until restoring again after partition.This can guarantee when network partition occurs, at most only There is the nodes in a partition to continue to run.(being a kind of in a manner of giving up availability and guarantee consistency)

Under pause-if-all-down mode, RabbitMQ cannot communicate automatic pause with other nodes of cluster Node.If given node is in the different subregions that can not be communicated, there will not be node to be cut off.By ignore or Autoheal mode is further processed.

Once having occurred network partition under autoheal mode, RabbitMQ will automatically determine one it is winning Then partition restarts all not nodes in winning partition.The partition of triumph is to possess most visitors The partition (partition most for node if connection is identical) of family end connection.

Exist in actual use, RabbitMQ network partition has occurred, both sides in RabbitMQ cluster (or it is more Side) it will be individually present, each party will be considered to its other party and collapse.Queues, bindings, exchanges can be each From independent creation, deletion.For Mirrored queues, each party in heterogeneous networks subregion can possess respective Master, and read-write independent.It can also happen that other unknown behaviors.Even if cluster network partition recovery, Data information in RabbitMQ cluster can not automatically restore to the state before network partition occurs, and have in cluster Queuing message it is unavailable.

To sum up, RabbitMQ cluster cannot handle network partition phenomenon well, RabbitMQ by queue, The information such as exchange, bindings are stored in the distributed data base Mnesia of Erlang.In Network Abnormal, RabbitMQ Service node delay machine, CPU soft-lock and so on Shi Douhui cause the network partition of RabbitMQ cluster, RabbitMQ itself Automatically process mode cluster state can be made to restore normal, but there are existing queue is unavailable, influence to connect RabbitMQ Component service is not normally functioning.

In view of the above problems, how but cluster queue metadata inconsistent fault condition normal to RabbitMQ cluster state Judgement and fast quick-recovery, be the technical issues that need to address.

Summary of the invention

Technical assignment of the invention is against the above deficiency, to provide a kind of detection recovery system of RabbitMQ clustering fault And method, come solve how judgement normal to RabbitMQ cluster state but the inconsistent fault condition of cluster queue metadata and The problem of fast quick-recovery.

In a first aspect, the present invention provides a kind of detection recovery system of RabbitMQ clustering fault, comprising:

Information acquisition module, the information acquisition module for obtaining health data, the inspection health data for The relevant data of RabbitMQ node health status checkout；

Abnormality detection module, the abnormality detection module for being tested and analyzed to health data, and for pair The consistency of RabbitMQ cluster state and queue metadata is tested and analyzed, and obtains testing and analyzing result；

Monitoring service end, the monitoring service end is for receiving and storing health data and testing and analyzing as a result, and being used for It is deposited in detection and analysis result and generates warning information when abnormal；

Fault analysis and handling module, the fault analysis and handling module are used for according to detection and analysis result generation pair The processing result of RabbitMQ node, processing result include that RabbitMQ service is restarted automatically and the reconstruction of RabbitMQ cluster；

Restore detection module, the recoverys detection module for after troubleshooting, to RabbitMQ cluster can It is verified with property, and verification result is sent to monitoring service end, availability verification includes: newly-built test queue, is chosen There is queue to carry out message transmission verifying；

Data memory module, the data memory module are connect with monitoring service end, for storing monitoring data, detection point It analyses result, warning information and tests and analyzes result.

Preferably, information acquisition module acquires health data from RabbitMQ Cluster, health data includes but not It is limited to RabbitMQ service state, cluster state, daily record data, operating system performance index.

Preferably, abnormality detection module for being tested and analyzed as follows:

By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster is sent out Raw network partition；

Determine whether RabbitMQ service is normal；

Determine whether RabbitMQ cluster state is normal；

Determine whether RabbitMQ node resource utilization rate is more than threshold value；

Determine whether each node of RabbitMQ cluster queue metadata is consistent.

Preferably, RabbitMQ service is restarted automatically including being restarted to RabbitMQ service and to operation The node of RabbitMQ service is restarted, and the reconstruction of RabbitMQ cluster includes that weight is carried out to RabbitMQMnesia database It builds；

When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted；

Warning information indicates that RabbitMQ node operating system is abnormal, carries out weight to the node of operation RabbitMQ service It opens；

When warning information indicates that RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.

More preferably, the availability of RabbitMQ cluster is verified, further include to the service state of RabbitMQ cluster, Cluster state, metadata consistency, simulation has the message transmitting and receiving of queue, the creation of new queue is detected, for true It is normal and available to determine RabbitMQ cluster recovery.

Second aspect, the present invention provide a kind of detection restoration methods of RabbitMQ clustering fault, pass through such as first aspect The detection recovery system of described in any item RabbitMQ clustering faults carries out fault detection and recovery, institute to RabbitMQ cluster The method of stating includes:

Health data is acquired, the inspection health data is data relevant to RabbitMQ node health status checkout；

Health data is tested and analyzed, and the consistency of RabbitMQ cluster state and queue metadata is examined Analysis is surveyed, obtains testing and analyzing result；

It is deposited in detection and analysis result and generates warning information when abnormal；

According to result generation is tested and analyzed to the processing result of RabbitMQ node, processing result includes RabbitMQ service It is restarted automatically and is rebuild with RabbitMQ cluster；

After troubleshooting, the availability of RabbitMQ cluster is verified, availability verification includes: newly-built survey Queue is tried, existing queue is chosen and carries out message transmission verifying.

Preferably, acquiring health data from RabbitMQ Cluster by information acquisition module, health data includes But it is not limited to RabbitMQ service state, cluster state, daily record data, operating system performance index.

Preferably, being tested and analyzed to health data, comprising:

Determine whether RabbitMQ service is normal；

Determine whether RabbitMQ cluster state is normal；

Determine whether each node of RabbitMQ cluster queue metadata is consistent.

The detection recovery system and method for RabbitMQ clustering fault of the invention have the advantage that

1, the detection of the consistency to RabbitMQ cluster queue metadata automated, can find queue metadata in time Service caused by inconsistent is abnormal, improves the timeliness of problem discovery；

2, what is automated carries out classification reparation and verifying to abnormal, improves platform service availability.

Detailed description of the invention

It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to required in being described in embodiment The attached drawing used is briefly introduced, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings His attached drawing.

The following further describes the present invention with reference to the drawings.

Attached drawing 1 is the structural block diagram of the detection recovery system of embodiment 1RabbitMQ clustering fault；

Attached drawing 2 is the common cluster mode team of RabbitMQ in the detection recovery system of embodiment 1RabbitMQ clustering fault Column data stores building-block of logic.

Specific embodiment

The present invention will be further explained below with reference to the attached drawings and specific examples, so that those skilled in the art can be with It more fully understands the present invention and can be practiced, but illustrated embodiment is not as a limitation of the invention, the case where not conflicting Under, the technical characteristic in the embodiment of the present invention and embodiment can be combined with each other.

The embodiment of the present invention provides the detection recovery system and method for a kind of RabbitMQ clustering fault, how is used for solution But the judgement of cluster queue metadata inconsistent fault condition and the technology of fast quick-recovery normal to RabbitMQ cluster state are asked Topic.

Embodiment 1:

The detection recovery system of RabbitMQ clustering fault of the invention, including information acquisition module, abnormality detection module, Detection module and data memory module, are restored at fault analysis and handling module in monitoring service end.

Wherein, information acquisition module is connect with RabbitMQ Cluster, strong for obtaining from RabbitMQ Cluster Health data, health data are data relevant to RabbitMQ node health status checkout, including RabbitMQ service state, collection Group state, daily record data, operating system performance index, operating system performance index includes that operating system CPU/ memory/disk/is System load etc..

Monitoring service end includes data processing module, alarm management module, and data processing module and information acquisition module connect It connects, the uploading health data of information acquisition module acquisition is simultaneously stored in data processing module.

Abnormality detection module is connect with monitoring service end, is obtained health data from data processing module and is carried out detection point Analysis；Meanwhile abnormality detection module is connect with RabbitMQ Cluster, acquires queuing data from RabbitMQ Cluster, and The consistency of RabbitMQ cluster state and queue metadata is tested and analyzed；Pass through above-mentioned analysis, abnormality detection module Output test and analyze as a result, and will test analysis result and be sent to alarm management module, alarm management module is tied testing and analyzing Result is tested and analyzed in fruit.

Abnormality detection module obtains health data and is tested and analyzed, which includes:

Determine whether RabbitMQ service is normal；

Determine whether RabbitMQ cluster state is normal；

Determine whether each node of RabbitMQ cluster queue metadata is consistent.

Fault analysis and handling module and abnormality detection module are tested and analyzed for obtaining from abnormality detection module as a result, simultaneously According to result generation is tested and analyzed to the processing result of RabbitMQ node, processing result includes that RabbitMQ service is restarted automatically It is rebuild with RabbitMQ cluster.RabbitMQ service is restarted automatically including being restarted to RabbitMQ service and to operation The node of RabbitMQ service is restarted, and the reconstruction of RabbitMQ cluster includes that weight is carried out to RabbitMQ Mnesia database It builds；When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted；Warning information indicates RabbitMQ node operating system is abnormal, restarts to the node of operation RabbitMQ service；Warning information indicates When RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.

Restore detection module to connect with fault analysis and handling module, monitoring service end respectively, it is right after troubleshooting The availability of RabbitMQ cluster is verified, and verification result is sent to monitoring service end, and availability verification includes: newly-built Test queue chooses existing queue and carries out message transmission verifying.It further include service state to RabbitMQ cluster, cluster shape State, metadata consistency, simulation has the message transmitting and receiving of queue, the creation of new queue is detected, for determining RabbitMQ cluster recovery is normal and available.

Data memory module is connect with monitoring service end, for store monitoring data, test and analyze result, warning information and Test and analyze result.

The detection recovery system of RabbitMQ clustering fault of the invention, which can be realized, carries out certainly RabbitMQ clustering fault Dynamic detection and fast quick-recovery.

Embodiment 2:

The detection restoration methods of a kind of RabbitMQ clustering fault of the invention, based on RabbitMQ disclosed in embodiment 1 The detection recovery system of clustering fault is realized, is included the following steps:

S100, health data is acquired by information acquisition module, and by uploading health data monitoring service end, examine healthy number According to for data relevant to RabbitMQ node health status checkout；

S200, health data is tested and analyzed by abnormality detection module, and to RabbitMQ cluster state and team The consistency of column metadata is tested and analyzed, and is tested and analyzed as a result, and will test analysis result upload monitoring service end；

S300, it deposits by monitoring service end and generates warning information when abnormal testing and analyzing result；

S400, it is generated according to result is tested and analyzed to the processing knot of RabbitMQ node by fault analysis and handling module Fruit, processing result include that RabbitMQ service is restarted automatically and the reconstruction of RabbitMQ cluster；

S500, after troubleshooting, by restore detection module the availability of RabbitMQ cluster is verified, And availability is subjected to verification result and uploads monitoring service end, availability verification includes: newly-built test queue, chooses existing queue Carry out message transmission verifying.

Wherein, health data, healthy number are acquired from RabbitMQ Cluster by information acquisition module in step S100 According to including but not limited to RabbitMQ service state, cluster state, daily record data, operating system performance index.

Health data is tested and analyzed in step S200, comprising:

Determine whether RabbitMQ service is normal；

Determine whether RabbitMQ cluster state is normal；

Determine whether each node of RabbitMQ cluster queue metadata is consistent.

The storage location and content of queuing message are as shown in Fig. 2 under the common cluster mode of RabbitMQ.First number of queue It is believed that breath specifically includes that queued name, persistence, is automatically deleted, owner node.Abnormality detection module comparative analysis Whether each node queue's metadata information of RabbitMQ cluster is consistent, and analyzing cluster, queue metadata whether occur inconsistent Exception.

In step S400, RabbitMQ service is restarted automatically including being restarted to RabbitMQ service and to operation The node of RabbitMQ service is restarted, and the reconstruction of RabbitMQ cluster includes that weight is carried out to RabbitMQMnesia database It builds；When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted；Warning information indicates RabbitMQ node operating system is abnormal, restarts to the node of operation RabbitMQ service；Warning information indicates When RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.

In step S500, the availability of RabbitMQ cluster is verified, further includes the service to RabbitMQ cluster State, cluster state, metadata consistency, simulation has the message transmitting and receiving of queue, the creation of new queue is detected, and uses In determining that RabbitMQ cluster recovery is normal and available.

Embodiment described above is only to absolutely prove preferred embodiment that is of the invention and being lifted, protection model of the invention It encloses without being limited thereto.Those skilled in the art's made equivalent substitute or transformation on the basis of the present invention, in the present invention Protection scope within.Protection scope of the present invention is subject to claims.

Claims

The detection recovery system of 1.RabbitMQ clustering fault, characterized by comprising:

Information acquisition module, for obtaining health data, the inspection health data is and RabbitMQ the information acquisition module The relevant data of node health status checkout；

Abnormality detection module, the abnormality detection module are used for for testing and analyzing to health data to RabbitMQ The consistency of cluster state and queue metadata is tested and analyzed, and obtains testing and analyzing result；

Monitoring service end, the monitoring service end is for receiving and storing health data and testing and analyzing as a result, and for examining Survey analysis result is deposited generates warning information when abnormal；

Fault analysis and handling module, the fault analysis and handling module are used to be generated according to detection and analysis result and save to RabbitMQ The processing result of point, processing result include that RabbitMQ service is restarted automatically and the reconstruction of RabbitMQ cluster；

Restore detection module, the recovery detection module is used for after troubleshooting, to the availability of RabbitMQ cluster It is verified, and verification result is sent to monitoring service end, availability verification includes: newly-built test queue, chooses existing team Column carry out message transmission verifying；

Data memory module, the data memory module are connect with monitoring service end, for storing monitoring data, testing and analyzing knot Fruit, warning information and detection and analysis result.
2. the detection recovery system of RabbitMQ clustering fault according to claim 1, it is characterised in that information collection mould Block acquires health data from RabbitMQ Cluster, and health data includes but is not limited to RabbitMQ service state, cluster shape State, daily record data, operating system performance index.
3. the detection recovery system of RabbitMQ clustering fault according to claim 1, it is characterised in that abnormality detection mould Block for being tested and analyzed as follows:

By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster occurs net Network subregion；

Determine whether RabbitMQ service is normal；

Determine whether RabbitMQ cluster state is normal；

Determine whether RabbitMQ node resource utilization rate is more than threshold value；

Determine whether each node of RabbitMQ cluster queue metadata is consistent.
4. the detection recovery system of RabbitMQ clustering fault according to claim 1, it is characterised in that RabbitMQ clothes Business is restarted automatically including being restarted to RabbitMQ service and being restarted to the node of operation RabbitMQ service, The reconstruction of RabbitMQ cluster includes rebuilding to RabbitMQ Mnesia database；

When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted；

Warning information indicates that RabbitMQ node operating system is abnormal, restarts to the node of operation RabbitMQ service；

When warning information indicates that RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
5. the detection recovery system of RabbitMQ clustering fault according to claim 1, it is characterised in that RabbitMQ The availability of cluster is verified, and further includes the service state to RabbitMQ cluster, cluster state, metadata consistency, mould The message of quasi- existing queue transmits and receives, the creation of new queue is detected, for determine RabbitMQ cluster recovery it is normal and It can use.
The detection restoration methods of 6.RabbitMQ clustering fault, it is characterised in that by as described in any one in claim 1-5 The detection recovery system of RabbitMQ clustering fault carries out fault detection and recovery to RabbitMQ cluster, which comprises

Health data is acquired, the inspection health data is data relevant to RabbitMQ node health status checkout；

Health data is tested and analyzed, and detection point is carried out to the consistency of RabbitMQ cluster state and queue metadata Analysis obtains testing and analyzing result；

It is deposited in detection and analysis result and generates warning information when abnormal；

According to result generation is tested and analyzed to the processing result of RabbitMQ node, processing result includes that RabbitMQ service is automatic Restart and is rebuild with RabbitMQ cluster；

After troubleshooting, the availability of RabbitMQ cluster is verified, availability verification includes: newly-built test team Column choose existing queue and carry out message transmission verifying.
7. the detection restoration methods of RabbitMQ clustering fault according to claim 6, it is characterised in that adopted by information Collect module and acquire health data from RabbitMQ Cluster, health data includes but is not limited to RabbitMQ service state, collection Group state, daily record data, operating system performance index.
8. the detection restoration methods of RabbitMQ clustering fault according to claim 6, it is characterised in that health data It is tested and analyzed, comprising:

By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster occurs net Network subregion；

Determine whether RabbitMQ service is normal；

Determine whether RabbitMQ cluster state is normal；

Determine whether RabbitMQ node resource utilization rate is more than threshold value；

Determine whether each node of RabbitMQ cluster queue metadata is consistent.
9. the detection restoration methods of RabbitMQ clustering fault according to claim 6, it is characterised in that RabbitMQ clothes Business is restarted automatically including being restarted to RabbitMQ service and being restarted to the node of operation RabbitMQ service, The reconstruction of RabbitMQ cluster includes rebuilding to RabbitMQ Mnesia database；

When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted；

Warning information indicates that RabbitMQ node operating system is abnormal, restarts to the node of operation RabbitMQ service；

When warning information indicates that RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
10. the detection restoration methods of RabbitMQ clustering fault according to claim 6, it is characterised in that RabbitMQ The availability of cluster is verified, and further includes the service state to RabbitMQ cluster, cluster state, metadata consistency, mould The message of quasi- existing queue transmits and receives, the creation of new queue is detected, for determine RabbitMQ cluster recovery it is normal and It can use.