CN110290012A - The detection recovery system and method for RabbitMQ clustering fault - Google Patents

The detection recovery system and method for RabbitMQ clustering fault Download PDF

Info

Publication number
CN110290012A
CN110290012A CN201910593885.3A CN201910593885A CN110290012A CN 110290012 A CN110290012 A CN 110290012A CN 201910593885 A CN201910593885 A CN 201910593885A CN 110290012 A CN110290012 A CN 110290012A
Authority
CN
China
Prior art keywords
rabbitmq
cluster
detection
node
service
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910593885.3A
Other languages
Chinese (zh)
Inventor
宋伟
蔡卫卫
谢涛涛
赖振
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Cloud Information Technology Co Ltd
Original Assignee
Inspur Cloud Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Cloud Information Technology Co Ltd filed Critical Inspur Cloud Information Technology Co Ltd
Priority to CN201910593885.3A priority Critical patent/CN110290012A/en
Publication of CN110290012A publication Critical patent/CN110290012A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a kind of detection recovery system of RabbitMQ clustering fault and method, belong to field of cloud calculation, technical problems to be solved be how judgement and fast quick-recovery normal to RabbitMQ cluster state but the inconsistent fault condition of cluster queue metadata;Its structural information acquisition module, monitoring service end, fault analysis and handling module, restores detection module and data memory module at abnormality detection module.Method includes: acquisition health data;Health data is tested and analyzed, and the consistency of RabbitMQ cluster state and queue metadata is tested and analyzed, obtains testing and analyzing result;It is deposited in detection and analysis result and generates warning information when abnormal;The processing result to RabbitMQ node is generated according to detection and analysis result;After troubleshooting, the availability of RabbitMQ cluster is verified.

Description

The detection recovery system and method for RabbitMQ clustering fault
Technical field
The present invention relates to field of cloud calculation, the detection recovery system of specifically a kind of RabbitMQ clustering fault and side Method.
Background technique
AMQP (advanced message queue protocol) be mainly characterized by message-oriented, queue, routing (including it is point-to-point and hair Cloth/subscription), reliability, safety.RabbitMQ is that a kind of open source of the message-oriented middleware of AMQP is realized, is mainly used for being distributed Storage forwarding message in formula system.RabbitMQ server end is write with Erlang language, supports a variety of clients, such as Java, Python, C etc..
The cluster mode that RabbitMQ is provided is divided into: common cluster mode, mirror image cluster mode.
Common cluster mode is the cluster mode of default, below with three nodes (rabbit01, rabbit02, Rabbit03 be illustrated for): for Queue, message entity exist only in one of node rabbit01 (or Person rabbit02 or rabbit03), rabbit01 and rabbit02,03 node only have identical metadata, the i.e. knot of queue Structure;After message enters the Queue of rabbit01 node, when consumer is consumed from rabbit02 node, RabbitMQ can face When message transmission is carried out between rabbit01, rabbit02, the message entity in A taken out and passes through B be sent to consumer. So consumer should connect each node as far as possible, message is therefrom taken.It, be in multiple nodes i.e. for the same logic query Establish physics Queue;Otherwise no matter consumer connects rabbit01 or rabbit02, exports always in rabbit01, can generate bottle Neck;After rabbit01 node failure, rabbit02 node can not get the message entity that do not consume also in rabbit01 node; If having done message duration, rabbit01 node must be waited to restore, then can just be consumed;If not persistence The phenomenon that talking about, information drop-out will be generated.
Mirror image cluster mode: each node of message under mirror image cluster mode in queue can have portion copy, this It is a in the case where individual node failure, entire cluster can still provide service and (but find in the actual environment, immediately collection Group's state is normal but cluster metadata is inconsistent also can not normally provide service).But since data need to answer in multiple nodes System, while increasing availability, the handling capacity of system can be declined.It is real inside mirror queue on realization mechanism Show a set of election algorithm, has the message in a master and multiple slave, queue based on master, for Publish can choose any one node and be attached, if the node is not master inside RabbitMQ, be transmitted to Master, master send the message to other slave nodes, rear to carry out message localization process, and multicast replication message arrives The storage of other nodes, for consumer, can choose any one node and is attached, the request of consumption can be transmitted to Master, for the reliability for guaranteeing message, consumer needs to carry out ack confirmation, after master receives ack, just will be deleted and disappear Breath, ack message can synchronize (default is asynchronous) to other each nodes, carry out slave knot removal message.If master node loses Effect, then mirror queue can automatic election go out a node (message queue the longest in slave) as master, as disappearing Cease the reference of consumption;Not the case where not being synchronized to all nodes there may be ack message in this case (default is asynchronous), If slave node failure, the state of other nodes is without changing in mirror queue cluster.
Mnesia is a distributed data library module, can be automatically in multiple erlang inter-node synchronous databases. RabbitMQ service by Mnesia database purchase queue (including queue attribute information), message, vhost, user, The information such as exchange (including exchange attribute information).
RabbitMQ cluster network partition (being commonly called as fissure), if a certain node is at one section in RabbitMQ cluster Interior (depending on the setting of net_ticktime, defaulting 60s) cannot get in touch with another node, then Mnesia thinks to fail The node to get in touch therewith is broken down.If two node restore contacted each other, but all Zeng Yiwei other side breaks down, then Manesia, which concludes, occurred Network partition.RabbitMQ provides three kinds of modes for automatically processing network partition: Pause-minority mode, pause-if-all-down mode and autoheal mode (be defaulted as ignore mode, namely Need manual processing)
Under pause-minority mode, discover other nodes down fall rear RabbitMQ by automatic pause think from Oneself is the nodes (half of e.g., less than or equal to total nodes number) of minority, and network partition once occurs, " minority " Nodes will suspend at once, until restoring again after partition.This can guarantee when network partition occurs, at most only There is the nodes in a partition to continue to run.(being a kind of in a manner of giving up availability and guarantee consistency)
Under pause-if-all-down mode, RabbitMQ cannot communicate automatic pause with other nodes of cluster Node.If given node is in the different subregions that can not be communicated, there will not be node to be cut off.By ignore or Autoheal mode is further processed.
Once having occurred network partition under autoheal mode, RabbitMQ will automatically determine one it is winning Then partition restarts all not nodes in winning partition.The partition of triumph is to possess most visitors The partition (partition most for node if connection is identical) of family end connection.
Exist in actual use, RabbitMQ network partition has occurred, both sides in RabbitMQ cluster (or it is more Side) it will be individually present, each party will be considered to its other party and collapse.Queues, bindings, exchanges can be each From independent creation, deletion.For Mirrored queues, each party in heterogeneous networks subregion can possess respective Master, and read-write independent.It can also happen that other unknown behaviors.Even if cluster network partition recovery, Data information in RabbitMQ cluster can not automatically restore to the state before network partition occurs, and have in cluster Queuing message it is unavailable.
To sum up, RabbitMQ cluster cannot handle network partition phenomenon well, RabbitMQ by queue, The information such as exchange, bindings are stored in the distributed data base Mnesia of Erlang.In Network Abnormal, RabbitMQ Service node delay machine, CPU soft-lock and so on Shi Douhui cause the network partition of RabbitMQ cluster, RabbitMQ itself Automatically process mode cluster state can be made to restore normal, but there are existing queue is unavailable, influence to connect RabbitMQ Component service is not normally functioning.
In view of the above problems, how but cluster queue metadata inconsistent fault condition normal to RabbitMQ cluster state Judgement and fast quick-recovery, be the technical issues that need to address.
Summary of the invention
Technical assignment of the invention is against the above deficiency, to provide a kind of detection recovery system of RabbitMQ clustering fault And method, come solve how judgement normal to RabbitMQ cluster state but the inconsistent fault condition of cluster queue metadata and The problem of fast quick-recovery.
In a first aspect, the present invention provides a kind of detection recovery system of RabbitMQ clustering fault, comprising:
Information acquisition module, the information acquisition module for obtaining health data, the inspection health data for The relevant data of RabbitMQ node health status checkout;
Abnormality detection module, the abnormality detection module for being tested and analyzed to health data, and for pair The consistency of RabbitMQ cluster state and queue metadata is tested and analyzed, and obtains testing and analyzing result;
Monitoring service end, the monitoring service end is for receiving and storing health data and testing and analyzing as a result, and being used for It is deposited in detection and analysis result and generates warning information when abnormal;
Fault analysis and handling module, the fault analysis and handling module are used for according to detection and analysis result generation pair The processing result of RabbitMQ node, processing result include that RabbitMQ service is restarted automatically and the reconstruction of RabbitMQ cluster;
Restore detection module, the recoverys detection module for after troubleshooting, to RabbitMQ cluster can It is verified with property, and verification result is sent to monitoring service end, availability verification includes: newly-built test queue, is chosen There is queue to carry out message transmission verifying;
Data memory module, the data memory module are connect with monitoring service end, for storing monitoring data, detection point It analyses result, warning information and tests and analyzes result.
Preferably, information acquisition module acquires health data from RabbitMQ Cluster, health data includes but not It is limited to RabbitMQ service state, cluster state, daily record data, operating system performance index.
Preferably, abnormality detection module for being tested and analyzed as follows:
By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster is sent out Raw network partition;
Determine whether RabbitMQ service is normal;
Determine whether RabbitMQ cluster state is normal;
Determine whether RabbitMQ node resource utilization rate is more than threshold value;
Determine whether each node of RabbitMQ cluster queue metadata is consistent.
Preferably, RabbitMQ service is restarted automatically including being restarted to RabbitMQ service and to operation The node of RabbitMQ service is restarted, and the reconstruction of RabbitMQ cluster includes that weight is carried out to RabbitMQMnesia database It builds;
When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted;
Warning information indicates that RabbitMQ node operating system is abnormal, carries out weight to the node of operation RabbitMQ service It opens;
When warning information indicates that RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
More preferably, the availability of RabbitMQ cluster is verified, further include to the service state of RabbitMQ cluster, Cluster state, metadata consistency, simulation has the message transmitting and receiving of queue, the creation of new queue is detected, for true It is normal and available to determine RabbitMQ cluster recovery.
Second aspect, the present invention provide a kind of detection restoration methods of RabbitMQ clustering fault, pass through such as first aspect The detection recovery system of described in any item RabbitMQ clustering faults carries out fault detection and recovery, institute to RabbitMQ cluster The method of stating includes:
Health data is acquired, the inspection health data is data relevant to RabbitMQ node health status checkout;
Health data is tested and analyzed, and the consistency of RabbitMQ cluster state and queue metadata is examined Analysis is surveyed, obtains testing and analyzing result;
It is deposited in detection and analysis result and generates warning information when abnormal;
According to result generation is tested and analyzed to the processing result of RabbitMQ node, processing result includes RabbitMQ service It is restarted automatically and is rebuild with RabbitMQ cluster;
After troubleshooting, the availability of RabbitMQ cluster is verified, availability verification includes: newly-built survey Queue is tried, existing queue is chosen and carries out message transmission verifying.
Preferably, acquiring health data from RabbitMQ Cluster by information acquisition module, health data includes But it is not limited to RabbitMQ service state, cluster state, daily record data, operating system performance index.
Preferably, being tested and analyzed to health data, comprising:
By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster is sent out Raw network partition;
Determine whether RabbitMQ service is normal;
Determine whether RabbitMQ cluster state is normal;
Determine whether RabbitMQ node resource utilization rate is more than threshold value;
Determine whether each node of RabbitMQ cluster queue metadata is consistent.
Preferably, RabbitMQ service is restarted automatically including being restarted to RabbitMQ service and to operation The node of RabbitMQ service is restarted, and the reconstruction of RabbitMQ cluster includes that weight is carried out to RabbitMQMnesia database It builds;
When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted;
Warning information indicates that RabbitMQ node operating system is abnormal, carries out weight to the node of operation RabbitMQ service It opens;
When warning information indicates that RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
More preferably, the availability of RabbitMQ cluster is verified, further include to the service state of RabbitMQ cluster, Cluster state, metadata consistency, simulation has the message transmitting and receiving of queue, the creation of new queue is detected, for true It is normal and available to determine RabbitMQ cluster recovery.
The detection recovery system and method for RabbitMQ clustering fault of the invention have the advantage that
1, the detection of the consistency to RabbitMQ cluster queue metadata automated, can find queue metadata in time Service caused by inconsistent is abnormal, improves the timeliness of problem discovery;
2, what is automated carries out classification reparation and verifying to abnormal, improves platform service availability.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to required in being described in embodiment The attached drawing used is briefly introduced, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings His attached drawing.
The following further describes the present invention with reference to the drawings.
Attached drawing 1 is the structural block diagram of the detection recovery system of embodiment 1RabbitMQ clustering fault;
Attached drawing 2 is the common cluster mode team of RabbitMQ in the detection recovery system of embodiment 1RabbitMQ clustering fault Column data stores building-block of logic.
Specific embodiment
The present invention will be further explained below with reference to the attached drawings and specific examples, so that those skilled in the art can be with It more fully understands the present invention and can be practiced, but illustrated embodiment is not as a limitation of the invention, the case where not conflicting Under, the technical characteristic in the embodiment of the present invention and embodiment can be combined with each other.
The embodiment of the present invention provides the detection recovery system and method for a kind of RabbitMQ clustering fault, how is used for solution But the judgement of cluster queue metadata inconsistent fault condition and the technology of fast quick-recovery normal to RabbitMQ cluster state are asked Topic.
Embodiment 1:
The detection recovery system of RabbitMQ clustering fault of the invention, including information acquisition module, abnormality detection module, Detection module and data memory module, are restored at fault analysis and handling module in monitoring service end.
Wherein, information acquisition module is connect with RabbitMQ Cluster, strong for obtaining from RabbitMQ Cluster Health data, health data are data relevant to RabbitMQ node health status checkout, including RabbitMQ service state, collection Group state, daily record data, operating system performance index, operating system performance index includes that operating system CPU/ memory/disk/is System load etc..
Monitoring service end includes data processing module, alarm management module, and data processing module and information acquisition module connect It connects, the uploading health data of information acquisition module acquisition is simultaneously stored in data processing module.
Abnormality detection module is connect with monitoring service end, is obtained health data from data processing module and is carried out detection point Analysis;Meanwhile abnormality detection module is connect with RabbitMQ Cluster, acquires queuing data from RabbitMQ Cluster, and The consistency of RabbitMQ cluster state and queue metadata is tested and analyzed;Pass through above-mentioned analysis, abnormality detection module Output test and analyze as a result, and will test analysis result and be sent to alarm management module, alarm management module is tied testing and analyzing Result is tested and analyzed in fruit.
Abnormality detection module obtains health data and is tested and analyzed, which includes:
By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster is sent out Raw network partition;
Determine whether RabbitMQ service is normal;
Determine whether RabbitMQ cluster state is normal;
Determine whether RabbitMQ node resource utilization rate is more than threshold value;
Determine whether each node of RabbitMQ cluster queue metadata is consistent.
Fault analysis and handling module and abnormality detection module are tested and analyzed for obtaining from abnormality detection module as a result, simultaneously According to result generation is tested and analyzed to the processing result of RabbitMQ node, processing result includes that RabbitMQ service is restarted automatically It is rebuild with RabbitMQ cluster.RabbitMQ service is restarted automatically including being restarted to RabbitMQ service and to operation The node of RabbitMQ service is restarted, and the reconstruction of RabbitMQ cluster includes that weight is carried out to RabbitMQ Mnesia database It builds;When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted;Warning information indicates RabbitMQ node operating system is abnormal, restarts to the node of operation RabbitMQ service;Warning information indicates When RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
Restore detection module to connect with fault analysis and handling module, monitoring service end respectively, it is right after troubleshooting The availability of RabbitMQ cluster is verified, and verification result is sent to monitoring service end, and availability verification includes: newly-built Test queue chooses existing queue and carries out message transmission verifying.It further include service state to RabbitMQ cluster, cluster shape State, metadata consistency, simulation has the message transmitting and receiving of queue, the creation of new queue is detected, for determining RabbitMQ cluster recovery is normal and available.
Data memory module is connect with monitoring service end, for store monitoring data, test and analyze result, warning information and Test and analyze result.
The detection recovery system of RabbitMQ clustering fault of the invention, which can be realized, carries out certainly RabbitMQ clustering fault Dynamic detection and fast quick-recovery.
Embodiment 2:
The detection restoration methods of a kind of RabbitMQ clustering fault of the invention, based on RabbitMQ disclosed in embodiment 1 The detection recovery system of clustering fault is realized, is included the following steps:
S100, health data is acquired by information acquisition module, and by uploading health data monitoring service end, examine healthy number According to for data relevant to RabbitMQ node health status checkout;
S200, health data is tested and analyzed by abnormality detection module, and to RabbitMQ cluster state and team The consistency of column metadata is tested and analyzed, and is tested and analyzed as a result, and will test analysis result upload monitoring service end;
S300, it deposits by monitoring service end and generates warning information when abnormal testing and analyzing result;
S400, it is generated according to result is tested and analyzed to the processing knot of RabbitMQ node by fault analysis and handling module Fruit, processing result include that RabbitMQ service is restarted automatically and the reconstruction of RabbitMQ cluster;
S500, after troubleshooting, by restore detection module the availability of RabbitMQ cluster is verified, And availability is subjected to verification result and uploads monitoring service end, availability verification includes: newly-built test queue, chooses existing queue Carry out message transmission verifying.
Wherein, health data, healthy number are acquired from RabbitMQ Cluster by information acquisition module in step S100 According to including but not limited to RabbitMQ service state, cluster state, daily record data, operating system performance index.
Health data is tested and analyzed in step S200, comprising:
By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster is sent out Raw network partition;
Determine whether RabbitMQ service is normal;
Determine whether RabbitMQ cluster state is normal;
Determine whether RabbitMQ node resource utilization rate is more than threshold value;
Determine whether each node of RabbitMQ cluster queue metadata is consistent.
The storage location and content of queuing message are as shown in Fig. 2 under the common cluster mode of RabbitMQ.First number of queue It is believed that breath specifically includes that queued name, persistence, is automatically deleted, owner node.Abnormality detection module comparative analysis Whether each node queue's metadata information of RabbitMQ cluster is consistent, and analyzing cluster, queue metadata whether occur inconsistent Exception.
In step S400, RabbitMQ service is restarted automatically including being restarted to RabbitMQ service and to operation The node of RabbitMQ service is restarted, and the reconstruction of RabbitMQ cluster includes that weight is carried out to RabbitMQMnesia database It builds;When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted;Warning information indicates RabbitMQ node operating system is abnormal, restarts to the node of operation RabbitMQ service;Warning information indicates When RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
In step S500, the availability of RabbitMQ cluster is verified, further includes the service to RabbitMQ cluster State, cluster state, metadata consistency, simulation has the message transmitting and receiving of queue, the creation of new queue is detected, and uses In determining that RabbitMQ cluster recovery is normal and available.
Embodiment described above is only to absolutely prove preferred embodiment that is of the invention and being lifted, protection model of the invention It encloses without being limited thereto.Those skilled in the art's made equivalent substitute or transformation on the basis of the present invention, in the present invention Protection scope within.Protection scope of the present invention is subject to claims.

Claims (10)

  1. The detection recovery system of 1.RabbitMQ clustering fault, characterized by comprising:
    Information acquisition module, for obtaining health data, the inspection health data is and RabbitMQ the information acquisition module The relevant data of node health status checkout;
    Abnormality detection module, the abnormality detection module are used for for testing and analyzing to health data to RabbitMQ The consistency of cluster state and queue metadata is tested and analyzed, and obtains testing and analyzing result;
    Monitoring service end, the monitoring service end is for receiving and storing health data and testing and analyzing as a result, and for examining Survey analysis result is deposited generates warning information when abnormal;
    Fault analysis and handling module, the fault analysis and handling module are used to be generated according to detection and analysis result and save to RabbitMQ The processing result of point, processing result include that RabbitMQ service is restarted automatically and the reconstruction of RabbitMQ cluster;
    Restore detection module, the recovery detection module is used for after troubleshooting, to the availability of RabbitMQ cluster It is verified, and verification result is sent to monitoring service end, availability verification includes: newly-built test queue, chooses existing team Column carry out message transmission verifying;
    Data memory module, the data memory module are connect with monitoring service end, for storing monitoring data, testing and analyzing knot Fruit, warning information and detection and analysis result.
  2. 2. the detection recovery system of RabbitMQ clustering fault according to claim 1, it is characterised in that information collection mould Block acquires health data from RabbitMQ Cluster, and health data includes but is not limited to RabbitMQ service state, cluster shape State, daily record data, operating system performance index.
  3. 3. the detection recovery system of RabbitMQ clustering fault according to claim 1, it is characterised in that abnormality detection mould Block for being tested and analyzed as follows:
    By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster occurs net Network subregion;
    Determine whether RabbitMQ service is normal;
    Determine whether RabbitMQ cluster state is normal;
    Determine whether RabbitMQ node resource utilization rate is more than threshold value;
    Determine whether each node of RabbitMQ cluster queue metadata is consistent.
  4. 4. the detection recovery system of RabbitMQ clustering fault according to claim 1, it is characterised in that RabbitMQ clothes Business is restarted automatically including being restarted to RabbitMQ service and being restarted to the node of operation RabbitMQ service, The reconstruction of RabbitMQ cluster includes rebuilding to RabbitMQ Mnesia database;
    When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted;
    Warning information indicates that RabbitMQ node operating system is abnormal, restarts to the node of operation RabbitMQ service;
    When warning information indicates that RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
  5. 5. the detection recovery system of RabbitMQ clustering fault according to claim 1, it is characterised in that RabbitMQ The availability of cluster is verified, and further includes the service state to RabbitMQ cluster, cluster state, metadata consistency, mould The message of quasi- existing queue transmits and receives, the creation of new queue is detected, for determine RabbitMQ cluster recovery it is normal and It can use.
  6. The detection restoration methods of 6.RabbitMQ clustering fault, it is characterised in that by as described in any one in claim 1-5 The detection recovery system of RabbitMQ clustering fault carries out fault detection and recovery to RabbitMQ cluster, which comprises
    Health data is acquired, the inspection health data is data relevant to RabbitMQ node health status checkout;
    Health data is tested and analyzed, and detection point is carried out to the consistency of RabbitMQ cluster state and queue metadata Analysis obtains testing and analyzing result;
    It is deposited in detection and analysis result and generates warning information when abnormal;
    According to result generation is tested and analyzed to the processing result of RabbitMQ node, processing result includes that RabbitMQ service is automatic Restart and is rebuild with RabbitMQ cluster;
    After troubleshooting, the availability of RabbitMQ cluster is verified, availability verification includes: newly-built test team Column choose existing queue and carry out message transmission verifying.
  7. 7. the detection restoration methods of RabbitMQ clustering fault according to claim 6, it is characterised in that adopted by information Collect module and acquire health data from RabbitMQ Cluster, health data includes but is not limited to RabbitMQ service state, collection Group state, daily record data, operating system performance index.
  8. 8. the detection restoration methods of RabbitMQ clustering fault according to claim 6, it is characterised in that health data It is tested and analyzed, comprising:
    By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster occurs net Network subregion;
    Determine whether RabbitMQ service is normal;
    Determine whether RabbitMQ cluster state is normal;
    Determine whether RabbitMQ node resource utilization rate is more than threshold value;
    Determine whether each node of RabbitMQ cluster queue metadata is consistent.
  9. 9. the detection restoration methods of RabbitMQ clustering fault according to claim 6, it is characterised in that RabbitMQ clothes Business is restarted automatically including being restarted to RabbitMQ service and being restarted to the node of operation RabbitMQ service, The reconstruction of RabbitMQ cluster includes rebuilding to RabbitMQ Mnesia database;
    When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted;
    Warning information indicates that RabbitMQ node operating system is abnormal, restarts to the node of operation RabbitMQ service;
    When warning information indicates that RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
  10. 10. the detection restoration methods of RabbitMQ clustering fault according to claim 6, it is characterised in that RabbitMQ The availability of cluster is verified, and further includes the service state to RabbitMQ cluster, cluster state, metadata consistency, mould The message of quasi- existing queue transmits and receives, the creation of new queue is detected, for determine RabbitMQ cluster recovery it is normal and It can use.
CN201910593885.3A 2019-07-03 2019-07-03 The detection recovery system and method for RabbitMQ clustering fault Pending CN110290012A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910593885.3A CN110290012A (en) 2019-07-03 2019-07-03 The detection recovery system and method for RabbitMQ clustering fault

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910593885.3A CN110290012A (en) 2019-07-03 2019-07-03 The detection recovery system and method for RabbitMQ clustering fault

Publications (1)

Publication Number Publication Date
CN110290012A true CN110290012A (en) 2019-09-27

Family

ID=68020472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910593885.3A Pending CN110290012A (en) 2019-07-03 2019-07-03 The detection recovery system and method for RabbitMQ clustering fault

Country Status (1)

Country Link
CN (1) CN110290012A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061586A (en) * 2019-12-05 2020-04-24 深圳先进技术研究院 Container cloud platform anomaly detection method and system and electronic equipment
CN111597079A (en) * 2020-05-21 2020-08-28 山东汇贸电子口岸有限公司 Method and system for detecting and recovering MySQL Galera cluster fault
CN111694694A (en) * 2020-05-22 2020-09-22 北京三快在线科技有限公司 Database cluster processing method and device, storage medium and node
CN111865695A (en) * 2020-07-28 2020-10-30 浪潮云信息技术股份公司 Method and system for automatic fault handling in cloud environment
CN112003929A (en) * 2020-08-21 2020-11-27 苏州浪潮智能科技有限公司 RabbitMQ cluster-based thermal restoration method, system, device and medium
CN112115022A (en) * 2020-08-27 2020-12-22 北京航空航天大学 AADL-based IMA system health monitoring test method
CN112118282A (en) * 2020-07-29 2020-12-22 苏州浪潮智能科技有限公司 Service node elastic expansion method based on RabbitMQ cluster
CN112272113A (en) * 2020-10-23 2021-01-26 上海万向区块链股份公司 Method and system for monitoring and automatically switching based on various block chain nodes
CN112486776A (en) * 2020-12-07 2021-03-12 中国船舶重工集团公司第七一六研究所 Cluster member node availability monitoring equipment and method
CN112486761A (en) * 2020-11-19 2021-03-12 苏州浪潮智能科技有限公司 Cable-free cluster health state detection method
CN112714013A (en) * 2020-12-22 2021-04-27 浪潮云信息技术股份公司 Application fault positioning method in cloud environment
CN113438111A (en) * 2021-06-23 2021-09-24 华云数据控股集团有限公司 Method for restoring RabbitMQ network partition based on Raft distribution and application
CN114827145A (en) * 2022-04-24 2022-07-29 阿里巴巴(中国)有限公司 Server cluster system, and metadata access method and device
CN115037595A (en) * 2022-04-29 2022-09-09 北京华耀科技有限公司 Network recovery method, device, equipment and storage medium
CN117395263A (en) * 2023-12-12 2024-01-12 苏州元脑智能科技有限公司 Data synchronization method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445754A (en) * 2016-09-13 2017-02-22 郑州云海信息技术有限公司 Method and system for inspecting cluster health status and cluster server
US20180048587A1 (en) * 2016-05-16 2018-02-15 Yang Bai Port switch service
CN109286529A (en) * 2018-10-31 2019-01-29 武汉烽火信息集成技术有限公司 A kind of method and system for restoring RabbitMQ network partition
CN109525456A (en) * 2018-11-07 2019-03-26 郑州云海信息技术有限公司 A kind of server monitoring method, device and system
CN109947730A (en) * 2017-07-25 2019-06-28 中兴通讯股份有限公司 Metadata restoration methods, device, distributed file system and readable storage medium storing program for executing

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180048587A1 (en) * 2016-05-16 2018-02-15 Yang Bai Port switch service
CN106445754A (en) * 2016-09-13 2017-02-22 郑州云海信息技术有限公司 Method and system for inspecting cluster health status and cluster server
CN109947730A (en) * 2017-07-25 2019-06-28 中兴通讯股份有限公司 Metadata restoration methods, device, distributed file system and readable storage medium storing program for executing
CN109286529A (en) * 2018-10-31 2019-01-29 武汉烽火信息集成技术有限公司 A kind of method and system for restoring RabbitMQ network partition
CN109525456A (en) * 2018-11-07 2019-03-26 郑州云海信息技术有限公司 A kind of server monitoring method, device and system

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061586A (en) * 2019-12-05 2020-04-24 深圳先进技术研究院 Container cloud platform anomaly detection method and system and electronic equipment
CN111061586B (en) * 2019-12-05 2023-09-19 深圳先进技术研究院 Container cloud platform anomaly detection method and system and electronic equipment
CN111597079A (en) * 2020-05-21 2020-08-28 山东汇贸电子口岸有限公司 Method and system for detecting and recovering MySQL Galera cluster fault
CN111597079B (en) * 2020-05-21 2023-12-05 山东汇贸电子口岸有限公司 Method and system for detecting and recovering MySQL Galera cluster faults
CN111694694A (en) * 2020-05-22 2020-09-22 北京三快在线科技有限公司 Database cluster processing method and device, storage medium and node
CN111865695A (en) * 2020-07-28 2020-10-30 浪潮云信息技术股份公司 Method and system for automatic fault handling in cloud environment
CN112118282A (en) * 2020-07-29 2020-12-22 苏州浪潮智能科技有限公司 Service node elastic expansion method based on RabbitMQ cluster
CN112118282B (en) * 2020-07-29 2022-05-13 苏州浪潮智能科技有限公司 Service node elastic expansion method based on RabbitMQ cluster
CN112003929A (en) * 2020-08-21 2020-11-27 苏州浪潮智能科技有限公司 RabbitMQ cluster-based thermal restoration method, system, device and medium
CN112003929B (en) * 2020-08-21 2022-05-13 苏州浪潮智能科技有限公司 RabbitMQ cluster-based thermal restoration method, system, device and medium
CN112115022A (en) * 2020-08-27 2020-12-22 北京航空航天大学 AADL-based IMA system health monitoring test method
CN112115022B (en) * 2020-08-27 2022-03-08 北京航空航天大学 AADL-based IMA system health monitoring test method
CN112272113B (en) * 2020-10-23 2021-10-22 上海万向区块链股份公司 Method and system for monitoring and automatically switching based on various block chain nodes
CN112272113A (en) * 2020-10-23 2021-01-26 上海万向区块链股份公司 Method and system for monitoring and automatically switching based on various block chain nodes
CN112486761A (en) * 2020-11-19 2021-03-12 苏州浪潮智能科技有限公司 Cable-free cluster health state detection method
CN112486776A (en) * 2020-12-07 2021-03-12 中国船舶重工集团公司第七一六研究所 Cluster member node availability monitoring equipment and method
CN112714013A (en) * 2020-12-22 2021-04-27 浪潮云信息技术股份公司 Application fault positioning method in cloud environment
CN112714013B (en) * 2020-12-22 2023-02-03 浪潮云信息技术股份公司 Application fault positioning method in cloud environment
CN113438111A (en) * 2021-06-23 2021-09-24 华云数据控股集团有限公司 Method for restoring RabbitMQ network partition based on Raft distribution and application
CN114827145A (en) * 2022-04-24 2022-07-29 阿里巴巴(中国)有限公司 Server cluster system, and metadata access method and device
CN114827145B (en) * 2022-04-24 2024-01-05 阿里巴巴(中国)有限公司 Server cluster system, metadata access method and device
CN115037595A (en) * 2022-04-29 2022-09-09 北京华耀科技有限公司 Network recovery method, device, equipment and storage medium
CN115037595B (en) * 2022-04-29 2024-04-23 北京华耀科技有限公司 Network recovery method, device, equipment and storage medium
CN117395263A (en) * 2023-12-12 2024-01-12 苏州元脑智能科技有限公司 Data synchronization method, device, equipment and storage medium
CN117395263B (en) * 2023-12-12 2024-03-12 苏州元脑智能科技有限公司 Data synchronization method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110290012A (en) The detection recovery system and method for RabbitMQ clustering fault
CN105488610B (en) Fault real-time analysis and diagnosis method for power application system
CN107015872B (en) The processing method and processing device of monitoring data
TWI361595B (en) Pool-based network diagnostic systems and methods
US9104572B1 (en) Automated root cause analysis
JP4893828B2 (en) Network failure detection system
CN107453929B (en) Cluster system self-construction method and device and cluster system
CN110287081A (en) A kind of service monitoring system and method
CN112118174B (en) Software defined data gateway
JP4466615B2 (en) Operation management system, monitoring device, monitored device, operation management method and program
CN113242153B (en) Application-oriented monitoring analysis method based on network traffic monitoring
WO2007020118A1 (en) Cluster partition recovery using application state-based priority determination to award a quorum
CN112636942B (en) Method and device for monitoring service host node
CN109491975A (en) Distributed cache system
CN111124830B (en) Micro-service monitoring method and device
CN105827678B (en) Communication means and node under a kind of framework based on High Availabitity
CN104573428B (en) A kind of method and system for improving server cluster resource availability
CN106960060A (en) The management method and device of a kind of data-base cluster
CN112333249A (en) Business service system and method
CN107943657A (en) A kind of linux system problem automatic analysis method and system
CN109284294A (en) Method and device for collecting data, storage medium and processor
CN109586989A (en) A kind of state detection method, device and group system
CN114553747A (en) Method, device, terminal and storage medium for detecting abnormality of redis cluster
CN110545197B (en) Node state monitoring method and device
Sahoo et al. Providing persistent and consistent resources through event log analysis and predictions for large-scale computing systems

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190927

RJ01 Rejection of invention patent application after publication