CN110290012A - The detection recovery system and method for RabbitMQ clustering fault - Google Patents
The detection recovery system and method for RabbitMQ clustering fault Download PDFInfo
- Publication number
- CN110290012A CN110290012A CN201910593885.3A CN201910593885A CN110290012A CN 110290012 A CN110290012 A CN 110290012A CN 201910593885 A CN201910593885 A CN 201910593885A CN 110290012 A CN110290012 A CN 110290012A
- Authority
- CN
- China
- Prior art keywords
- rabbitmq
- cluster
- detection
- node
- service
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0654—Management of faults, events, alarms or notifications using network fault recovery
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a kind of detection recovery system of RabbitMQ clustering fault and method, belong to field of cloud calculation, technical problems to be solved be how judgement and fast quick-recovery normal to RabbitMQ cluster state but the inconsistent fault condition of cluster queue metadata;Its structural information acquisition module, monitoring service end, fault analysis and handling module, restores detection module and data memory module at abnormality detection module.Method includes: acquisition health data;Health data is tested and analyzed, and the consistency of RabbitMQ cluster state and queue metadata is tested and analyzed, obtains testing and analyzing result;It is deposited in detection and analysis result and generates warning information when abnormal;The processing result to RabbitMQ node is generated according to detection and analysis result;After troubleshooting, the availability of RabbitMQ cluster is verified.
Description
Technical field
The present invention relates to field of cloud calculation, the detection recovery system of specifically a kind of RabbitMQ clustering fault and side
Method.
Background technique
AMQP (advanced message queue protocol) be mainly characterized by message-oriented, queue, routing (including it is point-to-point and hair
Cloth/subscription), reliability, safety.RabbitMQ is that a kind of open source of the message-oriented middleware of AMQP is realized, is mainly used for being distributed
Storage forwarding message in formula system.RabbitMQ server end is write with Erlang language, supports a variety of clients, such as Java,
Python, C etc..
The cluster mode that RabbitMQ is provided is divided into: common cluster mode, mirror image cluster mode.
Common cluster mode is the cluster mode of default, below with three nodes (rabbit01, rabbit02,
Rabbit03 be illustrated for): for Queue, message entity exist only in one of node rabbit01 (or
Person rabbit02 or rabbit03), rabbit01 and rabbit02,03 node only have identical metadata, the i.e. knot of queue
Structure;After message enters the Queue of rabbit01 node, when consumer is consumed from rabbit02 node, RabbitMQ can face
When message transmission is carried out between rabbit01, rabbit02, the message entity in A taken out and passes through B be sent to consumer.
So consumer should connect each node as far as possible, message is therefrom taken.It, be in multiple nodes i.e. for the same logic query
Establish physics Queue;Otherwise no matter consumer connects rabbit01 or rabbit02, exports always in rabbit01, can generate bottle
Neck;After rabbit01 node failure, rabbit02 node can not get the message entity that do not consume also in rabbit01 node;
If having done message duration, rabbit01 node must be waited to restore, then can just be consumed;If not persistence
The phenomenon that talking about, information drop-out will be generated.
Mirror image cluster mode: each node of message under mirror image cluster mode in queue can have portion copy, this
It is a in the case where individual node failure, entire cluster can still provide service and (but find in the actual environment, immediately collection
Group's state is normal but cluster metadata is inconsistent also can not normally provide service).But since data need to answer in multiple nodes
System, while increasing availability, the handling capacity of system can be declined.It is real inside mirror queue on realization mechanism
Show a set of election algorithm, has the message in a master and multiple slave, queue based on master, for
Publish can choose any one node and be attached, if the node is not master inside RabbitMQ, be transmitted to
Master, master send the message to other slave nodes, rear to carry out message localization process, and multicast replication message arrives
The storage of other nodes, for consumer, can choose any one node and is attached, the request of consumption can be transmitted to
Master, for the reliability for guaranteeing message, consumer needs to carry out ack confirmation, after master receives ack, just will be deleted and disappear
Breath, ack message can synchronize (default is asynchronous) to other each nodes, carry out slave knot removal message.If master node loses
Effect, then mirror queue can automatic election go out a node (message queue the longest in slave) as master, as disappearing
Cease the reference of consumption;Not the case where not being synchronized to all nodes there may be ack message in this case (default is asynchronous),
If slave node failure, the state of other nodes is without changing in mirror queue cluster.
Mnesia is a distributed data library module, can be automatically in multiple erlang inter-node synchronous databases.
RabbitMQ service by Mnesia database purchase queue (including queue attribute information), message, vhost, user,
The information such as exchange (including exchange attribute information).
RabbitMQ cluster network partition (being commonly called as fissure), if a certain node is at one section in RabbitMQ cluster
Interior (depending on the setting of net_ticktime, defaulting 60s) cannot get in touch with another node, then Mnesia thinks to fail
The node to get in touch therewith is broken down.If two node restore contacted each other, but all Zeng Yiwei other side breaks down, then
Manesia, which concludes, occurred Network partition.RabbitMQ provides three kinds of modes for automatically processing network partition:
Pause-minority mode, pause-if-all-down mode and autoheal mode (be defaulted as ignore mode, namely
Need manual processing)
Under pause-minority mode, discover other nodes down fall rear RabbitMQ by automatic pause think from
Oneself is the nodes (half of e.g., less than or equal to total nodes number) of minority, and network partition once occurs, " minority "
Nodes will suspend at once, until restoring again after partition.This can guarantee when network partition occurs, at most only
There is the nodes in a partition to continue to run.(being a kind of in a manner of giving up availability and guarantee consistency)
Under pause-if-all-down mode, RabbitMQ cannot communicate automatic pause with other nodes of cluster
Node.If given node is in the different subregions that can not be communicated, there will not be node to be cut off.By ignore or
Autoheal mode is further processed.
Once having occurred network partition under autoheal mode, RabbitMQ will automatically determine one it is winning
Then partition restarts all not nodes in winning partition.The partition of triumph is to possess most visitors
The partition (partition most for node if connection is identical) of family end connection.
Exist in actual use, RabbitMQ network partition has occurred, both sides in RabbitMQ cluster (or it is more
Side) it will be individually present, each party will be considered to its other party and collapse.Queues, bindings, exchanges can be each
From independent creation, deletion.For Mirrored queues, each party in heterogeneous networks subregion can possess respective
Master, and read-write independent.It can also happen that other unknown behaviors.Even if cluster network partition recovery,
Data information in RabbitMQ cluster can not automatically restore to the state before network partition occurs, and have in cluster
Queuing message it is unavailable.
To sum up, RabbitMQ cluster cannot handle network partition phenomenon well, RabbitMQ by queue,
The information such as exchange, bindings are stored in the distributed data base Mnesia of Erlang.In Network Abnormal, RabbitMQ
Service node delay machine, CPU soft-lock and so on Shi Douhui cause the network partition of RabbitMQ cluster, RabbitMQ itself
Automatically process mode cluster state can be made to restore normal, but there are existing queue is unavailable, influence to connect RabbitMQ
Component service is not normally functioning.
In view of the above problems, how but cluster queue metadata inconsistent fault condition normal to RabbitMQ cluster state
Judgement and fast quick-recovery, be the technical issues that need to address.
Summary of the invention
Technical assignment of the invention is against the above deficiency, to provide a kind of detection recovery system of RabbitMQ clustering fault
And method, come solve how judgement normal to RabbitMQ cluster state but the inconsistent fault condition of cluster queue metadata and
The problem of fast quick-recovery.
In a first aspect, the present invention provides a kind of detection recovery system of RabbitMQ clustering fault, comprising:
Information acquisition module, the information acquisition module for obtaining health data, the inspection health data for
The relevant data of RabbitMQ node health status checkout;
Abnormality detection module, the abnormality detection module for being tested and analyzed to health data, and for pair
The consistency of RabbitMQ cluster state and queue metadata is tested and analyzed, and obtains testing and analyzing result;
Monitoring service end, the monitoring service end is for receiving and storing health data and testing and analyzing as a result, and being used for
It is deposited in detection and analysis result and generates warning information when abnormal;
Fault analysis and handling module, the fault analysis and handling module are used for according to detection and analysis result generation pair
The processing result of RabbitMQ node, processing result include that RabbitMQ service is restarted automatically and the reconstruction of RabbitMQ cluster;
Restore detection module, the recoverys detection module for after troubleshooting, to RabbitMQ cluster can
It is verified with property, and verification result is sent to monitoring service end, availability verification includes: newly-built test queue, is chosen
There is queue to carry out message transmission verifying;
Data memory module, the data memory module are connect with monitoring service end, for storing monitoring data, detection point
It analyses result, warning information and tests and analyzes result.
Preferably, information acquisition module acquires health data from RabbitMQ Cluster, health data includes but not
It is limited to RabbitMQ service state, cluster state, daily record data, operating system performance index.
Preferably, abnormality detection module for being tested and analyzed as follows:
By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster is sent out
Raw network partition;
Determine whether RabbitMQ service is normal;
Determine whether RabbitMQ cluster state is normal;
Determine whether RabbitMQ node resource utilization rate is more than threshold value;
Determine whether each node of RabbitMQ cluster queue metadata is consistent.
Preferably, RabbitMQ service is restarted automatically including being restarted to RabbitMQ service and to operation
The node of RabbitMQ service is restarted, and the reconstruction of RabbitMQ cluster includes that weight is carried out to RabbitMQMnesia database
It builds;
When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted;
Warning information indicates that RabbitMQ node operating system is abnormal, carries out weight to the node of operation RabbitMQ service
It opens;
When warning information indicates that RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
More preferably, the availability of RabbitMQ cluster is verified, further include to the service state of RabbitMQ cluster,
Cluster state, metadata consistency, simulation has the message transmitting and receiving of queue, the creation of new queue is detected, for true
It is normal and available to determine RabbitMQ cluster recovery.
Second aspect, the present invention provide a kind of detection restoration methods of RabbitMQ clustering fault, pass through such as first aspect
The detection recovery system of described in any item RabbitMQ clustering faults carries out fault detection and recovery, institute to RabbitMQ cluster
The method of stating includes:
Health data is acquired, the inspection health data is data relevant to RabbitMQ node health status checkout;
Health data is tested and analyzed, and the consistency of RabbitMQ cluster state and queue metadata is examined
Analysis is surveyed, obtains testing and analyzing result;
It is deposited in detection and analysis result and generates warning information when abnormal;
According to result generation is tested and analyzed to the processing result of RabbitMQ node, processing result includes RabbitMQ service
It is restarted automatically and is rebuild with RabbitMQ cluster;
After troubleshooting, the availability of RabbitMQ cluster is verified, availability verification includes: newly-built survey
Queue is tried, existing queue is chosen and carries out message transmission verifying.
Preferably, acquiring health data from RabbitMQ Cluster by information acquisition module, health data includes
But it is not limited to RabbitMQ service state, cluster state, daily record data, operating system performance index.
Preferably, being tested and analyzed to health data, comprising:
By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster is sent out
Raw network partition;
Determine whether RabbitMQ service is normal;
Determine whether RabbitMQ cluster state is normal;
Determine whether RabbitMQ node resource utilization rate is more than threshold value;
Determine whether each node of RabbitMQ cluster queue metadata is consistent.
Preferably, RabbitMQ service is restarted automatically including being restarted to RabbitMQ service and to operation
The node of RabbitMQ service is restarted, and the reconstruction of RabbitMQ cluster includes that weight is carried out to RabbitMQMnesia database
It builds;
When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted;
Warning information indicates that RabbitMQ node operating system is abnormal, carries out weight to the node of operation RabbitMQ service
It opens;
When warning information indicates that RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
More preferably, the availability of RabbitMQ cluster is verified, further include to the service state of RabbitMQ cluster,
Cluster state, metadata consistency, simulation has the message transmitting and receiving of queue, the creation of new queue is detected, for true
It is normal and available to determine RabbitMQ cluster recovery.
The detection recovery system and method for RabbitMQ clustering fault of the invention have the advantage that
1, the detection of the consistency to RabbitMQ cluster queue metadata automated, can find queue metadata in time
Service caused by inconsistent is abnormal, improves the timeliness of problem discovery;
2, what is automated carries out classification reparation and verifying to abnormal, improves platform service availability.
Detailed description of the invention
It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to required in being described in embodiment
The attached drawing used is briefly introduced, it should be apparent that, drawings in the following description are only some embodiments of the invention, for
For those skilled in the art, without creative efforts, it can also be obtained according to these attached drawings
His attached drawing.
The following further describes the present invention with reference to the drawings.
Attached drawing 1 is the structural block diagram of the detection recovery system of embodiment 1RabbitMQ clustering fault;
Attached drawing 2 is the common cluster mode team of RabbitMQ in the detection recovery system of embodiment 1RabbitMQ clustering fault
Column data stores building-block of logic.
Specific embodiment
The present invention will be further explained below with reference to the attached drawings and specific examples, so that those skilled in the art can be with
It more fully understands the present invention and can be practiced, but illustrated embodiment is not as a limitation of the invention, the case where not conflicting
Under, the technical characteristic in the embodiment of the present invention and embodiment can be combined with each other.
The embodiment of the present invention provides the detection recovery system and method for a kind of RabbitMQ clustering fault, how is used for solution
But the judgement of cluster queue metadata inconsistent fault condition and the technology of fast quick-recovery normal to RabbitMQ cluster state are asked
Topic.
Embodiment 1:
The detection recovery system of RabbitMQ clustering fault of the invention, including information acquisition module, abnormality detection module,
Detection module and data memory module, are restored at fault analysis and handling module in monitoring service end.
Wherein, information acquisition module is connect with RabbitMQ Cluster, strong for obtaining from RabbitMQ Cluster
Health data, health data are data relevant to RabbitMQ node health status checkout, including RabbitMQ service state, collection
Group state, daily record data, operating system performance index, operating system performance index includes that operating system CPU/ memory/disk/is
System load etc..
Monitoring service end includes data processing module, alarm management module, and data processing module and information acquisition module connect
It connects, the uploading health data of information acquisition module acquisition is simultaneously stored in data processing module.
Abnormality detection module is connect with monitoring service end, is obtained health data from data processing module and is carried out detection point
Analysis;Meanwhile abnormality detection module is connect with RabbitMQ Cluster, acquires queuing data from RabbitMQ Cluster, and
The consistency of RabbitMQ cluster state and queue metadata is tested and analyzed;Pass through above-mentioned analysis, abnormality detection module
Output test and analyze as a result, and will test analysis result and be sent to alarm management module, alarm management module is tied testing and analyzing
Result is tested and analyzed in fruit.
Abnormality detection module obtains health data and is tested and analyzed, which includes:
By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster is sent out
Raw network partition;
Determine whether RabbitMQ service is normal;
Determine whether RabbitMQ cluster state is normal;
Determine whether RabbitMQ node resource utilization rate is more than threshold value;
Determine whether each node of RabbitMQ cluster queue metadata is consistent.
Fault analysis and handling module and abnormality detection module are tested and analyzed for obtaining from abnormality detection module as a result, simultaneously
According to result generation is tested and analyzed to the processing result of RabbitMQ node, processing result includes that RabbitMQ service is restarted automatically
It is rebuild with RabbitMQ cluster.RabbitMQ service is restarted automatically including being restarted to RabbitMQ service and to operation
The node of RabbitMQ service is restarted, and the reconstruction of RabbitMQ cluster includes that weight is carried out to RabbitMQ Mnesia database
It builds;When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted;Warning information indicates
RabbitMQ node operating system is abnormal, restarts to the node of operation RabbitMQ service;Warning information indicates
When RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
Restore detection module to connect with fault analysis and handling module, monitoring service end respectively, it is right after troubleshooting
The availability of RabbitMQ cluster is verified, and verification result is sent to monitoring service end, and availability verification includes: newly-built
Test queue chooses existing queue and carries out message transmission verifying.It further include service state to RabbitMQ cluster, cluster shape
State, metadata consistency, simulation has the message transmitting and receiving of queue, the creation of new queue is detected, for determining
RabbitMQ cluster recovery is normal and available.
Data memory module is connect with monitoring service end, for store monitoring data, test and analyze result, warning information and
Test and analyze result.
The detection recovery system of RabbitMQ clustering fault of the invention, which can be realized, carries out certainly RabbitMQ clustering fault
Dynamic detection and fast quick-recovery.
Embodiment 2:
The detection restoration methods of a kind of RabbitMQ clustering fault of the invention, based on RabbitMQ disclosed in embodiment 1
The detection recovery system of clustering fault is realized, is included the following steps:
S100, health data is acquired by information acquisition module, and by uploading health data monitoring service end, examine healthy number
According to for data relevant to RabbitMQ node health status checkout;
S200, health data is tested and analyzed by abnormality detection module, and to RabbitMQ cluster state and team
The consistency of column metadata is tested and analyzed, and is tested and analyzed as a result, and will test analysis result upload monitoring service end;
S300, it deposits by monitoring service end and generates warning information when abnormal testing and analyzing result;
S400, it is generated according to result is tested and analyzed to the processing knot of RabbitMQ node by fault analysis and handling module
Fruit, processing result include that RabbitMQ service is restarted automatically and the reconstruction of RabbitMQ cluster;
S500, after troubleshooting, by restore detection module the availability of RabbitMQ cluster is verified,
And availability is subjected to verification result and uploads monitoring service end, availability verification includes: newly-built test queue, chooses existing queue
Carry out message transmission verifying.
Wherein, health data, healthy number are acquired from RabbitMQ Cluster by information acquisition module in step S100
According to including but not limited to RabbitMQ service state, cluster state, daily record data, operating system performance index.
Health data is tested and analyzed in step S200, comprising:
By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster is sent out
Raw network partition;
Determine whether RabbitMQ service is normal;
Determine whether RabbitMQ cluster state is normal;
Determine whether RabbitMQ node resource utilization rate is more than threshold value;
Determine whether each node of RabbitMQ cluster queue metadata is consistent.
The storage location and content of queuing message are as shown in Fig. 2 under the common cluster mode of RabbitMQ.First number of queue
It is believed that breath specifically includes that queued name, persistence, is automatically deleted, owner node.Abnormality detection module comparative analysis
Whether each node queue's metadata information of RabbitMQ cluster is consistent, and analyzing cluster, queue metadata whether occur inconsistent
Exception.
In step S400, RabbitMQ service is restarted automatically including being restarted to RabbitMQ service and to operation
The node of RabbitMQ service is restarted, and the reconstruction of RabbitMQ cluster includes that weight is carried out to RabbitMQMnesia database
It builds;When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted;Warning information indicates
RabbitMQ node operating system is abnormal, restarts to the node of operation RabbitMQ service;Warning information indicates
When RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
In step S500, the availability of RabbitMQ cluster is verified, further includes the service to RabbitMQ cluster
State, cluster state, metadata consistency, simulation has the message transmitting and receiving of queue, the creation of new queue is detected, and uses
In determining that RabbitMQ cluster recovery is normal and available.
Embodiment described above is only to absolutely prove preferred embodiment that is of the invention and being lifted, protection model of the invention
It encloses without being limited thereto.Those skilled in the art's made equivalent substitute or transformation on the basis of the present invention, in the present invention
Protection scope within.Protection scope of the present invention is subject to claims.
Claims (10)
- The detection recovery system of 1.RabbitMQ clustering fault, characterized by comprising:Information acquisition module, for obtaining health data, the inspection health data is and RabbitMQ the information acquisition module The relevant data of node health status checkout;Abnormality detection module, the abnormality detection module are used for for testing and analyzing to health data to RabbitMQ The consistency of cluster state and queue metadata is tested and analyzed, and obtains testing and analyzing result;Monitoring service end, the monitoring service end is for receiving and storing health data and testing and analyzing as a result, and for examining Survey analysis result is deposited generates warning information when abnormal;Fault analysis and handling module, the fault analysis and handling module are used to be generated according to detection and analysis result and save to RabbitMQ The processing result of point, processing result include that RabbitMQ service is restarted automatically and the reconstruction of RabbitMQ cluster;Restore detection module, the recovery detection module is used for after troubleshooting, to the availability of RabbitMQ cluster It is verified, and verification result is sent to monitoring service end, availability verification includes: newly-built test queue, chooses existing team Column carry out message transmission verifying;Data memory module, the data memory module are connect with monitoring service end, for storing monitoring data, testing and analyzing knot Fruit, warning information and detection and analysis result.
- 2. the detection recovery system of RabbitMQ clustering fault according to claim 1, it is characterised in that information collection mould Block acquires health data from RabbitMQ Cluster, and health data includes but is not limited to RabbitMQ service state, cluster shape State, daily record data, operating system performance index.
- 3. the detection recovery system of RabbitMQ clustering fault according to claim 1, it is characterised in that abnormality detection mould Block for being tested and analyzed as follows:By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster occurs net Network subregion;Determine whether RabbitMQ service is normal;Determine whether RabbitMQ cluster state is normal;Determine whether RabbitMQ node resource utilization rate is more than threshold value;Determine whether each node of RabbitMQ cluster queue metadata is consistent.
- 4. the detection recovery system of RabbitMQ clustering fault according to claim 1, it is characterised in that RabbitMQ clothes Business is restarted automatically including being restarted to RabbitMQ service and being restarted to the node of operation RabbitMQ service, The reconstruction of RabbitMQ cluster includes rebuilding to RabbitMQ Mnesia database;When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted;Warning information indicates that RabbitMQ node operating system is abnormal, restarts to the node of operation RabbitMQ service;When warning information indicates that RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
- 5. the detection recovery system of RabbitMQ clustering fault according to claim 1, it is characterised in that RabbitMQ The availability of cluster is verified, and further includes the service state to RabbitMQ cluster, cluster state, metadata consistency, mould The message of quasi- existing queue transmits and receives, the creation of new queue is detected, for determine RabbitMQ cluster recovery it is normal and It can use.
- The detection restoration methods of 6.RabbitMQ clustering fault, it is characterised in that by as described in any one in claim 1-5 The detection recovery system of RabbitMQ clustering fault carries out fault detection and recovery to RabbitMQ cluster, which comprisesHealth data is acquired, the inspection health data is data relevant to RabbitMQ node health status checkout;Health data is tested and analyzed, and detection point is carried out to the consistency of RabbitMQ cluster state and queue metadata Analysis obtains testing and analyzing result;It is deposited in detection and analysis result and generates warning information when abnormal;According to result generation is tested and analyzed to the processing result of RabbitMQ node, processing result includes that RabbitMQ service is automatic Restart and is rebuild with RabbitMQ cluster;After troubleshooting, the availability of RabbitMQ cluster is verified, availability verification includes: newly-built test team Column choose existing queue and carry out message transmission verifying.
- 7. the detection restoration methods of RabbitMQ clustering fault according to claim 6, it is characterised in that adopted by information Collect module and acquire health data from RabbitMQ Cluster, health data includes but is not limited to RabbitMQ service state, collection Group state, daily record data, operating system performance index.
- 8. the detection restoration methods of RabbitMQ clustering fault according to claim 6, it is characterised in that health data It is tested and analyzed, comprising:By analysis RabbitMQ service state, cluster state and daily record data, determine whether RabbitMQ cluster occurs net Network subregion;Determine whether RabbitMQ service is normal;Determine whether RabbitMQ cluster state is normal;Determine whether RabbitMQ node resource utilization rate is more than threshold value;Determine whether each node of RabbitMQ cluster queue metadata is consistent.
- 9. the detection restoration methods of RabbitMQ clustering fault according to claim 6, it is characterised in that RabbitMQ clothes Business is restarted automatically including being restarted to RabbitMQ service and being restarted to the node of operation RabbitMQ service, The reconstruction of RabbitMQ cluster includes rebuilding to RabbitMQ Mnesia database;When warning information indicates RabbitMQ node serve exception, RabbitMQ service is restarted;Warning information indicates that RabbitMQ node operating system is abnormal, restarts to the node of operation RabbitMQ service;When warning information indicates that RabbitMQ cluster metadata is inconsistent, then RabbitMQ cluster is rebuild.
- 10. the detection restoration methods of RabbitMQ clustering fault according to claim 6, it is characterised in that RabbitMQ The availability of cluster is verified, and further includes the service state to RabbitMQ cluster, cluster state, metadata consistency, mould The message of quasi- existing queue transmits and receives, the creation of new queue is detected, for determine RabbitMQ cluster recovery it is normal and It can use.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910593885.3A CN110290012A (en) | 2019-07-03 | 2019-07-03 | The detection recovery system and method for RabbitMQ clustering fault |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910593885.3A CN110290012A (en) | 2019-07-03 | 2019-07-03 | The detection recovery system and method for RabbitMQ clustering fault |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110290012A true CN110290012A (en) | 2019-09-27 |
Family
ID=68020472
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910593885.3A Pending CN110290012A (en) | 2019-07-03 | 2019-07-03 | The detection recovery system and method for RabbitMQ clustering fault |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110290012A (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061586A (en) * | 2019-12-05 | 2020-04-24 | 深圳先进技术研究院 | Container cloud platform anomaly detection method and system and electronic equipment |
CN111597079A (en) * | 2020-05-21 | 2020-08-28 | 山东汇贸电子口岸有限公司 | Method and system for detecting and recovering MySQL Galera cluster fault |
CN111694694A (en) * | 2020-05-22 | 2020-09-22 | 北京三快在线科技有限公司 | Database cluster processing method and device, storage medium and node |
CN111865695A (en) * | 2020-07-28 | 2020-10-30 | 浪潮云信息技术股份公司 | Method and system for automatic fault handling in cloud environment |
CN112003929A (en) * | 2020-08-21 | 2020-11-27 | 苏州浪潮智能科技有限公司 | RabbitMQ cluster-based thermal restoration method, system, device and medium |
CN112115022A (en) * | 2020-08-27 | 2020-12-22 | 北京航空航天大学 | AADL-based IMA system health monitoring test method |
CN112118282A (en) * | 2020-07-29 | 2020-12-22 | 苏州浪潮智能科技有限公司 | Service node elastic expansion method based on RabbitMQ cluster |
CN112272113A (en) * | 2020-10-23 | 2021-01-26 | 上海万向区块链股份公司 | Method and system for monitoring and automatically switching based on various block chain nodes |
CN112486776A (en) * | 2020-12-07 | 2021-03-12 | 中国船舶重工集团公司第七一六研究所 | Cluster member node availability monitoring equipment and method |
CN112486761A (en) * | 2020-11-19 | 2021-03-12 | 苏州浪潮智能科技有限公司 | Cable-free cluster health state detection method |
CN112714013A (en) * | 2020-12-22 | 2021-04-27 | 浪潮云信息技术股份公司 | Application fault positioning method in cloud environment |
CN113438111A (en) * | 2021-06-23 | 2021-09-24 | 华云数据控股集团有限公司 | Method for restoring RabbitMQ network partition based on Raft distribution and application |
CN114827145A (en) * | 2022-04-24 | 2022-07-29 | 阿里巴巴(中国)有限公司 | Server cluster system, and metadata access method and device |
CN115037595A (en) * | 2022-04-29 | 2022-09-09 | 北京华耀科技有限公司 | Network recovery method, device, equipment and storage medium |
CN117395263A (en) * | 2023-12-12 | 2024-01-12 | 苏州元脑智能科技有限公司 | Data synchronization method, device, equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106445754A (en) * | 2016-09-13 | 2017-02-22 | 郑州云海信息技术有限公司 | Method and system for inspecting cluster health status and cluster server |
US20180048587A1 (en) * | 2016-05-16 | 2018-02-15 | Yang Bai | Port switch service |
CN109286529A (en) * | 2018-10-31 | 2019-01-29 | 武汉烽火信息集成技术有限公司 | A kind of method and system for restoring RabbitMQ network partition |
CN109525456A (en) * | 2018-11-07 | 2019-03-26 | 郑州云海信息技术有限公司 | A kind of server monitoring method, device and system |
CN109947730A (en) * | 2017-07-25 | 2019-06-28 | 中兴通讯股份有限公司 | Metadata restoration methods, device, distributed file system and readable storage medium storing program for executing |
-
2019
- 2019-07-03 CN CN201910593885.3A patent/CN110290012A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180048587A1 (en) * | 2016-05-16 | 2018-02-15 | Yang Bai | Port switch service |
CN106445754A (en) * | 2016-09-13 | 2017-02-22 | 郑州云海信息技术有限公司 | Method and system for inspecting cluster health status and cluster server |
CN109947730A (en) * | 2017-07-25 | 2019-06-28 | 中兴通讯股份有限公司 | Metadata restoration methods, device, distributed file system and readable storage medium storing program for executing |
CN109286529A (en) * | 2018-10-31 | 2019-01-29 | 武汉烽火信息集成技术有限公司 | A kind of method and system for restoring RabbitMQ network partition |
CN109525456A (en) * | 2018-11-07 | 2019-03-26 | 郑州云海信息技术有限公司 | A kind of server monitoring method, device and system |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061586A (en) * | 2019-12-05 | 2020-04-24 | 深圳先进技术研究院 | Container cloud platform anomaly detection method and system and electronic equipment |
CN111061586B (en) * | 2019-12-05 | 2023-09-19 | 深圳先进技术研究院 | Container cloud platform anomaly detection method and system and electronic equipment |
CN111597079A (en) * | 2020-05-21 | 2020-08-28 | 山东汇贸电子口岸有限公司 | Method and system for detecting and recovering MySQL Galera cluster fault |
CN111597079B (en) * | 2020-05-21 | 2023-12-05 | 山东汇贸电子口岸有限公司 | Method and system for detecting and recovering MySQL Galera cluster faults |
CN111694694A (en) * | 2020-05-22 | 2020-09-22 | 北京三快在线科技有限公司 | Database cluster processing method and device, storage medium and node |
CN111865695A (en) * | 2020-07-28 | 2020-10-30 | 浪潮云信息技术股份公司 | Method and system for automatic fault handling in cloud environment |
CN112118282A (en) * | 2020-07-29 | 2020-12-22 | 苏州浪潮智能科技有限公司 | Service node elastic expansion method based on RabbitMQ cluster |
CN112118282B (en) * | 2020-07-29 | 2022-05-13 | 苏州浪潮智能科技有限公司 | Service node elastic expansion method based on RabbitMQ cluster |
CN112003929A (en) * | 2020-08-21 | 2020-11-27 | 苏州浪潮智能科技有限公司 | RabbitMQ cluster-based thermal restoration method, system, device and medium |
CN112003929B (en) * | 2020-08-21 | 2022-05-13 | 苏州浪潮智能科技有限公司 | RabbitMQ cluster-based thermal restoration method, system, device and medium |
CN112115022A (en) * | 2020-08-27 | 2020-12-22 | 北京航空航天大学 | AADL-based IMA system health monitoring test method |
CN112115022B (en) * | 2020-08-27 | 2022-03-08 | 北京航空航天大学 | AADL-based IMA system health monitoring test method |
CN112272113B (en) * | 2020-10-23 | 2021-10-22 | 上海万向区块链股份公司 | Method and system for monitoring and automatically switching based on various block chain nodes |
CN112272113A (en) * | 2020-10-23 | 2021-01-26 | 上海万向区块链股份公司 | Method and system for monitoring and automatically switching based on various block chain nodes |
CN112486761A (en) * | 2020-11-19 | 2021-03-12 | 苏州浪潮智能科技有限公司 | Cable-free cluster health state detection method |
CN112486776A (en) * | 2020-12-07 | 2021-03-12 | 中国船舶重工集团公司第七一六研究所 | Cluster member node availability monitoring equipment and method |
CN112714013A (en) * | 2020-12-22 | 2021-04-27 | 浪潮云信息技术股份公司 | Application fault positioning method in cloud environment |
CN112714013B (en) * | 2020-12-22 | 2023-02-03 | 浪潮云信息技术股份公司 | Application fault positioning method in cloud environment |
CN113438111A (en) * | 2021-06-23 | 2021-09-24 | 华云数据控股集团有限公司 | Method for restoring RabbitMQ network partition based on Raft distribution and application |
CN114827145A (en) * | 2022-04-24 | 2022-07-29 | 阿里巴巴(中国)有限公司 | Server cluster system, and metadata access method and device |
CN114827145B (en) * | 2022-04-24 | 2024-01-05 | 阿里巴巴(中国)有限公司 | Server cluster system, metadata access method and device |
CN115037595A (en) * | 2022-04-29 | 2022-09-09 | 北京华耀科技有限公司 | Network recovery method, device, equipment and storage medium |
CN115037595B (en) * | 2022-04-29 | 2024-04-23 | 北京华耀科技有限公司 | Network recovery method, device, equipment and storage medium |
CN117395263A (en) * | 2023-12-12 | 2024-01-12 | 苏州元脑智能科技有限公司 | Data synchronization method, device, equipment and storage medium |
CN117395263B (en) * | 2023-12-12 | 2024-03-12 | 苏州元脑智能科技有限公司 | Data synchronization method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110290012A (en) | The detection recovery system and method for RabbitMQ clustering fault | |
CN105488610B (en) | Fault real-time analysis and diagnosis method for power application system | |
CN107015872B (en) | The processing method and processing device of monitoring data | |
TWI361595B (en) | Pool-based network diagnostic systems and methods | |
US9104572B1 (en) | Automated root cause analysis | |
JP4893828B2 (en) | Network failure detection system | |
CN107453929B (en) | Cluster system self-construction method and device and cluster system | |
CN110287081A (en) | A kind of service monitoring system and method | |
CN112118174B (en) | Software defined data gateway | |
JP4466615B2 (en) | Operation management system, monitoring device, monitored device, operation management method and program | |
CN113242153B (en) | Application-oriented monitoring analysis method based on network traffic monitoring | |
WO2007020118A1 (en) | Cluster partition recovery using application state-based priority determination to award a quorum | |
CN112636942B (en) | Method and device for monitoring service host node | |
CN109491975A (en) | Distributed cache system | |
CN111124830B (en) | Micro-service monitoring method and device | |
CN105827678B (en) | Communication means and node under a kind of framework based on High Availabitity | |
CN104573428B (en) | A kind of method and system for improving server cluster resource availability | |
CN106960060A (en) | The management method and device of a kind of data-base cluster | |
CN112333249A (en) | Business service system and method | |
CN107943657A (en) | A kind of linux system problem automatic analysis method and system | |
CN109284294A (en) | Method and device for collecting data, storage medium and processor | |
CN109586989A (en) | A kind of state detection method, device and group system | |
CN114553747A (en) | Method, device, terminal and storage medium for detecting abnormality of redis cluster | |
CN110545197B (en) | Node state monitoring method and device | |
Sahoo et al. | Providing persistent and consistent resources through event log analysis and predictions for large-scale computing systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190927 |
|
RJ01 | Rejection of invention patent application after publication |