CN105933153A - Cluster failure monitoring method and device - Google Patents

Cluster failure monitoring method and device Download PDF

Info

Publication number
CN105933153A
CN105933153A CN201610261291.9A CN201610261291A CN105933153A CN 105933153 A CN105933153 A CN 105933153A CN 201610261291 A CN201610261291 A CN 201610261291A CN 105933153 A CN105933153 A CN 105933153A
Authority
CN
China
Prior art keywords
server
test
list
residue
access
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610261291.9A
Other languages
Chinese (zh)
Inventor
侯志贞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LeTV Holding Beijing Co Ltd
LeTV Information Technology Beijing Co Ltd
Original Assignee
LeTV Holding Beijing Co Ltd
LeTV Information Technology Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by LeTV Holding Beijing Co Ltd, LeTV Information Technology Beijing Co Ltd filed Critical LeTV Holding Beijing Co Ltd
Priority to CN201610261291.9A priority Critical patent/CN105933153A/en
Publication of CN105933153A publication Critical patent/CN105933153A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • H04L43/106Active monitoring, e.g. heartbeat, ping or trace-route using time related information in packets, e.g. by adding timestamps

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

Embodiments of the invention provide a cluster failure monitoring method and device. According to the structure of a server cluster, an access path of every server in the cluster is obtained. Under the condition of the same access switch, a first sever and a second server are selected as a test pair in accordance with a preset strategy. A data transmitting and receiving test of the test pair is carried out according to the access paths of the first and second servers. A result of the data transmitting and receiving test is obtained, and the transmitting and receiving bandwidth between the first and second servers is obtained according to the result of the data transmitting and receiving test. When it is judged that the transmitting and receiving bandwidth is greater than a preset bandwidth threshold, then the first and second servers are determined as failure-free. In this way, server cluster failures can be monitored in real time and be found rapidly.

Description

Clustering fault monitoring method and device
Technical field
The present embodiments relate to big technical field of data processing, particularly relate to a kind of clustering fault monitoring side Method and device.
Background technology
Server cluster refers to get up a lot of server centered carry out same service together, client Holding apparently server cluster similarly is only one of which server.Cluster can utilize multiple computer to carry out also Row calculates thus obtains the highest calculating speed, it is also possible to backup with multiple computers, so that appoint What machine is broken whole system still can be properly functioning.Install the most on the server and run group Collection service, this server can add cluster.Clustered operation can reduce Single Point of Faliure quantity, and Achieve the high availability of clustered resource.
Generally in distributed server cluster, a big operation is split as multiple task, and by this Multiple server parallel processings that multiple tasks are distributed in cluster such that it is able to realize high efficiency number According to process.But, if in this server cluster, there is running situation slowly in a certain server, then It is sizable on the impact of whole server cluster, and it can drag the slow whole cluster process speed to operation Degree.In cluster, the on-hook of a certain server is readily detected, but runs slow this fault It is different from server on-hook, is difficult to this fault be detected intuitively.
Therefore, how finding the server broken down is a step the most crucial, the step for relation Whether whole server cluster can be properly functioning.
Summary of the invention
The embodiment of the present invention provides a kind of clustering fault monitoring method and device, in order to solve in prior art Cluster in server fail thus drag the defect of slow whole cluster running status, it is achieved clustering fault Efficient monitoring.
The embodiment of the present invention provides a kind of clustering fault monitoring method, including:
Structure according to server cluster obtains the access path of each server in cluster;
Under same access switch, choose first server according to preset strategy and second server composition is surveyed It is right to try;
Described access path according to described first server and described second server is right to described test Carry out data transmit-receive test;
Obtain the result of described data transmit-receive test and obtain the first clothes according to described data transmit-receive test result Transceiving band between business device and second server;
When judging that described transceiving band is more than the bandwidth threshold preset, it is determined that described first server and described the Two server fault-free.
The embodiment of the present invention provides a kind of clustering fault monitoring device, including:
Data obtaining module, obtains the access of each server in cluster for the structure according to server cluster Path;
Test module, under same access switch, chooses first server and the according to preset strategy Two server composition tests are right;According to described first server and the described access of described second server Path to described test to carrying out data transmit-receive test;
Analyze module, for obtaining the result of described data transmit-receive test and testing according to described data transmit-receive Result obtains the transceiving band between first server and second server;When judging that described transceiving band is big In default bandwidth threshold, it is determined that described first server and described second server fault-free.
The clustering fault monitoring method of embodiment of the present invention offer and device, according to each in server cluster The access path of server builds to be tested to and tests described test to the transmitting-receiving carrying out data, thus sentences Whether the bandwidth between disconnected two-server there is exception, and carries out the judgement of server cluster fault with this, Monitoring and the fault in real time that achieve server cluster fault quickly find.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to reality Execute the required accompanying drawing used in example or description of the prior art to be briefly described, it should be apparent that under, Accompanying drawing during face describes is some embodiments of the present invention, for those of ordinary skill in the art, On the premise of not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the techniqueflow chart of the embodiment of the present application one;
Fig. 2 is the techniqueflow chart of the embodiment of the present application two;
Fig. 3 is the device example structure schematic diagram of the embodiment of the present application three.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with this Accompanying drawing in bright embodiment, is clearly and completely described the technical scheme in the embodiment of the present invention, Obviously, described embodiment is a part of embodiment of the present invention rather than whole embodiments.Based on Embodiment in the present invention, those of ordinary skill in the art are obtained under not making creative work premise The every other embodiment obtained, broadly falls into the scope of protection of the invention.
Fig. 1 is the techniqueflow chart of the embodiment of the present application one, in conjunction with Fig. 1, the embodiment of the present application one cluster Fault monitoring method, can be realized by the steps:
Step S110: obtain the access path of each server in cluster according to the structure of server cluster;
Step S120: under same access switch, chooses first server and second according to preset strategy Server composition test is right;
Step S130: according to described first server and the described access path pair of described second server Described test is to carrying out data transmit-receive test;
Step S140: obtain the result of described data transmit-receive test and according to described data transmit-receive test result Obtain the transceiving band between first server and second server;
Step S150: when judging that described transceiving band is more than the bandwidth threshold preset, it is determined that described first clothes Business device and described second server fault-free.
Concrete, in step s 110, in described acquisition cluster, the access path of each server, i.e. obtains Which the access switch and the core switch that take the access of each described server be.Generally, based on Big data distributed type assemblies structure is as follows: server connects access switch, and access switch connects core Switch.Access switch has 48 downlink ports, at most can connect 48 station servers, carry a width of 10Gbits/s.Access switch has 2 uplink ports, connects two core switch respectively, to prevent Single core exchange fault causes whole cluster unavailable, each a width of 40Gbits/s of uplink port band, 80Gbits/s altogether.Core switch has 48 down going port.Under the most a set of core switch be up to 48*48=2304 station server.The maximum flow of these a set of two core switch is 3840Gbits/s.
According to the structure of above-mentioned cluster, it is connected to the server under the access switch 1 of core switch 1 Access path may have a description result:
/ machine room/core switch 1/ access switch 1/1
/ machine room/core switch 1/ access switch 1/2
/ machine room/core switch 1/ access switch 1/3
..........
/ machine room/core switch 1/ access switch 1/48
Being connected to the server under the access switch 2 of core switch 2 can be by following access path:
/ machine room/core switch 2/ access switch 2/1
/ machine room/core switch 2/ access switch 2/2
/ machine room/core switch 2/ access switch 2/3
..........
/ machine room/core switch 2/ access switch 2/48
In this step, after getting the access path of each server, need to be by the IP address of each server Mating with described access path, described IP address is for the data transmit-receive test of subsequent step.
Such as, the IP of server is write node and can get following result;
/ machine room/core switch 1/ access switch 1/1/192.0.x.1
/ machine room/core switch 1/ access switch 1/2/192.0.x.2
/ machine room/core switch 1/ access switch 1/3/192.0.x.3
..........
/ machine room/core switch n/ access switch 1/ server 48/192.0.x.48
It should be noted that in the embodiment of the present application, using zookeeper is that each server writes path. Owing to zookeeper is the reliable coordination system of ripe distributed system, the present embodiment repeats no more.
Concrete, in the step s 120, described preset strategy be pre-set test to Selection Strategy, Can include such a way:
One: each described server under described same switch is numbered, depends on by described numbering Secondary choose two described servers to form described tests right;
In this Selection Strategy, it is assumed that under same access switch, there are 48 services numbered in order Device, then can the most in order, and the server of numbered 1 and the server composition test of numbered 2 are right, The server of numbered 3 and numbered 4 server composition test to etc., until the clothes of numbered 47 Business device is right with the server of numbered 48 composition test.
Its two: the described server of described numbered odd number is carried out list and obtains odd number server list, The server of described numbered even number is carried out list and obtains even number server list;
One is selected successively according to the order of described odd number server list and described even number server list The server of the server of described numbered radix and a described numbered even number forms described test Right.
Concrete, in step S130, to described test to carrying out data transmit-receive test, described data transmit-receive Test and specifically include the steps:
S131: described second server to default port transmission packet and records current by preset protocol The very first time stamp of delivery time, wherein, described default port is monitored by described first server;
S132: first server receives described packet by described default port and sends out to second server Send complex data back to;
S133: second server receives described reply data and records second timestamp of the current time of reception;
S134: obtain described first server and institute according to stamp of the described very first time and described second timestamp State the bandwidth between second server.
In above-mentioned steps, owing to access path and the IP address of each server are all known, therefore During data send and receive, directly read the IP ground of the destination server that target data sends Location can quickly realize data and send.
It should be noted that, in the embodiment of the present application, when selecting data test pair, the same access of prioritizing selection Server under switch, if there is odd number server under same access switch, thus causes residue When one server does not has other servers paired, then exchange in core according to described access switch Access order under machine, by the residue server registration under described access switch at described core switch Residue server list in;
For ultra-large distributed type assemblies, choosing of pairing server is extremely important, it is impossible to be arbitrary. As 48 station servers under a switch all match with the server under other switch, then his friendship The bandwidth of core switch of changing planes can become bottleneck.
Such as, access switch 1 includes 47 station servers, then, when forming described test pair, necessarily have One station server falls single, then fall single server registration core at access switch 1 place by this In the residue server list of heart switch.In like manner, if access switch 4 also has a station server Falling single, single server registration that the most equally this fallen takes in the residue of described core switch In business device list.
Such as, registering result may is that
/ machine room/core switch 1/1 192.0.x.1
/ machine room/core switch 1/4 192.0.y.1
Treat that the Single-Server that falls of 1 time all access switch of core switch is all listed in described core switch Residue server list after, choosing two servers according to described preset strategy, to form described tests right. When the residue server list of described core switch comprises server described in odd number, necessarily have one Individual server falls and does not singly match, and now, will remain unpaired described server according to described core Heart switch access sequential registration under machine room is in the residue server list of described machine room;Described In the residue server list of machine room, choose two servers according to described preset strategy and form described test Right.
When the residue server list of described machine room comprises server described in odd number, by unpaired institute State server right with the described test of composition of reserved stake server.
Concrete, in step S140, between described first server and described second server transmitting-receiving Bandwidth is according to the difference between stamp of the described very first time and described second timestamp and described first service The data volume of the transmitting-receiving between device and described second server is calculated, between usual two servers The data sent are all default fixed value, such as 10G, thus the bandwidth calculated and the bandwidth preset Threshold value has comparability.Transceiving data amount is divided by the difference between stamp of the described very first time and described second timestamp Be worth to is exactly the bandwidth between two servers.
Concrete, in step S150, described default bandwidth threshold is that data are received and dispatched between servers Theoretical velocity.For a large-scale cluster, described bandwidth threshold should at least four values, It is designated as the first bandwidth threshold, the second bandwidth threshold, the 3rd bandwidth threshold and the 4th bandwidth threshold.
Wherein, for the first bandwidth threshold is directed to the server under same access switch, connect same When entering to carry out under switch data transmit-receive test, the transmission of its data passes not across access switch, data Defeated fastest, corresponding first bandwidth threshold should also be as being maximum in four bandwidth threshold;Second band Wide threshold value for the server in the residue server list of described core switch, these servers be across Access switch, the speed of its data transmission is more relatively slow, therefore, in theory, the second bandwidth Threshold value should be less than the first bandwidth threshold.3rd bandwidth threshold is for the residue server list of described machine room In server, these servers are across core switch, and in theory, the 3rd bandwidth threshold is less than the Two bandwidth threshold.4th bandwidth threshold is in the residue server list for stake server and described machine room For remaining server.
After obtaining the transceiving band between first server and second server, read described the further One server and the access path of described second server, it is judged that whether the two is across access switch and be No across core switch, thus select suitable bandwidth threshold to carry out the judgement of server failure.Generally feelings Under condition, if the transceiving band between two servers is more than theoretical value, i.e. transmission speed is more than theoretical speed Degree, then can directly judge that between two servers, data transmission is normal, without postponing, and fault-free.
Separately, in the embodiment of the present application, when judging that described transceiving band is less than or equal to the bandwidth threshold preset, Then can determine that and fault between two servers, must be there is, but do not know specifically first server Faulty or second server is faulty.It is preferred, therefore, that the embodiment of the present application above-mentioned steps it After, it is also possible to comprise the steps:
Step S160: choose the 3rd server and the 4th server respectively with described first server and institute State second server two described tests of composition right;Wherein, described 3rd server and described 4th service Device is right for transmitting trouble-free described test.
In this step, first ensure the service that the 3rd server chosen and the 4th server are up Device, can form test by first server and the 3rd server right, by second server and the 4th service Device composition test is right, it is also possible to first server and the 4th server are formed test right, by second service Device and the composition test of the 3rd server are right.Consequently, it is possible to two new test centerings of composition, comprise respectively The server of one normal operation.When carrying out data transmit-receive test again, transceiving band is less than or equal to bandwidth In the test group of threshold value, in addition to the server of normal operation, another server must be failed server.
In the present embodiment, according to the access path of server each in server cluster build test to and right The transmitting-receiving carrying out data is tested by described test, thus judges whether the bandwidth between two-server occurs Abnormal, and carry out the judgement of server cluster fault with this, it is achieved that the real-time prison of server cluster fault Survey and fault quickly finds.
Fig. 2 is the techniqueflow chart of the embodiment of the present application two, in conjunction with Fig. 2, the embodiment of the present application one cluster Fault monitoring method, it is also possible to realized by the steps:
Step S201: obtain the access path of each server in cluster according to the structure of server cluster;
Step S202: under same access switch, chooses first server and second according to preset strategy Server composition test is right;
Step S203: according to described first server and the described access path pair of described second server Described test is to carrying out data transmit-receive test;
Step S204: described second server sends packet record by preset protocol to default port The very first time stamp in currently transmitted moment, wherein, described default port is monitored by described first server;
Step S205: first server receives described packet and to second service by described default port Device sends back complex data;
Step S206: second server receive described reply data and record the current time of reception second time Between stab;
Step S207: obtain described first server according to stamp of the described very first time and described second timestamp And the bandwidth between described second server;
Step S208: when judging that described transceiving band is less than or equal to the bandwidth threshold preset, it is determined that described There is fault in first server or described second server;
Step S209: choose the 3rd server and the 4th server respectively with described first server and institute State second server two described tests of composition right;Wherein, described 3rd server and described 4th service Device is right for transmitting trouble-free described test.
Step S210: when judge two described tests to described in the described test at first server place right Described transceiving band less than described default transceiving band, then judge that described first server exists fault; Or, when judge two described tests to described in second server place described test to described transmitting-receiving Bandwidth is less than described default transceiving band, then judge that described second server exists fault.
In the present embodiment, according to the access path of server each in server cluster build test to and right The transmitting-receiving carrying out data is tested by described test, thus judges whether the bandwidth between two-server occurs Abnormal;When judging to exist between two-server transmitting-receiving and being abnormal, utilize receive and dispatch trouble-free server with The transmitting-receiving carrying out data again is tested by the two-server composition test of transmitting-receiving exception, thus judges that fault takes Business device, it is achieved that monitoring and the fault in real time of server cluster fault quickly find.
Fig. 3 is the apparatus structure schematic diagram of the embodiment of the present application three, and in conjunction with Fig. 3, the embodiment of the present application is a kind of Clustering fault monitoring device, the module including following:
Data obtaining module 310, obtains each server in cluster for the structure according to server cluster Access path;
Test module 320, under same access switch, chooses first server according to preset strategy Right with second server composition test;According to described first server and described second server Access path to described test to carrying out data transmit-receive test;
Analyze module 330, for obtaining the result of described data transmit-receive test and surveying according to described data transmit-receive Test result obtains the transceiving band between first server and second server;When judging described transceiving band More than the bandwidth threshold preset, it is determined that described first server and described second server fault-free.
Wherein, described default strategy includes: carry out each described server in described server cluster Numbering is also chosen two described servers successively by described numbering to form described tests right;Or,
Obtain the odd number server list that the described server of described numbered odd number is corresponding, obtain described numbering For the even number server list that the described server of even number is corresponding;;
One is selected successively according to the order of described odd number server list and described even number server list The server of the server of described numbered radix and a described numbered even number forms described test Right.
Wherein, described test module 320 is additionally operable to: take described in odd number when existing under described access switch During business device, the residue server under described access switch is exchanged in core according to described access switch Access sequential registration under machine is in the residue server list of described core switch;Wherein, described surplus Remaining server be server described in odd number does not has other servers form therewith test to server.
Wherein, described test module 320 is additionally operable to: in the residue server list of described core switch, Two servers described tests of composition are chosen right according to described preset strategy.
Wherein, described test module 320 is additionally operable to: when in the residue server list of described core switch When comprising server described in odd number, unpaired described server will be remained
Under machine room, sequential registration is accessed at the residue server of described machine room according to described core switch In list;In the residue server list of described machine room, choose two services according to described preset strategy It is right that device forms described test.
Described test module 320 is additionally operable to: when comprising odd number institute in the residue server list of described machine room When stating server, unpaired described server and reserved stake server are formed described test right.
Wherein, described test module 320 specifically for: described second server by preset protocol to presetting Port send packet and record the currently transmitted moment the very first time stamp, wherein, described default port by Described first server is monitored;First server receives described packet and to the by described default port Two servers send back complex data;Second server receives described reply data and records the current time of reception The second timestamp;Described first server is obtained according to stamp of the described very first time and described second timestamp And the bandwidth between described second server.
Wherein, described test module 320 is additionally operable to: when judge described transceiving band less than or equal to preset Bandwidth threshold, choose the 3rd server and the 4th server respectively with described first server and described Two server two described tests of composition are right;Wherein, described 3rd server and described 4th server are Transmit trouble-free described test right;
When judge two described tests to described in first server place described test to described transmitting-receiving Bandwidth is less than described default transceiving band, then judge that described first server exists fault;Or,
When judge two described tests to described in second server place described test to described transmitting-receiving Bandwidth is less than described default transceiving band, then judge that described second server exists fault.
Fig. 3 shown device can perform the method for Fig. 1 and embodiment illustrated in fig. 2, it is achieved principle and technology effect Fruit, with reference to Fig. 1 and embodiment illustrated in fig. 2, repeats no more.
Application example
A concrete application scenarios will be combined, with an actual example to the embodiment of the present application with lower part Technical scheme be further elaborated.
Server cluster system first retains a station server as stake, and it monitors 54321 ports, for and surplus Remaining server can not find the server pairing of pairing.
Step one, to utilize zookeeper be that each server writes path, the clothes under each access switch Business device, register a sequential node, path may be such that
/ machine room/core switch 1/ access switch 1/1
/ machine room/core switch 1/ access switch 1/2
/ machine room/core switch 1/ access switch 1/3
..........
/ machine room/core switch 1/ access switch 1/n
The IP of server is write node;
/ machine room/core switch 1/ access switch 1/1/192.0.x.1
/ machine room/core switch 1/ access switch 1/2/192.0.x.2
/ machine room/core switch 1/ access switch 1/3/192.0.x.3
..........
/ machine room/core switch n/ access switch m/ server n/192.0.x.n
If last numbered odd number, then perform step 2, the otherwise server of the numbered even number of comparison The path list of the server of path list and numbered odd number, in order, by an odd number server With an even number server pairing, perform testing procedure.
Such as, the server node of entitled the 1 of registration, the server node toward entitled 2 beats data. The most entitled "/machine room/core switch 1/ access switch 1/1192.0.x.1 " and entitled "/machine room/ Core switch 1/ access switch 1/2192.0.x.2 " server pair of data that to be a pair can beat mutually.
When sending the synchronization of server node 1 etc. synchronous service, carry out data transmission, then junction associated Delete.
The testing procedure of two-server is specific as follows, it is assumed that two testing service devices are respectively A, B, surveys The purpose of examination is to calculate the bandwidth between AB server.First can a station server wherein, example As server B monitors a port (such as 54321), another station server A passes through predetermined protocol, toward clothes The listening port of business device B sends one piece of data such as 10GB, and server B receives data and replys, clothes Business device A records system time t1 at that time before transmitting, then receives after the reply of server B record again System time t2, this twice time subtracts each other, can obtain sending the time t that data are spent, with send Data are divided by the time spent, it is simply that the bandwidth between two-server.
Step 2, judge to learn under each access switch that a be up to station server can not be joined according to step one Add test, because there is no another paired server under same access switch.These are taken by this Business device sequential registration is under core switch, such as:
/ machine room/core switch 1/1192.168.1.87
/ machine room/core switch 1/2192.168.2.32
.....
If last numbered odd number, then perform step 3, the otherwise server of the numbered even number of comparison The path list of the server of path list and numbered odd number, in order, can be by odd number clothes Business device and an even number server pairing, perform testing procedure.As registration entitled 1 node, past The node of entitled 2 beats data.
When sending the synchronization in stage 2 etc. synchronous service, carry out data transmission, then junction associated is deleted.
Step 3, judge to learn under each core switch that a be up to station server can not be joined from step 2 Add test, can be by these server registrations at/machine room.
/ machine room/1 192.168.1.123
/ machine room/2 192.168.2.128
....
If last numbered odd number, then match with reserved stake node.The otherwise numbered even number of comparison Server path list and the path list of server of numbered odd number, in order, can be by one An individual odd number server and the pairing of even number server, perform testing procedure.Such as registration entitled 1 Node, beats data in the node toward entitled 2.
The test of each step, if server A to the speed of server B considerably slower than theoretical velocity, Then cannot judge it is that server A is slow, server B is slow, the slowest.Determination methods is as follows, chooses another The normal server pair of outer speed, server C, server D, the most after tested from server C to clothes The bandwidth of business device D transmission data is normal.Server A and server D, server C and server now B forms two new pairings, and carries out data transmission test, if server A is to the biography of server D Defeated speed is slow, then server A is problematic.
Device embodiment described above is only schematically, wherein said illustrates as separating component Unit can be or may not be physically separate, the parts shown as unit can be or May not be physical location, i.e. may be located at a place, or multiple network list can also be distributed to In unit.Some or all of module therein can be selected according to the actual needs to realize the present embodiment side The purpose of case.Those of ordinary skill in the art, in the case of not paying performing creative labour, i.e. can manage Solve and implement.
Through the above description of the embodiments, those skilled in the art is it can be understood that arrive each enforcement Mode can add the mode of required general hardware platform by software and realize, naturally it is also possible to pass through hardware. Based on such understanding, the part that prior art is contributed by technique scheme the most in other words can Embodying with the form with software product, this computer software product can be stored in computer-readable and deposit In storage media, such as ROM/RAM, magnetic disc, CD etc., including some instructions with so that a calculating Machine (can be personal computer, server, or network equipment etc.) perform each embodiment or The method described in some part of person's embodiment.
Last it is noted that above example is only in order to illustrate technical scheme, rather than it is limited System;Although the present invention being described in detail with reference to previous embodiment, the ordinary skill people of this area Member it is understood that the technical scheme described in foregoing embodiments still can be modified by it, or Wherein portion of techniques feature is carried out equivalent;And these amendments or replacement, do not make relevant art The essence of scheme departs from the spirit and scope of various embodiments of the present invention technical scheme.

Claims (14)

1. a clustering fault monitoring method, it is characterised in that comprise the following steps that
Structure according to server cluster obtains the access path of each server in cluster;
Under same access switch, choose first server according to preset strategy and second server composition is surveyed It is right to try;
Described access path according to described first server and described second server is right to described test Carry out data transmit-receive test;
Obtain the result of described data transmit-receive test and obtain the first clothes according to described data transmit-receive test result Transceiving band between business device and second server;
When judging that described transceiving band is more than the bandwidth threshold preset, it is determined that described first server and described the Two server fault-free.
Method the most according to claim 1, it is characterised in that described default strategy includes:
Each described server under described same access switch is numbered, selects successively by described numbering Take two described servers described tests of composition right;Or,
Obtain the odd number server list that the described server of described numbered odd number is corresponding, obtain described numbering For the even number server list that the described server of even number is corresponding;
One is selected successively according to the order of described odd number server list and described even number server list The server of the server of described numbered radix and a described numbered even number forms described test Right.
Method the most according to claim 2, it is characterised in that choose the first clothes according to preset strategy Business device and second server composition test are right, also include:
When there is server described in odd number under described access switch, by the residue under described access switch Server exchanges in described core according to described access switch access sequential registration under core switch In the residue server list of machine;
In the residue server list of described core switch, choose two clothes according to described preset strategy It is right that business device forms described test.
Method the most according to claim 3, it is characterised in that described method also includes:
When the residue server list of described core switch comprises server described in odd number, will residue Unpaired described server according to described core switch access sequential registration under machine room at described machine In the residue server list in room;
In the residue server list of described machine room, choose two server compositions according to described preset strategy Described test is right.
Method the most according to claim 4, it is characterised in that described method also includes:
When the residue server list of described machine room comprises server described in odd number, by unpaired institute State server right with the described test of composition of reserved stake server.
Method the most according to claim 5, it is characterised in that to described test to carrying out data receipts Send out test, specifically include:
When described second server is sent packet by preset protocol to default port and records currently transmitted The very first time stamp carved, wherein, described default port is monitored by described first server;
First server receives described packet by described default port and sends reply to second server Data;
Second server receives described reply data and records second timestamp of the current time of reception;
According to stamp of the described very first time and described second timestamp, obtain described first server and described second Bandwidth between server.
Method the most according to claim 1, it is characterised in that described method also includes:
When judging that described transceiving band, less than or equal to the bandwidth threshold preset, chooses the 3rd server and the It is right that four servers form two described tests with described first server and described second server respectively;Its In, described 3rd server and described 4th server are right for transmitting trouble-free described test;
When judge two described tests to described in first server place described test to described transmitting-receiving Bandwidth is less than described default transceiving band, then judge that described first server exists fault;Or,
When judge two described tests to described in second server place described test to described transmitting-receiving Bandwidth is less than described default transceiving band, then judge that described second server exists fault.
8. a clustering fault monitoring device, it is characterised in that include following module:
Data obtaining module, obtains the access of each server in cluster for the structure according to server cluster Path;
Test module, under same access switch, chooses first server and the according to preset strategy Two server composition tests are right;According to described first server and the described access of described second server Path to described test to carrying out data transmit-receive test;
Analyze module, for obtaining the result of described data transmit-receive test and according to described data transmit-receive test knot Fruit obtains the transceiving band between first server and second server;When judging that described transceiving band is more than The bandwidth threshold preset, it is determined that described first server and described second server fault-free.
Device the most according to claim 7, it is characterised in that described default strategy includes:
Each described server under described same access switch is numbered, by described numbering successively Choose two described servers described tests of composition right;Or,
Obtain the odd number server list that the described server of described numbered odd number is corresponding, obtain described numbering For the even number server list that the described server of even number is corresponding;
One is selected successively according to the order of described odd number server list and described even number server list The server of the server of described numbered radix and a described numbered even number forms described test Right.
Device the most according to claim 9, it is characterised in that described test module is additionally operable to:
When there is server described in odd number under described access switch, by remaining under described access switch Remaining server is handed in described core according to described access switch access sequential registration under core switch In the residue server list changed planes;
In the residue server list of described core switch, choose two services according to described preset strategy It is right that device forms described test.
11. devices according to claim 10, it is characterised in that described test module is additionally operable to:
When the residue server list of described core switch comprises server described in odd number, will residue Unpaired described server according to described core switch access sequential registration under machine room at described machine In the residue server list in room;
In the residue server list of described machine room, choose two server compositions according to described preset strategy Described test is right.
12. devices according to claim 11, it is characterised in that described test module is additionally operable to:
When the residue server list of described machine room comprises server described in odd number, by unpaired institute State server right with the described test of composition of reserved stake server.
13. devices according to claim 8, it is characterised in that described test module specifically for:
When described second server is sent packet by preset protocol to default port and records currently transmitted The very first time stamp carved, wherein, described default port is monitored by described first server;
First server receives described packet by described default port and sends reply to second server Data;
Second server receives described reply data and records second timestamp of the current time of reception;
Described first server and described second is obtained according to stamp of the described very first time and described second timestamp Bandwidth between server.
14. devices according to claim 8, it is characterised in that described test module is additionally operable to:
When judging that described transceiving band, less than or equal to the bandwidth threshold preset, chooses the 3rd server and the It is right that four servers form two described tests with described first server and described second server respectively;Its In, described 3rd server and described 4th server are right for transmitting trouble-free described test;
When judge two described tests to described in first server place described test to described transmitting-receiving Bandwidth is less than described default transceiving band, then judge that described first server exists fault;Or,
When judge two described tests to described in second server place described test to described transmitting-receiving Bandwidth is less than described default transceiving band, then judge that described second server exists fault.
CN201610261291.9A 2016-04-25 2016-04-25 Cluster failure monitoring method and device Pending CN105933153A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610261291.9A CN105933153A (en) 2016-04-25 2016-04-25 Cluster failure monitoring method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610261291.9A CN105933153A (en) 2016-04-25 2016-04-25 Cluster failure monitoring method and device

Publications (1)

Publication Number Publication Date
CN105933153A true CN105933153A (en) 2016-09-07

Family

ID=56836072

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610261291.9A Pending CN105933153A (en) 2016-04-25 2016-04-25 Cluster failure monitoring method and device

Country Status (1)

Country Link
CN (1) CN105933153A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106656606A (en) * 2016-12-27 2017-05-10 北京奇虎科技有限公司 Data path testing method, data path testing server and data path testing system
CN111130917A (en) * 2018-10-31 2020-05-08 北京国双科技有限公司 Line testing method, device and system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140067859A1 (en) * 2012-09-04 2014-03-06 Salesforce.Com, Inc. Facilitating dynamically controlled fetching of data at client computing devices in an on-demand services environment
CN104202375A (en) * 2014-08-22 2014-12-10 广州华多网络科技有限公司 Method and system for synchronous data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140067859A1 (en) * 2012-09-04 2014-03-06 Salesforce.Com, Inc. Facilitating dynamically controlled fetching of data at client computing devices in an on-demand services environment
CN104202375A (en) * 2014-08-22 2014-12-10 广州华多网络科技有限公司 Method and system for synchronous data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
严代彪: "军车多区域无线集群通信故障检测方法研究", 《计算机仿真》 *
张毅: "多集群计算环境故障监控管理***", 《计算机工程与科学》 *
梁佼: "高性能服务器故障诊断方法的研究与设计", 《万方数据》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106656606A (en) * 2016-12-27 2017-05-10 北京奇虎科技有限公司 Data path testing method, data path testing server and data path testing system
CN111130917A (en) * 2018-10-31 2020-05-08 北京国双科技有限公司 Line testing method, device and system

Similar Documents

Publication Publication Date Title
CA2407342C (en) Analysis of network performance
CN105165054B (en) Network service failure processing method, service management system and system management module
CN101035037B (en) Method, system and related device for detecting the network communication quality
CN105721318B (en) The method and apparatus of network topology are found in a kind of software defined network SDN
CN105897507B (en) The condition detection method and device of node device
CN106302017B (en) The small capaciated flow network velocity-measuring system of high concurrent and method
CN111800354B (en) Message processing method and device, message processing equipment and storage medium
CN109428785A (en) A kind of fault detection method and device
CN106982244B (en) Method and device for realizing message mirroring of dynamic flow under cloud network environment
CN109714190A (en) A kind of load balancing based on application level and failure transfer system and its method
CN109104335A (en) A kind of industrial control equipment network attack test method and system
CN106301987B (en) Message loss detection method, device and system
CN106411629A (en) Method used for monitoring state of CDN node and equipment thereof
CN103684818A (en) Method and device for detecting failures of network channel
EP3035596A1 (en) Link performance test method and device, logical processor and network processor
CN107426051B (en) The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system
CN107294767A (en) A kind of Living Network transmission fault monitoring method and system
CN110674096A (en) Node troubleshooting method, device and equipment and computer readable storage medium
CN111200544B (en) Network port flow testing method and device
KR101640476B1 (en) Test analysis system of network and analysis method thereof
CN101252477B (en) Determining method and analyzing apparatus of network fault root
CN103995901B (en) A kind of method for determining back end failure
CN105933153A (en) Cluster failure monitoring method and device
CN114401258A (en) Short message sending method, device, electronic device and storage medium
CN106506265B (en) Detection fpga chip hangs dead method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20160907