CN105933153A - Cluster failure monitoring method and device - Google Patents
Cluster failure monitoring method and device Download PDFInfo
- Publication number
- CN105933153A CN105933153A CN201610261291.9A CN201610261291A CN105933153A CN 105933153 A CN105933153 A CN 105933153A CN 201610261291 A CN201610261291 A CN 201610261291A CN 105933153 A CN105933153 A CN 105933153A
- Authority
- CN
- China
- Prior art keywords
- server
- test
- list
- residue
- access
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0677—Localisation of faults
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/10—Active monitoring, e.g. heartbeat, ping or trace-route
- H04L43/106—Active monitoring, e.g. heartbeat, ping or trace-route using time related information in packets, e.g. by adding timestamps
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Cardiology (AREA)
- General Health & Medical Sciences (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
Embodiments of the invention provide a cluster failure monitoring method and device. According to the structure of a server cluster, an access path of every server in the cluster is obtained. Under the condition of the same access switch, a first sever and a second server are selected as a test pair in accordance with a preset strategy. A data transmitting and receiving test of the test pair is carried out according to the access paths of the first and second servers. A result of the data transmitting and receiving test is obtained, and the transmitting and receiving bandwidth between the first and second servers is obtained according to the result of the data transmitting and receiving test. When it is judged that the transmitting and receiving bandwidth is greater than a preset bandwidth threshold, then the first and second servers are determined as failure-free. In this way, server cluster failures can be monitored in real time and be found rapidly.
Description
Technical field
The present embodiments relate to big technical field of data processing, particularly relate to a kind of clustering fault monitoring side
Method and device.
Background technology
Server cluster refers to get up a lot of server centered carry out same service together, client
Holding apparently server cluster similarly is only one of which server.Cluster can utilize multiple computer to carry out also
Row calculates thus obtains the highest calculating speed, it is also possible to backup with multiple computers, so that appoint
What machine is broken whole system still can be properly functioning.Install the most on the server and run group
Collection service, this server can add cluster.Clustered operation can reduce Single Point of Faliure quantity, and
Achieve the high availability of clustered resource.
Generally in distributed server cluster, a big operation is split as multiple task, and by this
Multiple server parallel processings that multiple tasks are distributed in cluster such that it is able to realize high efficiency number
According to process.But, if in this server cluster, there is running situation slowly in a certain server, then
It is sizable on the impact of whole server cluster, and it can drag the slow whole cluster process speed to operation
Degree.In cluster, the on-hook of a certain server is readily detected, but runs slow this fault
It is different from server on-hook, is difficult to this fault be detected intuitively.
Therefore, how finding the server broken down is a step the most crucial, the step for relation
Whether whole server cluster can be properly functioning.
Summary of the invention
The embodiment of the present invention provides a kind of clustering fault monitoring method and device, in order to solve in prior art
Cluster in server fail thus drag the defect of slow whole cluster running status, it is achieved clustering fault
Efficient monitoring.
The embodiment of the present invention provides a kind of clustering fault monitoring method, including:
Structure according to server cluster obtains the access path of each server in cluster;
Under same access switch, choose first server according to preset strategy and second server composition is surveyed
It is right to try;
Described access path according to described first server and described second server is right to described test
Carry out data transmit-receive test;
Obtain the result of described data transmit-receive test and obtain the first clothes according to described data transmit-receive test result
Transceiving band between business device and second server;
When judging that described transceiving band is more than the bandwidth threshold preset, it is determined that described first server and described the
Two server fault-free.
The embodiment of the present invention provides a kind of clustering fault monitoring device, including:
Data obtaining module, obtains the access of each server in cluster for the structure according to server cluster
Path;
Test module, under same access switch, chooses first server and the according to preset strategy
Two server composition tests are right;According to described first server and the described access of described second server
Path to described test to carrying out data transmit-receive test;
Analyze module, for obtaining the result of described data transmit-receive test and testing according to described data transmit-receive
Result obtains the transceiving band between first server and second server;When judging that described transceiving band is big
In default bandwidth threshold, it is determined that described first server and described second server fault-free.
The clustering fault monitoring method of embodiment of the present invention offer and device, according to each in server cluster
The access path of server builds to be tested to and tests described test to the transmitting-receiving carrying out data, thus sentences
Whether the bandwidth between disconnected two-server there is exception, and carries out the judgement of server cluster fault with this,
Monitoring and the fault in real time that achieve server cluster fault quickly find.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to reality
Execute the required accompanying drawing used in example or description of the prior art to be briefly described, it should be apparent that under,
Accompanying drawing during face describes is some embodiments of the present invention, for those of ordinary skill in the art,
On the premise of not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the techniqueflow chart of the embodiment of the present application one;
Fig. 2 is the techniqueflow chart of the embodiment of the present application two;
Fig. 3 is the device example structure schematic diagram of the embodiment of the present application three.
Detailed description of the invention
For making the purpose of the embodiment of the present invention, technical scheme and advantage clearer, below in conjunction with this
Accompanying drawing in bright embodiment, is clearly and completely described the technical scheme in the embodiment of the present invention,
Obviously, described embodiment is a part of embodiment of the present invention rather than whole embodiments.Based on
Embodiment in the present invention, those of ordinary skill in the art are obtained under not making creative work premise
The every other embodiment obtained, broadly falls into the scope of protection of the invention.
Fig. 1 is the techniqueflow chart of the embodiment of the present application one, in conjunction with Fig. 1, the embodiment of the present application one cluster
Fault monitoring method, can be realized by the steps:
Step S110: obtain the access path of each server in cluster according to the structure of server cluster;
Step S120: under same access switch, chooses first server and second according to preset strategy
Server composition test is right;
Step S130: according to described first server and the described access path pair of described second server
Described test is to carrying out data transmit-receive test;
Step S140: obtain the result of described data transmit-receive test and according to described data transmit-receive test result
Obtain the transceiving band between first server and second server;
Step S150: when judging that described transceiving band is more than the bandwidth threshold preset, it is determined that described first clothes
Business device and described second server fault-free.
Concrete, in step s 110, in described acquisition cluster, the access path of each server, i.e. obtains
Which the access switch and the core switch that take the access of each described server be.Generally, based on
Big data distributed type assemblies structure is as follows: server connects access switch, and access switch connects core
Switch.Access switch has 48 downlink ports, at most can connect 48 station servers, carry a width of
10Gbits/s.Access switch has 2 uplink ports, connects two core switch respectively, to prevent
Single core exchange fault causes whole cluster unavailable, each a width of 40Gbits/s of uplink port band,
80Gbits/s altogether.Core switch has 48 down going port.Under the most a set of core switch be up to
48*48=2304 station server.The maximum flow of these a set of two core switch is 3840Gbits/s.
According to the structure of above-mentioned cluster, it is connected to the server under the access switch 1 of core switch 1
Access path may have a description result:
/ machine room/core switch 1/ access switch 1/1
/ machine room/core switch 1/ access switch 1/2
/ machine room/core switch 1/ access switch 1/3
..........
/ machine room/core switch 1/ access switch 1/48
Being connected to the server under the access switch 2 of core switch 2 can be by following access path:
/ machine room/core switch 2/ access switch 2/1
/ machine room/core switch 2/ access switch 2/2
/ machine room/core switch 2/ access switch 2/3
..........
/ machine room/core switch 2/ access switch 2/48
In this step, after getting the access path of each server, need to be by the IP address of each server
Mating with described access path, described IP address is for the data transmit-receive test of subsequent step.
Such as, the IP of server is write node and can get following result;
/ machine room/core switch 1/ access switch 1/1/192.0.x.1
/ machine room/core switch 1/ access switch 1/2/192.0.x.2
/ machine room/core switch 1/ access switch 1/3/192.0.x.3
..........
/ machine room/core switch n/ access switch 1/ server 48/192.0.x.48
It should be noted that in the embodiment of the present application, using zookeeper is that each server writes path.
Owing to zookeeper is the reliable coordination system of ripe distributed system, the present embodiment repeats no more.
Concrete, in the step s 120, described preset strategy be pre-set test to Selection Strategy,
Can include such a way:
One: each described server under described same switch is numbered, depends on by described numbering
Secondary choose two described servers to form described tests right;
In this Selection Strategy, it is assumed that under same access switch, there are 48 services numbered in order
Device, then can the most in order, and the server of numbered 1 and the server composition test of numbered 2 are right,
The server of numbered 3 and numbered 4 server composition test to etc., until the clothes of numbered 47
Business device is right with the server of numbered 48 composition test.
Its two: the described server of described numbered odd number is carried out list and obtains odd number server list,
The server of described numbered even number is carried out list and obtains even number server list;
One is selected successively according to the order of described odd number server list and described even number server list
The server of the server of described numbered radix and a described numbered even number forms described test
Right.
Concrete, in step S130, to described test to carrying out data transmit-receive test, described data transmit-receive
Test and specifically include the steps:
S131: described second server to default port transmission packet and records current by preset protocol
The very first time stamp of delivery time, wherein, described default port is monitored by described first server;
S132: first server receives described packet by described default port and sends out to second server
Send complex data back to;
S133: second server receives described reply data and records second timestamp of the current time of reception;
S134: obtain described first server and institute according to stamp of the described very first time and described second timestamp
State the bandwidth between second server.
In above-mentioned steps, owing to access path and the IP address of each server are all known, therefore
During data send and receive, directly read the IP ground of the destination server that target data sends
Location can quickly realize data and send.
It should be noted that, in the embodiment of the present application, when selecting data test pair, the same access of prioritizing selection
Server under switch, if there is odd number server under same access switch, thus causes residue
When one server does not has other servers paired, then exchange in core according to described access switch
Access order under machine, by the residue server registration under described access switch at described core switch
Residue server list in;
For ultra-large distributed type assemblies, choosing of pairing server is extremely important, it is impossible to be arbitrary.
As 48 station servers under a switch all match with the server under other switch, then his friendship
The bandwidth of core switch of changing planes can become bottleneck.
Such as, access switch 1 includes 47 station servers, then, when forming described test pair, necessarily have
One station server falls single, then fall single server registration core at access switch 1 place by this
In the residue server list of heart switch.In like manner, if access switch 4 also has a station server
Falling single, single server registration that the most equally this fallen takes in the residue of described core switch
In business device list.
Such as, registering result may is that
/ machine room/core switch 1/1 192.0.x.1
/ machine room/core switch 1/4 192.0.y.1
Treat that the Single-Server that falls of 1 time all access switch of core switch is all listed in described core switch
Residue server list after, choosing two servers according to described preset strategy, to form described tests right.
When the residue server list of described core switch comprises server described in odd number, necessarily have one
Individual server falls and does not singly match, and now, will remain unpaired described server according to described core
Heart switch access sequential registration under machine room is in the residue server list of described machine room;Described
In the residue server list of machine room, choose two servers according to described preset strategy and form described test
Right.
When the residue server list of described machine room comprises server described in odd number, by unpaired institute
State server right with the described test of composition of reserved stake server.
Concrete, in step S140, between described first server and described second server transmitting-receiving
Bandwidth is according to the difference between stamp of the described very first time and described second timestamp and described first service
The data volume of the transmitting-receiving between device and described second server is calculated, between usual two servers
The data sent are all default fixed value, such as 10G, thus the bandwidth calculated and the bandwidth preset
Threshold value has comparability.Transceiving data amount is divided by the difference between stamp of the described very first time and described second timestamp
Be worth to is exactly the bandwidth between two servers.
Concrete, in step S150, described default bandwidth threshold is that data are received and dispatched between servers
Theoretical velocity.For a large-scale cluster, described bandwidth threshold should at least four values,
It is designated as the first bandwidth threshold, the second bandwidth threshold, the 3rd bandwidth threshold and the 4th bandwidth threshold.
Wherein, for the first bandwidth threshold is directed to the server under same access switch, connect same
When entering to carry out under switch data transmit-receive test, the transmission of its data passes not across access switch, data
Defeated fastest, corresponding first bandwidth threshold should also be as being maximum in four bandwidth threshold;Second band
Wide threshold value for the server in the residue server list of described core switch, these servers be across
Access switch, the speed of its data transmission is more relatively slow, therefore, in theory, the second bandwidth
Threshold value should be less than the first bandwidth threshold.3rd bandwidth threshold is for the residue server list of described machine room
In server, these servers are across core switch, and in theory, the 3rd bandwidth threshold is less than the
Two bandwidth threshold.4th bandwidth threshold is in the residue server list for stake server and described machine room
For remaining server.
After obtaining the transceiving band between first server and second server, read described the further
One server and the access path of described second server, it is judged that whether the two is across access switch and be
No across core switch, thus select suitable bandwidth threshold to carry out the judgement of server failure.Generally feelings
Under condition, if the transceiving band between two servers is more than theoretical value, i.e. transmission speed is more than theoretical speed
Degree, then can directly judge that between two servers, data transmission is normal, without postponing, and fault-free.
Separately, in the embodiment of the present application, when judging that described transceiving band is less than or equal to the bandwidth threshold preset,
Then can determine that and fault between two servers, must be there is, but do not know specifically first server
Faulty or second server is faulty.It is preferred, therefore, that the embodiment of the present application above-mentioned steps it
After, it is also possible to comprise the steps:
Step S160: choose the 3rd server and the 4th server respectively with described first server and institute
State second server two described tests of composition right;Wherein, described 3rd server and described 4th service
Device is right for transmitting trouble-free described test.
In this step, first ensure the service that the 3rd server chosen and the 4th server are up
Device, can form test by first server and the 3rd server right, by second server and the 4th service
Device composition test is right, it is also possible to first server and the 4th server are formed test right, by second service
Device and the composition test of the 3rd server are right.Consequently, it is possible to two new test centerings of composition, comprise respectively
The server of one normal operation.When carrying out data transmit-receive test again, transceiving band is less than or equal to bandwidth
In the test group of threshold value, in addition to the server of normal operation, another server must be failed server.
In the present embodiment, according to the access path of server each in server cluster build test to and right
The transmitting-receiving carrying out data is tested by described test, thus judges whether the bandwidth between two-server occurs
Abnormal, and carry out the judgement of server cluster fault with this, it is achieved that the real-time prison of server cluster fault
Survey and fault quickly finds.
Fig. 2 is the techniqueflow chart of the embodiment of the present application two, in conjunction with Fig. 2, the embodiment of the present application one cluster
Fault monitoring method, it is also possible to realized by the steps:
Step S201: obtain the access path of each server in cluster according to the structure of server cluster;
Step S202: under same access switch, chooses first server and second according to preset strategy
Server composition test is right;
Step S203: according to described first server and the described access path pair of described second server
Described test is to carrying out data transmit-receive test;
Step S204: described second server sends packet record by preset protocol to default port
The very first time stamp in currently transmitted moment, wherein, described default port is monitored by described first server;
Step S205: first server receives described packet and to second service by described default port
Device sends back complex data;
Step S206: second server receive described reply data and record the current time of reception second time
Between stab;
Step S207: obtain described first server according to stamp of the described very first time and described second timestamp
And the bandwidth between described second server;
Step S208: when judging that described transceiving band is less than or equal to the bandwidth threshold preset, it is determined that described
There is fault in first server or described second server;
Step S209: choose the 3rd server and the 4th server respectively with described first server and institute
State second server two described tests of composition right;Wherein, described 3rd server and described 4th service
Device is right for transmitting trouble-free described test.
Step S210: when judge two described tests to described in the described test at first server place right
Described transceiving band less than described default transceiving band, then judge that described first server exists fault;
Or, when judge two described tests to described in second server place described test to described transmitting-receiving
Bandwidth is less than described default transceiving band, then judge that described second server exists fault.
In the present embodiment, according to the access path of server each in server cluster build test to and right
The transmitting-receiving carrying out data is tested by described test, thus judges whether the bandwidth between two-server occurs
Abnormal;When judging to exist between two-server transmitting-receiving and being abnormal, utilize receive and dispatch trouble-free server with
The transmitting-receiving carrying out data again is tested by the two-server composition test of transmitting-receiving exception, thus judges that fault takes
Business device, it is achieved that monitoring and the fault in real time of server cluster fault quickly find.
Fig. 3 is the apparatus structure schematic diagram of the embodiment of the present application three, and in conjunction with Fig. 3, the embodiment of the present application is a kind of
Clustering fault monitoring device, the module including following:
Data obtaining module 310, obtains each server in cluster for the structure according to server cluster
Access path;
Test module 320, under same access switch, chooses first server according to preset strategy
Right with second server composition test;According to described first server and described second server
Access path to described test to carrying out data transmit-receive test;
Analyze module 330, for obtaining the result of described data transmit-receive test and surveying according to described data transmit-receive
Test result obtains the transceiving band between first server and second server;When judging described transceiving band
More than the bandwidth threshold preset, it is determined that described first server and described second server fault-free.
Wherein, described default strategy includes: carry out each described server in described server cluster
Numbering is also chosen two described servers successively by described numbering to form described tests right;Or,
Obtain the odd number server list that the described server of described numbered odd number is corresponding, obtain described numbering
For the even number server list that the described server of even number is corresponding;;
One is selected successively according to the order of described odd number server list and described even number server list
The server of the server of described numbered radix and a described numbered even number forms described test
Right.
Wherein, described test module 320 is additionally operable to: take described in odd number when existing under described access switch
During business device, the residue server under described access switch is exchanged in core according to described access switch
Access sequential registration under machine is in the residue server list of described core switch;Wherein, described surplus
Remaining server be server described in odd number does not has other servers form therewith test to server.
Wherein, described test module 320 is additionally operable to: in the residue server list of described core switch,
Two servers described tests of composition are chosen right according to described preset strategy.
Wherein, described test module 320 is additionally operable to: when in the residue server list of described core switch
When comprising server described in odd number, unpaired described server will be remained
Under machine room, sequential registration is accessed at the residue server of described machine room according to described core switch
In list;In the residue server list of described machine room, choose two services according to described preset strategy
It is right that device forms described test.
Described test module 320 is additionally operable to: when comprising odd number institute in the residue server list of described machine room
When stating server, unpaired described server and reserved stake server are formed described test right.
Wherein, described test module 320 specifically for: described second server by preset protocol to presetting
Port send packet and record the currently transmitted moment the very first time stamp, wherein, described default port by
Described first server is monitored;First server receives described packet and to the by described default port
Two servers send back complex data;Second server receives described reply data and records the current time of reception
The second timestamp;Described first server is obtained according to stamp of the described very first time and described second timestamp
And the bandwidth between described second server.
Wherein, described test module 320 is additionally operable to: when judge described transceiving band less than or equal to preset
Bandwidth threshold, choose the 3rd server and the 4th server respectively with described first server and described
Two server two described tests of composition are right;Wherein, described 3rd server and described 4th server are
Transmit trouble-free described test right;
When judge two described tests to described in first server place described test to described transmitting-receiving
Bandwidth is less than described default transceiving band, then judge that described first server exists fault;Or,
When judge two described tests to described in second server place described test to described transmitting-receiving
Bandwidth is less than described default transceiving band, then judge that described second server exists fault.
Fig. 3 shown device can perform the method for Fig. 1 and embodiment illustrated in fig. 2, it is achieved principle and technology effect
Fruit, with reference to Fig. 1 and embodiment illustrated in fig. 2, repeats no more.
Application example
A concrete application scenarios will be combined, with an actual example to the embodiment of the present application with lower part
Technical scheme be further elaborated.
Server cluster system first retains a station server as stake, and it monitors 54321 ports, for and surplus
Remaining server can not find the server pairing of pairing.
Step one, to utilize zookeeper be that each server writes path, the clothes under each access switch
Business device, register a sequential node, path may be such that
/ machine room/core switch 1/ access switch 1/1
/ machine room/core switch 1/ access switch 1/2
/ machine room/core switch 1/ access switch 1/3
..........
/ machine room/core switch 1/ access switch 1/n
The IP of server is write node;
/ machine room/core switch 1/ access switch 1/1/192.0.x.1
/ machine room/core switch 1/ access switch 1/2/192.0.x.2
/ machine room/core switch 1/ access switch 1/3/192.0.x.3
..........
/ machine room/core switch n/ access switch m/ server n/192.0.x.n
If last numbered odd number, then perform step 2, the otherwise server of the numbered even number of comparison
The path list of the server of path list and numbered odd number, in order, by an odd number server
With an even number server pairing, perform testing procedure.
Such as, the server node of entitled the 1 of registration, the server node toward entitled 2 beats data.
The most entitled "/machine room/core switch 1/ access switch 1/1192.0.x.1 " and entitled "/machine room/
Core switch 1/ access switch 1/2192.0.x.2 " server pair of data that to be a pair can beat mutually.
When sending the synchronization of server node 1 etc. synchronous service, carry out data transmission, then junction associated
Delete.
The testing procedure of two-server is specific as follows, it is assumed that two testing service devices are respectively A, B, surveys
The purpose of examination is to calculate the bandwidth between AB server.First can a station server wherein, example
As server B monitors a port (such as 54321), another station server A passes through predetermined protocol, toward clothes
The listening port of business device B sends one piece of data such as 10GB, and server B receives data and replys, clothes
Business device A records system time t1 at that time before transmitting, then receives after the reply of server B record again
System time t2, this twice time subtracts each other, can obtain sending the time t that data are spent, with send
Data are divided by the time spent, it is simply that the bandwidth between two-server.
Step 2, judge to learn under each access switch that a be up to station server can not be joined according to step one
Add test, because there is no another paired server under same access switch.These are taken by this
Business device sequential registration is under core switch, such as:
/ machine room/core switch 1/1192.168.1.87
/ machine room/core switch 1/2192.168.2.32
.....
If last numbered odd number, then perform step 3, the otherwise server of the numbered even number of comparison
The path list of the server of path list and numbered odd number, in order, can be by odd number clothes
Business device and an even number server pairing, perform testing procedure.As registration entitled 1 node, past
The node of entitled 2 beats data.
When sending the synchronization in stage 2 etc. synchronous service, carry out data transmission, then junction associated is deleted.
Step 3, judge to learn under each core switch that a be up to station server can not be joined from step 2
Add test, can be by these server registrations at/machine room.
/ machine room/1 192.168.1.123
/ machine room/2 192.168.2.128
....
If last numbered odd number, then match with reserved stake node.The otherwise numbered even number of comparison
Server path list and the path list of server of numbered odd number, in order, can be by one
An individual odd number server and the pairing of even number server, perform testing procedure.Such as registration entitled 1
Node, beats data in the node toward entitled 2.
The test of each step, if server A to the speed of server B considerably slower than theoretical velocity,
Then cannot judge it is that server A is slow, server B is slow, the slowest.Determination methods is as follows, chooses another
The normal server pair of outer speed, server C, server D, the most after tested from server C to clothes
The bandwidth of business device D transmission data is normal.Server A and server D, server C and server now
B forms two new pairings, and carries out data transmission test, if server A is to the biography of server D
Defeated speed is slow, then server A is problematic.
Device embodiment described above is only schematically, wherein said illustrates as separating component
Unit can be or may not be physically separate, the parts shown as unit can be or
May not be physical location, i.e. may be located at a place, or multiple network list can also be distributed to
In unit.Some or all of module therein can be selected according to the actual needs to realize the present embodiment side
The purpose of case.Those of ordinary skill in the art, in the case of not paying performing creative labour, i.e. can manage
Solve and implement.
Through the above description of the embodiments, those skilled in the art is it can be understood that arrive each enforcement
Mode can add the mode of required general hardware platform by software and realize, naturally it is also possible to pass through hardware.
Based on such understanding, the part that prior art is contributed by technique scheme the most in other words can
Embodying with the form with software product, this computer software product can be stored in computer-readable and deposit
In storage media, such as ROM/RAM, magnetic disc, CD etc., including some instructions with so that a calculating
Machine (can be personal computer, server, or network equipment etc.) perform each embodiment or
The method described in some part of person's embodiment.
Last it is noted that above example is only in order to illustrate technical scheme, rather than it is limited
System;Although the present invention being described in detail with reference to previous embodiment, the ordinary skill people of this area
Member it is understood that the technical scheme described in foregoing embodiments still can be modified by it, or
Wherein portion of techniques feature is carried out equivalent;And these amendments or replacement, do not make relevant art
The essence of scheme departs from the spirit and scope of various embodiments of the present invention technical scheme.
Claims (14)
1. a clustering fault monitoring method, it is characterised in that comprise the following steps that
Structure according to server cluster obtains the access path of each server in cluster;
Under same access switch, choose first server according to preset strategy and second server composition is surveyed
It is right to try;
Described access path according to described first server and described second server is right to described test
Carry out data transmit-receive test;
Obtain the result of described data transmit-receive test and obtain the first clothes according to described data transmit-receive test result
Transceiving band between business device and second server;
When judging that described transceiving band is more than the bandwidth threshold preset, it is determined that described first server and described the
Two server fault-free.
Method the most according to claim 1, it is characterised in that described default strategy includes:
Each described server under described same access switch is numbered, selects successively by described numbering
Take two described servers described tests of composition right;Or,
Obtain the odd number server list that the described server of described numbered odd number is corresponding, obtain described numbering
For the even number server list that the described server of even number is corresponding;
One is selected successively according to the order of described odd number server list and described even number server list
The server of the server of described numbered radix and a described numbered even number forms described test
Right.
Method the most according to claim 2, it is characterised in that choose the first clothes according to preset strategy
Business device and second server composition test are right, also include:
When there is server described in odd number under described access switch, by the residue under described access switch
Server exchanges in described core according to described access switch access sequential registration under core switch
In the residue server list of machine;
In the residue server list of described core switch, choose two clothes according to described preset strategy
It is right that business device forms described test.
Method the most according to claim 3, it is characterised in that described method also includes:
When the residue server list of described core switch comprises server described in odd number, will residue
Unpaired described server according to described core switch access sequential registration under machine room at described machine
In the residue server list in room;
In the residue server list of described machine room, choose two server compositions according to described preset strategy
Described test is right.
Method the most according to claim 4, it is characterised in that described method also includes:
When the residue server list of described machine room comprises server described in odd number, by unpaired institute
State server right with the described test of composition of reserved stake server.
Method the most according to claim 5, it is characterised in that to described test to carrying out data receipts
Send out test, specifically include:
When described second server is sent packet by preset protocol to default port and records currently transmitted
The very first time stamp carved, wherein, described default port is monitored by described first server;
First server receives described packet by described default port and sends reply to second server
Data;
Second server receives described reply data and records second timestamp of the current time of reception;
According to stamp of the described very first time and described second timestamp, obtain described first server and described second
Bandwidth between server.
Method the most according to claim 1, it is characterised in that described method also includes:
When judging that described transceiving band, less than or equal to the bandwidth threshold preset, chooses the 3rd server and the
It is right that four servers form two described tests with described first server and described second server respectively;Its
In, described 3rd server and described 4th server are right for transmitting trouble-free described test;
When judge two described tests to described in first server place described test to described transmitting-receiving
Bandwidth is less than described default transceiving band, then judge that described first server exists fault;Or,
When judge two described tests to described in second server place described test to described transmitting-receiving
Bandwidth is less than described default transceiving band, then judge that described second server exists fault.
8. a clustering fault monitoring device, it is characterised in that include following module:
Data obtaining module, obtains the access of each server in cluster for the structure according to server cluster
Path;
Test module, under same access switch, chooses first server and the according to preset strategy
Two server composition tests are right;According to described first server and the described access of described second server
Path to described test to carrying out data transmit-receive test;
Analyze module, for obtaining the result of described data transmit-receive test and according to described data transmit-receive test knot
Fruit obtains the transceiving band between first server and second server;When judging that described transceiving band is more than
The bandwidth threshold preset, it is determined that described first server and described second server fault-free.
Device the most according to claim 7, it is characterised in that described default strategy includes:
Each described server under described same access switch is numbered, by described numbering successively
Choose two described servers described tests of composition right;Or,
Obtain the odd number server list that the described server of described numbered odd number is corresponding, obtain described numbering
For the even number server list that the described server of even number is corresponding;
One is selected successively according to the order of described odd number server list and described even number server list
The server of the server of described numbered radix and a described numbered even number forms described test
Right.
Device the most according to claim 9, it is characterised in that described test module is additionally operable to:
When there is server described in odd number under described access switch, by remaining under described access switch
Remaining server is handed in described core according to described access switch access sequential registration under core switch
In the residue server list changed planes;
In the residue server list of described core switch, choose two services according to described preset strategy
It is right that device forms described test.
11. devices according to claim 10, it is characterised in that described test module is additionally operable to:
When the residue server list of described core switch comprises server described in odd number, will residue
Unpaired described server according to described core switch access sequential registration under machine room at described machine
In the residue server list in room;
In the residue server list of described machine room, choose two server compositions according to described preset strategy
Described test is right.
12. devices according to claim 11, it is characterised in that described test module is additionally operable to:
When the residue server list of described machine room comprises server described in odd number, by unpaired institute
State server right with the described test of composition of reserved stake server.
13. devices according to claim 8, it is characterised in that described test module specifically for:
When described second server is sent packet by preset protocol to default port and records currently transmitted
The very first time stamp carved, wherein, described default port is monitored by described first server;
First server receives described packet by described default port and sends reply to second server
Data;
Second server receives described reply data and records second timestamp of the current time of reception;
Described first server and described second is obtained according to stamp of the described very first time and described second timestamp
Bandwidth between server.
14. devices according to claim 8, it is characterised in that described test module is additionally operable to:
When judging that described transceiving band, less than or equal to the bandwidth threshold preset, chooses the 3rd server and the
It is right that four servers form two described tests with described first server and described second server respectively;Its
In, described 3rd server and described 4th server are right for transmitting trouble-free described test;
When judge two described tests to described in first server place described test to described transmitting-receiving
Bandwidth is less than described default transceiving band, then judge that described first server exists fault;Or,
When judge two described tests to described in second server place described test to described transmitting-receiving
Bandwidth is less than described default transceiving band, then judge that described second server exists fault.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610261291.9A CN105933153A (en) | 2016-04-25 | 2016-04-25 | Cluster failure monitoring method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610261291.9A CN105933153A (en) | 2016-04-25 | 2016-04-25 | Cluster failure monitoring method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN105933153A true CN105933153A (en) | 2016-09-07 |
Family
ID=56836072
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610261291.9A Pending CN105933153A (en) | 2016-04-25 | 2016-04-25 | Cluster failure monitoring method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105933153A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106656606A (en) * | 2016-12-27 | 2017-05-10 | 北京奇虎科技有限公司 | Data path testing method, data path testing server and data path testing system |
CN111130917A (en) * | 2018-10-31 | 2020-05-08 | 北京国双科技有限公司 | Line testing method, device and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140067859A1 (en) * | 2012-09-04 | 2014-03-06 | Salesforce.Com, Inc. | Facilitating dynamically controlled fetching of data at client computing devices in an on-demand services environment |
CN104202375A (en) * | 2014-08-22 | 2014-12-10 | 广州华多网络科技有限公司 | Method and system for synchronous data |
-
2016
- 2016-04-25 CN CN201610261291.9A patent/CN105933153A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140067859A1 (en) * | 2012-09-04 | 2014-03-06 | Salesforce.Com, Inc. | Facilitating dynamically controlled fetching of data at client computing devices in an on-demand services environment |
CN104202375A (en) * | 2014-08-22 | 2014-12-10 | 广州华多网络科技有限公司 | Method and system for synchronous data |
Non-Patent Citations (3)
Title |
---|
严代彪: "军车多区域无线集群通信故障检测方法研究", 《计算机仿真》 * |
张毅: "多集群计算环境故障监控管理***", 《计算机工程与科学》 * |
梁佼: "高性能服务器故障诊断方法的研究与设计", 《万方数据》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106656606A (en) * | 2016-12-27 | 2017-05-10 | 北京奇虎科技有限公司 | Data path testing method, data path testing server and data path testing system |
CN111130917A (en) * | 2018-10-31 | 2020-05-08 | 北京国双科技有限公司 | Line testing method, device and system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2407342C (en) | Analysis of network performance | |
CN105165054B (en) | Network service failure processing method, service management system and system management module | |
CN101035037B (en) | Method, system and related device for detecting the network communication quality | |
CN105721318B (en) | The method and apparatus of network topology are found in a kind of software defined network SDN | |
CN105897507B (en) | The condition detection method and device of node device | |
CN106302017B (en) | The small capaciated flow network velocity-measuring system of high concurrent and method | |
CN111800354B (en) | Message processing method and device, message processing equipment and storage medium | |
CN109428785A (en) | A kind of fault detection method and device | |
CN106982244B (en) | Method and device for realizing message mirroring of dynamic flow under cloud network environment | |
CN109714190A (en) | A kind of load balancing based on application level and failure transfer system and its method | |
CN109104335A (en) | A kind of industrial control equipment network attack test method and system | |
CN106301987B (en) | Message loss detection method, device and system | |
CN106411629A (en) | Method used for monitoring state of CDN node and equipment thereof | |
CN103684818A (en) | Method and device for detecting failures of network channel | |
EP3035596A1 (en) | Link performance test method and device, logical processor and network processor | |
CN107426051B (en) | The monitoring method of the working condition of distributed cluster system interior joint, apparatus and system | |
CN107294767A (en) | A kind of Living Network transmission fault monitoring method and system | |
CN110674096A (en) | Node troubleshooting method, device and equipment and computer readable storage medium | |
CN111200544B (en) | Network port flow testing method and device | |
KR101640476B1 (en) | Test analysis system of network and analysis method thereof | |
CN101252477B (en) | Determining method and analyzing apparatus of network fault root | |
CN103995901B (en) | A kind of method for determining back end failure | |
CN105933153A (en) | Cluster failure monitoring method and device | |
CN114401258A (en) | Short message sending method, device, electronic device and storage medium | |
CN106506265B (en) | Detection fpga chip hangs dead method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20160907 |