WO2021128977A1 - 一种故障诊断方法及装置 - Google Patents

一种故障诊断方法及装置 Download PDF

Info

Publication number
WO2021128977A1
WO2021128977A1 PCT/CN2020/116002 CN2020116002W WO2021128977A1 WO 2021128977 A1 WO2021128977 A1 WO 2021128977A1 CN 2020116002 W CN2020116002 W CN 2020116002W WO 2021128977 A1 WO2021128977 A1 WO 2021128977A1
Authority
WO
WIPO (PCT)
Prior art keywords
fault
network
information
service
location
Prior art date
Application number
PCT/CN2020/116002
Other languages
English (en)
French (fr)
Inventor
徐海兵
郭久明
Original Assignee
迈普通信技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 迈普通信技术股份有限公司 filed Critical 迈普通信技术股份有限公司
Publication of WO2021128977A1 publication Critical patent/WO2021128977A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis

Definitions

  • the present disclosure relates to the field of data communication technology, and in particular, to a fault diagnosis method and device.
  • the embodiments of the present disclosure provide a fault diagnosis method and device to improve the above technical problems.
  • the embodiments of the present disclosure provide a fault diagnosis method, which is applied to a central server, and the method includes: sending first probe information to a probe client, where the first probe information includes network services running on the service server The address; receiving the service measurement information sent by the probe client, the service measurement information is generated by the probe client after detecting the network service; determined according to the service measurement information and preset rules The location of the fault.
  • the above method deploys the probe client in the network. After the failure occurs, the central server instructs the probe client to perform fault detection by sending the first detection information, and then executes subsequent operations according to the business metric information returned by the probe client. It can locate the location of the fault in the network, without segmenting the network, so that the fault diagnosis can be completed quickly, and the impact of the fault on the network service can be reduced as much as possible.
  • the location where the fault occurs includes: the service server, network device, or network link.
  • the above-mentioned three types of fault occurrence locations basically cover the possible locations of network faults. Therefore, the method provided in the present disclosure can perform a more comprehensive diagnosis of network faults.
  • the determining the location of the fault based on the business metric information and preset rules includes: if the business metric information meets the first preset rule, determining the location of the fault Is the service server, otherwise it is determined that the location of the fault is the network device or the network link; or, if the service metric information satisfies the first preset rule, it is determined that the location of the fault is the The service server, otherwise, if the service metric information satisfies the second preset rule, it is determined that the location of the fault is the network device or the network link.
  • the above implementation methods include two fault location methods.
  • the first method is a simple dichotomy, that is, if the service measurement information meets the first preset rule, it is considered as a service server failure, otherwise it is considered as a network device or network link failure. ;
  • the second way is to set two conditions (the two conditions are best set to be mutually exclusive), if the business metric information meets the first preset rule, it is considered to be a business server failure, if the business metric information meets the second preset
  • the rule is considered to be network equipment or network link failure.
  • the specific fault location method to be used can be determined according to actual needs. For network equipment or network link failures, subsequent steps can be further performed to determine whether the network equipment or network link is faulty.
  • the service metric information includes the network delay between the probe client and the service server and the processing time of the service server for the network service;
  • the first preset rule is: the network delay is less than the first threshold and the processing time is greater than the second threshold; the second preset rule is: the network delay is greater than the third threshold.
  • the network delay is shorter (less than the first threshold) and the processing time is longer (greater than the second threshold), it indicates that there is a problem with the service processing, so that the service server can be presumed to be faulty; if the network delay is longer (greater than the third threshold) Threshold), it indicates that there is a problem with the network transmission of data, so that the network equipment or network link can be presumed to be faulty.
  • the above rules are simple to set, and at the same time the judgment accuracy is high.
  • the method further includes: if the location where the fault occurs is the service server, collecting first fault information from the service server, and according to the first fault information and The third preset rule determines the cause of the failure of the service server.
  • the first failure information can be further collected from the failed service server, and the cause of the failure can be analyzed, so that the network manager can grasp the failure status in time and quickly solve the failure.
  • the first preset rule, the second preset rule, and the third preset rule are stored in a knowledge base of the central server.
  • the knowledge base can be regarded as a collection of several rules related to network failures, which facilitates unified management of these rules.
  • the knowledge base of the central server generally refers to the knowledge base that the central server can access, that is, the knowledge base can be deployed locally on the central server, but it is not excluded to deploy it on other devices that the central server can access.
  • the representation of rules in the knowledge base is not limited. For example, knowledge representation methods such as productions, frames, or semantic networks can be used.
  • the determining the location of the fault based on the service metric information and preset rules further includes: if the location of the fault is the network device or the network link, Send second detection information to the probe client, where the second detection information includes the address of the service server; receive fault location information sent by the probe client, and the fault location information is sent by the probe.
  • the client generates after detecting the network between itself and the service server, the fault location information includes the address of the suspected faulty network device and the address of the next hop of the network device; according to the fault location information
  • the network device with the suspected failure and the next hop of the network device collect second failure information, and determine that the location of the failure is the network device with the suspected failure according to the second failure information and a fourth preset rule, The next hop of the network device or the network link between the two.
  • the probe client contains at least two types of detection functions, one is the detection service, and the other is the detection network.
  • the former function has been mentioned above, and the latter function is the function used in this implementation.
  • the probe client returns the fault location information to the central server after detecting the network.
  • the central server collects the second fault information from the network device indicated in the fault location information, it can then base on the matching relationship between the second fault information and the fourth preset rule Precisely locate the fault (locate to a certain network device or a certain network link).
  • the central server uses the second fault information to locate the fault at the same time It is also possible to analyze the cause of the failure at the same time.
  • the embodiments of the present disclosure provide a fault diagnosis method, which is applied to a probe client.
  • the method includes: receiving first probe information sent by a central server, where the first probe information includes a network running on a service server The address of the service; the network service is detected according to the first detection information to obtain service metric information; the service metric information is sent to the central server.
  • the method further includes: receiving second detection information sent by the central server, where the second detection information includes the address of the service server; and according to the second detection information Detect the network between the probe client and the service server to obtain fault location information, where the fault location information includes the address of the suspected faulty network device and the address of the next hop of the network device;
  • the central server sends the fault location information.
  • the probe client is deployed on a network device close to the user side in the network.
  • the probe client can be deployed anywhere in the network, but in most cases, the network failure is directly perceived by the user (for example, a user visits a certain website and finds that the speed is slow or completely inaccessible). Therefore, the probe client
  • the deployment of the terminal on the network equipment close to the user side in the network can better simulate the user terminal's access to the service server, and the information obtained by its detection is also more practical, which is beneficial to fault location and fault cause analysis.
  • the probe client can be deployed on edge network equipment or converged network equipment.
  • an embodiment of the present disclosure provides a fault diagnosis device, which is configured in a central server, and the device includes: a first information sending module, configured to send first detection information to a probe client, the first detection information Including the address of the network service running on the service server; the first information receiving module is used to receive the service metric information sent by the probe client, and the service metric information is processed by the probe client on the network service. Generated after detection; the fault diagnosis module is used to determine the location of the fault according to the business metric information and preset rules.
  • an embodiment of the present disclosure provides a fault diagnosis device configured in a probe client.
  • the device includes: a second information receiving module configured to receive first detection information sent by a central server, and the first detection The information includes the address of the network service running on the service server; the detection module is used to detect the network service according to the first detection information to obtain service metric information; the second information sending module is used to send the information to the central server Send the business metric information.
  • an embodiment of the present disclosure provides an electronic device including a memory and a processor.
  • the memory stores computer program instructions.
  • the computer program instructions When the computer program instructions are read and run by the processor, the first aspect is executed.
  • embodiments of the present disclosure provide a computer-readable storage medium having computer program instructions stored on the computer-readable storage medium, and when the computer program instructions are read and run by a processor, the first aspect, The method provided by the second aspect or any one of the possible implementations of the above two aspects.
  • Figure 1 shows a topological structure diagram of a network to which an embodiment of the present disclosure can be applied to provide a fault diagnosis method
  • Figure 2 shows a flowchart of a fault diagnosis method provided by an embodiment of the present disclosure
  • FIG. 3 shows a functional module diagram of a fault diagnosis device provided by an embodiment of the present disclosure
  • FIG. 4 shows a functional module diagram of another fault diagnosis device provided by an embodiment of the present disclosure
  • Fig. 5 shows a structural diagram of an electronic device provided by an embodiment of the present disclosure.
  • the network administrator locates the network fault and analyzes the cause of the fault by performing segmented troubleshooting on the network.
  • the inventor found through long-term research that although this method can also locate the fault point after a large number of attempts, the efficiency of the troubleshooting process is too low, so that the network services affected by the network fault cannot be recovered for a long time.
  • Fig. 1 shows a topology diagram of a network to which an embodiment of the present disclosure can be applied to provide a fault diagnosis method.
  • the network includes several entities involved in the method of the present disclosure: a central server 110, a probe client 120, and a network device 130 (two are shown in Figure 1, namely network device A and network device B). ), the network link 140 and the service server 150.
  • the connecting lines with arrows indicate possible data interaction relationships between these entities. It can be understood that the number of these entities and the topological relationship between them are not limited to those shown in FIG. 1, and FIG. 1 is only a simple example.
  • the probe client 120 is used to perform detection according to the instructions of the central server 110 and return the detection result to the central server to assist the central server 110 in completing fault diagnosis.
  • the service server 150 is used to run network services, such as web services. The user can use the terminal device to access the network service on the service server 150, for example, to browse the web page. When the user terminal accesses the network service, the message may pass through the network device 130 and the network link 140 in the network.
  • the network device 130 here may be a router or a switch.
  • the central server 110 and the probe client 120 can be deployed independently, and of course can also be deployed on a certain network device 130.
  • the probe client 120 can theoretically be deployed anywhere in the network, in most cases, the network failure is directly perceived by the user (for example, when the user visits a certain website and finds that the speed is slow or completely inaccessible), therefore If the probe client 120 is deployed on the network device 130 close to the user side in the network, it can be considered that the probe client 120 and the user terminal are in or basically in the same network environment, so that the detection behavior of the probe client 120 can be The actual access behavior of the user terminal to the service server 150 is better simulated, and the information obtained by the detection is also more practical, which is beneficial to fault location and fault cause analysis.
  • the probe client 120 can be deployed on edge network devices (located at the access layer) or convergence network devices (located at the convergence layer).
  • edge network devices located at the access layer
  • convergence network devices located at the convergence layer
  • the probe client 120 can also be deployed independently, for example, deployed on an independent server, and the server and the user terminal access the same network device.
  • the timing of deployment of the probe client 120 is not limited: for example, it can be deployed in advance, but the probe client 120 is only used when fault diagnosis is required; for example, the probe client 120 can also be deployed after a fault is found Used for fault diagnosis.
  • Fig. 2 shows a flowchart of a fault diagnosis method provided by an embodiment of the present disclosure. Referring to Figure 2, the method includes:
  • Step S210 The central server sends the first probe information to the probe client.
  • Step S210 can start after discovering the phenomenon of network failure (for example, the user discovers that the network service is unavailable or the response speed is very slow).
  • the first detection information is used to instruct the detection client to detect the network service.
  • the first detection information includes at least the address of the network service running on the service server, and may also include content such as detection frequency and detection mode.
  • the address of the network service can be a website address, such as a website address starting with http or https (corresponding to an http service and an https service respectively), and the address of the network service can also be a protocol address such as SFTP or RSTP, but
  • the detection frequency refers to the time interval between the probe client's detection each time
  • the detection mode refers to the way the probe client performs the detection, for example, permanent continuous detection, Continuous detection for a period of time or single detection, etc.
  • the content of the first detection information can be determined according to the user's diagnosis requirements, and of course, the default value can also be adopted.
  • Step S220 The probe client detects the network service according to the first detection information to obtain service metric information.
  • Service measurement information can be used to characterize the quality of network services experienced by users: for example, service measurement information can include the network delay between the probe client and the service server (for example, TCP connection establishment time or SSL three-way handshake time), The transmission delay between the probe client and the service server for the detected service (for example, page transmission time) or the processing time of the service server for the detected service (the processing time for the service request by the service server), etc.
  • Step S230 The probe client sends service metric information to the central server.
  • Step S240 The central server determines the location of the fault according to the business metric information and preset rules.
  • the central server After the central server receives the business metric information, it can determine the location of the fault in the network by using the content of the business metric information and preset rules.
  • the location where the fault occurs includes at least three possible locations of a service server, a network device, or a network link. These three locations basically cover the possible locations of the network fault. Therefore, the method provided in the present disclosure can comprehensively locate the network fault.
  • the preset rule may include the first preset rule, the second preset rule, the third preset rule, the fourth preset rule, etc. mentioned later.
  • these preset rules related to network failures are stored in the knowledge base of the central server.
  • the knowledge base can be regarded as a collection of a large number of rules, which facilitates unified management of these rules.
  • the so-called knowledge base of the central server generally refers to the knowledge base that the central server can access, that is, the knowledge base can be deployed locally on the central server, but it does not rule out its deployment on other devices that the central server can access.
  • the representation of rules in the knowledge base is not limited. For example, knowledge representation methods such as productions, frames, or semantic networks can be used.
  • all the rules can be stored in one knowledge base, or multiple knowledge bases can be formed.
  • multiple rules in the third preset rule can form an independent knowledge base.
  • the preset rules can also be stored in a form other than the knowledge base.
  • step S240 It can be understood as a general summary of fault location, and its specific implementation may be more complicated. For example, the following steps S241a to S246 show a possible implementation of step S240.
  • step S240 is executed by the central server
  • step S240 is executed by the central server
  • the central server may also need to pre-process the business metric information before using the business metric information for fault diagnosis.
  • the pre-processing may include decryption, decoding, format conversion, or elimination of redundancy (redundant information refers to and Information irrelevant to fault diagnosis) and other operations.
  • Step S241a The central server determines that the location of the fault is the business server.
  • Step S241b The central server determines that the location of the fault is the network device or the network link.
  • the location of the fault includes at least three possibilities of the service server, network device or network link.
  • the fault can be located on the service server, and in step S241b, the fault can be located on the network device or network. Link, but whether it is a network device or a network link needs to be further determined in subsequent steps.
  • the above two steps can be implemented in at least the following two ways:
  • the first method is a simple dichotomy.
  • the condition for judging the location of the fault is only the single condition of the first preset rule.
  • the first preset rule may be that the network delay between the probe client and the service server is less than the first threshold and the processing time of the service server for the detected service is greater than the second threshold.
  • the internal logic of this rule is: if the network delay is short (less than the first threshold) and the processing time is long (greater than the second threshold), it indicates that there is a problem with the service processing, so that the service server can be presumed to be faulty, otherwise it is not a service The processing resulted in a failure, and the failure should occur on the network device or network link.
  • the network delay may be the TCP connection establishment time between the probe client and the service server.
  • the network delay can be the TCP connection establishment time between the probe client and the service server or the SSL three-way handshake time.
  • these two times can also be used at the same time. For example, if the TCP connection is established When the time is less than a certain preset value, and the SSL three-way handshake time is also less than a certain preset value, and the processing time is greater than the second threshold, it is considered that the service server is faulty.
  • the second method uses two conditions when judging the location of the fault, namely the first preset rule and the second preset rule. These two conditions are best set to be mutually exclusive to avoid the fault location results under the two conditions. conflict.
  • the first preset rule may be: the network delay between the probe client and the service server is less than the first threshold and the processing time of the service server for the detected service is greater than the second threshold; the second preset rule may be: the probe The network delay between the client and the service server is greater than the third threshold.
  • the internal logic of these two rules is: if the network delay is short (less than the first threshold) and the processing time is longer (greater than the second threshold), it indicates that there is a problem with the service processing, so that the service server can be presumed to be faulty; otherwise If the network delay is longer (greater than the third threshold), it indicates that there is a problem with the network transmission of data, so that it can be inferred that the network device or network link is faulty.
  • the third threshold in the second preset rule may take a certain value not less than the first threshold.
  • the rule setting is relatively simple, and the accurate location of the service server fault can be quickly completed.
  • the fault location of the network equipment or network link can be performed in the subsequent steps.
  • it is not arranged in certain application scenarios. It is only necessary to determine whether the service server is faulty, and it does not care about faults in other locations. At this time, there is no need to locate network equipment or network link faults.
  • the failure of the business server is marked as X1.
  • Step S242 The central server collects the first failure information from the service server, and determines the cause of the failure of the service server according to the first failure information and a third preset rule.
  • the central server may further analyze the cause of the service server failure.
  • the step S242 of analyzing the cause of the fault is not part of the step S240 of locating the fault, but it is also described for the sake of simplicity.
  • the central server may send a request to the business server that has failed, instruct the business server to collect the first fault information, and return the first fault information to the central server.
  • the first fault information may include, but is not limited to, processor information, memory information, log information, network interface traffic information, or process information of the service server.
  • a rule in the third preset rule is: if the processor occupancy is at a high level for a long time, it is confirmed that the cause of the failure of the business server is the server performance bottleneck problem, if the central server receives the first failure information The processor information in can match this rule, and the central server can confirm that the cause of the failure is the performance bottleneck of the business server. After analyzing the cause of the failure, the network management personnel can grasp the failure status in time and take reasonable countermeasures to quickly eliminate the failure.
  • Step S243 The central server sends the second detection information to the probe client.
  • the central server may send the second detection information to the probe client and perform subsequent steps to accurately locate the network fault.
  • the second detection information is used to indicate how the detection client detects the network status.
  • the second detection information includes the address of the service server, and may also include content such as detection frequency or detection mode.
  • the address of the service server can be an IP address. It is mentioned in the description of step S210 that the central server can send the service URL to the probe client, and the probe client will first use DNS resolution to obtain the IP address of the service server before detecting the service. When the probe client returns the business metric information to the central server, the IP address can also be returned together, so that the central server can use the IP address in step S243. Of course, it is not ruled out that the central server uses DNS resolution to obtain the IP address of the service server. Regarding the detection frequency and detection mode, the previous article has been explained, and will not be repeated.
  • Step S244 The probe client probes the network between the probe client and the service server according to the second probe information to obtain fault location information.
  • the probe client After receiving the second detection information, the probe client performs network detection according to the address, detection frequency, or detection mode of the service server specified in the second detection information to obtain fault location information.
  • the fault location information is used to describe the general location of the fault (but not the final location).
  • the fault location information may include the address of the suspected faulty network device and the address of the next hop of the network device (if there is no next hop, this item need not be included), that is, the fault may occur in the suspected fault On the network device, or on the next hop of the suspected faulty network device, or on the network link between the two.
  • the suspected faulty network device refers to the device that exhibits certain fault characteristics, but sometimes exhibits the characteristics of the fault, not necessarily the fault of the device itself, it may also be caused by the network environment around the device, so the fault location information
  • the address of the next-hop network device is also included in the middle, which helps locate the true source of the fault.
  • the probe client detects the network between itself and the business server, and can call existing tools such as traceroute or ping. If a suspected failure of network device A is detected, it will send the fault location information to the central server It must contain both the IP address of network device A and the IP address of its next hop, which is network device B.
  • step S244 Comparing step S244 and step S220, it is not difficult to find that the probe client includes at least two types of detection functions, one is the detection service (step S220), and the other is the detection network (step S244).
  • Step S245 The probe client sends fault location information to the central server.
  • Step S246 The central server collects second fault information from the suspected faulty network device and the next hop of the network device according to the fault location information, and determines the location of the fault based on the second fault information and a fourth preset rule.
  • the central server may also need to preprocess the fault location information first.
  • the possible preprocessing methods have been introduced at step S240 and will not be repeated.
  • the central server may respectively send a request to the suspected faulty network device and the next hop of the network device to instruct the two devices to collect the second fault information, and return the second fault information to the central server.
  • the second fault information may include, but is not limited to, routing table information, device configuration information, or operating system information of the network device. It should be noted that two network devices do not necessarily return the same type of information. For example, network device A can return routing table information and device configuration information, and network device B can return operating system information. In short, the returned second fault information can be Combine according to needs.
  • the central server can match it with the fourth preset rule. If a certain fourth preset rule is matched, the fault location can be correspondingly obtained.
  • the fourth preset rule can also specify some operations for determining the location of the fault, and these operations are executed during the rule matching process.
  • the possible fault locations are as mentioned above, including the suspected faulty network device and the next hop of the suspected faulty network device or the network link between the two.
  • one of the rules in the fourth preset rule is: query the routing table of the network device with the suspected failure, determine whether the destination route from the network device to the service server exists, and if the destination route does not exist, confirm the network device with the suspected failure It is the location of the fault; if the destination route exists, a one-way loopback test will be performed between the network device with the suspected fault and its next hop. If the detection result is a failure, the network device with the suspected fault and its next hop will be confirmed The network link between is where the failure occurs.
  • the central server After the central server receives the second fault information, it can query the routing table of the suspected faulty network device, and then match the query result with the above rules. If it matches the rule that the destination route does not exist, confirm the suspected faulty network device It is the location of the fault, and at the same time, it can be confirmed that the cause of the fault is the missing routing table entry; if it matches the rule that the destination route exists, a single loopback test is performed, and then the test result is further matched with the above rule. If the test result is matched The rule of failure indicates that the network link between the detection source (suspected failure network device) and the detection destination (next hop device) is unavailable, thereby confirming the network link between the suspected failure network device and its next hop The path is the location where the fault occurs. Of course, the cause of the fault is the link failure.
  • the central server may also analyze the cause of the fault while using the second fault information to locate the fault. It does not need to be the same as the business. When the server fails, analyze the cause of the failure separately.
  • the failure reason obtained above may only be preliminary.
  • the central server can further analyze what caused the missing routing table entry based on the second failure information, and the analysis method can also adopt rule matching. The method will not be described in detail.
  • the fault diagnosis method deploys a probe client in the network. After a fault occurs, the central server instructs the probe client to perform fault detection by sending the first detection information, and then according to the probe client The returned service measurement information performs subsequent operations to locate the location of the fault in the network at one time, without the need to perform segmentation troubleshooting on the network, so that fault diagnosis can be completed quickly, and the impact of the fault on network services can be reduced as much as possible.
  • the central server can further determine the cause of the failure through analysis, which is beneficial to troubleshooting as soon as possible.
  • FIG. 3 shows a functional module diagram of a fault diagnosis device 300 provided by an embodiment of the present disclosure.
  • the device is configured on the central server and includes:
  • the first information sending module 310 is configured to send first detection information to the probe client, where the first detection information includes the address of the network service running on the service server;
  • the first information receiving module 320 is configured to receive service metric information sent by the probe client, where the service metric information is generated by the probe client after detecting the network service;
  • the fault diagnosis module 330 is configured to determine the location of the fault according to the business metric information and preset rules.
  • the location where the fault occurs includes: the service server, network equipment, or network link.
  • the fault diagnosis module 330 determines the location of the fault according to the service metric information and preset rules, including: if the service metric information meets the first preset rule, determining the The location of the failure is the service server, otherwise it is determined that the location of the failure is the network device or the network link; or, if the service metric information satisfies the first preset rule, the location of the failure is determined Is the service server; otherwise, if the service metric information satisfies the second preset rule, it is determined that the location of the fault is the network device or the network link.
  • the service metric information includes the network delay between the probe client and the service server and the processing time of the service server for the network service;
  • the first preset rule is: the network delay is less than a first threshold and the processing time is greater than a second threshold; the second preset rule is: the network delay is greater than a third threshold.
  • the fault diagnosis module 330 is further configured to: if the fault occurs at the service server, collect first fault information from the service server, and according to the first The failure information and the third preset rule determine the cause of the failure of the service server.
  • the first preset rule, the second preset rule, and the third preset rule are stored in the knowledge base of the central server.
  • the fault diagnosis module 330 determines the location of the fault according to the service metric information and preset rules, and further includes: if the location of the fault is the network device or the network link Path, send second detection information to the probe client, the second detection information includes the address of the service server; receive the fault location information sent by the probe client, the fault location information is The probe client is generated after detecting the network between itself and the service server, and the fault location information includes the address of the suspected faulty network device and the address of the next hop of the network device; according to the fault The location information collects second fault information from the suspected faulty network device and the next hop of the network device, and determines that the location of the fault is the suspected fault based on the second fault information and a fourth preset rule The network device, the next hop of the network device, or the network link between the two.
  • FIG. 4 shows a functional module diagram of a fault diagnosis device 400 provided by an embodiment of the present disclosure.
  • the device is configured on the probe client and includes:
  • the second information receiving module 410 is configured to receive the first detection information sent by the central server, where the first detection information includes the address of the network service running on the service server;
  • the detection module 420 is configured to detect the network service according to the first detection information to obtain service metric information;
  • the second information sending module 430 is configured to send the business metric information to the central server.
  • the second information receiving module 410 is further configured to: receive second detection information sent by the central server, where the second detection information includes the address of the service server;
  • the detection module 420 is further configured to: detect the network between the probe client and the service server according to the second detection information to obtain fault location information, where the fault location information includes information about a suspected faulty network device Address and the address of the next hop of the network device;
  • the second information sending module 430 is further configured to send the fault location information to the central server.
  • the probe client is deployed on a network device close to the user side in the network.
  • FIG. 5 shows a possible structure of an electronic device 500 provided by an embodiment of the present disclosure.
  • the electronic device 500 includes a processor 510, a memory 520, and a communication interface 530. These components are interconnected and communicate with each other through a communication bus 540 and/or other forms of connection mechanisms (not shown).
  • the memory 520 stores computer program instructions, and when these computer program instructions are read by the processor 510 and run, they execute the fault diagnosis method provided by the embodiments of the present disclosure and other desired functions.
  • the communication interface 530 is used for the electronic device 500 to communicate with other devices.
  • FIG. 5 is only for illustration, and the electronic device 500 may also include more or less components than those shown in FIG. 5, or have a different configuration from that shown in FIG.
  • the components shown in FIG. 5 can be implemented by hardware, software, or a combination thereof.
  • both the central server 110 in FIG. 1 and the device on which the probe client 120 is deployed can be implemented by the electronic device 500.
  • the embodiment of the present disclosure also provides a computer-readable storage medium, the computer-readable storage medium stores computer program instructions, and when the computer program instructions are read and run by a processor, the fault diagnosis method provided by the embodiment of the present disclosure is executed. step.
  • the computer-readable storage medium may be, but is not limited to, the memory 520 of the electronic device 500 in FIG. 5.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本公开涉及数据通信技术领域,提供一种故障诊断方法及装置。其中,故障诊断方法包括:中心服务器向探针客户端发送第一探测信息,第一探测信息包括业务服务器上运行的网络业务的地址;探针客户端根据第一探测信息对网络业务进行探测,获得业务度量信息;探针客户端向中心服务器发送业务度量信息;中心服务器根据业务度量信息以及预设规则确定故障发生位置。上述方法可一次性定位网络中发生故障的位置,无需对网络进行分段排查,从而能够快速完成故障诊断,尽可能降低故障对网络业务的影响。

Description

一种故障诊断方法及装置
相关申请的交叉引用
本公开要求于2019年12月24日提交中国专利局的申请号为CN201911346437.X、名称为“一种故障诊断方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
技术领域
本公开涉及数据通信技术领域,具体而言,涉及一种故障诊断方法及装置。
背景技术
随着网络设备的大幅增加以及网络服务的爆发性增长,网络故障的发生变为一种常态。网络故障一旦发生,其后果轻则导致节点或链路异常,重则导致网络服务完全瘫痪,因此,及时的定位故障并采取相应措施变得尤为重要。在现有方法中,排查网络故障往往采取由网络管理员分段排查网络的方式,其执行效率低下,导致网络业务受到比较严重的影响。
发明内容
有鉴于此,本公开实施例提供一种故障诊断方法及装置,以改善上述技术问题。
为实现上述目的,本公开提供如下技术方案:
第一方面,本公开实施例提供一种故障诊断方法,应用于中心服务器,所述方法包括:向探针客户端发送第一探测信息,所述第一探测信息包括业务服务器上运行的网络业务的地址;接收所述探针客户端发送的业务度量信息,所述业务度量信息由所述探针客户端在对所述网络业务进行探测后生成;根据所述业务度量信息以及预设规则确定故障发生位置。
上述方法在网络中部署探针客户端,中心服务器在故障发生后通过发送第一探测信息指示探针客户端进行故障探测,然后根据探针客户端返回的业务度量信息执行后续操作,即可一次性定位网络中发生故障的位置,无需对网络进行分段排查,从而能够快速完成故障诊断,尽可能降低故障对网络业务的影响。
在第一方面的一种实现方式中,所述故障发生位置包括:所述业务服务器、网络设备或网络链路。
上述三种故障发生位置基本涵盖了网络故障可能的发生地点,因此本公开提供的方法能够对网络故障进行较为全面的诊断。
在第一方面的一种实现方式中,所述根据所述业务度量信息以及预设规则确定故障发生位置,包括:若所述业务度量信息满足第一预设规则,则确定所述故障发生位置为所述业务服务器,否则确定所述故障发生位置为所述网络设备或所述网络链路;或者,若所述业务度量信息满足第一预设规则,则确定所述故障发生位置为所述业务服务器,否则若所 述业务度量信息满足第二预设规则,则确定所述故障发生位置为所述网络设备或所述网络链路。
上述实现方式包含了两种故障定位方式,第一种方式是一种简单的二分法,即业务度量信息满足第一预设规则就认为是业务服务器故障,否则认为是网络设备或网络链路故障;第二种方式则设置两个条件(这两个条件最好设置为互斥的),若业务度量信息满足第一预设规则就认为是业务服务器故障,若业务度量信息满足第二预设规则就认为是网络设备或网络链路故障。具体采用何种故障定位方式可以根据实际需求确定,对于网络设备或网络链路故障的,还可以进一步执行后续步骤,确定到底是网络设备故障还是网络链路故障。
在第一方面的一种实现方式中,所述业务度量信息包括所述探针客户端与所述业务服务器之间的网络时延以及所述业务服务器针对所述网络业务的处理时间;所述第一预设规则为:所述网络时延小于第一阈值且所述处理时间大于第二阈值;所述第二预设规则为:所述网络时延大于第三阈值。
若网络时延较短(小于第一阈值),而处理时间较长(大于第二阈值),则表明业务处理出现了问题,从而可以推定业务服务器故障;若网络时延较长(大于第三阈值),则表明数据的网络传输出现了问题,从而可以推定网络设备或网络链路故障。以上规则设置简单,同时判断准确率高。
在第一方面的一种实现方式中,所述方法还包括:若所述故障发生位置为所述业务服务器,则向所述业务服务器收集第一故障信息,并根据所述第一故障信息以及第三预设规则确定所述业务服务器的故障原因。
在定位到业务服务器故障后,还可以进一步从发生故障的业务服务器收集第一故障信息,进而分析故障原因,以便网络管理人员及时掌握故障状况,快速解决故障。
在第一方面的一种实现方式中,所述第一预设规则、所述第二预设规则以及所述第三预设规则保存在所述中心服务器的知识库中。
知识库可以视为若干与网络故障相关的规则的集合,便于对这些规则进行统一管理。中心服务器的知识库泛指中心服务器可以访问的知识库,即知识库可以部署在中心服务器本地,但也不排除将其部署在中心服务器能够访问的其他设备上。知识库中规则的表示方式不限定,例如可以采用产生式、框架或语义网络等知识表示方法。
在第一方面的一种实现方式中,所述根据所述业务度量信息以及预设规则确定故障发生位置,还包括:若所述故障发生位置为所述网络设备或所述网络链路,则向所述探针客户端发送第二探测信息,所述第二探测信息包括所述业务服务器的地址;接收所述探针客户端发送的故障位置信息,所述故障位置信息由所述探针客户端在对自身与所述业务服务 器之间的网络进行探测后生成,所述故障位置信息包括疑似故障的网络设备的地址以及该网络设备的下一跳的地址;根据所述故障位置信息向所述疑似故障的网络设备以及该网络设备的下一跳收集第二故障信息,并根据所述第二故障信息以及第四预设规则确定所述故障发生位置为所述疑似故障的网络设备、该网络设备的下一跳或二者之间的网络链路。
若在之前的步骤中判断出故障发生位置为网络设备或网络链路,还可以进一步具体判断是哪个网络设备或哪段网络链路发生了故障,在精确定位故障时仍然可以利用探针客户端,即探针客户端至少包含两类探测功能,一类是探测业务,一类是探测网络,前一项功能在前文已经提到,后一项功能本实现方式中利用到的功能。
探针客户端探测网络后向中心服务器返回的故障位置信息,中心服务器向故障位置信息中指示的网络设备收集第二故障信息后,即可根据第二故障信息与第四预设规则的匹配关系对故障进行精确定位(定位到某个网络设备或某段网络链路),另外,由于第二故障信息中也可能包含了故障原因的描述,因此中心服务器在利用第二故障信息定位故障的同时还有可能同时分析出故障原因。
第二方面,本公开实施例提供一种故障诊断方法,应用于探针客户端,所述方法包括:接收中心服务器发送的第一探测信息,所述第一探测信息包括业务服务器上运行的网络业务的地址;根据所述第一探测信息对所述网络业务进行探测,获得业务度量信息;向所述中心服务器发送所述业务度量信息。
在第二方面的一种实现方式中,所述方法还包括:接收所述中心服务器发送的第二探测信息,所述第二探测信息包括所述业务服务器的地址;根据所述第二探测信息对所述探针客户端与所述业务服务器之间的网络进行探测,获得故障位置信息,所述故障位置信息包括疑似故障的网络设备的地址以及该网络设备的下一跳的地址;向所述中心服务器发送所述故障位置信息。
在第二方面的一种实现方式中,所述探针客户端部署在网络中靠近用户侧的网络设备上。
理论上探针客户端可以部署在网络中的任何位置,但多数情况下,网络故障由用户直接感知(例如,用户访问某个网站发现速度很慢或者完全无法访问),因此,将探针客户端部署在网络中靠近用户侧的网络设备上能够更好地模拟用户终端对业务服务器的访问,其探测得到的信息也更具实用价值,有利于故障定位及故障原因分析。例如,探针客户端可以部署在边缘网络设备或者汇聚网络设备上。
第三方面,本公开实施例提供一种故障诊断装置,配置于中心服务器,所述装置包括:第一信息发送模块,用于向探针客户端发送第一探测信息,所述第一探测信息包括业务服 务器上运行的网络业务的地址;第一信息接收模块,用于接收所述探针客户端发送的业务度量信息,所述业务度量信息由所述探针客户端在对所述网络业务进行探测后生成;故障诊断模块,用于根据所述业务度量信息以及预设规则确定故障发生位置。
第四方面,本公开实施例提供一种故障诊断装置,配置于探针客户端,所述装置包括:第二信息接收模块,用于接收中心服务器发送的第一探测信息,所述第一探测信息包括业务服务器上运行的网络业务的地址;探测模块,用于根据所述第一探测信息对所述网络业务进行探测,获得业务度量信息;第二信息发送模块,用于向所述中心服务器发送所述业务度量信息。
第五方面,本公开实施例提供一种电子设备,包括存储器以及处理器,所述存储器中存储有计算机程序指令,所述计算机程序指令被所述处理器读取并运行时,执行第一方面、第二方面或以上两方面的任意一种可能的实现方式提供的方法。
第六方面,本公开实施例提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序指令,所述计算机程序指令被处理器读取并运行时,执行第一方面、第二方面或以上两方面的任意一种可能的实现方式提供的方法。
为使本公开的上述目的、技术方案和有益效果能更明显易懂,下文特举实施例,并配合所附附图,作详细说明如下。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1示出了一种可应用本公开实施例提供故障诊断方法的网络的拓扑结构图;
图2示出了本公开实施例提供的一种故障诊断方法的流程图;
图3示出了本公开实施例提供的一种故障诊断装置的功能模块图;
图4示出了本公开实施例提供的另一种故障诊断装置的功能模块图;
图5示出了本公开实施例提供的一种电子设备的结构图。
具体实施方式
随着网络环境的日益复杂,网络故障的发生频率也越来越高。在对照实施例中,网络管理人员通过对网络进行分段排查的方式定位网络故障并分析故障原因。发明人经长期研究发现,此种方式在进行大量尝试后虽然也能定位故障点,但排查故障过程效率太低,使得因网络故障而受到影响的网络业务迟迟不能恢复。
对照实施例中存在的上述缺陷,是发明人在经过实践并仔细研究后得出的结果,因此,上述问题的发现过程以及下文中本公开实施例针对上述问题所提出的解决方案,都应该是发明人在发明过程中对本发明做出的贡献。
下面将结合本公开实施例中附图,对本公开实施例中的技术方案进行清楚且完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。
需要指出,在本公开的描述中,术语“第一”和“第二”等仅用于将一个实体或者操作与另一个实体或操作区分开来,而不能理解为指示或暗示相对重要性,也不能理解为要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。
图1示出了一种可应用本公开实施例提供故障诊断方法的网络的拓扑结构图。参照图1,该网络中包括本公开方法所涉及的几种实体:中心服务器110、探针客户端120、网络设备130(图1中示出了两台,分别是网络设备A和网络设备B)、网络链路140以及业务服务器150。带箭头的连接线表示这些实体之间可能存在的数据交互关系。可以理解的,这些实体的数量以及相互之间的拓扑关系并不限于图1所示,图1仅仅是一个简单的示例而已。
其中,故障诊断(包括故障定位和故障原因分析等)的主要步骤在中心服务器110上进行。探针客户端120用于根据中心服务器110的指示进行探测并将探测结果返回给中心服务器,辅助中心服务器110完成故障诊断。业务服务器150用于运行网络业务,例如网页服务等。用户可以使用终端设备访问业务服务器150上的网络业务,例如进行网页浏览等。用户终端在访问网络业务的过程中,报文可能会经过网络中的网络设备130以及网络链路140,这里的网络设备130可能是路由器或交换机等。
中心服务器110以及探针客户端120可以独立部署,当然也可以部署在某台网络设备130上。特别地,虽然理论上探针客户端120可以部署在网络中的任何位置,但多数情况下,网络故障由用户直接感知(例如,用户访问某个网站发现速度很慢或者完全无法访问),因此,若将探针客户端120部署在网络中靠近用户侧的网络设备130上,可以认为探针客户端120和用户终端处于或基本处于同一网络环境中,从而探针客户端120的探测行为能够更好地模拟用户终端对业务服务器150的实际访问行为,其探测得到的信息也更具实用价值,有利于故障定位及故障原因分析。
例如,对于传统的三层网络架构(接入层、汇聚层和核心层),探针客户端120可以部署在边缘网络设备(位于接入层)或者汇聚网络设备(位于汇聚层)上。当然,目前也有一些网络并未采用传统的三层架构,这时将探针客户端120部署在靠近用户侧的网络设备130上就可以了。此外,上面也提到,探针客户端120也可以独立部署,例如部署在一台 独立的服务器上,该服务器和用户终端接入同一网络设备。
关于探针客户端120部署的时机不作限定:例如,可以预先部署好,但在需要进行故障诊断时才使用探针客户端120;又例如,也可以在发现故障后才部署探针客户端120用于故障诊断。
图2示出了本公开实施例提供的一种故障诊断方法的流程图。参照图2,该方法包括:
步骤S210:中心服务器向探针客户端发送第一探测信息。
步骤S210可以开始于发现网络故障的现象(例如,用户发现网络业务无法使用或者响应速度很慢)之后。第一探测信息用于指示探测客户端如何探测网络业务,第一探测信息中至少包括业务服务器上运行的网络业务的地址,还可以包括探测频率、探测模式等内容。
其中,网络业务的地址可以是一个网址,例如一个http开头或https开头对应的网址(分别对应一项http业务和一项https业务),网络业务的地址也可以是SFTP或RSTP等协议地址,但在后文阐述时为简单起见仍然以网址为例;探测频率是指探针客户端每次进行探测的时间间隔;探测模式是指探针客户端进行探测的方式,例如,永久持续探测、在一段时间内持续探测或者单次探测等。第一探测信息的内容可以根据用户的诊断需求确定,当然也可以采用默认的取值。
步骤S220:探针客户端根据第一探测信息对网络业务进行探测,获得业务度量信息。
探针客户端接收到第一探测信息后,根据第一探测信息中指定的网络业务的地址、探测频率或探测模式等进行业务探测,获得业务度量信息。业务度量信息可以用于表征用户体验到的网络业务的品质:例如,业务度量信息可以包括探针客户端与业务服务器之间的网络时延(比如,TCP连接建立时间或SSL三次握手时间)、探针客户端与业务服务器之间针对被探测业务的传输时延(比如,页面传输时间)或者业务服务器针对被探测业务的处理时间(业务服务器对业务请求的处理时间)等。
步骤S230:探针客户端向中心服务器发送业务度量信息。
步骤S240:中心服务器根据业务度量信息以及预设规则确定故障发生位置。
中心服务器接收到业务度量信息后,利用业务度量信息的内容以及预设规则,便可以确定网络中发生故障的位置。故障发生位置至少包括业务服务器、网络设备或网络链路三种可能的位置,这三种位置基本涵盖了网络故障可能的发生地点,因此本公开提供的方法能够对网络故障进行全面定位。
其中,预设规则可以包括后文所说的第一预设规则、第二预设规则、第三预设规则或第四预设规则等。在一种实现方式中,这些和网络故障相关的预设规则保存在中心服务器的知识库中。知识库可以视为大量规则的集合,从而便于对这些规则进行统一管理。所谓 中心服务器的知识库,泛指中心服务器可以访问的知识库,即知识库可以部署在中心服务器本地,但也不排除将其部署在中心服务器能够访问的其他设备上。知识库中规则的表示方式不限定,例如可以采用产生式、框架或语义网络等知识表示方法。还需要指出,可以所有的规则都保存在一个知识库中,也可以形成多个知识库,例如第三预设规则中的多条规则可以形成一个独立的知识库。当然,预设规则也可以使用知识库之外的形式保存。
对于故障发生位置的确定,可能只需要将业务度量信息与某个预设规则进行匹配即可完成(如业务服务器故障,具体见后文),但根据故障类型的不同,也可能涉及更复杂后续的操作(如网络设备或网络链路故障,具体见后文),当然这些后续操作也是由中心服务器所接收到的业务度量信息所触发,也会利用某些预设规则进行判断,因此步骤S240可以理解为对故障定位的一个总体概括,其具体实现则可能比较复杂,例如下面的步骤S241a至步骤S246给出了步骤S240的一种可能的实现方式。需要指出,虽然这些步骤中有的并不是中心服务器的行为(上文中步骤S240由中心服务器执行),但应当理解的是,这些步骤都是在中心服务器的驱动之下执行的,最终的诊断结果(包括故障发生位置在内)也在中心服务器上产生,因此在图2中将其表示为步骤S240的子步骤也是合理的。
在一些实现方式中,中心服务器在利用业务度量信息进行故障诊断之前,可能还需要先对业务度量信息进行预处理,预处理可能包含解密、解码、格式转换或消除冗余(冗余信息指和故障诊断无关的信息)等操作中的一项或几项。
步骤S241a:中心服务器确定故障发生位置为业务服务器。
步骤S241b:中心服务器确定故障发生位置为网络设备或网络链路。
对以上两个步骤合并阐述。之前已经提到,故障发生位置至少包括业务服务器、网络设备或网络链路三种可能,在步骤S241a中可以将故障定位到业务服务器,而在步骤S241b中则可以将故障定位到网络设备或网络链路,但具体是网络设备还是网络链路则需要在后续步骤中进一步确定。以上两个步骤至少有以下两种实现方式:
方式一:若业务度量信息满足第一预设规则,则确定故障发生位置为业务服务器,否则确定故障发生位置为网络设备或网络链路。
方式一是一种简单的二分法,判断故障发生位置的条件只有第一预设规则这一单一条件。作为一种可选的方案,第一预设规则可以是:探针客户端与业务服务器之间的网络时延小于第一阈值且业务服务器针对被探测业务的处理时间大于第二阈值。该规则的内在逻辑是:若网络时延较短(小于第一阈值),而处理时间较长(大于第二阈值),则表明业务处理出现了问题,从而可以推定业务服务器故障,否则不是业务处理导致了故障,故障应发生在网络设备或者网络链路上。
例如,对于http业务而言,网络时延可以是探针客户端与业务服务器之间的TCP连接建立时间。又例如,对于https业务而言,网络时延可以是探针客户端与业务服务器之间的TCP连接建立时间或者SSL三次握手时间,当然也可以同时采用这两个时间,例如,若TCP连接建立时间小于某个预设值,并且,SSL三次握手时间也小于某个预设值,并且,处理时间大于第二阈值才认为是业务服务器故障。
方式二:若业务度量信息满足第一预设规则,则确定故障发生位置为业务服务器,否则若业务度量信息满足第二预设规则,则确定故障发生位置为网络设备或网络链路。
方式二在判断故障发生位置时使用两个条件,分别是第一预设规则和第二预设规则,这两个条件最好设置成互斥的以避免在两个条件下的故障定位结果出现冲突。第一预设规则可以是:探针客户端与业务服务器之间的网络时延小于第一阈值且业务服务器针对被探测业务的处理时间大于第二阈值;第二预设规则可以是:探针客户端与业务服务器之间的网络时延大于第三阈值。这两项规则的内在逻辑是:若网络时延较短(小于第一阈值),而处理时间较长(大于第二阈值),则表明业务处理出现了问题,从而可以推定业务服务器故障;否则,若网络时延较长(大于第三阈值),则表明数据的网络传输出现了问题,从而可以推定网络设备或网络链路故障。为满足上面所说的条件互斥,第二预设规则中的第三阈值可以取不小于第一阈值的某个值。关于网络时延的具体实现,在介绍方式一时已经说明,不再重复。
无论是方式一还是方式二,其规则设置都比较简单,可以快速完成对业务服务器故障的准确定位,对于网络设备或网络链路的故障定位,则可以在后续步骤中进行。当然,也不排在某些应用场景中,只需要确定业务服务器是否出现故障,对于其他位置的故障则不关心,此时就不需要定位网络设备或网络链路的故障了。在图1中,业务服务器的故障标记为X1。
步骤S242:中心服务器向业务服务器收集第一故障信息,并根据第一故障信息以及第三预设规则确定业务服务器的故障原因。
在步骤S141a中将故障定位到业务服务器后,中心服务器还可以进一步分析得到业务服务器发生故障的原因。严格来说分析故障原因的步骤S242并不属于定位故障的步骤S240的一部分,但为简单起见也一并阐述。
中心服务器可以向发生故障的业务服务器发送请求,指示业务服务器收集第一故障信息,并将第一故障信息返回给中心服务器。第一故障信息可以包括,但不限于业务服务器的处理器信息、内存信息、日志信息、网络接口流量信息或进程信息等。中心服务器获得第一故障信息后,可以将其与第三预设规则进行匹配,若匹配上某条第三预设规则,就可 以对应得到故障原因。例如,第三预设规则中的一条规则为:若处理器占用情况在长时间内处于较高水平,则确认业务服务器的故障原因为服务器性能瓶颈问题,若中心服务器收到的第一故障信息中的处理器信息能够匹配上这条规则,中心服务器便可以确认故障原因为业务服务器性能瓶颈。分析出故障原因后,网络管理人员可以及时掌握故障状况,从而采取合理的对策快速排除故障。
步骤S243:中心服务器向探针客户端发送第二探测信息。
在步骤S141b中将故障定位到网络设备或网络链路后,中心服务器可以向探针客户端发送第二探测信息,并执行后续步骤,以便对网络故障进行精确定位。第二探测信息用于指示探测客户端如何探测网络状况,第二探测信息中包括业务服务器的地址,还可以包括探测频率或探测模式等内容。
其中,业务服务器的地址可以是IP地址,在阐述步骤S210时提到,中心服务器可以向探针客户端发送业务网址,探针客户端在探测业务前会首先利用DNS解析得到业务服务器的IP地址,在探针客户端向中心服务器返回业务度量信息时,也可以将该IP地址一并返回,从而中心服务器可以在步骤S243中使用该IP地址。当然,也不排除中心服务器自己利用DNS解析得到业务服务器的IP地址的实现方式。关于探测频率和探测模式,前文已经阐述,不再重复。
步骤S244:探针客户端根据第二探测信息对探针客户端与业务服务器之间的网络进行探测,获得故障位置信息。
探针客户端接收到第二探测信息后,根据第二探测信息中指定的业务服务器的地址、探测频率或探测模式等进行网络探测,获得故障位置信息。故障位置信息用于描述故障发生的大***置(但还不是最终位置)。在一种实现方式中,故障位置信息可以包括疑似故障的网络设备的地址以及该网络设备的下一跳的地址(如果没有下一跳则无需包含此项),即故障可能发生于该疑似故障的网络设备上,或者该疑似故障的网络设备的下一跳上,或者两者之间的网络链路上。其中,疑似故障的网络设备是指表现出某些故障特征的设备,但有时表现出故障的特征,并不一定是设备本身故障,也有可能是设备周边的网络环境造成的,所以在故障位置信息中将下一跳网络设备的地址也包含在其中,有利于定位到真实的故障源。
以图1为例,探针客户端探测自身与业务服务器之间的网络,可以调用traceroute或ping等现有工具,若探测到网络设备A疑似发生故障,则在向中心服务器发送的故障位置信息中既要包含网络设备A的IP地址,也要包含其下一跳即网络设备B的IP地址。
对比步骤S244和步骤S220不难发现,探针客户端至少包含两类探测功能,一类是探 测业务(步骤S220),一类是探测网络(步骤S244)。
步骤S245:探针客户端向中心服务器发送故障位置信息。
步骤S246:中心服务器根据故障位置信息向疑似故障的网络设备以及该网络设备的下一跳收集第二故障信息,并根据第二故障信息以及第四预设规则确定故障发生位置。
在一些实现方式中,中心服务器在利用故障位置信息进行故障诊断之前,可能还需要先对故障位置信息进行预处理,可能的预处理方式在步骤S240处已经介绍,不再重复。
中心服务器可以向疑似故障的网络设备以及该网络设备的下一跳分别发送请求,指示这两台设备收集第二故障信息,并将第二故障信息返回给中心服务器。第二故障信息可以包括,但不限于网络设备的路由表信息、设备配置信息或操作***信息等。需要注意,两台网络设备并不一定要返回种类相同的信息,例如,网络设备A可以返回路由表信息以及设备配置信息,网络设备B可以返回操作***信息,总之,返回的第二故障信息可以根据需求进行组合。
中心服务器获得第二故障信息后,可以将其与第四预设规则进行匹配,若匹配上某条第四预设规则,就可以对应得到故障位置。在第四预设规则中也可以指定一些用于确定故障位置的操作,在规则匹配的过程中这些操作被执行。可能的故障位置如前所述,包括疑似故障的网络设备以及疑似故障的网络设备的下一跳或者两者之间的网络链路。
例如,第四预设规则中的一条规则为:查询疑似故障的网络设备的路由表,判断该网络设备到业务服务器的目的路由是否存在,若目的路由不存在,则确认该疑似故障的网络设备是故障发生位置;若目的路由存在,则触发该疑似故障的网络设备与其下一跳之间进行单向环回检测,若检测结果为失败,则确认该疑似故障的网络设备和其下一跳之间的网络链路是故障发生位置。
中心服务器收到第二故障信息后,可以查询其中的疑似故障的网络设备的路由表,然后将查询结果与上述规则进行匹配,若匹配上目的路由不存在的规则,则确认疑似故障的网络设备为故障发生位置,同时还可以确认故障原因是路由表项缺失;若匹配上目的路由存在的规则,则进行单项环回检测,然后再将检测结果与上述规则进行进一步匹配,若匹配上检测结果失败的规则,则表明检测源(疑似故障的网络设备)与检测目的(下一跳设备)之间的网络链路不通,从而确认该疑似故障的网络设备和其下一跳之间的网络链路为故障发生位置,当然故障原因就是链路不通。
通过上面的阐述可知,由于第二故障信息中可能包含了某些对故障原因的描述信息,因此中心服务器在利用第二故障信息定位故障的同时还有可能同时分析出故障原因,不必再如同业务服务器故障时单独进行故障原因分析。当然,上面得到的故障原因可能只是初 步的原因,例如,对于路由表项缺失,中心服务器还可以根据第二故障信息进一步分析是什么原因导致了路由表项缺失,分析方法也可以采用规则匹配的方法,不再具体说明。
在图1中,若网络设备A的路由表中到业务服务器的表项缺失,则故障发生位置为网络设备A,标记为X2;若表项未缺失,但网络设备A和B之间的单向环回检测失败,则故障发生位置为网络设备A和B之间的链路,标记为X3。
综上所述,本公开实施例提供的故障诊断方法在网络中部署探针客户端,故障发生后,中心服务器通过发送第一探测信息指示探针客户端进行故障探测,然后根据探针客户端返回的业务度量信息执行后续操作,即可一次性定位网络中发生故障的位置,无需对网络进行分段排查,从而能够快速完成故障诊断,尽可能降低故障对网络业务的影响。在该方法的某些实现方式中,中心服务器还可以通过分析进一步确定故障发生原因,从而有利于故障的尽快排除。
图3示出了本公开实施例提供的一种故障诊断装置300的功能模块图。该装置配置于中心服务器,包括:
第一信息发送模块310,用于向探针客户端发送第一探测信息,所述第一探测信息包括业务服务器上运行的网络业务的地址;
第一信息接收模块320,用于接收所述探针客户端发送的业务度量信息,所述业务度量信息由所述探针客户端在对所述网络业务进行探测后生成;
故障诊断模块330,用于根据所述业务度量信息以及预设规则确定故障发生位置。
在故障诊断装置300的一种实现方式中,所述故障发生位置包括:所述业务服务器、网络设备或网络链路。
在故障诊断装置300的一种实现方式中,故障诊断模块330根据所述业务度量信息以及预设规则确定故障发生位置,包括:若所述业务度量信息满足第一预设规则,则确定所述故障发生位置为所述业务服务器,否则确定所述故障发生位置为所述网络设备或所述网络链路;或者,若所述业务度量信息满足第一预设规则,则确定所述故障发生位置为所述业务服务器,否则若所述业务度量信息满足第二预设规则,则确定所述故障发生位置为所述网络设备或所述网络链路。
在故障诊断装置300的一种实现方式中,所述业务度量信息包括所述探针客户端与所述业务服务器之间的网络时延以及所述业务服务器针对所述网络业务的处理时间;所述第一预设规则为:所述网络时延小于第一阈值且所述处理时间大于第二阈值;所述第二预设规则为:所述网络时延大于第三阈值。
在故障诊断装置300的一种实现方式中,故障诊断模块330还用于:若所述故障发生 位置为所述业务服务器,则向所述业务服务器收集第一故障信息,并根据所述第一故障信息以及第三预设规则确定所述业务服务器的故障原因。
在故障诊断装置300的一种实现方式中,所述第一预设规则、所述第二预设规则以及所述第三预设规则保存在所述中心服务器的知识库中。
在故障诊断装置300的一种实现方式中,故障诊断模块330根据所述业务度量信息以及预设规则确定故障发生位置,还包括:若所述故障发生位置为所述网络设备或所述网络链路,则向所述探针客户端发送第二探测信息,所述第二探测信息包括所述业务服务器的地址;接收所述探针客户端发送的故障位置信息,所述故障位置信息由所述探针客户端在对自身与所述业务服务器之间的网络进行探测后生成,所述故障位置信息包括疑似故障的网络设备的地址以及该网络设备的下一跳的地址;根据所述故障位置信息向所述疑似故障的网络设备以及该网络设备的下一跳收集第二故障信息,并根据所述第二故障信息以及第四预设规则确定所述故障发生位置为所述疑似故障的网络设备、该网络设备的下一跳或二者之间的网络链路。
本公开实施例提供的故障诊断装置300,其实现原理及产生的技术效果在前述方法实施例中已经介绍,为简要描述,装置实施例部分未提及之处,可参考前述方法施例中相应内容。
图4示出了本公开实施例提供的一种故障诊断装置400的功能模块图。该装置配置于探针客户端,包括:
第二信息接收模块410,用于接收中心服务器发送的第一探测信息,所述第一探测信息包括业务服务器上运行的网络业务的地址;
探测模块420,用于根据所述第一探测信息对所述网络业务进行探测,获得业务度量信息;
第二信息发送模块430,用于向所述中心服务器发送所述业务度量信息。
在故障诊断装置400的一种实现方式中,第二信息接收模块410还用于:接收所述中心服务器发送的第二探测信息,所述第二探测信息包括所述业务服务器的地址;
探测模块420还用于:根据所述第二探测信息对所述探针客户端与所述业务服务器之间的网络进行探测,获得故障位置信息,所述故障位置信息包括疑似故障的网络设备的地址以及该网络设备的下一跳的地址;
第二信息发送模块430还用于:向所述中心服务器发送所述故障位置信息。
在故障诊断装置400的一种实现方式中,所述探针客户端部署在网络中靠近用户侧的网络设备上。
图5示出了本公开实施例提供的电子设备500的一种可能的结构。参照图5,电子设备500包括:处理器510、存储器520以及通信接口530,这些组件通过通信总线540和/或其他形式的连接机构(未示出)互连并相互通讯。
存储器520中存储有计算机程序指令,这些计算机程序指令被处理器510读取并运行时,执行本公开实施例提供的故障诊断方法及其他期望的功能。通信接口530则用于电子设备500与其他设备进行通信。
可以理解,图5所示的结构仅为示意,电子设备500还可以包括比图5中所示更多或者更少的组件,或者具有与图5所示不同的配置。图5中所示的各组件可以采用硬件、软件或其组合实现。例如,图1中的中心服务器110以及部署探针客户端120的设备都可以采用电子设备500实现。
本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序指令,计算机程序指令被处理器读取并运行时,执行本公开实施例提供的故障诊断方法的步骤。例如,该计算机可读存储介质可以是,但不限于图5中电子设备500的存储器520。
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以权利要求的保护范围为准。

Claims (12)

  1. 一种故障诊断方法,其特征在于,应用于中心服务器,所述方法包括:
    向探针客户端发送第一探测信息,所述第一探测信息包括业务服务器上运行的网络业务的地址;
    接收所述探针客户端发送的业务度量信息,所述业务度量信息由所述探针客户端在对所述网络业务进行探测后生成;
    根据所述业务度量信息以及预设规则确定故障发生位置。
  2. 根据权利要求1所述的故障诊断方法,其特征在于,所述故障发生位置包括:所述业务服务器、网络设备或网络链路。
  3. 根据权利要求2所述的故障诊断方法,其特征在于,所述根据所述业务度量信息以及预设规则确定故障发生位置,包括:
    若所述业务度量信息满足第一预设规则,则确定所述故障发生位置为所述业务服务器,否则确定所述故障发生位置为所述网络设备或所述网络链路;
    或者,
    若所述业务度量信息满足第一预设规则,则确定所述故障发生位置为所述业务服务器,否则若所述业务度量信息满足第二预设规则,则确定所述故障发生位置为所述网络设备或所述网络链路。
  4. 根据权利要求3所述的故障诊断方法,其特征在于,所述业务度量信息包括所述探针客户端与所述业务服务器之间的网络时延以及所述业务服务器针对所述网络业务的处理时间;
    所述第一预设规则为:所述网络时延小于第一阈值且所述处理时间大于第二阈值;
    所述第二预设规则为:所述网络时延大于第三阈值。
  5. 根据权利要求3所述的故障诊断方法,其特征在于,所述方法还包括:
    若所述故障发生位置为所述业务服务器,则向所述业务服务器收集第一故障信息,并根据所述第一故障信息以及第三预设规则确定所述业务服务器的故障原因。
  6. 根据权利要求5所述的故障诊断方法,其特征在于,所述第一预设规则、所述第二预设规则以及所述第三预设规则保存在所述中心服务器的知识库中。
  7. 根据权利要求3所述的故障诊断方法,其特征在于,所述根据所述业务度量信息以及预设规则确定故障发生位置,还包括:
    若所述故障发生位置为所述网络设备或所述网络链路,则向所述探针客户端发送第二探测信息,所述第二探测信息包括所述业务服务器的地址;
    接收所述探针客户端发送的故障位置信息,所述故障位置信息由所述探针客户端在对自身与所述业务服务器之间的网络进行探测后生成,所述故障位置信息包括疑似故障的网络设备的地址以及该网络设备的下一跳的地址;
    根据所述故障位置信息向所述疑似故障的网络设备以及该网络设备的下一跳收集第二故障信息,并根据所述第二故障信息以及第四预设规则确定所述故障发生位置为所述疑似故障的网络设备、该网络设备的下一跳或二者之间的网络链路。
  8. 一种故障诊断方法,其特征在于,应用于探针客户端,所述方法包括:
    接收中心服务器发送的第一探测信息,所述第一探测信息包括业务服务器上运行的网络业务的地址;
    根据所述第一探测信息对所述网络业务进行探测,获得业务度量信息;
    向所述中心服务器发送所述业务度量信息。
  9. 根据权利要求8所述的故障诊断方法,其特征在于,所述方法还包括:
    接收所述中心服务器发送的第二探测信息,所述第二探测信息包括所述业务服务器的地址;
    根据所述第二探测信息对所述探针客户端与所述业务服务器之间的网络进行探测,获得故障位置信息,所述故障位置信息包括疑似故障的网络设备的地址以及该网络设备的下一跳的地址;
    向所述中心服务器发送所述故障位置信息。
  10. 根据权利要求8或9所述的故障诊断方法,其特征在于,所述探针客户端部署在网络中靠近用户侧的网络设备上。
  11. 一种故障诊断装置,其特征在于,配置于中心服务器,所述装置包括:
    第一信息发送模块,用于向探针客户端发送第一探测信息,所述第一探测信息包括业务服务器上运行的网络业务的地址;
    第一信息接收模块,用于接收所述探针客户端发送的业务度量信息,所述业务度量信息由所述探针客户端在对所述网络业务进行探测后生成;
    故障诊断模块,用于根据所述业务度量信息以及预设规则确定故障发生位置。
  12. 一种故障诊断装置,其特征在于,配置于探针客户端,所述装置包括:
    第二信息接收模块,用于接收中心服务器发送的第一探测信息,所述第一探测信息包括业务服务器上运行的网络业务的地址;
    探测模块,用于根据所述第一探测信息对所述网络业务进行探测,获得业务度量信息;
    第二信息发送模块,用于向所述中心服务器发送所述业务度量信息。
PCT/CN2020/116002 2019-12-24 2020-09-17 一种故障诊断方法及装置 WO2021128977A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911346437.XA CN111030873A (zh) 2019-12-24 2019-12-24 一种故障诊断方法及装置
CN201911346437.X 2019-12-24

Publications (1)

Publication Number Publication Date
WO2021128977A1 true WO2021128977A1 (zh) 2021-07-01

Family

ID=70212983

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/116002 WO2021128977A1 (zh) 2019-12-24 2020-09-17 一种故障诊断方法及装置

Country Status (2)

Country Link
CN (1) CN111030873A (zh)
WO (1) WO2021128977A1 (zh)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111030873A (zh) * 2019-12-24 2020-04-17 迈普通信技术股份有限公司 一种故障诊断方法及装置
CN111682960A (zh) * 2020-05-14 2020-09-18 深圳市有方科技股份有限公司 一种物联网网络及设备的故障诊断方法及装置
CN113727406B (zh) * 2020-05-21 2022-11-29 北京三快在线科技有限公司 通信控制方法、装置、设备及计算机可读存储介质
CN112019378B (zh) * 2020-08-04 2022-10-25 中国联合网络通信集团有限公司 一种故障排查方法及装置
CN112073234B (zh) * 2020-09-02 2024-06-28 腾讯科技(深圳)有限公司 一种故障检测方法、装置、***、设备及存储介质
CN112838955A (zh) * 2021-01-28 2021-05-25 广东浩云长盛网络股份有限公司 基于evit的数据中心服务器故障诊断方法
CN116806035A (zh) * 2022-03-17 2023-09-26 华为技术有限公司 一种时延分析方法及装置
CN116708150B (zh) * 2022-12-29 2024-04-02 荣耀终端有限公司 网络诊断方法和电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090225663A1 (en) * 2008-03-05 2009-09-10 Fujitsu Limited Network management apparatus and method thereof
CN105577418A (zh) * 2014-11-05 2016-05-11 中兴通讯股份有限公司 电信网络故障信息采集方法和设备
CN109787827A (zh) * 2019-01-18 2019-05-21 网宿科技股份有限公司 一种cdn网络监控的方法及装置
CN111030873A (zh) * 2019-12-24 2020-04-17 迈普通信技术股份有限公司 一种故障诊断方法及装置

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101521593B (zh) * 2008-11-13 2011-03-16 ***通信集团广东有限公司 数据链路层故障定位的方法及装置
US9131396B2 (en) * 2012-10-16 2015-09-08 At&T Intellectual Property I, Lp Measurement of field reliability metrics
CN106155844B (zh) * 2016-07-29 2019-02-12 深圳创维数字技术有限公司 一种web服务器的自恢复方法和自恢复***
CN110224883B (zh) * 2019-05-29 2020-11-27 中南大学 一种应用于电信承载网的灰色故障诊断方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090225663A1 (en) * 2008-03-05 2009-09-10 Fujitsu Limited Network management apparatus and method thereof
CN105577418A (zh) * 2014-11-05 2016-05-11 中兴通讯股份有限公司 电信网络故障信息采集方法和设备
CN109787827A (zh) * 2019-01-18 2019-05-21 网宿科技股份有限公司 一种cdn网络监控的方法及装置
CN111030873A (zh) * 2019-12-24 2020-04-17 迈普通信技术股份有限公司 一种故障诊断方法及装置

Also Published As

Publication number Publication date
CN111030873A (zh) 2020-04-17

Similar Documents

Publication Publication Date Title
WO2021128977A1 (zh) 一种故障诊断方法及装置
US11671342B2 (en) Link fault isolation using latencies
JP6419967B2 (ja) ネットワーク管理のためのシステムおよび方法
EP1999890B1 (en) Automated network congestion and trouble locator and corrector
US20090003241A1 (en) A Method and System For Obtaining Path Maximum Transfer Unit in Network
CN110224883B (zh) 一种应用于电信承载网的灰色故障诊断方法
CN112311580B (zh) 报文传输路径确定方法、装置及***、计算机存储介质
CN111934936B (zh) 网络状态检测方法、装置、电子设备及存储介质
RO132010A2 (ro) Metode, sisteme şi suport citibil de calculator pentru diagnosticarea reţelei
WO2021032175A1 (zh) 故障注入方法及其装置、业务服务***
EP3232620B1 (en) Data center based fault analysis method and device
CN112260922B (zh) 网络环路问题快速定位方法与***
WO2007016830A1 (fr) Procédé et côté client destinés à l’implémentation de la détection des performances du service dhcp
CN111565133B (zh) 专线切换方法、装置、电子设备和计算机可读存储介质
US10382290B2 (en) Service analytics
Kim et al. DYSWIS: crowdsourcing a home network diagnosis
CN109120449B (zh) 一种链路故障的检测方法及装置
US8195977B2 (en) Network fault isolation
JP4464256B2 (ja) ネットワーク上位監視装置
CN113055291B (zh) 一种数据包发送方法、路由器、数据包传输***
US11765059B2 (en) Leveraging operation, administration and maintenance protocols (OAM) to add ethernet level intelligence to software-defined wide area network (SD-WAN) functionality
US10904123B2 (en) Trace routing in virtual networks
JP3722277B2 (ja) インターネットの障害推定方法
GB2566467A (en) Obtaining local area network diagnostic test results
WO2020179704A1 (ja) ネットワーク管理方法、ネットワークシステム、集約解析装置、端末装置、及びプログラム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20905358

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20905358

Country of ref document: EP

Kind code of ref document: A1