CN115996408A - Fault processing method and device and electronic equipment - Google Patents

Fault processing method and device and electronic equipment Download PDF

Info

Publication number
CN115996408A
CN115996408A CN202211450654.5A CN202211450654A CN115996408A CN 115996408 A CN115996408 A CN 115996408A CN 202211450654 A CN202211450654 A CN 202211450654A CN 115996408 A CN115996408 A CN 115996408A
Authority
CN
China
Prior art keywords
monitoring
data
monitoring index
fault
structure file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211450654.5A
Other languages
Chinese (zh)
Inventor
谢利明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202211450654.5A priority Critical patent/CN115996408A/en
Publication of CN115996408A publication Critical patent/CN115996408A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The embodiment of the invention provides a fault processing method, a device and electronic equipment, wherein the method comprises the following steps: receiving a fault message sent when first equipment fails, wherein the first equipment is at least one of a plurality of monitored equipment managed by management equipment, and the fault message is used for indicating that the value of a first monitoring index when the first equipment transmits data is greater than or equal to a first monitoring threshold value corresponding to the first monitoring index; acquiring at least one configuration information associated with transmission data of a first device; respectively acquiring historical abnormal data of a first monitoring index associated with each configuration information; and determining whether to acquire address information of the attacked device causing the first device to fail according to the historical abnormal data.

Description

Fault processing method and device and electronic equipment
Technical Field
The present invention relates to the field of mobile communications technologies, and in particular, to a fault processing method, a fault processing device, and an electronic device.
Background
With the gradual penetration of the digital development, the on-line equipment of each unit gradually increases in scale by nearly 10-100 times compared with the ten-year old equipment. The device operation and maintenance mode is developed from a manual operation and maintenance mode to a tool operation and maintenance mode and a platform operation and maintenance mode, and still cannot meet the monitoring requirement of the current large-scale networking on the operation and maintenance device.
For example, large-scale equipment is included in a large-scale networking, the application relationship between the equipment is often complex, and the arrangement level is more, so that when a certain network equipment fails, how to perform fault tracing more reasonably is a problem which needs to be solved in a key way.
Disclosure of Invention
The embodiment of the invention provides a fault processing method, a fault processing device and electronic equipment, which aim to solve at least part of the problems.
In a first aspect, an embodiment of the present invention provides a fault handling method, including:
receiving a fault message sent when a first device fails, wherein the first device is at least one of a plurality of monitored devices managed by the management device, and the fault message is used for indicating that a value of a first monitoring index when the first device transmits data is greater than or equal to a first monitoring threshold corresponding to the first monitoring index;
acquiring at least one configuration information associated with the transmission data of the first equipment;
respectively acquiring historical abnormal data of the first monitoring index associated with each piece of configuration information;
and determining whether to acquire address information of the attacked device causing the first device failure according to the historical abnormal data.
In a second aspect, an embodiment of the present invention provides a fault handling method, including:
and sending a fault message to the management equipment, wherein the fault message is used for indicating that the value of a first monitoring index when the monitored equipment transmits data is greater than or equal to a first monitoring threshold value corresponding to the first monitoring index.
In a third aspect, an embodiment of the present invention provides a fault handling apparatus applied to a management device, the apparatus including:
the first receiving module is used for receiving a fault message sent when a first device fails, wherein the first device is at least one of a plurality of monitored devices managed by the management device, and the fault message is used for indicating that the value of a first monitoring index when the first device transmits data is greater than or equal to a first monitoring threshold corresponding to the first monitoring index;
the first acquisition module is used for acquiring at least one configuration information associated with the transmission data of the first equipment;
the second acquisition module is used for respectively acquiring historical abnormal data of the first monitoring index associated with each piece of configuration information;
and the determining module is used for determining whether to acquire address information of the attacked device which causes the first device to fail according to the historical abnormal data.
In a fourth aspect, an embodiment of the present invention provides a fault handling apparatus for application to a monitored device, the apparatus comprising:
the first sending module is used for sending a fault message to the management equipment, wherein the fault message is used for indicating that the value of a first monitoring index when the monitored equipment transmits data is larger than or equal to a first monitoring threshold value corresponding to the first monitoring index.
In a fifth aspect, embodiments of the present invention provide an electronic device comprising a memory, a transceiver, and a processor:
a memory for storing a computer program; a transceiver for transceiving data under control of the processor; a processor for reading the computer program in the memory and executing the fault handling method according to the first aspect or the fault handling method according to the second aspect.
In a sixth aspect, an embodiment of the present invention provides a processor-readable storage medium storing a computer program for causing the processor to execute the fault handling method described in the first aspect or the fault handling method described in the second aspect.
In the embodiment of the invention, the management device can receive a fault message sent when the first device (i.e. a plurality of managed devices) fails, where the fault message is used to indicate that a value of a first monitoring index when the first device transmits data is greater than or equal to a first monitoring threshold corresponding to the first monitoring index; then, the management device acquires at least one configuration information associated with the transmission data of the first device, and acquires historical abnormal data of the first monitoring index associated with each configuration information respectively, so that whether address information of the attacked device causing the first device to fail is acquired is determined according to the historical abnormal data.
It can be known that, in the embodiment of the present invention, after receiving the fault message sent by the first device, the management device does not immediately obtain the address information of the attacked device that causes the fault of the first device, that is, does not immediately perform fault tracing, but obtains at least one configuration information associated with the transmission data of the first device, so as to obtain the historical abnormal data of the first monitoring index from different directions represented by the different configuration information, and further determine whether to perform fault tracing according to the historical abnormal data, thereby performing fault tracing at a more reasonable time, but not immediately tracing when the fault occurs, and reducing the probability of unreasonable tracing, so that network resources can be further saved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a fault handling method according to an embodiment of the present invention;
FIG. 2 is a flowchart of another fault handling method according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a network topology to which the fault handling method according to the embodiment of the present invention is applicable;
fig. 4 is a block diagram of a fault handling apparatus according to an embodiment of the present invention;
FIG. 5 is a block diagram illustrating another fault handling apparatus according to an embodiment of the present invention;
fig. 6 is a block diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In the embodiment of the invention, the term "and/or" describes the association relation of the association objects, which means that three relations can exist, for example, a and/or B can be expressed as follows: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
The term "plurality" in the embodiments of the present application means two or more, and other adjectives are similar thereto.
The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The embodiment of the application provides a service switching method and device, which are used for solving the problem that in the prior art, when a UPF network element carrying a service generates equipment level, the service interruption time is long because the fault UPF configuration can only be manually deleted.
The method and the device are based on the same application, and because the principles of solving the problems by the method and the device are similar, the implementation of the device and the method can be referred to each other, and the repetition is not repeated.
Fig. 1 shows a flow chart of a fault handling method according to an embodiment of the present invention, where the method may be applied to a management device, as shown in fig. 1, and the method may include the following steps 101 to 102:
Step 101: and receiving a fault message sent when the first equipment fails.
The first device is at least one of a plurality of monitored devices managed by the management device. Here, the monitored device may include at least one of a server, other non-server devices (e.g., terminal devices)
It will be appreciated that the monitored devices may include both local monitored devices and off-site monitored devices.
In addition, the fault message is used for indicating that the value of a first monitoring index when the first device transmits data is greater than or equal to a first monitoring threshold corresponding to the first monitoring index. Here, the first monitoring index may be a CPU usage rate or a memory occupancy rate. For example, when the first device monitors that the CPU usage rate is greater than or equal to a monitoring threshold corresponding to the CPU usage rate, a fault message may be sent to the management device; or when the first device monitors that the memory occupancy rate is greater than or equal to the monitoring threshold value corresponding to the memory occupancy rate, the first device may send a fault message to the management device.
Step 102: at least one configuration information associated with the first device transmission data is obtained.
Wherein the first device transmission data corresponds to or is associated with at least one configuration information. Optionally, the configuration information includes at least one of a source address, a destination address, a source port, a destination port, and a protocol used in transmitting data. Here, the source address is the source address of the data sent by the first device; the destination address is the address of the sending device of the data received by the first device; the source port is a source port for transmitting data by the first device; the destination port is the port of the sending device of the data received by the first device.
Step 103: and respectively acquiring historical abnormal data of the first monitoring index associated with each piece of configuration information.
The first device corresponds to corresponding configuration information (such as a source address, a destination address, a source port, a destination port, and a protocol adopted) when transmitting data in the historical operation process, and abnormal data also exists in a value (i.e., historical data) of a first monitoring index of the data transmitted in the historical operation process.
It should be noted that, in different operation time periods, the monitoring threshold value adopted by the first device for the same monitoring index may be different, so that when the first device transmits data, historical abnormal data of the first monitoring index includes data which is greater than or equal to the monitoring threshold value adopted in the corresponding time period in different operation time periods.
Step 104: and determining whether to acquire address information of the attacked device causing the first device failure according to the historical abnormal data.
In step 104, it is determined whether to obtain address information of the attacked device that causes the first device to malfunction according to the historical anomaly data, that is, whether to perform malfunction tracing according to the historical anomaly data evaluation.
As can be seen from the above steps 101 to 104, in the embodiment of the present invention, the management device may receive a fault message sent when the first device (i.e. a plurality of managed devices) fails, where the fault message is used to indicate that a value of a first monitoring index when the first device transmits data is greater than or equal to a first monitoring threshold corresponding to the first monitoring index; then, the management device acquires at least one configuration information associated with the transmission data of the first device, and acquires historical abnormal data of the first monitoring index associated with each configuration information respectively, so that whether address information of the attacked device causing the first device to fail is acquired is determined according to the historical abnormal data.
It can be known that, in the embodiment of the present invention, after receiving the fault message sent by the first device, the management device does not immediately obtain the address information of the attacked device that causes the fault of the first device, that is, does not immediately perform fault tracing, but obtains at least one configuration information associated with the transmission data of the first device, so as to obtain the historical abnormal data of the first monitoring index from different directions represented by the different configuration information, and further determine whether to perform fault tracing according to the historical abnormal data, thereby performing fault tracing at a more reasonable time, but not immediately tracing when the fault occurs, and reducing the probability of unreasonable tracing, so that network resources can be further saved.
Optionally, before the step 101 of receiving the failure message sent when the first device fails, the method further includes:
converting a monitoring period, a monitoring index and a monitoring threshold corresponding to the monitoring index into a data structure file accepted by a network configuration NETCONF protocol;
transmitting the data structure file to the first device;
the first monitoring index is one of the monitoring indexes in the data structure file.
The monitoring index can comprise at least one of CPU utilization rate, memory occupancy rate, port information, equipment IP and board information, so that the monitoring index can comprehensively and accurately reflect the running state of the monitored equipment, wherein the board information has a corresponding relation with the CPU utilization rate, and the problem of which board can be accurately known.
In addition, the remote procedure call protocol (Remote Procedure Cal, RPC) layer of netcon provides a simple and transport protocol independent mechanism for the encoding of RPC modules. That is, request and response data (i.e., contents of an operation layer and a content layer) of monitored devices and management devices of the netcon f protocol are encapsulated by using < rpc > and < rpc-reply > elements, and data of configuration required by the management devices are normally encapsulated by the < rpc-reply > element, and when a request message of the management devices has an error or server processing is unsuccessful, the server encapsulates a < rpc-error > element containing detailed error information in the < rpc-reply > element to feed back to the management devices.
And, the command set of the netcon protocol consists of a series of commands to read, modify device configuration data, and read status data. Commands are communicated through the RPCs and responded with an RPC reply. I.e., an RPC reply must respond to an RPC to return. A configuration operation must consist of a series of RPCs, each with its corresponding reply RPC.
It can be seen that in the embodiment of the present invention, the receiving and responding messages can be performed by using the netcon f protocol to manage the monitored devices in the network, and programming the set of remote procedure call RPC message related instructions of the netcon f into the program on the device supporting the netcon f protocol. In this way, the configuration of monitoring indexes, monitoring periods and monitoring thresholds for the monitored equipment through the NETCONF protocol is realized, and the subscription of monitoring events of the monitored equipment is realized.
In addition, once the netcon session begins, the managing device and the monitored device exchange a set of "characteristics". This set of "properties" includes some information such as the netcon f protocol version support list, whether or not alternative data is present, the manner in which the data store in operation can be modified. In addition, "properties" are defined in netcon f request comments (Request For Comments, RFC): the developer can add additional "properties" by following the canonical format described in RFC. In this way, by adopting the netcon protocol, the monitoring index (for example, increasing, decreasing or modifying the monitoring index), the monitoring period and the monitoring threshold value can be set more flexibly, so that more various data can be collected.
In addition, the second-level monitoring can be realized by subscribing to the monitoring event through the netcon f protocol.
Optionally, after the foregoing step 101 "receiving the failure message sent when the first device fails", the method further includes:
and retransmitting the data structure file to the first device.
After subscribing to the monitoring event through the netcon f protocol, if the monitored device monitors that the value of a certain monitoring index exceeds the monitoring threshold corresponding to the monitoring index, the subscription will fail. Therefore, in order to enable a monitored device to send a fault message to a management device when a certain monitoring index exceeds a monitoring threshold corresponding to the monitoring index, the management device subscribes to a monitoring event through the NETCONF protocol after receiving the fault message.
It will be appreciated that, after receiving the above-mentioned fault message, the management device may simply resend the data structure file to the first device, or may resend the data structure file to all the monitored devices that it manages.
Optionally, the step 102 "obtaining at least one configuration information associated with the first device transmission data" includes:
And when the number of times of sending the data structure file to the first device reaches a preset number of times, acquiring at least one configuration information associated with the data transmission of the first device under the condition that the value of the first monitoring index when the first device transmits the data is still greater than or equal to the first monitoring threshold.
If the number of times the data structure file is sent to the first device reaches a preset number of times, the value of the first monitoring index when the first device transmits data is still greater than or equal to the first monitoring threshold, which indicates that the number of times of re-subscribing to the monitoring event reaches the preset number of times, the failure of the first device is still unresolved, which indicates that the failure is not caused by network jitter, so that in this case, a failure tracing needs to be performed, that is, the "acquiring at least one configuration information associated with the data transmission of the first device" needs to be performed "
If the number of times the data structure file is sent to the first device reaches the preset number of times, the value of the first monitoring index when the first device transmits data is smaller than the first monitoring threshold, and the first device fault is solved when the number of times of re-subscribing the monitoring event reaches the preset number of times, which indicates that the fault is caused by network jitter, and indicates that the fault caused by network jitter is solved, and fault tracing is not needed, so that the method does not need to acquire at least one configuration information associated with the first device transmission data under the condition.
It can be known that in the embodiment of the present invention, the RPC message of the netcon f protocol can be repeatedly executed by the program to subscribe to the alarm again, so as to prevent the problem of alarm subscription failure caused by network jitter.
Optionally, the aforementioned step 104 "determining whether to acquire address information of the attacked device that causes the first device to malfunction according to the historical anomaly data" includes:
storing the historical anomaly data associated with the ith configuration information into an ith set, i being an integer from 1 to N, N representing the number of the at least one configuration information;
classifying the historical abnormal data in the ith set according to a target threshold value by adopting a k-nearest neighbor algorithm prediction model, and obtaining the ith number of the historical abnormal data which is larger than or equal to the target threshold value in the ith set;
determining the ith set as an abnormal set under the condition that the ratio of the ith number to the number of the historical abnormal data in the ith set is larger than or equal to a preset ratio;
determining to acquire address information of the attacked device causing the first device failure under the condition that the number of the abnormal sets is greater than or equal to a preset number;
And under the condition that the number of the abnormal sets is smaller than the preset number, determining to skip the step of acquiring the address information of the attacked device causing the first device to malfunction.
For example, when the configuration information includes a source address, a destination address, a source port, a destination port, and an adopted protocol, the five sets may be obtained, and if three abnormal sets exist in the five sets, tracing of a fault is required.
The k nearest neighbor algorithm prediction model is a k nearest neighbor classifier, and the basic idea is to give a sample x of an undetermined class, search in a sample space, find out k samples closest to the undetermined class sample, and which class the sample to be classified belongs to is determined by the class to which most of the samples in the k nearest neighbors belong.
It can be seen that the main problem of k nearest neighbor classification is to determine the appropriate sample set, distance function, combining function and k value. For various types of attributes, the distance function may refer to a metric formula of sample similarity in the cluster analysis, and the combined function may use a simple unweighted voting (voting) or weighted voting method. In a simple unweighted vote, the effect of each neighbor x1 on the x-class is considered the same. By counting the categories to which k neighbors x belong, x is classified as the most counted category.
Wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0003950447460000091
η denotes the count function, if x i ∈C j Then eta (x) i ∈C j ) =1; otherwise eta (x) i ∈C j ) =0. When the category counts are the same, a category is randomly selected for x.
In addition, in the case of the optical fiber,
Figure BDA0003950447460000092
here, the weight is generally defined as w i =1/d(x,x i ) 2 ,d(x,x i ) Representing sample x and neighbor x i Is a distance of (3).
In addition, the k nearest neighbor classifier predicts based on local data, which is relatively sensitive to noise. The choice of k value is data dependent. An excessively large k value may reduce the influence of noise, but makes the number of neighbor samples of an undetermined class sample point large, possibly leading to a classification error. While too small a value of k may lead to voting failure or be affected by noise. A better k value can be obtained by various heuristic techniques.
Finding the nearest neighbor sample of a sample may calculate the distance between all pairs of samples. To efficiently find nearest neighbors, a clustering algorithm may be used to classify the training sample set, and if the centers of the two are relatively far apart, the samples in the corresponding cluster are generally unlikely to be nearest neighbors. The neighbors of a sample can be found by calculating the distance between samples of adjacent clusters.
As can be seen from the foregoing, regarding the step of classifying the historical abnormal data in the ith set according to the target threshold and using a k-nearest neighbor algorithm prediction model, and obtaining the ith number of the historical abnormal data greater than or equal to the target threshold in the ith set, if the current threshold is X, the data in the ith set includes { Y1 to Y100}, classifying the historical abnormal data in the ith set based on the target threshold X and using a k-nearest neighbor algorithm prediction model, namely:
In Y1-Y100, k samples closest to the target threshold X are found, wherein k is the ith number of historical abnormal data greater than or equal to the target threshold X in the ith set.
Here, in the above-described set, if a certain set includes a large amount of historical abnormal data, it takes a long time to obtain data greater than or equal to the target threshold value in the set by traversing the data in the set and comparing the data with the target threshold value, and thus the fault handling time is prolonged. In the embodiment of the invention, the k-nearest neighbor algorithm prediction model is adopted to acquire the data which is greater than or equal to the target threshold value in the set, so that the processing time of the process can be shortened, and the fault processing time is shortened.
Optionally, the method further comprises:
and executing an attack tracing strategy to acquire the address information of the attacked device after determining to acquire the address information of the attacked device causing the first device to fail.
The process of executing the attack traceability policy here may include the following:
firstly, inquiring dependency information related to an attack tracing strategy, namely trace and position related to the tracing strategy, then respectively storing the obtained trace and position information into an e variable and a d variable, executing the attack tracing strategy according to process (process) attribute and e and d of the tracing strategy, and finally obtaining address information (such as IP address) of the attacked equipment.
Illustratively, neo4j may be used as graph data, a network security knowledge base (MDATA) is stored, data is queried using a declarative (Cypher) query language, and an attack tracing algorithm is implemented using a computer programming (Python) language. As shown in fig. 3, the principle is that an attacker is simulated to use a computer (192.168.134.128) controlled remotely by the attacker as an attack host, find that the server (10.2.1.35) has a structured query language (Structured Query Language, sql) to inject a vulnerability, and inject a rebound window to the server by using the vulnerability, so as to realize the attack of enabling the server to actively communicate with the control host (192.168.134.130) through the rebound port.
After the attack event, the traceability personnel want to find out who attacks the server and who is controlling the server. Assuming that the Apache server application software is installed on the server, a traceability person firstly inquires a traceability strategy related to the Apache server through an MDTA network security knowledge base, and then an attack traceability algorithm is utilized to intuitively acquire the IP addresses of the attack host and the control host.
Fig. 2 shows a flow chart of a fault handling method according to an embodiment of the present invention, where the method may be applied to a management device, as shown in fig. 2, and the method may include the following steps 201 to 202:
Step 201: and sending a fault message to the management device.
The fault message is used for indicating that the value of a first monitoring index when the monitored equipment transmits data is larger than or equal to a first monitoring threshold value corresponding to the first monitoring index.
Here, the monitored device may include at least one of a server, other non-server devices (e.g., terminal devices).
In addition, the first monitoring index may be a CPU usage rate or a memory occupancy rate. For example, when the monitored device monitors that the CPU usage rate is greater than or equal to a monitoring threshold corresponding to the CPU usage rate, a fault message can be sent to the management device; or when the monitored equipment monitors that the memory occupancy rate is greater than or equal to the monitoring threshold value corresponding to the memory occupancy rate, a fault message can be sent to the management equipment.
In addition, after the management device receives the fault message, at least one configuration information related to the transmission data of the monitored device sending the fault message can be obtained, historical abnormal data of a first monitoring index related to each configuration information is respectively obtained, and whether address information of the attacked device of the fault of the monitored device sending the fault message is obtained is further determined according to the historical abnormal data.
It can be known that, in the embodiment of the present invention, after receiving the fault message sent by the first device, the management device does not immediately obtain the address information of the attacked device that causes the fault of the first device, that is, does not immediately perform fault tracing, but obtains at least one configuration information associated with the transmission data of the first device, so as to obtain the historical abnormal data of the first monitoring index from different directions represented by the different configuration information, and further determine whether to perform fault tracing according to the historical abnormal data, thereby performing fault tracing at a more reasonable time, but not immediately tracing when the fault occurs, and reducing the probability of unreasonable tracing, so that network resources can be further saved.
Optionally, before "send fault message to management device" in step 201, the method further includes:
receiving a data structure file sent by the management equipment, wherein the data structure file is a data structure file accepted by a network configuration NETCONF protocol, and the data structure file is a monitoring period, a monitoring index and a monitoring threshold corresponding to the monitoring index are converted into the data structure file accepted by the network configuration NETCONF protocol;
analyzing the data structure file to obtain the monitoring period, the monitoring index and the monitoring threshold;
According to the monitoring period, collecting the value of the monitoring index, and comparing the value of the monitoring index with the monitoring threshold corresponding to the monitoring index;
the first monitoring index is one of the monitoring indexes in the data structure file.
The monitoring index can comprise at least one of CPU utilization rate, memory occupancy rate, port information, equipment IP and board information, so that the monitoring index can comprehensively and accurately reflect the running state of the monitored equipment, wherein the board information has a corresponding relation with the CPU utilization rate, and the problem of which board can be accurately known.
Therefore, in the embodiment of the invention, the monitored equipment in the network can be managed by adopting the NETCONF protocol, so that the monitored equipment is configured with the monitoring index, the monitoring period and the monitoring threshold value by the NETCONF protocol, namely, the subscription of the monitoring event of the monitored equipment is realized.
Optionally, after "send failure message to management device" in step 201, the method further includes:
and receiving the data structure file retransmitted by the management equipment.
After subscribing to the monitoring event through the netcon f protocol, if the monitored device monitors that the value of a certain monitoring index exceeds the monitoring threshold corresponding to the monitoring index, the subscription will fail. Therefore, in order to enable a monitored device to send a fault message to a management device when a certain monitoring index exceeds a monitoring threshold corresponding to the monitoring index, the management device subscribes to a monitoring event through the NETCONF protocol after receiving the fault message.
It will be appreciated that, after receiving the above-mentioned fault message, the management device may simply resend the structure file to the first device, or may resend the data structure file to all monitored devices managed by the management device.
It can be known that in the embodiment of the present invention, the RPC message of the netcon f protocol can be repeatedly executed by the program to subscribe to the alarm again, so as to prevent the problem of alarm subscription failure caused by network jitter.
Optionally, after "send failure message to management device" in step 201, the method further includes:
storing fault information into a historical database under the condition that the value of the first monitoring index is smaller than the first monitoring threshold value when the monitored equipment transmits data;
wherein the fault information comprises at least one of a value, occurrence time, recovery time and fault content of the first monitoring index. Here, the fault content is used to indicate that the value of the first monitoring indicator is greater than or equal to the network effect (e.g., network stuck, etc.) generated by the first monitoring threshold.
Therefore, after the monitored equipment stores the fault information into the history database, if the related information of the fault needs to be queried, the related information in the history database can be called.
In addition, after the management device receives the fault message sent by the monitored device with the fault, at least one configuration information associated with the transmission data of the monitored device with the fault can be obtained, so that historical abnormal data of the first monitoring index associated with each configuration information is respectively obtained. Here, the management apparatus may extract, from the above-described history database, history abnormality data of the first monitoring index associated with each of the configuration information.
In summary, the specific implementation manner of the fault handling method according to the embodiment of the present invention may be as follows:
the management equipment converts the monitoring index, the monitoring period and the monitoring threshold corresponding to the monitoring index into a data structure file accepted by the NETCONF protocol, and then sends the data structure file to a plurality of servers managed by the management equipment to realize one-time subscription to the monitoring event; ( Subscribing the monitoring event to the server, namely setting a monitoring threshold for the ColumnCondition field, for example, through program command execution; and setting the value of the interval of the parameter to be 1 second, thereby completing the second-level inspection setting of the server. )
The server analyzes the data structure file to obtain a monitoring index, a monitoring period and a monitoring threshold value, and collects the value of the monitoring index according to the monitoring period so as to judge whether the monitoring index exceeds the corresponding monitoring threshold value; if the first monitoring index exceeds a first monitoring threshold corresponding to the first monitoring index, the server sends a fault message to the management equipment so as to indicate that the value of the first monitoring index exceeds the first monitoring threshold;
After the fault processing of the server with the fault is recovered, the server stores fault information into a historical database, wherein the fault information comprises a value of a monitoring index exceeding a monitoring threshold, fault content, occurrence time and recovery time.
In addition, after receiving the fault message, the management device re-subscribes to the monitoring event (i.e. re-sends the data structure file to the server);
when the management equipment detects that the number of times of re-subscribing the monitoring event reaches the preset number of times, acquiring configuration information such as quintuple information (source address, destination address, source port, destination port and adopted protocol) associated with transmission data of the monitored equipment with faults, thereby extracting historical abnormal data of a first monitoring index associated with each piece of quintuple information from a historical database, and respectively storing the historical abnormal data into five sets;
then, the management equipment adopts a pre-established k nearest neighbor algorithm prediction model according to the current threshold value, and divides data in each set into abnormal data and normal data, wherein when the duty ratio of the abnormal data in a certain set is larger than a preset ratio, the set is an abnormal set;
and when the number of the abnormal sets is greater than the preset number, executing an attack tracing strategy, and acquiring address information of the attacked device which causes the fault of the monitored device.
The process of executing the attack tracing policy is described above, and will not be described herein.
In summary, according to the embodiment of the present invention, for the actual situations of poor timeliness and low tracing efficiency when fault tracing occurs between the local monitoring device and the associated remote device, and locating the local or remote device, the following scheme is adopted:
1. a NETCONF management protocol is adopted to manage local and remote Servers (CPEs) in a network, and a set of RPC message related instructions of the NETCONF is programmed into a program on the CPEs supporting the NETCONF protocol to execute receiving and responding messages, subscribe monitoring events and set second level monitoring.
2. When the management equipment receives the fault message and the NETCONF protocol repeated subscription times exceed the preset times, the fault is not recovered, trace and position left by the attack are found out by analyzing network flow quintuple (namely source address, destination address, source port, destination port and adopted protocol), comprehensively judging the proportion of abnormal data of each data set by combining the current monitoring threshold value through a constructed k neighbor algorithm prediction model, thereby determining whether to carry out fault tracing, further when the fault tracing needs to be executed, searching an MDATA network security knowledge base, executing a tracing strategy corresponding to the attack threat, and finally positioning the attacked equipment.
Therefore, the fault processing method of the embodiment of the invention introduces a more efficient NETCONF protocol in combination with second-level monitoring and alarm subscription to prevent network jitter. The fault tracing can be performed more intelligently and efficiently, so that the operation and maintenance guarantee capability of the management network equipment is improved.
Having described the service switching method provided by the embodiment of the present invention, the service switching device provided by the embodiment of the present invention will be described with reference to the accompanying drawings.
Referring to fig. 4, the embodiment of the invention further provides a fault handling device, which is applied to the management device, and the fault handling device comprises the following modules:
a first receiving module 401, configured to receive a fault message sent when a first device fails, where the first device is at least one of a plurality of monitored devices managed by the management device, and the fault message is used to indicate that a value of a first monitoring indicator when the first device transmits data is greater than or equal to a first monitoring threshold corresponding to the first monitoring indicator;
a first obtaining module 402, configured to obtain at least one configuration information associated with the transmission data of the first device;
a second obtaining module 403, configured to obtain historical abnormal data of the first monitoring indicator associated with each configuration information respectively;
A determining module 404, configured to determine whether to obtain address information of an attacked device that causes the first device to malfunction according to the historical abnormal data.
Optionally, the apparatus further includes:
the conversion module is used for converting the monitoring period, the monitoring index and the monitoring threshold value corresponding to the monitoring index into a data structure file accepted by the network configuration NETCONF protocol;
the second sending module is used for sending the data structure file to the first equipment;
the first monitoring index is one of the monitoring indexes in the data structure file.
Optionally, the apparatus further includes:
and the third sending module is used for resending the data structure file to the first equipment.
Optionally, the first obtaining module 402 is specifically configured to:
and when the number of times of sending the data structure file to the first device reaches a preset number of times, acquiring at least one configuration information associated with the data transmission of the first device under the condition that the value of the first monitoring index when the first device transmits the data is still greater than or equal to the first monitoring threshold.
Optionally, the determining module 404 is specifically configured to:
Storing the historical anomaly data associated with the ith configuration information into an ith set, i being an integer from 1 to N, N representing the number of the at least one configuration information;
classifying the historical abnormal data in the ith set according to a target threshold value by adopting a k-nearest neighbor algorithm prediction model, and obtaining the ith number of the historical abnormal data which is larger than or equal to the target threshold value in the ith set;
determining the ith set as an abnormal set under the condition that the ratio of the ith number to the number of the historical abnormal data in the ith set is larger than or equal to a preset ratio;
determining to acquire address information of the attacked device causing the first device failure under the condition that the number of the abnormal sets is greater than or equal to a preset number;
and under the condition that the number of the abnormal sets is smaller than the preset number, determining to skip the step of acquiring the address information of the attacked device causing the first device to malfunction.
Optionally, the apparatus further includes:
and the third acquisition module is used for executing an attack tracing strategy to acquire the address information of the attacked device when the address information of the attacked device causing the first device to fail is determined to be acquired.
Optionally, the at least one configuration information includes: at least one of a source address, a destination address, a source port, a destination port, and a protocol employed in transmitting data.
Referring to fig. 5, the embodiment of the present invention further provides a fault handling apparatus, which is applied to a monitored device, and the fault handling apparatus may include the following modules:
the first sending module 501 is configured to send a fault message to a management device, where the fault message is used to indicate that a value of a first monitoring indicator when the monitored device transmits data is greater than or equal to a first monitoring threshold corresponding to the first monitoring indicator.
Optionally, the apparatus further includes:
the second receiving module is used for receiving the data structure file sent by the management equipment, wherein the data structure file is a data structure file accepted by the network configuration NETCONF protocol, and the data structure file is a monitoring period, a monitoring index and a monitoring threshold value corresponding to the monitoring index are converted into the data structure file accepted by the network configuration NETCONF protocol;
the analysis module is used for analyzing the data structure file to obtain the monitoring period, the monitoring index and the monitoring threshold;
the acquisition module is used for acquiring the value of the monitoring index according to the monitoring period and comparing the value of the monitoring index with the monitoring threshold corresponding to the monitoring index;
The first monitoring index is one of the monitoring indexes in the data structure file.
Optionally, the apparatus further includes:
and the third receiving module is used for receiving the data structure file retransmitted by the management equipment.
Optionally, the apparatus further includes:
the storage module is used for storing fault information into a historical database under the condition that the value of the first monitoring index is smaller than the first monitoring threshold value when the monitored equipment transmits data;
wherein the fault information comprises at least one of a value, occurrence time, recovery time and fault content of the first monitoring index.
As can be seen from the foregoing, in the embodiment of the present invention, the management device may receive a fault message sent when the first device (i.e. a plurality of managed devices managed by the management device) fails, where the fault message is used to indicate that a value of a first monitoring indicator when the first device transmits data is greater than or equal to a first monitoring threshold corresponding to the first monitoring indicator; then, the management device acquires at least one configuration information associated with the transmission data of the first device, and acquires historical abnormal data of the first monitoring index associated with each configuration information respectively, so that whether address information of the attacked device causing the first device to fail is acquired is determined according to the historical abnormal data.
It can be known that, in the embodiment of the present invention, after receiving the fault message sent by the first device, the management device does not immediately obtain the address information of the attacked device that causes the fault of the first device, that is, does not immediately perform fault tracing, but obtains at least one configuration information associated with the transmission data of the first device, so as to obtain the historical abnormal data of the first monitoring index from different directions represented by the different configuration information, and further determine whether to perform fault tracing according to the historical abnormal data, thereby performing fault tracing at a more reasonable time, but not immediately tracing when the fault occurs, and reducing the probability of unreasonable tracing, so that network resources can be further saved.
It should be noted that, in the embodiment of the present application, the division of the units is schematic, which is merely a logic function division, and other division manners may be implemented in actual practice. In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a processor-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution, in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
It should be noted that, the above device provided in the embodiment of the present invention can implement all the method steps implemented in the method embodiment and achieve the same technical effects, and detailed descriptions of the same parts and beneficial effects as those in the method embodiment in this embodiment are omitted.
An embodiment of the present invention also provides an electronic device, as shown in fig. 6, including a memory 620, a transceiver 610, and a processor 600;
a memory 620 for storing a computer program;
a transceiver 610 for receiving and transmitting data under the control of the processor 600;
the processor 600 is configured to read the computer program in the memory 620 and execute the fault handling method described in the foregoing first aspect or second aspect.
Wherein in fig. 6, a bus architecture may comprise any number of interconnected buses and bridges, and in particular one or more processors represented by processor 600 and various circuits of memory represented by memory 620, linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The transceiver 610 may be a number of elements, including a transmitter and a receiver, providing a means for communicating with various other apparatus over transmission media, including wireless channels, wired channels, optical cables, and the like. The processor 600 is responsible for managing the bus architecture and general processing, and the memory 620 may store data used by the processor 600 in performing operations.
The processor 600 may be a Central Processing Unit (CPU), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA) or a complex programmable logic device (Complex Programmable Logic Device, CPLD), and the processor 600 may also employ a multi-core architecture.
It should be noted that, the above device provided in the embodiment of the present invention can implement all the method steps implemented in the method embodiment and achieve the same technical effects, and detailed descriptions of the same parts and beneficial effects as those in the method embodiment in this embodiment are omitted.
Embodiments of the present invention also provide a processor-readable storage medium storing a computer program for causing the processor to execute the fault handling method of the first aspect or the second aspect.
The processor-readable storage medium may be any available medium or data storage device that can be accessed by a processor, including, but not limited to, magnetic storage (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical storage (e.g., CD, DVD, BD, HVD, etc.), semiconductor storage (e.g., ROM, EPROM, EEPROM, nonvolatile storage (NAND FLASH), solid State Disk (SSD)), and the like.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, magnetic disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-executable instructions. These computer-executable instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These processor-executable instructions may also be stored in a processor-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the processor-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These processor-executable instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made in the present application without departing from the spirit or scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (15)

1. A fault handling method for use with a management device, the method comprising:
receiving a fault message sent when a first device fails, wherein the first device is at least one of a plurality of monitored devices managed by the management device, and the fault message is used for indicating that a value of a first monitoring index when the first device transmits data is greater than or equal to a first monitoring threshold corresponding to the first monitoring index;
acquiring at least one configuration information associated with the transmission data of the first equipment;
respectively acquiring historical abnormal data of the first monitoring index associated with each piece of configuration information;
and determining whether to acquire address information of the attacked device causing the first device failure according to the historical abnormal data.
2. The method of claim 1, wherein prior to receiving the failure message sent when the first device fails, the method further comprises:
converting a monitoring period, a monitoring index and a monitoring threshold corresponding to the monitoring index into a data structure file accepted by a network configuration NETCONF protocol;
transmitting the data structure file to the first device;
The first monitoring index is one of the monitoring indexes in the data structure file.
3. The method of claim 2, wherein after receiving the failure message sent when the first device fails, the method further comprises:
and retransmitting the data structure file to the first device.
4. The method of claim 3, wherein the obtaining at least one configuration information associated with the first device transmission data comprises:
and when the number of times of sending the data structure file to the first device reaches a preset number of times, acquiring at least one configuration information associated with the data transmission of the first device under the condition that the value of the first monitoring index when the first device transmits the data is still greater than or equal to the first monitoring threshold.
5. The method of claim 1, wherein determining whether to obtain address information of an attacked device causing the first device failure based on the historical anomaly data comprises:
storing the historical anomaly data associated with the ith configuration information into an ith set, i being an integer from 1 to N, N representing the number of the at least one configuration information;
Classifying the historical abnormal data in the ith set according to a target threshold value by adopting a k-nearest neighbor algorithm prediction model, and obtaining the ith number of the historical abnormal data which is larger than or equal to the target threshold value in the ith set;
determining the ith set as an abnormal set under the condition that the ratio of the ith number to the number of the historical abnormal data in the ith set is larger than or equal to a preset ratio;
determining to acquire address information of the attacked device causing the first device failure under the condition that the number of the abnormal sets is greater than or equal to a preset number;
and under the condition that the number of the abnormal sets is smaller than the preset number, determining to skip the step of acquiring the address information of the attacked device causing the first device to malfunction.
6. The method according to claim 1, wherein the method further comprises:
and executing an attack tracing strategy to acquire the address information of the attacked device after determining to acquire the address information of the attacked device causing the first device to fail.
7. The method according to any one of claims 1 to 6, wherein the at least one configuration information comprises: at least one of a source address, a destination address, a source port, a destination port, and a protocol employed in transmitting data.
8. A fault handling method for application to a monitored device, the method comprising:
and sending a fault message to the management equipment, wherein the fault message is used for indicating that the value of a first monitoring index when the monitored equipment transmits data is greater than or equal to a first monitoring threshold value corresponding to the first monitoring index.
9. The method of claim 8, wherein prior to the sending the fault message to the management device, the method further comprises:
receiving a data structure file sent by the management equipment, wherein the data structure file is a data structure file accepted by a network configuration NETCONF protocol, and the data structure file is a monitoring period, a monitoring index and a monitoring threshold corresponding to the monitoring index are converted into the data structure file accepted by the network configuration NETCONF protocol;
analyzing the data structure file to obtain the monitoring period, the monitoring index and the monitoring threshold;
according to the monitoring period, collecting the value of the monitoring index, and comparing the value of the monitoring index with the monitoring threshold corresponding to the monitoring index;
the first monitoring index is one of the monitoring indexes in the data structure file.
10. The method of claim 9, wherein after the sending of the failure message to the management device, the method further comprises:
And receiving the data structure file retransmitted by the management equipment.
11. The method of claim 8, wherein after the sending of the failure message to the management device, the method further comprises:
storing fault information into a historical database under the condition that the value of the first monitoring index is smaller than the first monitoring threshold value when the monitored equipment transmits data;
wherein the fault information comprises at least one of a value, occurrence time, recovery time and fault content of the first monitoring index.
12. A fault handling apparatus for use with a management device, the apparatus comprising:
the first receiving module is used for receiving a fault message sent when a first device fails, wherein the first device is at least one of a plurality of monitored devices managed by the management device, and the fault message is used for indicating that the value of a first monitoring index when the first device transmits data is greater than or equal to a first monitoring threshold corresponding to the first monitoring index;
the first acquisition module is used for acquiring at least one configuration information associated with the transmission data of the first equipment;
The second acquisition module is used for respectively acquiring historical abnormal data of the first monitoring index associated with each piece of configuration information;
and the determining module is used for determining whether to acquire address information of the attacked device which causes the first device to fail according to the historical abnormal data.
13. A fault handling apparatus for application to a monitored device, the apparatus comprising:
the first sending module is used for sending a fault message to the management equipment, wherein the fault message is used for indicating that the value of a first monitoring index when the monitored equipment transmits data is larger than or equal to a first monitoring threshold value corresponding to the first monitoring index.
14. An electronic device comprising a memory, a transceiver, and a processor:
a memory for storing a computer program; a transceiver for transceiving data under control of the processor; a processor for reading the computer program in the memory and performing the fault handling method of any of claims 1 to 7 or performing the fault handling method of any of claims 8 to 11.
15. A processor-readable storage medium, characterized in that the processor-readable storage medium stores a computer program for causing the processor to execute the fault handling method of any one of claims 1 to 7 or to execute the fault handling method of any one of claims 8 to 11.
CN202211450654.5A 2022-11-18 2022-11-18 Fault processing method and device and electronic equipment Pending CN115996408A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211450654.5A CN115996408A (en) 2022-11-18 2022-11-18 Fault processing method and device and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211450654.5A CN115996408A (en) 2022-11-18 2022-11-18 Fault processing method and device and electronic equipment

Publications (1)

Publication Number Publication Date
CN115996408A true CN115996408A (en) 2023-04-21

Family

ID=85991341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211450654.5A Pending CN115996408A (en) 2022-11-18 2022-11-18 Fault processing method and device and electronic equipment

Country Status (1)

Country Link
CN (1) CN115996408A (en)

Similar Documents

Publication Publication Date Title
KR102418969B1 (en) System and method for predicting communication apparatuses failure based on deep learning
EP2997756B1 (en) Method and network device for cell anomaly detection
CN108092836A (en) The monitoring method and device of a kind of server
US10983855B2 (en) Interface for fault prediction and detection using time-based distributed data
US11805005B2 (en) Systems and methods for predictive assurance
CN110784355B (en) Fault identification method and device
CN110224885B (en) Equipment monitoring alarm method and device, storage medium and electronic equipment
US11281522B2 (en) Automated detection and classification of dynamic service outages
US20170364401A1 (en) Monitoring peripheral transactions
CN110929896A (en) Security analysis method and device for system equipment
US11893644B2 (en) Intelligent user interface monitoring and alert
CN116760509A (en) Power data transmission control method, system, terminal equipment and storage medium
US11373254B2 (en) Systems and methods of utility management
CN103166779A (en) Alarm confirming and processing method and device based on mobile terminal
US11153769B2 (en) Network fault discovery
CN115996408A (en) Fault processing method and device and electronic equipment
EP3844921A1 (en) Rule generation for network data
Kilinçer et al. Automatic fault detection with Bayes method in university campus network
CN115686381B (en) Prediction method and device for storage cluster running state
EP3756310A1 (en) Method and first node for managing transmission of probe messages
EP4027583A2 (en) Method and apparatus for maintaining web application firewall based on non-face-to-face authentication
Alcaraz et al. Addressing situational awareness in critical domains of a smart grid
WO2023280421A1 (en) Method and apparatus for determining a first causal map
CN116996300A (en) Data security processing method based on combination of alarm threshold triggering and MTD algorithm
CN116436792A (en) Information acquisition method and device and network equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination