CN100456687C - Network failure real-time relativity analysing method and system - Google Patents

Network failure real-time relativity analysing method and system Download PDF

Info

Publication number
CN100456687C
CN100456687C CNB031347290A CN03134729A CN100456687C CN 100456687 C CN100456687 C CN 100456687C CN B031347290 A CNB031347290 A CN B031347290A CN 03134729 A CN03134729 A CN 03134729A CN 100456687 C CN100456687 C CN 100456687C
Authority
CN
China
Prior art keywords
network
incident
failure
analysis
real
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB031347290A
Other languages
Chinese (zh)
Other versions
CN1529455A (en
Inventor
谭俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Service Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CNB031347290A priority Critical patent/CN100456687C/en
Publication of CN1529455A publication Critical patent/CN1529455A/en
Application granted granted Critical
Publication of CN100456687C publication Critical patent/CN100456687C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Images

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)
  • Computer And Data Communications (AREA)

Abstract

The present invention provides a network failure real-time relativity analyzing method and a system, which belongs to the field of computer network communication. Failure event information from various kinds of network equipment and business objects is written in an original event list; an analytic control engine is used for reading events for correlation analysis from the original event list according to levels and the type selection of original events; in a dynamic analysis algorithm, the information in various fields, such as scene of historical failure analysis, dynamic performance parameters of networks, dynamic topology information, time characters of events, etc. are comprehensively applied. As a result, the present invention overcomes the defects that in the existing method of failure correlation analysis, the state information of dynamic networks is neglected, a reasoning process is too dependent on presetting rules, automatic learning capability lacks, etc. Besides, the present invention can realize effective correlation analysis for original event sets caused by failures and perfectly solves the problem that no real-time failure reason analysis or failure location exists when a network failure storm occurs.

Description

Real-time correlation analysis of network failure and system
Affiliated technical field
The invention belongs to computer network communication field, be specifically related to based on the field integrated information network failure incident be carried out in a kind of network management the method and system of real-time correlation analysis.
Background technology
In computer and communication network, when certain equipment or service are broken down, can cause a series of network events because be closely connected between equipment, service and the business, be responsible for the event notice that this network of network management system of monitoring sends by equipment or the poll monitoring of network management system, can find a large amount of anomalous events, and be reflected on network manager's the administration interface, thereby show as " network failure storm " by SNMP Trap, Syslog or Indication.Because this fault storm often causes a large amount of incidents in a short period of time, flooded the most basic event of failure, allow the keeper be difficult to the true cause of finding that therefrom fault takes place, solve fault, just need therefrom analyze the most basic failure cause, just analyze the correlation between these incidents, seek the root incident.In order to carry out the event correlation analysis, industry develops and several typical methods: as rules-based analysis (Rule Based Reasoning), based on the analysis (ModelBased Reasoning) of model, based on the analysis of state transition diagram (State Transition Graph), based on the analysis of code book (CodeBook) and based on the analysis (Case-Based Reasoning) of case, these methods can both solve the problem that failure dependency is analyzed to a certain extent, and advantage is respectively arranged.But these methods all can't solve following problem fully:
(1) can't consider dynamically that network topology links information;
(2) do not have all incoming events of selecting of processing, efficient is difficult to improve, and resource consumption is big;
(3) reasoning process too relies on preset rules, mark sheet or model, lacks automatic learning ability, lacks adaptive capacity and disposal ability to the new situation beyond the knowledge base;
(4) observed event sequence in the scope at a fixed time can not change the time range of association analysis dynamically;
(5) in analytic process, lack consideration to conditional probability and time factor;
(6) can not be in conjunction with the network operation parameter of obtaining in real time in based on the analytic process of static information.
Summary of the invention
The invention provides a kind of method and system of the network failure incident being carried out real-time correlation analysis based on the field integrated information, overcome in the existing fault association analysis method and to have ignored dynamic network state information, reasoning process and too rely on preset rules and lack deficiency such as automatic learning ability, but the critical event in effective recognition fault source and it is located in network.
Technology contents of the present invention: the real-time correlation analysis of a kind of network failure comprises:
(1) incident is extracted the various event of failures that produce in the interface collection network, and writes in the primitive event tabulation;
(2) from the primitive event tabulation, read an incident, carry out event matches, the network equipment, service operation parameter are detected in real time by the historical failure context information;
(3) if match event is not arranged, select and detect in real time when the relevant network object of the incident of pre-treatment based on information model, topological dependence, and the result that will detect in real time application pushes back in the reason process as condition;
(4) return primitive event tabulation continue to search the incident relevant with current processing events or with the identical incident of real-time testing result, and this incident joined in the Work List;
(5) do not had other incidents that can add Work List in the primitive event tabulation, then the incident from Work List is constructed a new failure scenario and is joined in the historical failure context information, empties Work List;
(6) from the primitive event tabulation, read the incident that the next one meets selection strategy, turned back to for second step,, then hang up and wait for that the incident input is arranged if do not have incident in tabulation.
Described information model comprises:
(1) it is abstract the various managed objects in the managed networks to be carried out object-oriented;
(2) according to the information model of being formed a stratification by the inheritance between the administrative class after abstract;
(3) in information model, define by the correlation between the administrative class with association class.:
Described topological dependence comprises:
(1) in the network operation, keeps the consistent of topological dependence and network practical topology;
(2) network node with the operation of failure dependency routine analyzer is made as reference point;
(3) calculate the accessibility dependence that arrives other each nodes by reference point.
(4) being used to announcement from the topology change of equipment triggers topological synchronization program and recomputates topological dependence by up-to-date topology;
Described reasoning process comprises:
(1) gives a fiducial probability for each step reasoning, and draw ultimate analysis result's probability by the probability that calculates per step;
(2) in failure scenario is created the definition time constraint function time response of incident is described and the incident that is associated between time relationship;
(3) carry out representing of warning content with formalization method and mate.
The historical failure context information is configured to a failure scenario table of being convenient to quick search.
The collection of described primary fault incident further comprises:
(1) when handling different event types, dynamically changes the length of primitive event formation according to pre-defined rule;
(2) decide the starting point of which incident according to event level and user definition rule as correlation analysis;
(3) primitive event is carried out preliminary treatment, provide extendible incident to obtain interface, they are converted into unified internal form and filter at the event of failure of different agreement.
The failure scenario that described structure is new comprises:
(1) extracts the fault signature parameter;
(2) extract the fault propagation path;
(3) utilize the new fault of fault signature parameter and propagation path structure to solve sight.
The real-time correlation analysis system of a kind of network failure comprises:
Analysis and Control engine: be used for calling other modules and interface is finished the failure dependency analysis according to the analysis and Control engine algorithms;
Incident is extracted interface: be used to receive the diverse network incident that the network equipment is sent, incident is converted into unified format, write the primitive event tabulation, for the analysis and Control engine calling;
Real-time network parameter detecting interface: be used for detecting the real time information such as attribute, performance and accessibility of network various device and service, analyzed Control Engine is called, which network equipment is the parameter of accepting the accident analysis engine detect in real time with decision, and the result is returned to the analysis and Control engine;
Information model: a series of administrative class corresponding to procotol object and device object are described, and the relation of interdependence between them;
Information model query interface: be used for from the function that concerns between information model searching and managing class, management class attribute and the administrative class, provide information for the analysis and Control engine from information model in when operation;
Topology synchronization module: be used for being moved topological dependence generating algorithm by the network topological change Event triggered, generate the topological dependence of correct reflection current network topology connection relationship and deposit topological dependence storehouse in, topological dependence storehouse provides relevant information for the analysis and Control engine;
The failure scenario table generates module: be used for setting up a failure scenario finding on one group of incident of correlation, and this sight is deposited in the failure scenario table, mate by failure scenario table and follow-up incident.
Described information model is with the storage of hash table file mode, and the analysis and Control engine passes through the information of model query interface information extraction model in analytic process.
Further comprise pretreatment module: the primitive event that receives is anticipated according to predetermined preliminary treatment rule.
Technique effect of the present invention: made full use of various dynamic and static informations in the network, real time information and historical information, when network broke down, from complex failure phenomenon and the incident storm that causes thereof, the critical event in effective recognition fault source was also located it in network; In addition, because in analysis, used the topological dependence synchronous, and the network operation parameter of obtaining in real time, improved the accuracy of fault location with the real network topological condition; By original incoming event being carried out preliminary treatment (comprise protocol format conversion, filter and select), avoided starting with and carried out correlation analysis from the incident of all inputs, improved treatment effeciency; Utilize the historical sight table of structure troubleshooting, make this method have the ability of self-teaching from historical experience, and incident is mated fast with the sight table, make the incident that has directly in the sight table, to obtain coupling, thereby avoided all incidents are all carried out the correlation analysis of overall process, treatment effeciency is improved; And because applied probability logic and time-constrain function, regular expression fuzzy matching in parser, the complex relationship between the flexible processing incident has more improved the suitable ability of correlation analysis.
Description of drawings
Fig. 1 is the structural representation of the real-time correlation analysis system of network failure of the present invention;
Fig. 2 is the flow chart of the real-time correlation analysis of network failure of the present invention;
Fig. 3 is that the topology of the real-time correlation analysis of network failure of the present invention relies on the generating algorithm flow chart;
Fig. 4 is the network diagram of a specific embodiment of the real-time correlation analysis of network failure of the present invention;
Fig. 5 is the schematic diagram of the information model in the specific embodiment of the real-time correlation analysis of network failure of the present invention.
Embodiment
With reference to figure 1, the present invention is a control module with the analysis and Control engine, by with the information model query interface, what incident extracted that interface and pretreatment module, real-time network parameter detecting interface, failure scenario table generate module, topological synchronization module implements the real-time correlation analysis of network failure alternately.Concrete steps are:
1, incident is extracted interface with the event of failure information of different agreement (SNMP/SYSLOG etc.) extraction from various network device and business object, and be unified internal form with their format conversion, then by the incident pretreatment module, these event informations are compressed, filter (according to default filter), write in the primitive event tabulation; By primitive event is carried out preliminary treatment, can effectively improve treatment effect;
2, the analysis and Control engine reads an incident according to primitive event rank and type selecting and carries out correlation analysis from primitive event tabulation; In analytic process integrated application failure scenario table, information model information, detect information and topology information in real time, in analytic process, can continue from the primitive event tabulation, to read incident as required and come the tectonic event propagation path, till the incident that can't find the next one to mate again;
(1) the historical failure context information is configured to a failure scenario table of being convenient to quick search.In the sight table, can carry out the quick coupling of incident;
(2) the OO hierarchical network information model of structure: it is abstract that managed objects such as the hardware in the networking, link, software and network service are carried out object-oriented, and the inheritance tissue between the administrative class according to these after abstract becomes the information model of a stratification.In this model, defined by correlations such as comprising between the administrative class, dependence, bindings with association class simultaneously.Model can utilize the level and the relation of interdependence of the administrative class of model definition to derive by the visit of model object management interface with the storage of hash table (Hash) file mode; A series of administrative class corresponding to procotol object and device object have been described in information model, and various relations between them.The administrative class that defines in the information model can be divided into topological submodel, open service submodel and three big classes of network service submodel.
Below introduce the definition of administrative class as an example with open service system submodel: open service system submodel is mainly used in each node device and inner each module thereof in the data of description communication network, it provides network nodes of data transport service or data processing service abstract all is the service system of an opening, constitute different systems by software, hardware according to a kind of the expansion with the mode of cutting out, wherein administrative class is:
A, open service system: on behalf of all, (Open Service System) system of each layer data service is provided on data communication network; Comprise router, switch or server etc.;
B, software (software): the functional module that realizes by software in the open service system;
C, hardware (hardware): the functional module that realizes by hardware and firmware in the open service system;
D, application (application): various application programs, as Mail Clients;
E, operating system (os): various real-time and time sharing operating systems; As VxWorks, Windows, Unix, Linux etc.;
F, resource (resource): basic shared object in the system: as internal memory, disk, CPU, interruption etc.;
G, equipment (device): each module of forming hardware;
H, service (service):
I, protocol stack (protocol stack):
J, kernel (kernel):
K, driving (driver):
L, internal memory (memory):
M, hard disk (harddisk):
N, central processing unit (cpu):
O, bus (bus):
P, adapter (adapter):
Q, network adapter (network adapter):
U, controller (controller):
In this information model, there are the various dependences between the administrative class, as agreement dependence, exploitation service dependence etc.
(3) detect in real time: combine with reasoning process with to the real-time detection of the network equipment, service operation parameter.
(4) carrying out topological dependence based on specified reference point calculates in real time: the network node of failure dependency routine analyzer operation is made as reference point, calculate to arrive the accessibility dependence of other each nodes on this basis, and in the network operation, keep with network topology synchronously; The physical property that the topology dependence has been described between node and the node links, and is the basis of protocol interconnection and service availability.Reference point wherein, refer to when we consider the accessibility of certain node in the topological diagram, as that node of starting point, in the managed network of reality, often be exactly the residing node of network management platform, or the residing node location of network probe (software or hardware).With reference to figure 3, setting up dependence is a recursive algorithm, after each topology changes, operation algorithm is automatically triggered in the capital, upgrade relying on dependence, guarantee current fault location and related accuracy, thereby reach the set of the related network example object of next step possibility that need detect.
(5) finish the most crucial logic of correlation analysis in the control analysis engine internal, with reference to figure 2,
A, from tabulation, read an incident Ei (i=1~n), in the sight table, mate with this incident, see if there is with the relevant fault history sight (characteristic event of this failure scenario and this event matches) of this incident,, handle according to step (b) to each sight that meets;
B, call real-time detection module, the related example (considering simultaneously to produce the node that the relevant topology of node relies on this incident) of the related object class in this situation is carried out the real-time status detection, see whether return results meets the characteristic range of sight description; And then in the primitive event tabulation, search for the follow-up incident that whether has related example to produce, see the feature that whether meets the sight definition; If above inspection is passed through, these relevant incidents of mark and call output module format output analysis result then;
If c (b) the middle detection does not meet, then call the model query interface, in network information model, inquire about the corresponding administrative class of object with this incident of generation; Consider simultaneously to produce the node that the relevant topology of node relies on, obtain the set of next step the network example object that may be correlated with that need detect with this incident;
D, call the current state that real-time detection module detects these objects and whether meet the described characteristic range of the relation that defines in the galactic model, check the dependent event that in the primitive event tabulation, whether has these objects to send then, if have, then these incidents are joined the work event tabulation, change step (e); If above detection is not passed through, check then whether the work event tabulation is empty, if, then call the failure scenario constructing module and construct new failure scenario for these incidents and join in the failure scenario table if be empty for idle running step (e), empty work event simultaneously and tabulate; And then mark and remove these incidents and format output analysis result, change step (e);
E, from primitive event tabulation, read the incident that the next one meets selection strategy, change step (a) then,, then hang up and wait for that the incident input is arranged if do not have incident in tabulation;
Wherein, the reasoning process that coupling of mentioning in above-mentioned steps and real-time status detect comprises: based on the rule-based reasoning of probability: for a fiducial probability is given in each step reasoning, and draw ultimate analysis result's probability by the probability that calculates per step; Processing to the time-constrain factor: in failure scenario is created the definition time constraint function time response of incident is described and the incident that is associated between time relationship; Carry out the fuzzy matching of warning content with regular expression.
3, after finishing correlation analysis, (finish scanning) to all incidents in the current event tabulation, construct failure scenario and join the failure scenario table all over the incident that is associated in analyzing for this, then these incidents are shifted out primitive event tabulation and structure output analysis result;
4, when carrying out above work with the analysis and Control engine, incident acquisition module (comprising incident acquisition interface and incident pretreatment module) also writes the incident that newly receives synchronous in the primitive event tabulation, the topology synchronization module is the variation of monitor network topology simultaneously also, refreshes network topology dependence storehouse at any time; If do not had incident in the primitive event tabulation, the analysis and Control engine will be hung up, and wait for having new incident to write; When the incident pretreatment module writes the primitive event tabulation with new incident, hang up, will wake this process up if find the analysis and Control engine.
The concrete example explanation of adopting a local area network (LAN), with reference to figure 4, A wherein, C, D are the main frames of operation (SuSE) Linux OS in the local area network (LAN), and S is a three-tier switch, and R is a router that connects this local area network (LAN) and Web server, also is the gateway of this local area network (LAN).A, C directly link to each other with S, and D directly links to each other with R, and RP is the PC of an operation Windows, also is the reference point that we carry out correlation analysis, and correlation analysis system just operates on this main frame.
At first, with reference to figure 5, present embodiment adopts the information model of a simplification, in this network: host A, C, D, RP, router R, switch S can be counted as the open service system, each open service system has comprised a protocol stack, protocol stack be responsible for finishing use with network on communicating by letter between peer-entities in other open service systems.Data flow through downwards application, operating system, agreement, interface enter physical network then, arrive another open service system through two layers of forwarding and three layers of route, upwards through the application up to the other end of interface, agreement, operating system.
1) information model instantiation
Above model will generate some examples corresponding to above model entity in the network environment of reality: as the application on the router R, we are its called after Application_R, the operating system on the R, and called after:
OS_R,
Analogize therewith, we obtain other examples: Protocols_R, Interface_R;
Equally:
For host A, we obtain Application_A, Service_A, OS_A, Protocols_A, Interface_A;
For host C, we obtain Application_C, Service_B, OS_C, Protocols_C, Interface_C;
For main frame D, we obtain Application_D, Service_D, OS_D, Protocols_D, Interface_D;
And there is a following dependence:
Application->Service;
Service->OS;
OS->Protocols;
Protocols-〉Interface; (note: this is the model of a simplification);
Definition web_browse_in_url-is arranged in the hypothesized model〉DNS service;
X.interface.fail is equivalent to X.down;
2) topological dependence generates
For network shown in Figure 4, network management platform will obtain its topological data by automatic discovery, move topological dependence generating algorithm then, and (is reference point with RP) obtains following topological dependence set:
RD={A->S,C->S,S->R,D->R,Internet->R,R->RP}
Wherein: ' X-〉implication of Y ' can be interpreted as " will visit X, must earlier through Y ";
R-〉RP represents that R is the network node that directly links to each other with reference point RP;
When network topology or reference point changed, this algorithm upgraded dependence automatically, thereby kept dependence can reflect actual network operation situation.
3) incident extraction interface begins to receive the variety of event that produces in the network.
Suppose on host A, to have moved a DNS service (can regard a service as), and individual program is being arranged on the main frame D at the homepage www.harbournetworks.com that constantly visits on the Web server, it can be regarded as an Applicaion, we are called after web_browse_in_url.
Suppose that sometime incident is extracted interface and received following incident from the SNMP agency of each main frame, is expressed as follows after this incident is formatted:
{
E0=RP.ping.S.fail:t0, expression t0 constantly from can't ping on the RP to switch S,
E1=RP.ping.C.fail:t1, expression t1 constantly from can't ping on the RP to host C,
E2=RP.ping.C.fail:t2, expression t2 constantly from can't ping on the RP to host C,
E3=D.web_browse_in_url.Web_Server.fail:t3 represents that t3 can't visit Web server on the main frame D constantly.
E4=RP.ping.A.fail:t4, expression t4 constantly from can't ping on the RP to host A,
E5=RP.ping.A.fail:t5, expression t5 constantly from can't ping on the RP to host A,
E6=R.down:t6, expression t6 R constantly lost efficacy,
E7=RP.web_browse_in_url.web_server.fail:t7 represents that t7 can't visit Web server on the main frame RP constantly.
E8=R.up:t8, expression t8 R constantly resumes work,
}
4) E0 ... E4 is fed to pretreatment module subsequently handle after, the primitive event set after obtaining compressing, noted having filtered here the incident that repeats (E2, E5) and the paired incident removed of malfunction (E6, E8);
{
E0=RP.ping.S.fail:t0, expression t0 constantly from can't ping on the RP to switch S,
E1=RP.ping.C.fail:t1, expression t1 constantly from can't ping on the RP to host C,
E3=D.web_browse_in_url.Web_server.fail:t3 represents that t3 can't visit Web server on the main frame D constantly.
E4=RP.ping.A.fail:t4, expression t4 constantly from can't ping on the RP to host A,
E7=RP.web_browse_in_url.Web_Server.fail:t7 represents that t7 can't visit Web server on the main frame RP constantly.
}
5) utilize the field integrated information that the event of failure in the communication network is carried out real-time correlation analysis:
(a) the analysis and Control engine reads incident a: E0=RP.ping_S.fail:t0 from the primitive event tabulation;
Therefrom parse
Node object: source node RP, destination node S,
Application: RP.ping, ping belongs to Applications;
Application state: fail;
With the E0 mark and add work event tabulation;
(b) open and inquire about in the sight table and have or not and RP, S, the sight that ping is relevant finds that the sight table be empty (the system's initialization first time does not also add new sight), closes the sight table;
(c) recalls information model query interface, inquiry ping (Application) obtains relation: Applications-〉Services, Services-〉Protocols, Protocols-〉Interface; Inquire about topological dependence storehouse again, obtain R-RP, S-〉R;
(d) call network state and detect interface in real time, check S.Interface, find that the S.Interface state is fail, then can infer following result according to dependence:
S.Interface.fail==S.down;
S.down=>A.down?and?C.down;
A.down==A.Interface.fail=>A.application.fail?and?A.services.fail
C.down==C.Interface.fail=>C.application.fail?and?C.services.fail;
A.services.fail=>A.DNS.fail=>*.browse_web_in_url.fail
(e) begin to check the primitive event tabulation from E1.Read E1
E1=RP.ping.C.fail:t1 therefrom parses
Node object: source node RP, destination node C,
Application: RP.ping, ping belongs to Applications;
Application state: fail;
Ping belongs to application, require RP and C, and the S of topology dependence, the applications on the R, services, protocols, interface all keeps normally, S.down then, and C.down all can release E1, so on E1 was associated, analysis engine was with the E1 mark and join in the work event tabulation;
Continue down to read E3:
E3=D.web_browse_in_url.Web_server.fail:t3 resolves and obtains:
Node object: D, Web_server;
Application: web_browse_in_url;
Application state: fail;
According to what obtain previously: A.services.fail=〉A.DNS.fail=〉* .browse_web_in_url.fail, can draw the dependent event that E3 also is E1, so E3 is labeled and joins in the work event tabulation.
In like manner, can analyze the dependent event that E4 and E7 are E1, so this incident of mark is added into Work List.
(f) find not had unlabelled incident in the primitive event tabulation, then call output module to primitive event tabulation format output:
Outputting alarm:
Alarm1=
{
Cause:RP.ping.S.fail:t0
Affects:
[
RP.ping.C.fail:t1
D.web_browse_in_url.Web_server.fail:t3
RP.ping.A.fail:t4
RP.web_browse_in_url.Web_Server.fail:t7
]
}
(g) utilize fault signature parameter and fault propagation path to solve sight Scene1:S.down=for the new fault of these incident structures〉{ A.down and C.down and*.web_browse_in_url.fail} also joins in the failure scenario table.
(h) empty the work event tabulation; From the primitive event tabulation, remove these incidents.
(j) if having new incident to join the primitive event engine then change (3) this moment, otherwise hang up, wait for new incident input;
(k) suppose to have new incident to come:
E9=D.web_browse_in_url.Web_Server.fail:t9
E10=A.down:t10;
(1) the event analysis engine reads E9, in incident sight table, inquire about, discovery has this affair character pattern of * .web_browse_in_url.fail to match in Scene1, E9 is joined in the work event tabulation, continue to check in the primitive event tabulation whether characteristic event: A.down and C.down are arranged, read E10, satisfy A.down, E10 is added the work event tabulation; At this time do not had other incident in the tabulation, also surplus next feature C.down need be proved, so call real-time detection interface, detects and finds: C.down=true; So sight obtains coupling, S.down directly obtains a result.The step that as follows (1) is described.
In previous step, if to the real-time testing result C.down=false of C; Then above-mentioned sight can not be put letter fully, can give a fiducial probability.Expression also has other reason.
By utilization field integrated information, comprise the troubleshooting historical information of the management object hierarchical information of information model Network Based and correlation, study automatically, the network operation parameter of gathering in real time, network dynamic topology information, event time feature etc., and in reasoning process, use dynamic analysing method, better solved the failure dependency problem analysis in complex network environment.
With reference to figure 1, the real-time correlation analysis system of network failure of the present invention comprises:
The analysis and Control engine: the major control logic executor of analytic process is used for calling other modules and interface is finished the failure dependency analysis according to the analysis and Control engine algorithms;
Information model: described a series of administrative class corresponding to procotol object and device object, and various relations between them, the administrative class that defines in the information model can be divided into topological submodel, open service submodel and three big classes of network service submodel;
Information model query interface: be used for from the function that concerns between information model searching and managing class, management class attribute and the administrative class, provide information for the analysis and Control engine from information model in when operation;
Incident is extracted interface: be used to receive the diverse network incident that the network equipment is sent, comprise the event notification of variety of protocols such as SNMPTRAP, SYSLOG, CMIP Event Report, this incident is converted into unified format, and gives pretreatment module;
Pretreatment module: be used for the primitive event that receives is carried out simple filtering (removing the incident that some administrative staff need not to be concerned about according to the rule of setting), compression (removing the incident that repeats), redefines (it is a new incident that one or more incidents are redefined) etc. and anticipate, help correlation analysis;
Real-time network parameter detecting interface: be used for detecting the real time information such as attribute, performance and accessibility of network various device and service, called by the accident analysis engine, which network equipment is the parameter of accepting the accident analysis engine detect in real time with decision, and the result is returned to the accident analysis engine;
The failure scenario table generates module: be used for setting up a failure scenario finding on one group of incident of correlation, and this sight deposited in the failure scenario table, the failure scenario of these foundation is searched use fast for subsequent analysis, and the failure scenario of foundation can be searched fast and uses for subsequent analysis;
Topology synchronization module: be used for being moved topological dependence generating algorithm, generate the topological dependence of correct reflection current network topology connection relationship and deposit topological dependence storehouse in, use for the failure dependency analysis by the network topological change Event triggered.

Claims (10)

1. real-time correlation analysis of network failure comprises:
(1) incident is extracted the various event of failures that produce in the interface collection network, and writes in the primitive event tabulation;
(2) from the primitive event tabulation, read an incident, carry out event matches, the network equipment and service operational factor are detected in real time by the historical failure context information;
(3) if match event is not arranged, select and detect in real time when the relevant network object of the incident of pre-treatment based on information model, topological dependence, and the result that will detect in real time application pushes back in the reason process as condition;
(4) return primitive event tabulation continue to search the incident relevant with current processing events or with the identical incident of real-time testing result, and this incident joined in the Work List;
(5) do not had other incidents that can add Work List in the primitive event tabulation, then the incident from Work List is constructed a new failure scenario and is joined in the historical failure context information, empties Work List;
(6) from the primitive event tabulation, read the incident that the next one meets selection strategy, turned back to for (2) step,, then hang up and wait for that the incident input is arranged if do not have incident in tabulation.
2. the real-time correlation analysis of network failure as claimed in claim 1 is characterized in that described information model comprises:
(1) it is abstract the various managed objects in the managed networks to be carried out object-oriented;
(2) according to the information model of being formed a stratification by the inheritance between the administrative class after abstract;
(3) in information model, define by the correlation between the administrative class with association class.
3. the real-time correlation analysis of network failure as claimed in claim 1 or 2 is characterized in that described topological dependence comprises:
(1) in the network operation, keeps the consistent of topological dependence and network practical topology;
(2) network node with the operation of failure dependency routine analyzer is made as reference point;
(3) calculate the accessibility dependence that arrives other each nodes by reference point;
(4) being used to announcement from the topology change of equipment triggers topological synchronization program and recomputates topological dependence by up-to-date topology.
4. the real-time correlation analysis of network failure as claimed in claim 1 is characterized in that described reasoning process comprises:
(1) gives a fiducial probability for each step reasoning, and draw ultimate analysis result's probability by the probability that calculates per step;
(2) in failure scenario is created the definition time constraint function time response of incident is described and the incident that is associated between time relationship;
(3) carry out representing of warning content with formalization method and mate.
5. the real-time correlation analysis of network failure as claimed in claim 1 is characterized in that the historical failure context information is configured to a failure scenario table of being convenient to quick search.
6. the real-time correlation analysis of network failure as claimed in claim 1 is characterized in that described step (1) further comprises:
(1-1) when handling different event types, dynamically change the length of primitive event formation according to pre-defined rule;
(1-2) decide the starting point of which incident as correlation analysis according to event level and user definition rule;
(1-3) primitive event is carried out preliminary treatment, provide extendible incident to obtain interface, they are converted into unified internal form and filter at the event of failure of different agreement.
7. the real-time correlation analysis of network failure as claimed in claim 1 is characterized in that the new failure scenario of described structure comprises:
(1) extracts the fault signature parameter;
(2) extract the fault propagation path;
(3) utilize the new fault of fault signature parameter and propagation path structure to solve sight.
8. real-time correlation analysis system of network failure comprises:
Analysis and Control engine: be used for calling other modules and interface is finished the failure dependency analysis according to the analysis and Control engine algorithms;
Incident is extracted interface: be used to receive the diverse network incident that the network equipment is sent, incident is converted into unified format, write the primitive event tabulation, for the analysis and Control engine calling;
Real-time network parameter detecting interface: be used for detecting the real time information such as attribute, performance and accessibility of network various device and service, analyzed Control Engine is called, which network equipment is the parameter of accepting the accident analysis engine detect in real time with decision, and the result is returned to the analysis and Control engine;
Information model: a series of administrative class corresponding to procotol object and device object are described, and the relation of interdependence between them;
Information model query interface: be used for from the function that concerns between information model searching and managing class, management class attribute and the administrative class, provide information for the analysis and Control engine from information model in when operation;
Topology synchronization module: be used for being moved topological dependence generating algorithm by the network topological change Event triggered, generate the topological dependence of correct reflection current network topology connection relationship and deposit topological dependence storehouse in, topological dependence storehouse provides relevant information for the analysis and Control engine;
The failure scenario table generates module: be used for setting up a failure scenario finding on one group of incident of correlation, and this sight is deposited in the failure scenario table, mate by failure scenario table and follow-up incident.
9. the real-time correlation analysis system of network failure as claimed in claim 8 is characterized in that described information model with the storage of hash table file mode, and the analysis and Control engine passes through the information of model query interface information extraction model in analytic process.
10. the real-time correlation analysis system of network failure as claimed in claim 8 or 9 is characterized in that further comprising pretreatment module: the primitive event that receives is anticipated according to predetermined preliminary treatment rule.
CNB031347290A 2003-09-29 2003-09-29 Network failure real-time relativity analysing method and system Expired - Fee Related CN100456687C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB031347290A CN100456687C (en) 2003-09-29 2003-09-29 Network failure real-time relativity analysing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB031347290A CN100456687C (en) 2003-09-29 2003-09-29 Network failure real-time relativity analysing method and system

Publications (2)

Publication Number Publication Date
CN1529455A CN1529455A (en) 2004-09-15
CN100456687C true CN100456687C (en) 2009-01-28

Family

ID=34286184

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB031347290A Expired - Fee Related CN100456687C (en) 2003-09-29 2003-09-29 Network failure real-time relativity analysing method and system

Country Status (1)

Country Link
CN (1) CN100456687C (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10467083B2 (en) 2017-06-08 2019-11-05 International Business Machines Corporation Event relationship analysis in fault management
CN113169898A (en) * 2018-11-07 2021-07-23 西门子股份公司 System and method for error identification and error cause analysis in a network of network components
CN113271216A (en) * 2020-02-14 2021-08-17 华为技术有限公司 Data processing method and related equipment

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100417080C (en) * 2005-02-01 2008-09-03 华为技术有限公司 Method for detecting network chain fault and positioning said fault
CN100450016C (en) * 2005-06-03 2009-01-07 华为技术有限公司 Method for implementing online maintenance in communication network
FI20050625A0 (en) * 2005-06-13 2005-06-13 Nokia Corp Binary class control
CN100382509C (en) * 2005-11-28 2008-04-16 华为技术有限公司 Fault positioning method in wireless network
CN101388794B (en) * 2008-10-10 2011-12-07 中兴通讯股份有限公司 Method and system for positioning network management system exception affair
CN101394314B (en) * 2008-10-20 2011-03-23 北京邮电大学 Fault positioning method for Web application system
CN101610174B (en) * 2009-07-24 2011-08-24 深圳市永达电子股份有限公司 Log correlation analysis system and method
CN102640154B (en) * 2009-07-30 2015-03-25 惠普开发有限公司 Constructing a bayesian network based on received events associated with network entities
JP5542398B2 (en) * 2009-09-30 2014-07-09 株式会社日立製作所 Root cause analysis result display method, apparatus and system for failure
CN102045213B (en) * 2009-10-22 2014-04-02 华为技术有限公司 Fault positioning method and device
CN102158360B (en) * 2011-04-01 2013-10-30 华中科技大学 Network fault self-diagnosis method based on causal relationship positioning of time factors
CN102164089B (en) * 2011-05-13 2014-12-24 哈尔滨工程大学船舶装备科技有限公司 Routing-based IETM (Interactive Electronic Technical Manual) fault diagnosis recording and playback method
CN102307135A (en) * 2011-05-24 2012-01-04 中国电子科技集团公司第十研究所 Method for processing baseband data transmission data in real time by utilizing VxWorks platform
CN102404141B (en) * 2011-11-04 2014-03-12 华为技术有限公司 Method and device of alarm inhibition
GB2521774A (en) * 2012-10-25 2015-07-01 Hewlett Packard Development Co Event correlation
CN103152219B (en) * 2013-02-18 2015-12-09 中国工商银行股份有限公司 A kind of event monitoring system of computer network system and event-monitoring method
US9952922B2 (en) 2013-07-18 2018-04-24 Nxp Usa, Inc. Fault detection apparatus and method
KR101545215B1 (en) * 2013-10-30 2015-08-18 삼성에스디에스 주식회사 system and method for automatically manageing fault events of data center
CN104539941B (en) * 2014-12-25 2016-12-07 南京大学镇江高新技术研究院 Based on the traffic video private network Fault Locating Method improving code book
US10339032B2 (en) * 2016-03-29 2019-07-02 Microsoft Technology Licensing, LLD System for monitoring and reporting performance and correctness issues across design, compile and runtime
CN106484595A (en) * 2016-10-09 2017-03-08 华青融天(北京)技术股份有限公司 A kind of event-handling method and device
CN109428741A (en) * 2017-08-22 2019-03-05 中兴通讯股份有限公司 A kind of detection method and device of network failure
CN108171341A (en) * 2017-12-19 2018-06-15 深圳交控科技有限公司 The state analysis method and device of signalling arrangement
CN109308248A (en) * 2018-08-27 2019-02-05 上海功致信息科技有限公司 Event relation analyzing method and system
CN109597752B (en) * 2018-10-19 2022-11-04 中国船舶重工集团公司第七一六研究所 Fault propagation path simulation method based on complex network model
CN110855503A (en) * 2019-11-22 2020-02-28 叶晓斌 Fault cause determining method and system based on network protocol hierarchy dependency relationship
CN113206749B (en) * 2020-01-31 2023-11-17 瞻博网络公司 Programmable diagnostic model of correlation of network events
US11269711B2 (en) 2020-07-14 2022-03-08 Juniper Networks, Inc. Failure impact analysis of network events
CN114629776B (en) * 2020-12-11 2023-05-30 中国联合网络通信集团有限公司 Fault analysis method and device based on graph model
CN114363149B (en) * 2021-12-23 2023-12-26 上海哔哩哔哩科技有限公司 Fault processing method and device
CN116132214A (en) * 2022-12-30 2023-05-16 中国联合网络通信集团有限公司 Event transmission method, device, equipment and medium based on event bus model

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997050209A1 (en) * 1996-06-27 1997-12-31 Telefonaktiebolaget Lm Ericsson (Publ) A method for fault control of a telecommunications network and a telecommunications system
WO2003036914A1 (en) * 2001-10-25 2003-05-01 General Dynamics Government Systems Corporation A method and system for modeling, analysis and display of network security events

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO1997050209A1 (en) * 1996-06-27 1997-12-31 Telefonaktiebolaget Lm Ericsson (Publ) A method for fault control of a telecommunications network and a telecommunications system
WO2003036914A1 (en) * 2001-10-25 2003-05-01 General Dynamics Government Systems Corporation A method and system for modeling, analysis and display of network security events

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10467083B2 (en) 2017-06-08 2019-11-05 International Business Machines Corporation Event relationship analysis in fault management
CN113169898A (en) * 2018-11-07 2021-07-23 西门子股份公司 System and method for error identification and error cause analysis in a network of network components
CN113169898B (en) * 2018-11-07 2022-12-27 西门子股份公司 System and method for error identification and error cause analysis in a network of network components
CN113271216A (en) * 2020-02-14 2021-08-17 华为技术有限公司 Data processing method and related equipment
WO2021159676A1 (en) * 2020-02-14 2021-08-19 华为技术有限公司 Data processing method and related device

Also Published As

Publication number Publication date
CN1529455A (en) 2004-09-15

Similar Documents

Publication Publication Date Title
CN100456687C (en) Network failure real-time relativity analysing method and system
US10297128B2 (en) Wireless sensor network
CN112953778A (en) Intention-driven-based service arrangement system and method in intelligent fusion identification network
Sun et al. Efficient rule engine for smart building systems
EP3172866B1 (en) System and method for metadata enhanced inventory management of a communications system
CN107690776A (en) For the method and apparatus that feature is grouped into the case for having selectable case border in abnormality detection
EP3111433A1 (en) Wireless sensor network
CN100514962C (en) Host performance collection proxy in large-scale network
CN110912782B (en) Data acquisition method, device and storage medium
US20160188609A1 (en) System and Method for Model-based Search and Retrieval of Networked Data
CN115033657A (en) Inquiry method, device and equipment based on knowledge graph and storage medium
Solmaz et al. ALACA: A platform for dynamic alarm collection and alert notification in network management systems
CN110460662A (en) The processing method and system of internet of things data
CN115297007A (en) Construction method and system of network space asset information map for cooperative network
CN111368166A (en) Resource search method, resource search apparatus, and computer-readable storage medium
CN112134720A (en) Network topology discovery method
CN102045186B (en) Event analysis method and system
CN103532737A (en) Method, device and system for processing various types of alarms
Bernabé-Sánchez et al. Problem detection in the edge of IoT applications
Zhuang et al. [Retracted] Braking Control System of Oilfield Minor Repair Machine Based on Wireless Sensor Network
Bulut et al. Optimization techniques for reactive network monitoring
Amundson et al. OASiS: a service-oriented middleware for pervasive ambient-aware sensor networks
CN2747792Y (en) Real time correlation analysis syste of network fault
Zehnder et al. Using virtual events for edge-based data stream reduction in distributed publish/subscribe systems
Kim et al. How to share heterogeneous sensor networks in ubiquitous environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: HUAWEI TECHNOLOGY CO., LTD.

Free format text: FORMER OWNER: GANGWAN NETWORK CO., LTD.

Effective date: 20061013

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20061013

Address after: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Applicant after: Huawei Technologies Co., Ltd.

Address before: 100089, No. 21 West Third Ring Road, Beijing, Haidian District, Long Ling Building, 13 floor

Applicant before: Harbour Networks Holdings Limited

C14 Grant of patent or utility model
GR01 Patent grant
ASS Succession or assignment of patent right

Owner name: HUAWEI TECHNOLOGIES SERVICE GMBH

Free format text: FORMER OWNER: HUAWEI TECHNOLOGY CO LTD

Effective date: 20120217

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 518129 SHENZHEN, GUANGDONG PROVINCE TO: 065000 LANGFANG, HEBEI PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20120217

Address after: 065000 west of Wangjing Road, Langfang economic and Technological Development Zone, Hebei

Patentee after: Huawei Technoloy Service Co., Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: Huawei Technologies Co., Ltd.

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090128

Termination date: 20150929

EXPY Termination of patent right or utility model