Summary of the invention
The invention provides a kind of method and system of the network failure incident being carried out real-time correlation analysis based on the field integrated information, overcome in the existing fault association analysis method and to have ignored dynamic network state information, reasoning process and too rely on preset rules and lack deficiency such as automatic learning ability, but the critical event in effective recognition fault source and it is located in network.
Technology contents of the present invention: the real-time correlation analysis of a kind of network failure comprises:
(1) incident is extracted the various event of failures that produce in the interface collection network, and writes in the primitive event tabulation;
(2) from the primitive event tabulation, read an incident, carry out event matches, the network equipment, service operation parameter are detected in real time by the historical failure context information;
(3) if match event is not arranged, select and detect in real time when the relevant network object of the incident of pre-treatment based on information model, topological dependence, and the result that will detect in real time application pushes back in the reason process as condition;
(4) return primitive event tabulation continue to search the incident relevant with current processing events or with the identical incident of real-time testing result, and this incident joined in the Work List;
(5) do not had other incidents that can add Work List in the primitive event tabulation, then the incident from Work List is constructed a new failure scenario and is joined in the historical failure context information, empties Work List;
(6) from the primitive event tabulation, read the incident that the next one meets selection strategy, turned back to for second step,, then hang up and wait for that the incident input is arranged if do not have incident in tabulation.
Described information model comprises:
(1) it is abstract the various managed objects in the managed networks to be carried out object-oriented;
(2) according to the information model of being formed a stratification by the inheritance between the administrative class after abstract;
(3) in information model, define by the correlation between the administrative class with association class.:
Described topological dependence comprises:
(1) in the network operation, keeps the consistent of topological dependence and network practical topology;
(2) network node with the operation of failure dependency routine analyzer is made as reference point;
(3) calculate the accessibility dependence that arrives other each nodes by reference point.
(4) being used to announcement from the topology change of equipment triggers topological synchronization program and recomputates topological dependence by up-to-date topology;
Described reasoning process comprises:
(1) gives a fiducial probability for each step reasoning, and draw ultimate analysis result's probability by the probability that calculates per step;
(2) in failure scenario is created the definition time constraint function time response of incident is described and the incident that is associated between time relationship;
(3) carry out representing of warning content with formalization method and mate.
The historical failure context information is configured to a failure scenario table of being convenient to quick search.
The collection of described primary fault incident further comprises:
(1) when handling different event types, dynamically changes the length of primitive event formation according to pre-defined rule;
(2) decide the starting point of which incident according to event level and user definition rule as correlation analysis;
(3) primitive event is carried out preliminary treatment, provide extendible incident to obtain interface, they are converted into unified internal form and filter at the event of failure of different agreement.
The failure scenario that described structure is new comprises:
(1) extracts the fault signature parameter;
(2) extract the fault propagation path;
(3) utilize the new fault of fault signature parameter and propagation path structure to solve sight.
The real-time correlation analysis system of a kind of network failure comprises:
Analysis and Control engine: be used for calling other modules and interface is finished the failure dependency analysis according to the analysis and Control engine algorithms;
Incident is extracted interface: be used to receive the diverse network incident that the network equipment is sent, incident is converted into unified format, write the primitive event tabulation, for the analysis and Control engine calling;
Real-time network parameter detecting interface: be used for detecting the real time information such as attribute, performance and accessibility of network various device and service, analyzed Control Engine is called, which network equipment is the parameter of accepting the accident analysis engine detect in real time with decision, and the result is returned to the analysis and Control engine;
Information model: a series of administrative class corresponding to procotol object and device object are described, and the relation of interdependence between them;
Information model query interface: be used for from the function that concerns between information model searching and managing class, management class attribute and the administrative class, provide information for the analysis and Control engine from information model in when operation;
Topology synchronization module: be used for being moved topological dependence generating algorithm by the network topological change Event triggered, generate the topological dependence of correct reflection current network topology connection relationship and deposit topological dependence storehouse in, topological dependence storehouse provides relevant information for the analysis and Control engine;
The failure scenario table generates module: be used for setting up a failure scenario finding on one group of incident of correlation, and this sight is deposited in the failure scenario table, mate by failure scenario table and follow-up incident.
Described information model is with the storage of hash table file mode, and the analysis and Control engine passes through the information of model query interface information extraction model in analytic process.
Further comprise pretreatment module: the primitive event that receives is anticipated according to predetermined preliminary treatment rule.
Technique effect of the present invention: made full use of various dynamic and static informations in the network, real time information and historical information, when network broke down, from complex failure phenomenon and the incident storm that causes thereof, the critical event in effective recognition fault source was also located it in network; In addition, because in analysis, used the topological dependence synchronous, and the network operation parameter of obtaining in real time, improved the accuracy of fault location with the real network topological condition; By original incoming event being carried out preliminary treatment (comprise protocol format conversion, filter and select), avoided starting with and carried out correlation analysis from the incident of all inputs, improved treatment effeciency; Utilize the historical sight table of structure troubleshooting, make this method have the ability of self-teaching from historical experience, and incident is mated fast with the sight table, make the incident that has directly in the sight table, to obtain coupling, thereby avoided all incidents are all carried out the correlation analysis of overall process, treatment effeciency is improved; And because applied probability logic and time-constrain function, regular expression fuzzy matching in parser, the complex relationship between the flexible processing incident has more improved the suitable ability of correlation analysis.
Embodiment
With reference to figure 1, the present invention is a control module with the analysis and Control engine, by with the information model query interface, what incident extracted that interface and pretreatment module, real-time network parameter detecting interface, failure scenario table generate module, topological synchronization module implements the real-time correlation analysis of network failure alternately.Concrete steps are:
1, incident is extracted interface with the event of failure information of different agreement (SNMP/SYSLOG etc.) extraction from various network device and business object, and be unified internal form with their format conversion, then by the incident pretreatment module, these event informations are compressed, filter (according to default filter), write in the primitive event tabulation; By primitive event is carried out preliminary treatment, can effectively improve treatment effect;
2, the analysis and Control engine reads an incident according to primitive event rank and type selecting and carries out correlation analysis from primitive event tabulation; In analytic process integrated application failure scenario table, information model information, detect information and topology information in real time, in analytic process, can continue from the primitive event tabulation, to read incident as required and come the tectonic event propagation path, till the incident that can't find the next one to mate again;
(1) the historical failure context information is configured to a failure scenario table of being convenient to quick search.In the sight table, can carry out the quick coupling of incident;
(2) the OO hierarchical network information model of structure: it is abstract that managed objects such as the hardware in the networking, link, software and network service are carried out object-oriented, and the inheritance tissue between the administrative class according to these after abstract becomes the information model of a stratification.In this model, defined by correlations such as comprising between the administrative class, dependence, bindings with association class simultaneously.Model can utilize the level and the relation of interdependence of the administrative class of model definition to derive by the visit of model object management interface with the storage of hash table (Hash) file mode; A series of administrative class corresponding to procotol object and device object have been described in information model, and various relations between them.The administrative class that defines in the information model can be divided into topological submodel, open service submodel and three big classes of network service submodel.
Below introduce the definition of administrative class as an example with open service system submodel: open service system submodel is mainly used in each node device and inner each module thereof in the data of description communication network, it provides network nodes of data transport service or data processing service abstract all is the service system of an opening, constitute different systems by software, hardware according to a kind of the expansion with the mode of cutting out, wherein administrative class is:
A, open service system: on behalf of all, (Open Service System) system of each layer data service is provided on data communication network; Comprise router, switch or server etc.;
B, software (software): the functional module that realizes by software in the open service system;
C, hardware (hardware): the functional module that realizes by hardware and firmware in the open service system;
D, application (application): various application programs, as Mail Clients;
E, operating system (os): various real-time and time sharing operating systems; As VxWorks, Windows, Unix, Linux etc.;
F, resource (resource): basic shared object in the system: as internal memory, disk, CPU, interruption etc.;
G, equipment (device): each module of forming hardware;
H, service (service):
I, protocol stack (protocol stack):
J, kernel (kernel):
K, driving (driver):
L, internal memory (memory):
M, hard disk (harddisk):
N, central processing unit (cpu):
O, bus (bus):
P, adapter (adapter):
Q, network adapter (network adapter):
U, controller (controller):
In this information model, there are the various dependences between the administrative class, as agreement dependence, exploitation service dependence etc.
(3) detect in real time: combine with reasoning process with to the real-time detection of the network equipment, service operation parameter.
(4) carrying out topological dependence based on specified reference point calculates in real time: the network node of failure dependency routine analyzer operation is made as reference point, calculate to arrive the accessibility dependence of other each nodes on this basis, and in the network operation, keep with network topology synchronously; The physical property that the topology dependence has been described between node and the node links, and is the basis of protocol interconnection and service availability.Reference point wherein, refer to when we consider the accessibility of certain node in the topological diagram, as that node of starting point, in the managed network of reality, often be exactly the residing node of network management platform, or the residing node location of network probe (software or hardware).With reference to figure 3, setting up dependence is a recursive algorithm, after each topology changes, operation algorithm is automatically triggered in the capital, upgrade relying on dependence, guarantee current fault location and related accuracy, thereby reach the set of the related network example object of next step possibility that need detect.
(5) finish the most crucial logic of correlation analysis in the control analysis engine internal, with reference to figure 2,
A, from tabulation, read an incident Ei (i=1~n), in the sight table, mate with this incident, see if there is with the relevant fault history sight (characteristic event of this failure scenario and this event matches) of this incident,, handle according to step (b) to each sight that meets;
B, call real-time detection module, the related example (considering simultaneously to produce the node that the relevant topology of node relies on this incident) of the related object class in this situation is carried out the real-time status detection, see whether return results meets the characteristic range of sight description; And then in the primitive event tabulation, search for the follow-up incident that whether has related example to produce, see the feature that whether meets the sight definition; If above inspection is passed through, these relevant incidents of mark and call output module format output analysis result then;
If c (b) the middle detection does not meet, then call the model query interface, in network information model, inquire about the corresponding administrative class of object with this incident of generation; Consider simultaneously to produce the node that the relevant topology of node relies on, obtain the set of next step the network example object that may be correlated with that need detect with this incident;
D, call the current state that real-time detection module detects these objects and whether meet the described characteristic range of the relation that defines in the galactic model, check the dependent event that in the primitive event tabulation, whether has these objects to send then, if have, then these incidents are joined the work event tabulation, change step (e); If above detection is not passed through, check then whether the work event tabulation is empty, if, then call the failure scenario constructing module and construct new failure scenario for these incidents and join in the failure scenario table if be empty for idle running step (e), empty work event simultaneously and tabulate; And then mark and remove these incidents and format output analysis result, change step (e);
E, from primitive event tabulation, read the incident that the next one meets selection strategy, change step (a) then,, then hang up and wait for that the incident input is arranged if do not have incident in tabulation;
Wherein, the reasoning process that coupling of mentioning in above-mentioned steps and real-time status detect comprises: based on the rule-based reasoning of probability: for a fiducial probability is given in each step reasoning, and draw ultimate analysis result's probability by the probability that calculates per step; Processing to the time-constrain factor: in failure scenario is created the definition time constraint function time response of incident is described and the incident that is associated between time relationship; Carry out the fuzzy matching of warning content with regular expression.
3, after finishing correlation analysis, (finish scanning) to all incidents in the current event tabulation, construct failure scenario and join the failure scenario table all over the incident that is associated in analyzing for this, then these incidents are shifted out primitive event tabulation and structure output analysis result;
4, when carrying out above work with the analysis and Control engine, incident acquisition module (comprising incident acquisition interface and incident pretreatment module) also writes the incident that newly receives synchronous in the primitive event tabulation, the topology synchronization module is the variation of monitor network topology simultaneously also, refreshes network topology dependence storehouse at any time; If do not had incident in the primitive event tabulation, the analysis and Control engine will be hung up, and wait for having new incident to write; When the incident pretreatment module writes the primitive event tabulation with new incident, hang up, will wake this process up if find the analysis and Control engine.
The concrete example explanation of adopting a local area network (LAN), with reference to figure 4, A wherein, C, D are the main frames of operation (SuSE) Linux OS in the local area network (LAN), and S is a three-tier switch, and R is a router that connects this local area network (LAN) and Web server, also is the gateway of this local area network (LAN).A, C directly link to each other with S, and D directly links to each other with R, and RP is the PC of an operation Windows, also is the reference point that we carry out correlation analysis, and correlation analysis system just operates on this main frame.
At first, with reference to figure 5, present embodiment adopts the information model of a simplification, in this network: host A, C, D, RP, router R, switch S can be counted as the open service system, each open service system has comprised a protocol stack, protocol stack be responsible for finishing use with network on communicating by letter between peer-entities in other open service systems.Data flow through downwards application, operating system, agreement, interface enter physical network then, arrive another open service system through two layers of forwarding and three layers of route, upwards through the application up to the other end of interface, agreement, operating system.
1) information model instantiation
Above model will generate some examples corresponding to above model entity in the network environment of reality: as the application on the router R, we are its called after Application_R, the operating system on the R, and called after:
OS_R,
Analogize therewith, we obtain other examples: Protocols_R, Interface_R;
Equally:
For host A, we obtain Application_A, Service_A, OS_A, Protocols_A, Interface_A;
For host C, we obtain Application_C, Service_B, OS_C, Protocols_C, Interface_C;
For main frame D, we obtain Application_D, Service_D, OS_D, Protocols_D, Interface_D;
And there is a following dependence:
Application->Service;
Service->OS;
OS->Protocols;
Protocols-〉Interface; (note: this is the model of a simplification);
Definition web_browse_in_url-is arranged in the hypothesized model〉DNS service;
X.interface.fail is equivalent to X.down;
2) topological dependence generates
For network shown in Figure 4, network management platform will obtain its topological data by automatic discovery, move topological dependence generating algorithm then, and (is reference point with RP) obtains following topological dependence set:
RD={A->S,C->S,S->R,D->R,Internet->R,R->RP}
Wherein: ' X-〉implication of Y ' can be interpreted as " will visit X, must earlier through Y ";
R-〉RP represents that R is the network node that directly links to each other with reference point RP;
When network topology or reference point changed, this algorithm upgraded dependence automatically, thereby kept dependence can reflect actual network operation situation.
3) incident extraction interface begins to receive the variety of event that produces in the network.
Suppose on host A, to have moved a DNS service (can regard a service as), and individual program is being arranged on the main frame D at the homepage www.harbournetworks.com that constantly visits on the Web server, it can be regarded as an Applicaion, we are called after web_browse_in_url.
Suppose that sometime incident is extracted interface and received following incident from the SNMP agency of each main frame, is expressed as follows after this incident is formatted:
{
E0=RP.ping.S.fail:t0, expression t0 constantly from can't ping on the RP to switch S,
E1=RP.ping.C.fail:t1, expression t1 constantly from can't ping on the RP to host C,
E2=RP.ping.C.fail:t2, expression t2 constantly from can't ping on the RP to host C,
E3=D.web_browse_in_url.Web_Server.fail:t3 represents that t3 can't visit Web server on the main frame D constantly.
E4=RP.ping.A.fail:t4, expression t4 constantly from can't ping on the RP to host A,
E5=RP.ping.A.fail:t5, expression t5 constantly from can't ping on the RP to host A,
E6=R.down:t6, expression t6 R constantly lost efficacy,
E7=RP.web_browse_in_url.web_server.fail:t7 represents that t7 can't visit Web server on the main frame RP constantly.
E8=R.up:t8, expression t8 R constantly resumes work,
}
4) E0 ... E4 is fed to pretreatment module subsequently handle after, the primitive event set after obtaining compressing, noted having filtered here the incident that repeats (E2, E5) and the paired incident removed of malfunction (E6, E8);
{
E0=RP.ping.S.fail:t0, expression t0 constantly from can't ping on the RP to switch S,
E1=RP.ping.C.fail:t1, expression t1 constantly from can't ping on the RP to host C,
E3=D.web_browse_in_url.Web_server.fail:t3 represents that t3 can't visit Web server on the main frame D constantly.
E4=RP.ping.A.fail:t4, expression t4 constantly from can't ping on the RP to host A,
E7=RP.web_browse_in_url.Web_Server.fail:t7 represents that t7 can't visit Web server on the main frame RP constantly.
}
5) utilize the field integrated information that the event of failure in the communication network is carried out real-time correlation analysis:
(a) the analysis and Control engine reads incident a: E0=RP.ping_S.fail:t0 from the primitive event tabulation;
Therefrom parse
Node object: source node RP, destination node S,
Application: RP.ping, ping belongs to Applications;
Application state: fail;
With the E0 mark and add work event tabulation;
(b) open and inquire about in the sight table and have or not and RP, S, the sight that ping is relevant finds that the sight table be empty (the system's initialization first time does not also add new sight), closes the sight table;
(c) recalls information model query interface, inquiry ping (Application) obtains relation: Applications-〉Services, Services-〉Protocols, Protocols-〉Interface; Inquire about topological dependence storehouse again, obtain R-RP, S-〉R;
(d) call network state and detect interface in real time, check S.Interface, find that the S.Interface state is fail, then can infer following result according to dependence:
S.Interface.fail==S.down;
S.down=>A.down?and?C.down;
A.down==A.Interface.fail=>A.application.fail?and?A.services.fail
C.down==C.Interface.fail=>C.application.fail?and?C.services.fail;
A.services.fail=>A.DNS.fail=>*.browse_web_in_url.fail
(e) begin to check the primitive event tabulation from E1.Read E1
E1=RP.ping.C.fail:t1 therefrom parses
Node object: source node RP, destination node C,
Application: RP.ping, ping belongs to Applications;
Application state: fail;
Ping belongs to application, require RP and C, and the S of topology dependence, the applications on the R, services, protocols, interface all keeps normally, S.down then, and C.down all can release E1, so on E1 was associated, analysis engine was with the E1 mark and join in the work event tabulation;
Continue down to read E3:
E3=D.web_browse_in_url.Web_server.fail:t3 resolves and obtains:
Node object: D, Web_server;
Application: web_browse_in_url;
Application state: fail;
According to what obtain previously: A.services.fail=〉A.DNS.fail=〉* .browse_web_in_url.fail, can draw the dependent event that E3 also is E1, so E3 is labeled and joins in the work event tabulation.
In like manner, can analyze the dependent event that E4 and E7 are E1, so this incident of mark is added into Work List.
(f) find not had unlabelled incident in the primitive event tabulation, then call output module to primitive event tabulation format output:
Outputting alarm:
Alarm1=
{
Cause:RP.ping.S.fail:t0
Affects:
[
RP.ping.C.fail:t1
D.web_browse_in_url.Web_server.fail:t3
RP.ping.A.fail:t4
RP.web_browse_in_url.Web_Server.fail:t7
]
}
(g) utilize fault signature parameter and fault propagation path to solve sight Scene1:S.down=for the new fault of these incident structures〉{ A.down and C.down and*.web_browse_in_url.fail} also joins in the failure scenario table.
(h) empty the work event tabulation; From the primitive event tabulation, remove these incidents.
(j) if having new incident to join the primitive event engine then change (3) this moment, otherwise hang up, wait for new incident input;
(k) suppose to have new incident to come:
E9=D.web_browse_in_url.Web_Server.fail:t9
E10=A.down:t10;
(1) the event analysis engine reads E9, in incident sight table, inquire about, discovery has this affair character pattern of * .web_browse_in_url.fail to match in Scene1, E9 is joined in the work event tabulation, continue to check in the primitive event tabulation whether characteristic event: A.down and C.down are arranged, read E10, satisfy A.down, E10 is added the work event tabulation; At this time do not had other incident in the tabulation, also surplus next feature C.down need be proved, so call real-time detection interface, detects and finds: C.down=true; So sight obtains coupling, S.down directly obtains a result.The step that as follows (1) is described.
In previous step, if to the real-time testing result C.down=false of C; Then above-mentioned sight can not be put letter fully, can give a fiducial probability.Expression also has other reason.
By utilization field integrated information, comprise the troubleshooting historical information of the management object hierarchical information of information model Network Based and correlation, study automatically, the network operation parameter of gathering in real time, network dynamic topology information, event time feature etc., and in reasoning process, use dynamic analysing method, better solved the failure dependency problem analysis in complex network environment.
With reference to figure 1, the real-time correlation analysis system of network failure of the present invention comprises:
The analysis and Control engine: the major control logic executor of analytic process is used for calling other modules and interface is finished the failure dependency analysis according to the analysis and Control engine algorithms;
Information model: described a series of administrative class corresponding to procotol object and device object, and various relations between them, the administrative class that defines in the information model can be divided into topological submodel, open service submodel and three big classes of network service submodel;
Information model query interface: be used for from the function that concerns between information model searching and managing class, management class attribute and the administrative class, provide information for the analysis and Control engine from information model in when operation;
Incident is extracted interface: be used to receive the diverse network incident that the network equipment is sent, comprise the event notification of variety of protocols such as SNMPTRAP, SYSLOG, CMIP Event Report, this incident is converted into unified format, and gives pretreatment module;
Pretreatment module: be used for the primitive event that receives is carried out simple filtering (removing the incident that some administrative staff need not to be concerned about according to the rule of setting), compression (removing the incident that repeats), redefines (it is a new incident that one or more incidents are redefined) etc. and anticipate, help correlation analysis;
Real-time network parameter detecting interface: be used for detecting the real time information such as attribute, performance and accessibility of network various device and service, called by the accident analysis engine, which network equipment is the parameter of accepting the accident analysis engine detect in real time with decision, and the result is returned to the accident analysis engine;
The failure scenario table generates module: be used for setting up a failure scenario finding on one group of incident of correlation, and this sight deposited in the failure scenario table, the failure scenario of these foundation is searched use fast for subsequent analysis, and the failure scenario of foundation can be searched fast and uses for subsequent analysis;
Topology synchronization module: be used for being moved topological dependence generating algorithm, generate the topological dependence of correct reflection current network topology connection relationship and deposit topological dependence storehouse in, use for the failure dependency analysis by the network topological change Event triggered.