CN115408189A - Artificial intelligence and big data combined anomaly detection method and service system - Google Patents

Artificial intelligence and big data combined anomaly detection method and service system Download PDF

Info

Publication number
CN115408189A
CN115408189A CN202211059848.2A CN202211059848A CN115408189A CN 115408189 A CN115408189 A CN 115408189A CN 202211059848 A CN202211059848 A CN 202211059848A CN 115408189 A CN115408189 A CN 115408189A
Authority
CN
China
Prior art keywords
record
probability
records
abnormal
local
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202211059848.2A
Other languages
Chinese (zh)
Inventor
易江枫
许闻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202211059848.2A priority Critical patent/CN115408189A/en
Publication of CN115408189A publication Critical patent/CN115408189A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0766Error or fault reporting or storing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses an anomaly detection method combining artificial intelligence and big data, which comprises the following steps: the cloud server collects system operation records from an HBase database, wherein the operation records comprise user tags, operation time and operation instructions; sorting according to operation time, and generating a record table from the sorted operation records, wherein the record table comprises a plurality of records; inputting the record table into an improved Bayesian probability model, acquiring the probability of each character string in each record, retaining the records with the probability greater than a first threshold value, and discarding the records with the probability less than or equal to the first threshold value; and processing the characteristic number of the reserved record, and inputting the record subjected to the characteristic number processing into a local abnormal factor and knowledge graph combined model for abnormal detection.

Description

Artificial intelligence and big data combined anomaly detection method and service system
Technical Field
The invention belongs to the technical field of information, and particularly relates to an anomaly detection method and a service system combining artificial intelligence and big data.
Background
Currently, with the development of internet technology, more and more data needs to be stored and processed by a platform of security and colleges. The existing big data development platform HBase has a storage mechanism of the big data, but on the platform, a case of data abnormity or attack threat can occur, so that operation and maintenance personnel are required to find and stop in a short time.
At present, operation and maintenance personnel usually detect some important parameters by a manual detection method, find problems in time, or perform early warning on user operation behaviors by some artificial intelligence algorithms. However, the above detection mechanism is not efficient and has insufficient detection accuracy.
Disclosure of Invention
The invention provides an anomaly detection method and a service system combining artificial intelligence and big data, solves the problems of low efficiency and insufficient precision of an anomaly detection mechanism in the prior art, and effectively improves the efficiency and precision of anomaly detection.
In order to achieve the above object, the present invention provides an anomaly detection method combining artificial intelligence and big data, comprising:
the method comprises the steps that a cloud server collects system operation records from an HBase database, wherein the operation records comprise user labels, operation time and operation instructions;
sorting according to the operation time, and generating a record table by the sorted operation records, wherein the record table comprises a plurality of records;
inputting the record table into an improved Bayesian probability model, acquiring the probability of each character string in each record, retaining the records with the probability greater than a first threshold value, and discarding the records with the probability less than or equal to the first threshold value;
and processing the characteristic number of the reserved record, and inputting the record subjected to the characteristic number processing into a local abnormal factor and knowledge graph combined model for abnormal detection.
Optionally, inputting the record table into an improved bayesian probability model, and obtaining the probability of each character string in each record, including:
calculating the probability Ps of each character occurrence in each record:
Ps=P[Sn=b n ]*u
wherein u is the probability weight of the occurrence of the current record, s is a character string identifier, b is a next character string identifier, and n is a positive integer;
if the probability Pu of the character is larger than the character threshold value, setting the character as a candidate node;
setting a tree structure, initializing a root node, and sequentially setting each candidate node in the tree structure;
and recursively traversing the candidate nodes to obtain the probability of each character string in each record.
Optionally, recursively traversing the candidate nodes to obtain a probability of each character string in each record, including:
calculating the probability P of the next character string after the current character string b
If P b If the value is larger than the second threshold value, the candidate node corresponding to the character string is reserved, otherwise, the candidate node corresponding to the character string is discarded;
and carrying out weighted average on the reserved candidate node probability, and calculating the probability of the record corresponding to the candidate node.
Optionally, performing feature number processing on the reserved record, including:
obtaining a maximum length of the retained record;
the records having a recording length less than the maximum length are numerically padded so that the number of features per record is the same.
Optionally, inputting the record after feature number processing into a local anomaly factor and knowledge graph joint model for anomaly detection, including:
performing exception screening on the record by using a K-means clustering algorithm;
using a local abnormal factor model to screen local factors for the record after abnormal screening;
generating a knowledge graph based on the records after the local factor screening, wherein the knowledge graph comprises a plurality of triples;
and completing the abnormal reason of the record based on the knowledge graph.
Optionally, the record is screened for abnormalities using a K-means clustering algorithm, including:
initializing a plurality of recorded cluster groups, and presetting m cluster centers in the cluster groups;
setting any node in the cluster group as a first cluster center;
calculating the Euclidean distance from any node to the cluster center;
selecting the node with the maximum Euclidean distance as a second cluster center;
repeating the calculation and selection processes until m cluster centers are selected;
calculating the distance from each record to m cluster centers, finding the cluster center closest to each record, and calculating the gravity center of the cluster group corresponding to the cluster center;
and for each cluster center, calculating the distance between the gravity center of the cluster corresponding to the cluster center and the cluster center, arranging the distances in the descending order, and setting the distance larger than a preset threshold value as an abnormal record.
Optionally, the local anomaly factor model is used to perform local factor screening on the record after anomaly screening, including:
defining a domain size k and a pollution parameter c;
sequentially traversing k and c, and calculating the mean value and the variance of the local outlier factor scores of the points under different values of k and c;
for each c and k, calculating the difference in local anomaly factor scores between the predicted anomaly and the normal point;
selecting a difference set T c, K corresponding to the maximum value in the K is used as a k value of a local abnormal factor algorithm, and a difference set T corresponding to the k value is selected c,,opt And the corresponding c value is used as the optimal solution of the c value of the local anomaly factor algorithm.
Optionally, based on the knowledge graph, completing the record for the reason of the abnormality, including:
processing the head entity, the tail entity and the relation of the triplets of the knowledge graph by using a pre-training language model to obtain the representation vector and the probability distribution size of each triplet;
obtaining N nearest neighbor nodes of the triple target entity by using the Euclidean distance, wherein the neighbor nodes are the abnormal root cause nodes;
and sequencing the probabilities of the N nearest neighbor nodes to obtain the neighbor node with the maximum probability, and taking the root cause content of the node as the abnormal cause of the record for completing.
The embodiment of the invention also provides an anomaly detection service system combining artificial intelligence and big data, which comprises:
the system comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring system operation records from an HBase database, and the operation records comprise user tags, operation time and operation instructions;
the sorting unit is used for sorting according to operation time and generating a record table from the sorted operation records, wherein the record table comprises a plurality of records;
the processing unit is used for inputting the record table into the improved Bayesian probability model, acquiring the probability of each character string in each record, retaining the records with the probability larger than a first threshold value, and discarding the records with the probability smaller than or equal to the first threshold value; and processing the characteristic number of the reserved record, and inputting the record subjected to the characteristic number processing into a local abnormal factor and knowledge graph combined model for abnormal detection.
The embodiment of the invention provides an artificial intelligence and big data combined anomaly detection service system which comprises a memory and a processor, wherein computer executable instructions are stored in the memory, and the processor realizes the method when running the computer executable instructions on the memory.
The method and the system of the embodiment of the invention have the following advantages:
in the embodiment of the invention, the method of combining artificial intelligence and big data is adopted, and the model combining the Bayesian probability model, the local abnormal factor and the knowledge graph is adopted, so that the abnormal record can be efficiently and accurately detected, the root cause analysis of the abnormal record can be carried out, and the operation and maintenance efficiency is greatly improved.
Drawings
FIG. 1 is a flow diagram of a method for anomaly detection with artificial intelligence coupled with big data in one embodiment;
FIG. 2 is a tree structure diagram in one embodiment;
FIG. 3 is a diagram illustrating an embodiment of an anomaly detection service system architecture incorporating artificial intelligence and big data;
FIG. 4 is a diagram illustrating the hardware components of the system in one embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
Fig. 1 is a flowchart in an embodiment of the present invention, and as shown in fig. 1, the embodiment of the present invention includes the following steps:
s101, a cloud server collects system operation records from an HBase database, wherein the operation records comprise user tags, operation time and operation instructions;
apache HBase is a Hadoop database and is distributed and scalable large data storage. HBase is constructed on the basis of HDFS, and the HDFS is a distributed file system. The HBase stores operation records and information of various systems, such as user operation logs of various enterprise clients and various e-commerce platforms, and the like, the operation records are sequentially stored in HBase data according to time sequence and types, and different operation records all need to include a user tag (user ID), operation time (time for a user or an administrator to operate/issue a command), and a corresponding operation instruction.
The operation records are massive operation record data, and the operation records cause maintenance due to large data volume, and especially the difficulty of manual operation and maintenance is greatly increased, so that the operation and maintenance difficulty needs to be reduced by an appropriate method.
S102, sorting according to operation time, and generating a record table from the sorted operation records, wherein the record table comprises a plurality of records;
after the operation records are obtained, sequencing is carried out according to the time sequence, and the operation record table is generated based on the sequenced sequence, wherein the operation record table comprises a plurality of records stored in sequence.
S103, inputting the record table into an improved Bayesian probability model, acquiring the probability of each character string in each record, retaining the records with the probability greater than a first threshold value, and discarding the records with the probability less than or equal to the first threshold value;
bayesian Probability (Bayesian Probability) is an interpretation of Probability provided by Bayesian theory, which uses the concept of defining Probability as the degree to which a person trusts a proposition. Bayesian theory also suggests that bayesian theorem can be used as a rule to derive or update existing confidence levels based on new information. When comparing two hypotheses with the same data, hypothesis testing theory is based on a probabilistic frequency interpretation that allows one model/hypothesis (null hypothesis) to be negated or accepted based on the probability that the error-inferred data supports the other model/hypothesis. The probability of such errors occurring is referred to as a class of errors that requires consideration of hypothetical data sets derived from the same data source to be more extreme than the data actually observed. This approach allows for the conclusion that either the two hypotheses are different or the observed data is a misleading set'. Correspondingly, bayesian methods are based on actually observed data, and therefore can directly assign posterior probabilities to any number of hypotheses. The requirement that a probability must be assigned to the parameters of the model representing each hypothesis is a cost of this straightforward approach.
In the embodiment of the invention, different from the traditional Bayesian probability model, the improved Bayesian probability model and the thought of the Bayesian probability model are adopted, but the algorithm is different. The method comprises the following specific steps:
calculating the probability Ps of each character in each record:
Ps=P[Sn=b n ]*u
wherein u is the probability weight of the occurrence of the current record, s is a character string identifier, b is a next character string identifier, and n is a positive integer;
if the probability Pu of the character is larger than the character threshold value, setting the character as a candidate node;
setting a tree structure, wherein the tree structure is provided with a root node, leaf nodes and the like, initializing root nodes root, and sequentially setting each candidate node in the tree structure; that is, the character strings of the candidate nodes are added into the tree, for example, if the character string of the current candidate node is abcd, a, b, c, and d may be sequentially set in different leaf nodes, as shown in fig. 2.
And recursively traversing the candidate nodes to obtain the probability of each character string in each record.
In addition, recursively traversing the candidate nodes to obtain the probability of each character string in each record may specifically be:
calculating the probability P of the next character string after the current character string b
P b =P[s n =b n |s 0 ]* w; where w is a weighting factor, s 0 A character string is initially set.
If P b If the value is larger than a second threshold value, the candidate nodes corresponding to the character strings are reserved, otherwise, the candidate nodes corresponding to the character strings are abandoned;
and carrying out weighted average on the reserved candidate node probability, and calculating the probability of the record corresponding to the candidate node.
And S104, carrying out characteristic number processing on the reserved records, and inputting the records subjected to the characteristic number processing into a local abnormal factor and knowledge graph combined model for abnormal detection.
The feature number processing on the reserved record may be: obtaining a maximum length of the retained record; the records having a recording length less than the maximum length are numerically padded so that the number of features per record is the same.
In the embodiment of the invention, the specific steps of the traditional local abnormal factor algorithm are as follows:
for each data point in the data set, the Euclidean distance dist (d) between it and other data points is calculated i ,d j );
For each data point in the dataset, the k distance k _ dist (d) is calculated i ) The k distance is the distance from the data point obtained in the first step to the kth data point by sorting the Euclidean distances from the data point to other data points from small to large;
for each data point in the data set, a k-distance neighborhood N is obtained k (d i ) The k-distance neighborhood is a set of data points whose distance to the data point is less than k;
calculating the k-th reachable distance reach _ dist k (d r ,d i ) The kth reachable distance is the data point d i K-distance and data point d i To data point d r The maximum value therebetween.
reach_dist k (d r ,d i )=max{K_dist(d i ),dist(d r ,d i )}
Calculating local reachable density which is a data point d i All data points within the k-distance neighborhood of (d) to data point d i The inverse of the average reachable distance of (c).
Figure BDA0003823965020000081
Calculating local outlier factor, which is data point d i And a local reachable density of all data points in the k-distance neighborhood of (d) and a data point i Average of the ratio of local achievable densities.
Figure BDA0003823965020000082
And judging whether the point is an abnormal point or not according to the size of the local outlier factor.
And (3) carrying out anomaly detection by using a local anomaly factor algorithm, judging whether the point is an abnormal point or not by judging the size of the local outlier factor value, and if the local outlier factor value is far more than 1, determining that the point is an abnormal point, otherwise, determining that the point is a normal point. The local outlier factor algorithm needs to calculate the distance between every two data points, so that the time complexity of the whole algorithm is very high, and the efficiency is low. Training the outlier algorithm requires two hyper-parameters to learn. The local anomaly factor algorithm uses two hyper-parameters: the first is the neighborhood size (k), which defines the neighborhood where the local density is computed; the second is the contamination parameter c, which determines the proportion of outliers. The size of the hyper-parameters of the two algorithms is crucial for the local anomaly algorithm. The local abnormal factor algorithm is improved in two ways, namely, the calculation cost of the local abnormal factor algorithm is reduced, and two hyper-parameters of the local abnormal factor algorithm are trained by a heuristic method.
The embodiment of the invention provides an improved local abnormal factor algorithm which comprises the following specific steps:
s1041, performing exception screening on the record by using a K-means clustering algorithm;
specifically, the K-means algorithm comprises the following steps:
initializing a plurality of recorded clusters, and presetting m cluster centers in the clusters, wherein m is a natural number;
setting any node in the cluster group as a first cluster center;
calculating the Euclidean distance dist (x) from any node to the cluster center;
selecting the maximum Euclidean distance Dist max The node of (a) is a second cluster center;
repeating the calculation and selection process (i.e. calculating the Euclidean distance Dist (x) from the second node to the cluster center, and selecting the maximum Euclidean distance Dist max The node of (a) is the ith cluster center, i is a natural number from 1 to m), until m cluster centers are selected;
calculating the distance from each record (the record is used as an input sample) to m cluster centers, finding the closest cluster center of each record, and calculating the gravity center of a cluster group corresponding to the cluster center;
and for each cluster center, calculating the distance between the gravity center of the cluster corresponding to the cluster center and the cluster center, arranging the distances according to the sequence from large to small, and setting the distance larger than a preset threshold value as an abnormal record.
S1042, using a local abnormal factor model to screen local factors for the records after abnormal screening;
defining a field size k and a pollution parameter c, and the value ranges corresponding to c and k;
sequentially traversing k and c, and calculating the mean value and the variance of the local outlier factor scores of the points under different values of k and c;
for each c and k, calculating the difference in local anomaly factor scores between the predicted anomaly and the normal point;
selecting a difference set T c,k K corresponding to the maximum value in the K is used as a k value of a local abnormal factor algorithm, and a difference set T corresponding to the k value is selected c,k And the corresponding c value is used as the optimal solution of the c value of the local anomaly factor algorithm.
S1043, generating a knowledge graph based on the records screened by the local factors, wherein the knowledge graph comprises a plurality of triples;
the knowledge graph is used as an efficient organization form of security knowledge such as entities, concepts and the like, can exert the advantage of knowledge integration, organizes scattered multi-source heterogeneous security data, and provides support in data analysis and knowledge reasoning for threat modeling, risk analysis, attack reasoning and the like of a network security space.
A knowledge graph is a structured semantic knowledge base used for rapidly describing concepts and mutual relations in the physical world. The knowledge graph is represented in a form of a triple (h, r, t), and the h, r and t respectively represent a head entity, a relation and a tail entity. Although a large amount of factual data are stored in the existing knowledge graph, a large amount of entity-to-entity implicit information is still not reflected, which brings great influence on further analysis and modeling of the knowledge graph.
The knowledge-graph completion task is mainly based on the relation between the existing knowledge-graph research data, so that the relation between entities, such as the relation between abnormal phenomena and abnormal root causes, is deduced, and a knowledge-graph book becomes more complete. At present, two main types of models are distance-based models and tensor decomposition-based models for the knowledge graph completion task. Distance-based models such as (TransSE, transH, transR, etc.) that use Minkowski distance to measure the rationality of triples cannot handle complex relational patterns (one-to-many, many-to-one, etc.) although they all perform well in knowledge-graph completion tasks; although the latest Rotate model can well process complex relational patterns and has good effect in the knowledge completion task, the model needs to carry out negative sampling on samples in the modeling process, and the efficiency of the knowledge graph completion task can be reduced. Based on a model of tensor decomposition, the knowledge graph is regarded as a part of observable third-order tensor, and therefore knowledge graph completion is modeled into a tensor completion problem.
In the embodiment of the present invention, each triplet is composed of a head entity, a relationship and a tail entity, and the triplet is marked as (e) i ,r j ,e k )。
And S1044, completing the abnormal reasons of the record based on the knowledge graph.
Processing the head entity, the tail entity and the relation of the triplets of the knowledge graph by using a pre-training language model to obtain the representation vector and the probability distribution size of each triplet;
training triplets by using a pre-training language model bert to obtain a vector representation q of each line i =f(c i ) And its probability vector Pq.
Obtaining N nearest neighbor nodes of the triple target entity by using Euclidean distance, wherein the neighbor nodes are the abnormal root cause nodes;
for the text description vector representation of each triple, calculating the distance d between the text description vector representation of each triple and the vector representation of each row of the triple by using Euclidean distance, and finding out the nearest N neighbor points, wherein the nearest N neighbor points are arranged from small to large, and the corresponding target is found.
A probability distribution is calculated. And calculating the distribution of one target by the k nearest neighbors, and then aggregating the same target targets.
Figure BDA0003823965020000111
And sequencing the probabilities of the N nearest neighbor nodes to obtain the neighbor node with the maximum probability, and taking the root cause content of the node as the abnormal cause of the record to be completed.
As shown in fig. 3, an embodiment of the present invention further provides an anomaly detection service system 30 combining artificial intelligence and big data, including:
the acquisition unit 31 is configured to acquire a system operation record from the HBase database, where the operation record includes a user tag, operation time, and an operation instruction;
the Apache HBase is a Hadoop database and is distributed and telescopic large data storage. HBase is constructed on the basis of HDFS, which is a distributed file system. The HBase stores operation records and information of various systems, such as user operation logs of various enterprise clients and various e-commerce platforms, and the like, the operation records are sequentially stored in HBase data according to time sequence and types, and different operation records all need to include a user tag (user ID), operation time (time for a user or an administrator to operate/issue a command), and a corresponding operation instruction.
The operation records are massive operation record data, and the operation records cause maintenance due to large data volume, and especially the difficulty of manual operation and maintenance is greatly increased, so that the operation and maintenance difficulty needs to be reduced by an appropriate method.
A sorting unit 32, configured to sort according to operation time, and generate a record table from the sorted operation records, where the record table includes multiple records;
after the operation records are obtained, sequencing is carried out according to the time sequence, the operation record table is generated based on the sequenced sequence, and the record table comprises a plurality of records stored in sequence.
The processing unit 33 is configured to input the record table to the improved bayesian probability model, obtain the probability of each character string in each record, retain the records with the probabilities greater than the first threshold, and discard the records with the probabilities less than or equal to the first threshold; and processing the characteristic number of the reserved record, and inputting the record subjected to the characteristic number processing into a local abnormal factor and knowledge graph combined model for abnormal detection.
Bayesian Probability (Bayesian Probability) is an interpretation of Probability provided by Bayesian theory, which uses the concept of defining Probability as the degree to which a person trusts a proposition. Bayesian theory also suggests that bayesian theorem can be used as a rule to derive or update existing confidence levels based on new information. When comparing two hypotheses with the same data, hypothesis testing theory is based on a probabilistic frequency interpretation that allows one model/hypothesis (null hypothesis) to be negated or accepted based on the probability that the error-inferred data supports the other model/hypothesis. The probability of such errors occurring is referred to as a class of errors, which requires that hypothetical data sets derived from the same data source be considered more extreme than the data actually observed. This approach allows for the conclusion that either the two hypotheses are different or the observed data is a misleading set'. Correspondingly, bayesian methods are based on actually observed data, and therefore can directly assign posterior probabilities to any number of hypotheses. The requirement that a probability must be assigned to the parameters of the model representing each hypothesis is a cost of this straightforward approach.
In the embodiment of the invention, different from the traditional Bayesian probability model, the improved Bayesian probability model and the thought of the Bayesian probability model are adopted, but the algorithm is different. The method comprises the following specific steps:
calculating the probability Ps of each character occurrence in each record:
Ps=P[Sn=b n ]*u
wherein u is the probability weight of the occurrence of the current record, s is a character string identifier, b is a next character string identifier, and n is a positive integer;
if the probability Pu of the character is larger than the character threshold value, setting the character as a candidate node;
setting a tree structure, wherein the tree structure is provided with a root node, leaf nodes and the like, initializing root nodes root, and sequentially setting each candidate node in the tree structure; that is, the character strings of the candidate nodes are added into the tree, for example, if the character string of the current candidate node is abcd, a, b, c and d may be sequentially set in different leaf nodes.
And recursively traversing the candidate nodes to obtain the probability of each character string in each record.
In addition, recursively traversing the candidate nodes to obtain the probability of each character string in each record may specifically be:
calculating the probability P of the next character string after the current character string b
P b =P[s n =b n |s 0 ]* w; where w is a weighting factor, s 0 A character string is initially set.
If P b If the value is larger than a second threshold value, the candidate nodes corresponding to the character strings are reserved, otherwise, the candidate nodes corresponding to the character strings are abandoned;
and carrying out weighted average on the reserved candidate node probability, and calculating the probability of the record corresponding to the candidate node.
The feature number processing on the reserved record may be: obtaining a maximum length of the retained record; the records having a recording length less than the maximum length are numerically padded so that the number of features per record is the same.
The embodiment of the invention provides an improved local abnormal factor algorithm which comprises the following specific steps:
performing exception screening on the record by using a K-means clustering algorithm;
specifically, the K-means algorithm comprises the following steps:
initializing a plurality of recorded clusters, and presetting m cluster centers in the clusters, wherein m is a natural number;
setting any node in the cluster group as a first cluster center;
calculating the Euclidean distance dist (x) from any node to the cluster center;
selectingMaximum Euclidean distance Dist max The node of (b) is a second cluster center;
repeating the calculation and selection process (i.e. calculating the Euclidean distance Dist (x) from the second node to the cluster center, and selecting the maximum Euclidean distance Dist max The node of (b) is the ith cluster center, i is a natural number from 1 to m) until m cluster centers are selected;
calculating the distance from each record (the record is used as an input sample) to m cluster centers, finding the closest cluster center of each record, and calculating the gravity center of a cluster group corresponding to the cluster center;
and for each cluster center, calculating the distance between the gravity center of the cluster corresponding to the cluster center and the cluster center, arranging the distances in the descending order, and setting the distance larger than a preset threshold value as an abnormal record.
Using a local abnormal factor model to screen local factors for the record after abnormal screening;
defining a field size k, a pollution parameter c and value ranges corresponding to c and k;
sequentially traversing k and c, and calculating the mean value and the variance of the local outlier factor scores of the points under different values of k and c;
for each c and k, calculating the difference in local anomaly factor scores between the predicted anomaly and the normal point;
selecting a difference set T c,k K corresponding to the maximum value in the K is used as a k value of a local abnormal factor algorithm, and a difference set T corresponding to the k value is selected c,k And the corresponding c value is used as the optimal solution of the c value of the local abnormal factor algorithm.
Generating a knowledge graph based on the records screened by the local factors, wherein the knowledge graph comprises a plurality of triples;
the knowledge graph is used as an efficient organization form of security knowledge such as entities, concepts and the like, can exert the advantage of knowledge integration, organizes scattered multi-source heterogeneous security data, and provides support in data analysis and knowledge reasoning for threat modeling, risk analysis, attack reasoning and the like of a network security space.
A knowledge graph is a structured semantic knowledge base used for rapidly describing concepts and mutual relations in the physical world. The knowledge graph is represented in a form of a triple (h, r, t), and the h, r and t respectively represent a head entity, a relation and a tail entity. Although a large amount of fact data is stored in the existing knowledge graph, a lot of entity-to-entity implicit information is still not reflected, which brings great influence on further analysis and modeling of the knowledge graph.
The knowledge-graph completion task is mainly based on the relation between the existing knowledge-graph research data, so that the relation between entities, such as the relation between abnormal phenomena and abnormal root causes, is deduced, and a knowledge-graph book becomes more complete. At present, two main types of models are distance-based models and tensor decomposition-based models for the knowledge graph completion task. Distance-based models such as (TransSE, transH, transR, etc.) that use Minkowski distances to measure the reasonableness of triples cannot handle complex relational patterns (one-to-many, many-to-one, etc.) although such models achieve good performance in the knowledge-graph completion task; although the latest Rotate model can well process complex relational modes and has good effect in the knowledge completion task, the model needs to carry out negative sampling on samples in the modeling process, and the efficiency of the knowledge graph completion task can be reduced. Based on a tensor decomposition model, the knowledge graph is regarded as a part of observable third-order tensor, and therefore knowledge graph completion is modeled into a tensor completion problem.
In the embodiment of the present invention, each triplet is composed of a head entity, a relationship and a tail entity, and the triplet is marked as (e) i ,r j ,e k )。
And completing the abnormal reasons of the records based on the knowledge graph.
Processing the head entity, the tail entity and the relation of the triplets of the knowledge graph by using a pre-training language model to obtain the representation vector and the probability distribution size of each triplet;
training the triplets by using a pre-training language model bert to obtain a vector representation q of each line i =f(c i ) And its probability vector Pq.
Obtaining N nearest neighbor nodes of the triple target entity by using the Euclidean distance, wherein the neighbor nodes are the abnormal root cause nodes;
for the text description vector representation of each triple, calculating the distance d between the text description vector representation of each triple and the vector representation of each row of the triple by using Euclidean distance, and finding out the nearest N neighbor points, wherein the nearest N neighbor points are arranged from small to large, and the corresponding target is found.
A probability distribution is calculated. And calculating the distribution of one target by the k nearest neighbors, and then aggregating the same target targets.
Figure BDA0003823965020000161
And sequencing the probabilities of the N nearest neighbor nodes to obtain the neighbor node with the maximum probability, and taking the root cause content of the node as the abnormal cause of the record to be completed.
The embodiment of the invention provides an artificial intelligence and big data combined anomaly detection service system which comprises a memory and a processor, wherein a computer executable instruction is stored on the memory, and the processor realizes the method when running the computer executable instruction on the memory.
The method and the system of the embodiment of the invention have the following advantages:
in the embodiment of the invention, a method of combining artificial intelligence and big data is adopted, and a model combining a Bayesian probability model, local abnormal factors and a knowledge graph is adopted, so that abnormal records can be efficiently and accurately detected, root cause analysis of the abnormal records can be carried out, and the operation and maintenance efficiency is greatly improved. .
Embodiments of the present invention also provide a computer-readable storage medium, on which computer-executable instructions are stored, where the computer-executable instructions are used to execute the method in the foregoing embodiments.
FIG. 4 is a diagram illustrating the hardware components of the system in one embodiment. It will be appreciated that fig. 4 only shows a simplified design of the system. In practical applications, the systems may also respectively include other necessary elements, including but not limited to any number of input/output systems, processors, controllers, memories, etc., and all systems that can implement the big data management method of the embodiments of the present application are within the protection scope of the present application.
The memory includes, but is not limited to, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM), or a portable read-only memory (CD-ROM), which is used for storing instructions and data.
The input system is for inputting data and/or signals and the output system is for outputting data and/or signals. The output system and the input system may be separate devices or may be an integral device.
The processor may include one or more processors, for example, one or more Central Processing Units (CPUs), and in the case of one CPU, the CPU may be a single-core CPU or a multi-core CPU. The processor may also include one or more special purpose processors, which may include GPUs, FPGAs, etc., for accelerated processing.
The memory is used to store program codes and data of the network device.
The processor is used for calling the program codes and data in the memory and executing the steps in the method embodiment. For details, reference may be made to the description in the method embodiments, which are not repeated herein.
In the several embodiments provided in the present application, it should be understood that the disclosed system and method may be implemented in other ways. For example, the division of the unit is only one logical function division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. The shown or discussed mutual coupling, direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, systems or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, it may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable system. The computer instructions may be stored on or transmitted over a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that includes one or more of the available media. The usable medium may be a read-only memory (ROM), or a Random Access Memory (RAM), or a magnetic medium, such as a floppy disk, a hard disk, a magnetic tape, a magnetic disk, or an optical medium, such as a Digital Versatile Disk (DVD), or a semiconductor medium, such as a Solid State Disk (SSD).
The above is only a specific embodiment of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think of various equivalent modifications or substitutions within the technical scope of the present application, and these modifications or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. An anomaly detection method combining artificial intelligence and big data is characterized by comprising the following steps:
the cloud server collects system operation records from an HBase database, wherein the operation records comprise user tags, operation time and operation instructions;
sorting according to the operation time, and generating a record table by the sorted operation records, wherein the record table comprises a plurality of records;
inputting the record table into an improved Bayesian probability model, acquiring the probability of each character string in each record, retaining the records with the probability greater than a first threshold value, and discarding the records with the probability less than or equal to the first threshold value;
and processing the characteristic number of the reserved record, and inputting the record subjected to the characteristic number processing into a local abnormal factor and knowledge graph combined model for abnormal detection.
2. The method of claim 1, wherein inputting the record table into a modified bayesian probability model to obtain the probability of each string in each record comprises:
calculating the probability Ps of each character occurrence in each record:
Ps=P[Sn=b n ]*u
wherein u is the probability weight of the occurrence of the current record, s is a character string identifier, b is a next character string identifier, and n is a positive integer;
if the probability Pu of the character is larger than the character threshold value, setting the character as a candidate node;
setting a tree structure, initializing a root node, and sequentially setting each candidate node in the tree structure;
and recursively traversing the candidate nodes to obtain the probability of each character string in each record.
3. The method of claim 2, wherein recursively traversing the candidate nodes to obtain a probability of each string in each record comprises:
calculating the probability P of the next character string after the current character string b
If P b If the value is larger than the second threshold value, the candidate node corresponding to the character string is reserved, otherwise, the candidate node corresponding to the character string is discarded;
and carrying out weighted average on the reserved candidate node probability, and calculating the probability of the record corresponding to the candidate node.
4. The method of claim 1, wherein performing the feature number processing on the retained record comprises:
obtaining a maximum length of the retained record;
and performing numerical filling on the records with the recording length smaller than the maximum length so that the characteristic number of each record is the same.
5. The method of claim 1, wherein inputting the feature processed record into a local anomaly factor and knowledge graph joint model for anomaly detection comprises:
performing exception screening on the record by using a K-means clustering algorithm;
using a local abnormal factor model to screen local factors for the record after abnormal screening;
generating a knowledge graph based on the records after the local factor screening, wherein the knowledge graph comprises a plurality of triples;
and completing the abnormal reasons of the records based on the knowledge graph.
6. The method of claim 5, wherein the records are screened for abnormalities using a K-means clustering algorithm comprising:
initializing a plurality of recorded cluster groups, and presetting m cluster centers in the cluster groups;
setting any node in the cluster group as a first cluster center;
calculating the Euclidean distance from any node to the cluster center;
selecting the node with the maximum Euclidean distance as a second cluster center;
repeating the calculation and selection processes until m cluster centers are selected;
calculating the distance from each record to m cluster centers, finding the cluster center closest to each record, and calculating the gravity center of the cluster group corresponding to the cluster center;
and for each cluster center, calculating the distance between the gravity center of the cluster corresponding to the cluster center and the cluster center, arranging the distances in the descending order, and setting the distance larger than a preset threshold value as an abnormal record.
7. The method of claim 5, wherein the locally factor screening of the abnormality screened records using a locally abnormal factor model comprises:
defining a field size k and a pollution parameter c;
sequentially traversing k and c, and calculating the mean value and the variance of the local outlier factor scores of the points under different values of k and c;
for each c and k, calculating the difference in local anomaly factor scores between the predicted anomaly and the normal point;
selecting a difference set T c,k K corresponding to the maximum value in the K is used as a k value of a local abnormal factor algorithm, and a difference set T corresponding to the k value is selected c,k,opt And the corresponding c value is used as the optimal solution of the c value of the local abnormal factor algorithm.
8. The method of claim 5, wherein complementing the record for a cause of the anomaly based on the knowledge-graph comprises:
processing the head entity, the tail entity and the relation of the triplets of the knowledge graph by using a pre-training language model to obtain the representation vector and the probability distribution size of each triplet;
obtaining N nearest neighbor nodes of the triple target entity by using Euclidean distance, wherein the neighbor nodes are the abnormal root cause nodes;
and sequencing the probabilities of the N nearest neighbor nodes to obtain the neighbor node with the maximum probability, and taking the root cause content of the node as the abnormal cause of the record to be completed.
9. An anomaly detection service system combining artificial intelligence and big data, characterized by comprising:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring system operation records from an HBase database, and the operation records comprise user tags, operation time and operation instructions;
the sorting unit is used for sorting according to operation time and generating a record table from the sorted operation records, wherein the record table comprises a plurality of records;
the processing unit is used for inputting the record table into the improved Bayesian probability model, acquiring the probability of each character string in each record, retaining the records with the probability larger than a first threshold value, and discarding the records with the probability smaller than or equal to the first threshold value; and processing the characteristic number of the reserved record, and inputting the record subjected to the characteristic number processing into a local abnormal factor and knowledge graph combined model for abnormal detection.
10. An artificial intelligence and big data combined anomaly detection service system, comprising a memory and a processor, wherein the memory stores computer-executable instructions, and the processor executes the computer-executable instructions on the memory to realize the method of any one of claims 1 to 8.
CN202211059848.2A 2022-08-30 2022-08-30 Artificial intelligence and big data combined anomaly detection method and service system Withdrawn CN115408189A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211059848.2A CN115408189A (en) 2022-08-30 2022-08-30 Artificial intelligence and big data combined anomaly detection method and service system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211059848.2A CN115408189A (en) 2022-08-30 2022-08-30 Artificial intelligence and big data combined anomaly detection method and service system

Publications (1)

Publication Number Publication Date
CN115408189A true CN115408189A (en) 2022-11-29

Family

ID=84163470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211059848.2A Withdrawn CN115408189A (en) 2022-08-30 2022-08-30 Artificial intelligence and big data combined anomaly detection method and service system

Country Status (1)

Country Link
CN (1) CN115408189A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115912359A (en) * 2023-02-23 2023-04-04 豪派(陕西)电子科技有限公司 Digitalized potential safety hazard identification, investigation and treatment method based on big data

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115912359A (en) * 2023-02-23 2023-04-04 豪派(陕西)电子科技有限公司 Digitalized potential safety hazard identification, investigation and treatment method based on big data
CN115912359B (en) * 2023-02-23 2023-07-25 豪派(陕西)电子科技有限公司 Digital potential safety hazard identification, investigation and treatment method based on big data

Similar Documents

Publication Publication Date Title
CA3088899C (en) Systems and methods for preparing data for use by machine learning algorithms
CN108475287B (en) Outlier detection for streaming data
CN111694879B (en) Multielement time sequence abnormal mode prediction method and data acquisition monitoring device
Chen et al. Entity embedding-based anomaly detection for heterogeneous categorical events
BR102018009859A2 (en) METHOD AND SYSTEM FOR DATA-BASED OPTIMIZATION OF PERFORMANCE INDICATORS IN MANUFACTURING AND PROCESS INDUSTRIES
CN111612041B (en) Abnormal user identification method and device, storage medium and electronic equipment
Liu et al. Categorization and construction of rule based systems
KR101965277B1 (en) System and method for analysis of hypergraph data and computer program for the same
US11645500B2 (en) Method and system for enhancing training data and improving performance for neural network models
KR20220133914A (en) Efficient ground truth annotation
Voznica et al. Deep learning from phylogenies to uncover the epidemiological dynamics of outbreaks
CN113554175B (en) Knowledge graph construction method and device, readable storage medium and terminal equipment
KR20230031889A (en) Anomaly detection in network topology
CN114168608A (en) Data processing system for updating knowledge graph
CN112162860A (en) CPU load trend prediction method based on IF-EMD-LSTM
Kitonyi et al. Hybrid gradient descent grey wolf optimizer for optimal feature selection
CN115408189A (en) Artificial intelligence and big data combined anomaly detection method and service system
Smith TreeSearch: morphological phylogenetic analysis in R
CN114880482A (en) Graph embedding-based relation graph key personnel analysis method and system
Zhou et al. On the opportunities of green computing: A survey
US20220051126A1 (en) Classification of erroneous cell data
CN117421171A (en) Big data task monitoring method, system, device and storage medium
CN114925210B (en) Knowledge graph construction method, device, medium and equipment
Gias et al. Samplehst: Efficient on-the-fly selection of distributed traces
US20230018525A1 (en) Artificial Intelligence (AI) Framework to Identify Object-Relational Mapping Issues in Real-Time

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20221129

WW01 Invention patent application withdrawn after publication