CN111800312B - Message content analysis-based industrial control system anomaly detection method and system - Google Patents

Message content analysis-based industrial control system anomaly detection method and system Download PDF

Info

Publication number
CN111800312B
CN111800312B CN202010578994.0A CN202010578994A CN111800312B CN 111800312 B CN111800312 B CN 111800312B CN 202010578994 A CN202010578994 A CN 202010578994A CN 111800312 B CN111800312 B CN 111800312B
Authority
CN
China
Prior art keywords
message
cluster
detection
model
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010578994.0A
Other languages
Chinese (zh)
Other versions
CN111800312A (en
Inventor
王大秋
赵海江
蒲蔚若
张明星
顾俊杰
廖砚
朱小勇
肖波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nuclear Power Institute of China
Original Assignee
Nuclear Power Institute of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nuclear Power Institute of China filed Critical Nuclear Power Institute of China
Priority to CN202010578994.0A priority Critical patent/CN111800312B/en
Publication of CN111800312A publication Critical patent/CN111800312A/en
Application granted granted Critical
Publication of CN111800312B publication Critical patent/CN111800312B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/145Network analysis or design involving simulating, designing, planning or modelling of a network
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/028Capturing of monitoring data by filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/16Threshold monitoring

Landscapes

  • Engineering & Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention discloses an industrial control system abnormity detection method and system based on message content analysis, wherein the method comprises the steps of collecting data messages in the industrial control system and mapping the messages to corresponding characteristic vectors; secondly, constructing a prototype model of the network message based on a data input set formed by the feature vectors; analyzing the occurrence frequency of the features in each prototype model and filtering the rare features in the model by using a frequency threshold value to obtain a detection model; and step four, realizing the abnormal detection of the industrial control system by adopting the model optimized in the step three and the detection information acquired in the step one. The invention solves the problem of the limitation of the commonly used binary protocol in the industrial computer network to the traditional abnormity detection method.

Description

Message content analysis-based industrial control system anomaly detection method and system
Technical Field
The invention relates to the technical field of industrial control system detection, in particular to an industrial control system abnormity detection method and system based on message content analysis.
Background
Protecting industrial computer networks is a difficult task. Networks in industrial control systems typically use proprietary systems and protocols, and in many cases, technical details about these proprietary systems and protocols are rarely published to the outside. As a result, operators of industrial control systems are often unaware of the specific implementation details of the systems they operate, and in-depth analysis of communication content is difficult to achieve. There is a need for a detection method that can analyze the content of unknown binary protocols in an industrial control system and perform anomaly detection.
Disclosure of Invention
The invention provides an industrial control system anomaly detection method based on message content analysis, which solves the problem that a binary protocol commonly used in an industrial computer network limits a traditional anomaly detection method.
The invention is realized by the following technical scheme:
a method for detecting the abnormity of an industrial control system based on message content analysis comprises the following steps:
step one, collecting data messages in an industrial control system and mapping the messages to corresponding characteristic vectors;
secondly, constructing a prototype model of the network message based on a data input set formed by the feature vectors;
analyzing the occurrence frequency of the features in each prototype model and filtering the rare features in the model by using a frequency threshold value to obtain a detection model;
and step four, realizing the abnormal detection of the industrial control system by adopting the model optimized in the step three and the detection information acquired in the step one.
The invention comprises the steps of constructing a prototype model which is based on message content analysis and can reflect interactive data of unknown protocol content and structure in a general mode, and a method for optimizing the constructed prototype model to reduce errors.
Preferably, the mapping the acquired message to the corresponding feature vector in the first step of the present invention specifically comprises:
extracting all sub-character strings with the length of n from the acquired message m and recording the occurrence times of the sub-character strings, each sub-character string and each characterOne dimension of the eigenspace is associated, so that the message m is represented as a vector x ═ phi (m) where the substring appears, and a specific mapping process is represented as: phi (m) is (phi)s(m))s∈S
Wherein phi iss(m) f (S, m), the set S representing all possible substrings of length n, the function f (S, m) representing the number of occurrences of the substring S in the input message m;
mapping a set of messages m1,m2,...,mNConverting into a set of vectors X ═ X1,x2,...,xNIn which xi=φ(mi) Taking X as an input data set and using the X for model construction in the step two; or converting a message as a detection message into a single vector for the anomaly judgment of the detection message in the fourth step.
Preferably, the second step of the present invention specifically comprises the following steps:
step S21, selecting k samples from the input data set for initializing cluster C1,...,Ck
Step S22, for each sample remaining in the input data
Figure BDA0002552439160000027
The similarity of the sample to each cluster is measured and assigned to the closest cluster according to the similarity to obtain an updated cluster CjWherein, in the step (A),
Figure BDA0002552439160000021
and step S23, repeating step S22 until all the remaining samples are distributed into k clusters, namely k prototype models are obtained.
Preferably, step S21 of the present invention specifically includes:
step S211, randomly selecting a sample from the data set as an initial clustering center c1
Step S212, calculating the shortest distance between each sample and the current existing cluster center, namely the distance between each sample and the nearest cluster center, and using D(x) It is shown that,
Figure BDA0002552439160000022
wherein
Figure BDA0002552439160000023
Is the variance of the ith dimension for all samples in the dataset;
step S213, calculating the probability of each sample being selected as the next cluster center
Figure BDA0002552439160000024
Then selecting the next clustering center according to a wheel disc method;
and repeating the steps S212 to S213 until k cluster centers are selected.
Preferably, in step S22 of the present invention, the similarity calculation between the samples and the clusters uses a cosine similarity method, specifically:
using input samples x and clusters CiCenter c ofiCosine similarity of the two vectors indicates:
Figure BDA0002552439160000025
where θ represents vectors x and ciThe smaller the angle is, the greater the similarity of the two vectors is;
cluster CiCenter c ofiI.e. the centroid of all samples belonging to the cluster:
Figure BDA0002552439160000026
wherein, | CiI denotes cluster CiThe number of samples in the cluster, and when a sample is assigned to the closest cluster, the center of the cluster changes.
Preferably, step S3 of the present invention is to filter features occurring in less than t input samples of the prototype model by setting a frequency threshold t, so as to obtain the detection model MiWherein i ∈ [1, k ]]。
Preferably, step S4 of the present invention specifically includes:
step S41, using detection model MiThe ratio of the number of features to the total number of features l of the detected message M is known as the detected message M to the detection model MiWill detect each of the messages M and M models MiCalculating a score once and treating the highest score among them as the confidence a of detecting the message m, i.e.
Figure BDA0002552439160000031
Step S42, according to the confidence coefficient alpha and the abnormal judgment threshold value T, when the confidence coefficient alpha (M, M) is less than or equal to T, the message is considered to be malicious, and an abnormal alarm is sent; otherwise, it is considered benign and the next message continues to be detected.
On the other hand, the invention also provides an industrial control system anomaly detection system based on message content analysis, which comprises a message collection and preprocessing module, a learning module, an optimization module and a detection module;
the message collecting and preprocessing module is used for collecting data messages in the industrial control system, mapping the messages to corresponding characteristic vectors and transmitting the characteristic vectors to the learning module and the detection module;
the learning module is used for constructing a prototype model of the network message based on an input data set consisting of the feature vectors transmitted by the message collecting and preprocessing module;
the optimization module is used for analyzing the occurrence frequency of the features in each prototype model and filtering the rare features in the models by using a frequency threshold value to obtain a detection model;
the detection module utilizes the detection model and the feature vector of the detection information transmitted by the message collection and preprocessing module to judge whether the detection information is abnormal.
Preferably, the mapping of the acquired message to the corresponding feature vector by the message collecting and preprocessing module of the present invention specifically comprises:
extracting all sub-character strings with the length of n from the acquired message m and recording the occurrence times of the sub-character strings, wherein each sub-character string is associated with one dimension of a feature space so as to ensure thatObtaining a vector x ═ phi (m) where the message m is expressed as a substring, and the specific mapping process is expressed as: phi (m) is (phi)s(m))s∈S
Wherein phi iss(m) f (S, m), the set S representing all possible substrings of length n, the function f (S, m) representing the number of occurrences of the substring S in the input message m;
mapping a set of messages m1,m2,...,mNConverting into a set of vectors X ═ X1,x2,...,xNIn which xi=φ(mi) Taking X as an input data set and using the X for model construction in the step two; or converting a message as a detection message into a single vector for the anomaly judgment of the detection message in the fourth step.
Preferably, the learning module of the present invention is configured to perform the steps of:
step S21, selecting k samples from the input data set for initializing cluster C1,...,Ck
Step S22, for each sample remaining in the input data
Figure BDA0002552439160000041
The similarity of the sample to each cluster is measured and assigned to the closest cluster according to the similarity to obtain an updated cluster CjWherein, in the step (A),
Figure BDA0002552439160000042
and step S23, repeating step S22 until all the remaining samples are distributed into k clusters, namely k prototype models are obtained.
The invention has the following advantages and beneficial effects:
1. the invention adopts the industrial control system abnormity detection method based on message content analysis, and can solve the problem that the traditional abnormity detection method is limited by the commonly used binary protocol in the traditional industrial computer network; the invention analyzes the message content of the industrial control system, reflects the prototype model of the interactive data of unknown protocol content and structure in a general way (general rule), and adopts the prototype model to detect the abnormity of the detection information, thereby greatly improving the stability and the safety of the industrial control system.
2. The invention has wide application range, can be well suitable for the abnormality detection of the nuclear reactor industrial control system, and can also be applied to the abnormality detection of other industrial control systems.
Drawings
The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
FIG. 1 is a schematic block diagram of the system of the present invention.
FIG. 2 is a flow chart of the present invention for building prototype models.
FIG. 3 is a flow chart of the optimization model of the present invention.
FIG. 4 is a flow chart of the optimization model based anomaly detection of the present invention.
Detailed Description
Hereinafter, the term "comprising" or "may include" used in various embodiments of the present invention indicates the presence of the invented function, operation or element, and does not limit the addition of one or more functions, operations or elements. Furthermore, as used in various embodiments of the present invention, the terms "comprises," "comprising," "includes," "including," "has," "having" and their derivatives are intended to mean that the specified features, numbers, steps, operations, elements, components, or combinations of the foregoing, are only meant to indicate that a particular feature, number, step, operation, element, component, or combination of the foregoing, and should not be construed as first excluding the existence of, or adding to the possibility of, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.
In various embodiments of the invention, the expression "or" at least one of a or/and B "includes any or all combinations of the words listed simultaneously. For example, the expression "a or B" or "at least one of a or/and B" may include a, may include B, or may include both a and B.
Expressions (such as "first", "second", and the like) used in various embodiments of the present invention may modify various constituent elements in various embodiments, but may not limit the respective constituent elements. For example, the above description does not limit the order and/or importance of the elements described. The foregoing description is for the purpose of distinguishing one element from another. For example, the first user device and the second user device indicate different user devices, although both are user devices. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of various embodiments of the present invention.
It should be noted that: if it is described that one constituent element is "connected" to another constituent element, the first constituent element may be directly connected to the second constituent element, and a third constituent element may be "connected" between the first constituent element and the second constituent element. In contrast, when one constituent element is "directly connected" to another constituent element, it is understood that there is no third constituent element between the first constituent element and the second constituent element.
The terminology used in the various embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the various embodiments of the invention. As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments of the present invention.
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.
Example 1
The embodiment provides an industrial control system anomaly detection method based on message content analysis.
The detection method of the embodiment is implemented by relying on four key modules of the detection system, as shown in fig. 1, the four key modules of the detection system are specifically:
A. and the message collection and preprocessing module. The module collects data messages in the industrial control system and maps the messages to corresponding characteristic vectors, and the characteristic vectors can construct an input data set and can be used by a subsequent learning module and a subsequent detection module.
B. And a learning module. The module is used for establishing a prototype model. To build the prototype model, this module automatically partitions the network traffic into k sets of messages with similar content, approximating the state of the underlying protocol. This module is used not only to separate message types, but also to derive a prototype model for each message type in the process. Prototype models may not only represent the structure of individual message types, but may also describe the data they typically contain.
C. And an optimization module. One of the main obstacles to analyzing unknown protocols is noise, i.e., seemingly random data prevents the inference of a suitable content model. The present module solves this problem by analyzing the frequency of occurrence of features in each prototype model and filtering the rare features using a frequency threshold t.
D. And a detection module. After the prototype model is built and optimized, the module takes the deviation between the input message and the built model as the judgment condition of the abnormal detection. The learning module and the optimization module form the basis of the detection module, so that the detector D (k, T) is parameterized by the number of message sets k, the frequency threshold T and the anomaly decision threshold T.
In the message collection and preprocessing module, there are two message collection modes, including that the monitoring industrial control system obtains the message and the operator of the industrial control system provides the message for the learning module. The method of mapping the collected message m to the corresponding feature vector x ═ Φ (m) is as follows: extracting all the length n children from the message mThe strings and records their number of occurrences, each substring being associated with one dimension of the feature space, so that the message m can be represented as a vector of occurrences of the substring. Formally defined as follows: phi (m) is (phi)s(m))s∈SWherein phi issThe set S represents all possible substrings of length n, and the function f (S, m) represents the number of occurrences of S in the input message m. Using this mapping, a set of messages m can be mapped1,m2,...,mNConverting into a set of vectors X ═ X1,x2,...,xNIn which xi=φ(mi) Using X as an input data set and for the learning phase; or a message is used as a detection message and converted into a single vector for judging the abnormity of the message in the detection stage.
A flow chart for building a prototype model of a network message in the learning module is shown in fig. 2.
After the input data set is constructed, k samples are selected from the input data to initialize the cluster C1,...,CkFirstly, randomly selecting a sample from the data set as an initial clustering center c1(ii) a Calculating the shortest distance between each sample and the current existing cluster center, namely the distance between each sample and the nearest cluster center, and expressing the shortest distance by D (x),
Figure BDA0002552439160000061
wherein
Figure BDA0002552439160000062
Is the variance of the ith dimension of all samples in the dataset, then calculates the probability that each sample is selected as the next cluster center
Figure BDA0002552439160000063
Then selecting the next clustering center according to a wheel disc method; this is repeated until k cluster centers are selected.
Then, for each sample remaining
Figure BDA0002552439160000067
The similarity of the sample to each cluster is measured and assigned to the closest cluster CjWherein, in the step (A),
Figure BDA0002552439160000064
the similarity of samples to clusters is defined based on the input sample x and the cluster CiCenter c ofiThe cosine similarity of (a) is expressed by the cosine similarity of two vectors:
Figure BDA0002552439160000065
where θ represents vectors x and yiThe smaller the angle is, the greater the similarity of the two vectors is; cluster CiCenter c ofiI.e. the centroid of all samples belonging to the cluster:
Figure BDA0002552439160000066
wherein, | CiI denotes cluster CiThe number of samples in the cluster, and when a sample is assigned to the closest cluster, the center of the cluster changes.
In order to save the working memory in use, the invention only saves the count of the message and the substrings thereof, but does not save the message itself; for each cluster CiSaving only the total number of all samples contained in the cluster, and the cumulative count vector as a prototype model
Figure BDA0002552439160000071
This is sufficient to calculate the similarity between the previously defined messages and clusters and can be used for further anomaly detection.
A flow chart of the optimization model in the optimization module is shown in fig. 3.
The method for optimizing the model by the module is to filter irrelevant information by using the occurrence times. Recording the frequency f of the appearance sample of each feature or substring S E S when establishing the prototype model, and establishing the prototype model PiThese characteristic frequencies in the training samples are accurately represented. Thus, the detection scheme for the binary protocol can be directly derived from the prototype model PiDerived without additional training.
By setting the frequency threshold t, the method filters features occurring in fewer than t input samples associated with a certain cluster, thereby effectively reducing noise from the training data. Thus, model MiIs a set of features that can be used for detection and that can collectively make up the overall content model M ═ { M ] used by D (k, T)1,...,MkIn which M isi={s∈S|Pi≥t}。
An optimization model based anomaly detection flow diagram in the detection module is shown in fig. 4.
The detection module determines overall anomaly detection using the optimized M model and the detection message M transmitted in the message collection and preprocessing module, and D (k, T, T) considers that the expected result of the model and the detected message M should have similarity. This module uses the detection model MiThe ratio of the number of features to the total number of features l of the message M is known as the ratio of the detected message M to the model MiWill detect each of the messages M and M models MiCalculating a score once and treating the highest score among them as the confidence a of detecting the message m, i.e.
Figure BDA0002552439160000072
According to the confidence coefficient alpha and the abnormity judgment threshold value T, for the confidence coefficient alpha (M, M) less than or equal to T, the message is considered to be malicious, because the message does not accord with the expected results of all models, the module sends an abnormity alarm; otherwise, it is considered benign and the module continues to detect the next message.
Example 2
In this embodiment, the detection method provided in embodiment 1 is adopted to detect a group of collected messages, and the specific process is as follows:
as shown in table 1 below, a set of messages is captured, rules describing the set of messages are then learned, and the rules may be used to generate new messages that conform to the format of the set of messages. It is clear that the exact protocol specification cannot be learned, but rather the protocol is approximately described in terms of captured network traffic. The set of messages depicted obviously contains a binary transmission command 101111 to detect a sensor value from a particular device. In addition, it contains the length of the transmission string, the front of the message being a 1-byte integer, and another unspecified flag at the end. In this set of messages, only the respective part of the device identifier and the last binary flag have changed, so that the algorithm considers the rest to be unchanged.
TABLE 1
Figure BDA0002552439160000081
As shown in table 2 below, the attack message randomly selects an existing network message and replaces the variable field in the message with the encoded attack field, while leaving the other constant or unmodified variable fields unchanged.
TABLE 2
Figure BDA0002552439160000082
If n is 4 when extracting features, the feature vector for extracting the set of messages is shown in table 3 below.
TABLE 3
Figure BDA0002552439160000083
Let grouping message k be 2, select point 5 as initial cluster center, then calculate the probability of d (x) and being selected as second cluster center for each of the remaining samples, d (x) from point 1 to point 8234.13,27.23,22.67,22.67,0.0,39.23,20.0,50.87, p (x) is 0.157,0.126,0.105,0.105,0.0,0.181,0.092,0.234, sum is 0.157, 0.283, 0.388, 0.493, 0.674, 0.766, 1, a random number between 0 and 1 is randomly generated, and assuming that the generated random number is 0.25, point 2 is selected as the second cluster center. Initializing clusters C with Point 21Point 5 initialization cluster C2
And clustering the remaining points. Point No. 1 and two clusters C1、C2Respectively of similarity ofIs composed of
Figure BDA0002552439160000091
And
Figure BDA0002552439160000092
add Point No. 1 to Cluster C2At this time, cluster C2Becomes [0.5,2.5,4.0,1.0,2.5,2.0,1.5,2.5,2.5,1.5,1.0,3.0,2.0,2.0,2.5,1.0](ii) a The other points are clustered according to the same method of photography, and the final clustering result is C1Contains points 2,3, 6, 8 with centers of [0.5,0.75,2.0,2.5,1.0,1.5,2.25,3.25,0.75,2.5,0.75,3.25,2.25,2.75,3.25,1.5],C2Contains points 1, 4, 5, 7 with centers of [0.75,2.0,3.0,1.5,2.5,1.75,1.75,2.5,2.0,1.75,1.5,2.75,1.5,2.25,2.25,1.0]Generating a prototype model P after completioni,P1Is [2,3,8,10,4,6,9,13,3,10,3,13,9,11,13,6 ]],P2Is [3,8,12,6,10,7,7,10,8,7,6,11,6,9,9,4 ] of]。
Optimizing the prototype model, and if the frequency threshold t is 8, then P1The characteristics of 0000, 0001, 0100 and the like in the step (1) are filtered, and the characteristics of 0010, 0011, 0110 and the like are reserved. The optimized content model M is represented by { [3,4,7,8,10,12,13,14,15 in decimal notation],[2,3,5,8,9,12,14,15]}。
Assuming that the anomaly determination threshold T is 0.7, the extracted feature vector of the attack message M is [0,2,2,4,3,0,1,2,3,5,4,3,4,0,2,1], the number of features conforming to the model M is 2+4+1+2+5+3+14+2 ═ 23 and 2+2+3+2+3+3+1 ═ 16, respectively, and the total number of features of M is 36, the confidence α of the message M is 0.64 < T, and the message is considered to be an anomaly message and an alarm is issued.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (3)

1. A message content analysis-based industrial control system anomaly detection method is characterized by comprising the following steps:
step one, collecting data messages in an industrial control system and mapping the messages to corresponding characteristic vectors;
secondly, constructing a prototype model of the network message based on a data input set formed by the feature vectors;
analyzing the occurrence frequency of the features in each prototype model and filtering the rare features in the model by using a frequency threshold value to obtain a detection model;
step four, the model optimized in step three and the detection information collected in step one are adopted to realize the abnormal detection of the industrial control system;
the mapping of the acquired message to the corresponding feature vector in the first step is specifically:
extracting all substrings with the length of n from the acquired message m and recording the occurrence times of the substrings, wherein each substring is associated with one dimension of a feature space, so that the message m is expressed as a vector x (phi (m) of the occurrence of the substring, and the specific mapping process is expressed as: phi (m) is (phi)s(m))s∈S
Wherein phi iss(m) f (S, m), the set S representing all possible substrings of length n, the function f (S, m) representing the number of occurrences of the substring S in the input message m;
mapping a set of messages m1,m2,...,mNConverting into a set of vectors X ═ X1,x2,...,xNIn which xi=φ(mi) Taking X as an input data set and using the X for model construction in the step two; or converting a message as a detection message into a single vector for the abnormal judgment of the detection message in the fourth step;
the second step specifically comprises the following steps:
step S21, selecting k samples from the input data set for initializing cluster C1,...,Ck
The step S21 specifically includes:
step S211, slave numberRandomly selecting a sample in the data set as an initial clustering center c1
Step S212, calculating the shortest distance between each sample and the current existing cluster center, namely the distance between each sample and the nearest cluster center, and expressing the shortest distance with D (x),
Figure FDA0003149378290000011
wherein
Figure FDA0003149378290000012
Is the variance of the ith dimension for all samples in the dataset;
step S213, calculating the probability of each sample being selected as the next cluster center
Figure FDA0003149378290000013
Then selecting the next clustering center according to a wheel disc method;
repeating the steps S212 to S213 until k cluster centers are selected;
step S22, for each sample remaining in the input data
Figure FDA0003149378290000014
The similarity of the sample to each cluster is measured and assigned to the closest cluster according to the similarity to obtain an updated cluster CjWherein, in the step (A),
Figure FDA0003149378290000021
in the step S22, the similarity calculation between the samples and the clusters uses a cosine similarity method, which specifically includes:
using input samples x and clusters CiCenter c ofiCosine similarity of the two vectors indicates:
Figure FDA0003149378290000022
where θ represents vectors x and ciThe smaller the angle is, the greater the similarity of the two vectors is;
cluster CiCenter c ofiI.e. the centroid of all samples belonging to the cluster:
Figure FDA0003149378290000023
wherein, | CiI denotes cluster CiThe number of middle samples, when a sample is assigned to the closest cluster, the center of the cluster changes;
step S23, repeating step S22 until all the remaining samples are distributed into k clusters, namely k prototype models are obtained;
the step S4 specifically includes:
step S41, using detection model MiThe ratio of the number of features to the total number of features l of the detected message M is known as the detected message M to the detection model MiWill detect each of the messages M and M models MiCalculating a score once and regarding the highest score as the confidence degree alpha of the detected message m, namely alpha:
Figure FDA0003149378290000024
step S42, according to the confidence coefficient alpha and the abnormal judgment threshold value T, when the confidence coefficient alpha (M, M) is less than or equal to T, the message is considered to be malicious, and an abnormal alarm is sent; otherwise, it is considered benign and the next message continues to be detected.
2. The method according to claim 1, wherein in step S3, the features occurring in fewer than t input samples in the prototype model are filtered by setting a frequency threshold t, so as to obtain the detection model MiWherein i ∈ [1, k ]]。
3. An industrial control system anomaly detection system based on message content analysis is characterized by comprising a message collection and preprocessing module, a learning module, an optimization module and a detection module;
the message collecting and preprocessing module is used for collecting data messages in the industrial control system, mapping the messages to corresponding characteristic vectors and transmitting the characteristic vectors to the learning module and the detection module;
the learning module is used for constructing a prototype model of the network message based on an input data set consisting of the feature vectors transmitted by the message collecting and preprocessing module;
the optimization module is used for analyzing the occurrence frequency of the features in each prototype model and filtering the rare features in the models by using a frequency threshold value to obtain a detection model;
the detection module judges whether the detection information is abnormal or not by using the detection model and the characteristic vector of the detection information transmitted by the message collection and preprocessing module; the mapping of the acquired message to the corresponding feature vector by the message collecting and preprocessing module is specifically as follows:
extracting all substrings with the length of n from the acquired message m and recording the occurrence times of the substrings, wherein each substring is associated with one dimension of a feature space, so that the message m is expressed as a vector x (phi (m) of the occurrence of the substring, and the specific mapping process is expressed as: phi (m) is (phi)s(m))s∈S
Wherein phi iss(m) f (S, m), the set S representing all possible substrings of length n, the function f (S, m) representing the number of occurrences of the substring S in the input message m;
mapping a set of messages m1,m2,...,mNConverting into a set of vectors X ═ X1,x2,...,xNIn which xi=φ(mi) Taking X as an input data set and using the X for model construction in the step two; or one message is used as a detection message and is converted into a single vector for judging the abnormity of the detection message; the learning module is configured to perform the steps of:
step S21, selecting k samples from the input data set for initializing cluster C1,...,Ck
The step S21 specifically includes:
step S211, randomly selecting a sample from the data set as an initial clustering center c1
Step S212, calculating the shortest distance between each sample and the current existing cluster center, namely the distance between each sample and the nearest cluster center, and expressing the shortest distance with D (x),
Figure FDA0003149378290000031
wherein
Figure FDA0003149378290000032
Is the variance of the ith dimension for all samples in the dataset;
step S213, calculating the probability of each sample being selected as the next cluster center
Figure FDA0003149378290000033
Then selecting the next clustering center according to a wheel disc method;
repeating the steps S212 to S213 until k cluster centers are selected;
step S22, for each sample remaining in the input data
Figure FDA0003149378290000034
The similarity of the sample to each cluster is measured and assigned to the closest cluster according to the similarity to obtain an updated cluster CjWherein, in the step (A),
Figure FDA0003149378290000035
in the step S22, the similarity calculation between the samples and the clusters uses a cosine similarity method, which specifically includes:
using input samples x and clusters CiCenter c ofiCosine similarity of the two vectors indicates:
Figure FDA0003149378290000041
where θ represents vectors x and ciThe smaller the angle is, the greater the similarity of the two vectors is;
cluster CiCenter c ofiI.e. the centroid of all samples belonging to the cluster:
Figure FDA0003149378290000042
wherein, | CiI denotes cluster CiThe number of middle samples, when a sample is assigned to the closest cluster, the center of the cluster changes;
step S23, repeating step S22 until all the remaining samples are distributed into k clusters, namely k prototype models are obtained;
the abnormality judgment process of the detection module comprises the following steps:
step S41, using detection model MiThe ratio of the number of features to the total number of features l of the detected message M is known as the detected message M to the detection model MiWill detect each of the messages M and M models MiCalculating a score once and regarding the highest score as the confidence degree alpha of the detected message m, namely alpha:
Figure FDA0003149378290000043
step S42, according to the confidence coefficient alpha and the abnormal judgment threshold value T, when the confidence coefficient alpha (M, M) is less than or equal to T, the message is considered to be malicious, and an abnormal alarm is sent; otherwise, it is considered benign and the next message continues to be detected.
CN202010578994.0A 2020-06-23 2020-06-23 Message content analysis-based industrial control system anomaly detection method and system Active CN111800312B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010578994.0A CN111800312B (en) 2020-06-23 2020-06-23 Message content analysis-based industrial control system anomaly detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010578994.0A CN111800312B (en) 2020-06-23 2020-06-23 Message content analysis-based industrial control system anomaly detection method and system

Publications (2)

Publication Number Publication Date
CN111800312A CN111800312A (en) 2020-10-20
CN111800312B true CN111800312B (en) 2021-08-24

Family

ID=72804638

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010578994.0A Active CN111800312B (en) 2020-06-23 2020-06-23 Message content analysis-based industrial control system anomaly detection method and system

Country Status (1)

Country Link
CN (1) CN111800312B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112666907B (en) * 2020-12-23 2022-04-01 北京天融信网络安全技术有限公司 Industrial control strategy generation method and device, electronic equipment and storage medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE60205572D1 (en) * 2001-12-25 2005-09-22 Matsushita Electric Ind Co Ltd Device and system for detecting anomalies
CN106502234A (en) * 2016-10-17 2017-03-15 重庆邮电大学 Industrial control system method for detecting abnormality based on double skeleton patterns
CN106911514A (en) * 2017-03-15 2017-06-30 江苏省电力试验研究院有限公司 SCADA network inbreak detection methods and system based on the agreements of IEC60870 5 104
CN109660518A (en) * 2018-11-22 2019-04-19 北京六方领安网络科技有限公司 Communication data detection method, device and the machine readable storage medium of network
CN110768946A (en) * 2019-08-13 2020-02-07 中国电力科学研究院有限公司 Industrial control network intrusion detection system and method based on bloom filter
CN110909811A (en) * 2019-11-28 2020-03-24 国网湖南省电力有限公司 OCSVM (online charging management system) -based power grid abnormal behavior detection and analysis method and system
CN111262722A (en) * 2019-12-31 2020-06-09 中国广核电力股份有限公司 Safety monitoring method for industrial control system network

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2528044B (en) * 2014-07-04 2018-08-22 Arc Devices Ni Ltd Non-touch optical detection of vital signs
CN105204487A (en) * 2014-12-26 2015-12-30 北京邮电大学 Intrusion detection method and intrusion detection system for industrial control system based on communication model
US20170046510A1 (en) * 2015-08-14 2017-02-16 Qualcomm Incorporated Methods and Systems of Building Classifier Models in Computing Devices
CN107241226B (en) * 2017-06-29 2020-10-16 北京工业大学 Fuzzy test method based on industrial control private protocol

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE60205572D1 (en) * 2001-12-25 2005-09-22 Matsushita Electric Ind Co Ltd Device and system for detecting anomalies
CN106502234A (en) * 2016-10-17 2017-03-15 重庆邮电大学 Industrial control system method for detecting abnormality based on double skeleton patterns
CN106911514A (en) * 2017-03-15 2017-06-30 江苏省电力试验研究院有限公司 SCADA network inbreak detection methods and system based on the agreements of IEC60870 5 104
CN109660518A (en) * 2018-11-22 2019-04-19 北京六方领安网络科技有限公司 Communication data detection method, device and the machine readable storage medium of network
CN110768946A (en) * 2019-08-13 2020-02-07 中国电力科学研究院有限公司 Industrial control network intrusion detection system and method based on bloom filter
CN110909811A (en) * 2019-11-28 2020-03-24 国网湖南省电力有限公司 OCSVM (online charging management system) -based power grid abnormal behavior detection and analysis method and system
CN111262722A (en) * 2019-12-31 2020-06-09 中国广核电力股份有限公司 Safety monitoring method for industrial control system network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Rolling element bearing fault detection in industrial environments based on a K-means clustring approach";C.T.Yiakopoulos;《EXpert Systems with Applications》;20100916;全文 *
"基于FCM-SVM的工控网络异常检测算法研究";崔君容;《中国优秀硕士学位论文全文库》;20190115;全文 *

Also Published As

Publication number Publication date
CN111800312A (en) 2020-10-20

Similar Documents

Publication Publication Date Title
US11991194B2 (en) Cognitive neuro-linguistic behavior recognition system for multi-sensor data fusion
US20240071037A1 (en) Mapper component for a neuro-linguistic behavior recognition system
US20200193092A1 (en) Perceptual associative memory for a neuro-linguistic behavior recognition system
US20220012422A1 (en) Lexical analyzer for a neuro-linguistic behavior recognition system
CN111709022B (en) Hybrid alarm association method based on AP clustering and causal relationship
CN109656878B (en) Health record data generation method and device
US20230237306A1 (en) Anomaly score adjustment across anomaly generators
CN111800312B (en) Message content analysis-based industrial control system anomaly detection method and system
CN113746780B (en) Abnormal host detection method, device, medium and equipment based on host image
CN112422546A (en) Network anomaly detection method based on variable neighborhood algorithm and fuzzy clustering
US20170293608A1 (en) Unusual score generators for a neuro-linguistic behavioral recognition system
CN111343165B (en) Network intrusion detection method and system based on BIRCH and SMOTE
US12032909B2 (en) Perceptual associative memory for a neuro-linguistic behavior recognition system
WO2024004083A1 (en) Data generation device, data generation method, and program
Aiming et al. Method on rule extracting in misuse intrusion detection based on rough set genetic algorithm
Ren Webshell combination detection method based on Naïve Bayes and decision tree

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant