CN111800312B

CN111800312B - Message content analysis-based industrial control system anomaly detection method and system

Info

Publication number: CN111800312B
Application number: CN202010578994.0A
Authority: CN
Inventors: 王大秋; 赵海江; 蒲蔚若; 张明星; 顾俊杰; 廖砚; 朱小勇; 肖波
Original assignee: Nuclear Power Institute of China
Current assignee: Nuclear Power Institute of China
Priority date: 2020-06-23
Filing date: 2020-06-23
Publication date: 2021-08-24
Anticipated expiration: 2040-06-23
Also published as: CN111800312A

Abstract

The invention discloses an industrial control system abnormity detection method and system based on message content analysis, wherein the method comprises the steps of collecting data messages in the industrial control system and mapping the messages to corresponding characteristic vectors; secondly, constructing a prototype model of the network message based on a data input set formed by the feature vectors; analyzing the occurrence frequency of the features in each prototype model and filtering the rare features in the model by using a frequency threshold value to obtain a detection model; and step four, realizing the abnormal detection of the industrial control system by adopting the model optimized in the step three and the detection information acquired in the step one. The invention solves the problem of the limitation of the commonly used binary protocol in the industrial computer network to the traditional abnormity detection method.

Description

Message content analysis-based industrial control system anomaly detection method and system

Technical Field

The invention relates to the technical field of industrial control system detection, in particular to an industrial control system abnormity detection method and system based on message content analysis.

Background

Protecting industrial computer networks is a difficult task. Networks in industrial control systems typically use proprietary systems and protocols, and in many cases, technical details about these proprietary systems and protocols are rarely published to the outside. As a result, operators of industrial control systems are often unaware of the specific implementation details of the systems they operate, and in-depth analysis of communication content is difficult to achieve. There is a need for a detection method that can analyze the content of unknown binary protocols in an industrial control system and perform anomaly detection.

Disclosure of Invention

The invention provides an industrial control system anomaly detection method based on message content analysis, which solves the problem that a binary protocol commonly used in an industrial computer network limits a traditional anomaly detection method.

The invention is realized by the following technical scheme:

a method for detecting the abnormity of an industrial control system based on message content analysis comprises the following steps:

step one, collecting data messages in an industrial control system and mapping the messages to corresponding characteristic vectors;

secondly, constructing a prototype model of the network message based on a data input set formed by the feature vectors;

analyzing the occurrence frequency of the features in each prototype model and filtering the rare features in the model by using a frequency threshold value to obtain a detection model;

and step four, realizing the abnormal detection of the industrial control system by adopting the model optimized in the step three and the detection information acquired in the step one.

The invention comprises the steps of constructing a prototype model which is based on message content analysis and can reflect interactive data of unknown protocol content and structure in a general mode, and a method for optimizing the constructed prototype model to reduce errors.

Preferably, the mapping the acquired message to the corresponding feature vector in the first step of the present invention specifically comprises:

extracting all sub-character strings with the length of n from the acquired message m and recording the occurrence times of the sub-character strings, each sub-character string and each characterOne dimension of the eigenspace is associated, so that the message m is represented as a vector x ═ phi (m) where the substring appears, and a specific mapping process is represented as: phi (m) is (phi)_s(m))_s∈S；

Wherein phi is_s(m) f (S, m), the set S representing all possible substrings of length n, the function f (S, m) representing the number of occurrences of the substring S in the input message m;

mapping a set of messages m₁,m₂,...,m_NConverting into a set of vectors X ═ X₁,x₂,...,x_NIn which x_i＝φ(m_i) Taking X as an input data set and using the X for model construction in the step two; or converting a message as a detection message into a single vector for the anomaly judgment of the detection message in the fourth step.

Preferably, the second step of the present invention specifically comprises the following steps:

step S21, selecting k samples from the input data set for initializing cluster C₁,...,C_k；

Step S22, for each sample remaining in the input data

The similarity of the sample to each cluster is measured and assigned to the closest cluster according to the similarity to obtain an updated cluster C_jWherein, in the step (A),

and step S23, repeating step S22 until all the remaining samples are distributed into k clusters, namely k prototype models are obtained.

Preferably, step S21 of the present invention specifically includes:

step S211, randomly selecting a sample from the data set as an initial clustering center c₁；

Step S212, calculating the shortest distance between each sample and the current existing cluster center, namely the distance between each sample and the nearest cluster center, and using D(x) It is shown that,

wherein

Is the variance of the ith dimension for all samples in the dataset;

step S213, calculating the probability of each sample being selected as the next cluster center

Then selecting the next clustering center according to a wheel disc method;

and repeating the steps S212 to S213 until k cluster centers are selected.

Preferably, in step S22 of the present invention, the similarity calculation between the samples and the clusters uses a cosine similarity method, specifically:

using input samples x and clusters C_iCenter c of_iCosine similarity of the two vectors indicates:

where θ represents vectors x and c_iThe smaller the angle is, the greater the similarity of the two vectors is;

cluster C_iCenter c of_iI.e. the centroid of all samples belonging to the cluster:

wherein, | C_iI denotes cluster C_iThe number of samples in the cluster, and when a sample is assigned to the closest cluster, the center of the cluster changes.

Preferably, step S3 of the present invention is to filter features occurring in less than t input samples of the prototype model by setting a frequency threshold t, so as to obtain the detection model M_iWherein i ∈ [1, k ]]。

Preferably, step S4 of the present invention specifically includes:

step S41, using detection model M_iThe ratio of the number of features to the total number of features l of the detected message M is known as the detected message M to the detection model M_iWill detect each of the messages M and M models M_iCalculating a score once and treating the highest score among them as the confidence a of detecting the message m, i.e.

Step S42, according to the confidence coefficient alpha and the abnormal judgment threshold value T, when the confidence coefficient alpha (M, M) is less than or equal to T, the message is considered to be malicious, and an abnormal alarm is sent; otherwise, it is considered benign and the next message continues to be detected.

On the other hand, the invention also provides an industrial control system anomaly detection system based on message content analysis, which comprises a message collection and preprocessing module, a learning module, an optimization module and a detection module;

the message collecting and preprocessing module is used for collecting data messages in the industrial control system, mapping the messages to corresponding characteristic vectors and transmitting the characteristic vectors to the learning module and the detection module;

the learning module is used for constructing a prototype model of the network message based on an input data set consisting of the feature vectors transmitted by the message collecting and preprocessing module;

the optimization module is used for analyzing the occurrence frequency of the features in each prototype model and filtering the rare features in the models by using a frequency threshold value to obtain a detection model;

the detection module utilizes the detection model and the feature vector of the detection information transmitted by the message collection and preprocessing module to judge whether the detection information is abnormal.

Preferably, the mapping of the acquired message to the corresponding feature vector by the message collecting and preprocessing module of the present invention specifically comprises:

extracting all sub-character strings with the length of n from the acquired message m and recording the occurrence times of the sub-character strings, wherein each sub-character string is associated with one dimension of a feature space so as to ensure thatObtaining a vector x ═ phi (m) where the message m is expressed as a substring, and the specific mapping process is expressed as: phi (m) is (phi)_s(m))_s∈S；

Preferably, the learning module of the present invention is configured to perform the steps of:

Step S22, for each sample remaining in the input data

The invention has the following advantages and beneficial effects:

1. the invention adopts the industrial control system abnormity detection method based on message content analysis, and can solve the problem that the traditional abnormity detection method is limited by the commonly used binary protocol in the traditional industrial computer network; the invention analyzes the message content of the industrial control system, reflects the prototype model of the interactive data of unknown protocol content and structure in a general way (general rule), and adopts the prototype model to detect the abnormity of the detection information, thereby greatly improving the stability and the safety of the industrial control system.

2. The invention has wide application range, can be well suitable for the abnormality detection of the nuclear reactor industrial control system, and can also be applied to the abnormality detection of other industrial control systems.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is a schematic block diagram of the system of the present invention.

FIG. 2 is a flow chart of the present invention for building prototype models.

FIG. 3 is a flow chart of the optimization model of the present invention.

FIG. 4 is a flow chart of the optimization model based anomaly detection of the present invention.

Detailed Description

Hereinafter, the term "comprising" or "may include" used in various embodiments of the present invention indicates the presence of the invented function, operation or element, and does not limit the addition of one or more functions, operations or elements. Furthermore, as used in various embodiments of the present invention, the terms "comprises," "comprising," "includes," "including," "has," "having" and their derivatives are intended to mean that the specified features, numbers, steps, operations, elements, components, or combinations of the foregoing, are only meant to indicate that a particular feature, number, step, operation, element, component, or combination of the foregoing, and should not be construed as first excluding the existence of, or adding to the possibility of, one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.

In various embodiments of the invention, the expression "or" at least one of a or/and B "includes any or all combinations of the words listed simultaneously. For example, the expression "a or B" or "at least one of a or/and B" may include a, may include B, or may include both a and B.

Expressions (such as "first", "second", and the like) used in various embodiments of the present invention may modify various constituent elements in various embodiments, but may not limit the respective constituent elements. For example, the above description does not limit the order and/or importance of the elements described. The foregoing description is for the purpose of distinguishing one element from another. For example, the first user device and the second user device indicate different user devices, although both are user devices. For example, a first element could be termed a second element, and, similarly, a second element could be termed a first element, without departing from the scope of various embodiments of the present invention.

It should be noted that: if it is described that one constituent element is "connected" to another constituent element, the first constituent element may be directly connected to the second constituent element, and a third constituent element may be "connected" between the first constituent element and the second constituent element. In contrast, when one constituent element is "directly connected" to another constituent element, it is understood that there is no third constituent element between the first constituent element and the second constituent element.

The terminology used in the various embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the various embodiments of the invention. As used herein, the singular forms are intended to include the plural forms as well, unless the context clearly indicates otherwise. Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of the present invention belong. The terms (such as those defined in commonly used dictionaries) should be interpreted as having a meaning that is consistent with their contextual meaning in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in various embodiments of the present invention.

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

Example 1

The embodiment provides an industrial control system anomaly detection method based on message content analysis.

The detection method of the embodiment is implemented by relying on four key modules of the detection system, as shown in fig. 1, the four key modules of the detection system are specifically:

A. and the message collection and preprocessing module. The module collects data messages in the industrial control system and maps the messages to corresponding characteristic vectors, and the characteristic vectors can construct an input data set and can be used by a subsequent learning module and a subsequent detection module.

B. And a learning module. The module is used for establishing a prototype model. To build the prototype model, this module automatically partitions the network traffic into k sets of messages with similar content, approximating the state of the underlying protocol. This module is used not only to separate message types, but also to derive a prototype model for each message type in the process. Prototype models may not only represent the structure of individual message types, but may also describe the data they typically contain.

C. And an optimization module. One of the main obstacles to analyzing unknown protocols is noise, i.e., seemingly random data prevents the inference of a suitable content model. The present module solves this problem by analyzing the frequency of occurrence of features in each prototype model and filtering the rare features using a frequency threshold t.

D. And a detection module. After the prototype model is built and optimized, the module takes the deviation between the input message and the built model as the judgment condition of the abnormal detection. The learning module and the optimization module form the basis of the detection module, so that the detector D (k, T) is parameterized by the number of message sets k, the frequency threshold T and the anomaly decision threshold T.

In the message collection and preprocessing module, there are two message collection modes, including that the monitoring industrial control system obtains the message and the operator of the industrial control system provides the message for the learning module. The method of mapping the collected message m to the corresponding feature vector x ═ Φ (m) is as follows: extracting all the length n children from the message mThe strings and records their number of occurrences, each substring being associated with one dimension of the feature space, so that the message m can be represented as a vector of occurrences of the substring. Formally defined as follows: phi (m) is (phi)_s(m))_s∈SWherein phi is_sThe set S represents all possible substrings of length n, and the function f (S, m) represents the number of occurrences of S in the input message m. Using this mapping, a set of messages m can be mapped₁,m₂,...,m_NConverting into a set of vectors X ═ X₁,x₂,...,x_NIn which x_i＝φ(m_i) Using X as an input data set and for the learning phase; or a message is used as a detection message and converted into a single vector for judging the abnormity of the message in the detection stage.

A flow chart for building a prototype model of a network message in the learning module is shown in fig. 2.

After the input data set is constructed, k samples are selected from the input data to initialize the cluster C₁,...,C_kFirstly, randomly selecting a sample from the data set as an initial clustering center c₁(ii) a Calculating the shortest distance between each sample and the current existing cluster center, namely the distance between each sample and the nearest cluster center, and expressing the shortest distance by D (x),

wherein

Is the variance of the ith dimension of all samples in the dataset, then calculates the probability that each sample is selected as the next cluster center

Then selecting the next clustering center according to a wheel disc method; this is repeated until k cluster centers are selected.

Then, for each sample remaining

The similarity of the sample to each cluster is measured and assigned to the closest cluster C_jWherein, in the step (A),

the similarity of samples to clusters is defined based on the input sample x and the cluster C_iCenter c of_iThe cosine similarity of (a) is expressed by the cosine similarity of two vectors:

where θ represents vectors x and y_iThe smaller the angle is, the greater the similarity of the two vectors is; cluster C_iCenter c of_iI.e. the centroid of all samples belonging to the cluster:

In order to save the working memory in use, the invention only saves the count of the message and the substrings thereof, but does not save the message itself; for each cluster C_iSaving only the total number of all samples contained in the cluster, and the cumulative count vector as a prototype model

This is sufficient to calculate the similarity between the previously defined messages and clusters and can be used for further anomaly detection.

A flow chart of the optimization model in the optimization module is shown in fig. 3.

The method for optimizing the model by the module is to filter irrelevant information by using the occurrence times. Recording the frequency f of the appearance sample of each feature or substring S E S when establishing the prototype model, and establishing the prototype model P_iThese characteristic frequencies in the training samples are accurately represented. Thus, the detection scheme for the binary protocol can be directly derived from the prototype model P_iDerived without additional training.

By setting the frequency threshold t, the method filters features occurring in fewer than t input samples associated with a certain cluster, thereby effectively reducing noise from the training data. Thus, model M_iIs a set of features that can be used for detection and that can collectively make up the overall content model M ═ { M ] used by D (k, T)₁,...,M_kIn which M is_i＝{s∈S|P_i≥t}。

An optimization model based anomaly detection flow diagram in the detection module is shown in fig. 4.

The detection module determines overall anomaly detection using the optimized M model and the detection message M transmitted in the message collection and preprocessing module, and D (k, T, T) considers that the expected result of the model and the detected message M should have similarity. This module uses the detection model M_iThe ratio of the number of features to the total number of features l of the message M is known as the ratio of the detected message M to the model M_iWill detect each of the messages M and M models M_iCalculating a score once and treating the highest score among them as the confidence a of detecting the message m, i.e.

According to the confidence coefficient alpha and the abnormity judgment threshold value T, for the confidence coefficient alpha (M, M) less than or equal to T, the message is considered to be malicious, because the message does not accord with the expected results of all models, the module sends an abnormity alarm; otherwise, it is considered benign and the module continues to detect the next message.

Example 2

In this embodiment, the detection method provided in embodiment 1 is adopted to detect a group of collected messages, and the specific process is as follows:

as shown in table 1 below, a set of messages is captured, rules describing the set of messages are then learned, and the rules may be used to generate new messages that conform to the format of the set of messages. It is clear that the exact protocol specification cannot be learned, but rather the protocol is approximately described in terms of captured network traffic. The set of messages depicted obviously contains a binary transmission command 101111 to detect a sensor value from a particular device. In addition, it contains the length of the transmission string, the front of the message being a 1-byte integer, and another unspecified flag at the end. In this set of messages, only the respective part of the device identifier and the last binary flag have changed, so that the algorithm considers the rest to be unchanged.

TABLE 1

As shown in table 2 below, the attack message randomly selects an existing network message and replaces the variable field in the message with the encoded attack field, while leaving the other constant or unmodified variable fields unchanged.

TABLE 2

If n is 4 when extracting features, the feature vector for extracting the set of messages is shown in table 3 below.

TABLE 3

Let grouping message k be 2, select point 5 as initial cluster center, then calculate the probability of d (x) and being selected as second cluster center for each of the remaining samples, d (x) from point 1 to point 8²34.13,27.23,22.67,22.67,0.0,39.23,20.0,50.87, p (x) is 0.157,0.126,0.105,0.105,0.0,0.181,0.092,0.234, sum is 0.157, 0.283, 0.388, 0.493, 0.674, 0.766, 1, a random number between 0 and 1 is randomly generated, and assuming that the generated random number is 0.25, point 2 is selected as the second cluster center. Initializing clusters C with Point 2₁Point 5 initialization cluster C₂。

And clustering the remaining points. Point No. 1 and two clusters C₁、C₂Respectively of similarity ofIs composed of

And

add Point No. 1 to Cluster C₂At this time, cluster C₂Becomes [0.5,2.5,4.0,1.0,2.5,2.0,1.5,2.5,2.5,1.5,1.0,3.0,2.0,2.0,2.5,1.0](ii) a The other points are clustered according to the same method of photography, and the final clustering result is C₁Contains points 2,3, 6, 8 with centers of [0.5,0.75,2.0,2.5,1.0,1.5,2.25,3.25,0.75,2.5,0.75,3.25,2.25,2.75,3.25,1.5]，C₂Contains points 1, 4, 5, 7 with centers of [0.75,2.0,3.0,1.5,2.5,1.75,1.75,2.5,2.0,1.75,1.5,2.75,1.5,2.25,2.25,1.0]Generating a prototype model P after completion_i，P₁Is [2,3,8,10,4,6,9,13,3,10,3,13,9,11,13,6 ]]，P₂Is [3,8,12,6,10,7,7,10,8,7,6,11,6,9,9,4 ] of]。

Optimizing the prototype model, and if the frequency threshold t is 8, then P₁The characteristics of 0000, 0001, 0100 and the like in the step (1) are filtered, and the characteristics of 0010, 0011, 0110 and the like are reserved. The optimized content model M is represented by { [3,4,7,8,10,12,13,14,15 in decimal notation]，[2,3,5,8,9,12,14,15]}。

Assuming that the anomaly determination threshold T is 0.7, the extracted feature vector of the attack message M is [0,2,2,4,3,0,1,2,3,5,4,3,4,0,2,1], the number of features conforming to the model M is 2+4+1+2+5+3+14+2 ═ 23 and 2+2+3+2+3+3+1 ═ 16, respectively, and the total number of features of M is 36, the confidence α of the message M is 0.64 < T, and the message is considered to be an anomaly message and an alarm is issued.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A message content analysis-based industrial control system anomaly detection method is characterized by comprising the following steps:

step four, the model optimized in step three and the detection information collected in step one are adopted to realize the abnormal detection of the industrial control system;

the mapping of the acquired message to the corresponding feature vector in the first step is specifically:

extracting all substrings with the length of n from the acquired message m and recording the occurrence times of the substrings, wherein each substring is associated with one dimension of a feature space, so that the message m is expressed as a vector x (phi (m) of the occurrence of the substring, and the specific mapping process is expressed as: phi (m) is (phi)_s(m))_s∈S；

mapping a set of messages m₁,m₂,...,m_NConverting into a set of vectors X ═ X₁,x₂,...,x_NIn which x_i＝φ(m_i) Taking X as an input data set and using the X for model construction in the step two; or converting a message as a detection message into a single vector for the abnormal judgment of the detection message in the fourth step;

the second step specifically comprises the following steps:

The step S21 specifically includes:

step S211, slave numberRandomly selecting a sample in the data set as an initial clustering center c₁；

Step S212, calculating the shortest distance between each sample and the current existing cluster center, namely the distance between each sample and the nearest cluster center, and expressing the shortest distance with D (x),

wherein

Is the variance of the ith dimension for all samples in the dataset;

Then selecting the next clustering center according to a wheel disc method;

repeating the steps S212 to S213 until k cluster centers are selected;

step S22, for each sample remaining in the input data

in the step S22, the similarity calculation between the samples and the clusters uses a cosine similarity method, which specifically includes:

wherein, | C_iI denotes cluster C_iThe number of middle samples, when a sample is assigned to the closest cluster, the center of the cluster changes;

step S23, repeating step S22 until all the remaining samples are distributed into k clusters, namely k prototype models are obtained;

the step S4 specifically includes:

step S41, using detection model M_iThe ratio of the number of features to the total number of features l of the detected message M is known as the detected message M to the detection model M_iWill detect each of the messages M and M models M_iCalculating a score once and regarding the highest score as the confidence degree alpha of the detected message m, namely alpha:

2. The method according to claim 1, wherein in step S3, the features occurring in fewer than t input samples in the prototype model are filtered by setting a frequency threshold t, so as to obtain the detection model M_iWherein i ∈ [1, k ]]。

3. An industrial control system anomaly detection system based on message content analysis is characterized by comprising a message collection and preprocessing module, a learning module, an optimization module and a detection module;

the detection module judges whether the detection information is abnormal or not by using the detection model and the characteristic vector of the detection information transmitted by the message collection and preprocessing module; the mapping of the acquired message to the corresponding feature vector by the message collecting and preprocessing module is specifically as follows:

mapping a set of messages m₁,m₂,...,m_NConverting into a set of vectors X ═ X₁,x₂,...,x_NIn which x_i＝φ(m_i) Taking X as an input data set and using the X for model construction in the step two; or one message is used as a detection message and is converted into a single vector for judging the abnormity of the detection message; the learning module is configured to perform the steps of:

The step S21 specifically includes:

wherein

Is the variance of the ith dimension for all samples in the dataset;

Then selecting the next clustering center according to a wheel disc method;

repeating the steps S212 to S213 until k cluster centers are selected;

step S22, for each sample remaining in the input data

the abnormality judgment process of the detection module comprises the following steps: