CN110995692A

CN110995692A - Network security intrusion detection method based on factor analysis and subspace collaborative representation

Info

Publication number: CN110995692A
Application number: CN201911192193.4A
Authority: CN
Inventors: 张明明; 李萌; 陈咏秋
Original assignee: State Grid Jiangsu Electric Power Co Ltd; Jiangsu Electric Power Information Technology Co Ltd; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Current assignee: State Grid Jiangsu Electric Power Co Ltd; Jiangsu Electric Power Information Technology Co Ltd; Information and Telecommunication Branch of State Grid Jiangsu Electric Power Co Ltd
Priority date: 2019-11-28
Filing date: 2019-11-28
Publication date: 2020-04-10

Abstract

The invention discloses a network security intrusion detection method based on factor analysis and subspace collaborative representation, which is used for obtaining security factors reflecting different security states of a network at a certain moment or within a certain time period, wherein the security factors comprise TCP connection basic characteristics, TCP connection content characteristics, time-based network flow statistical characteristics and host-based network flow statistical characteristics, analyzing the factors through 4 algorithms to obtain contribution weights of different factors to network intrusion detection, sequencing the factors according to the mean value of the contribution weights obtained by the 4 algorithms, and extracting N factors with the largest contribution weights. The invention utilizes the subspace collaborative representation classification algorithm to detect the network security state, and has the advantages of rapidness, effectiveness, ingenious method, novel concept and good application prospect.

Description

Network security intrusion detection method based on factor analysis and subspace collaborative representation

Technical Field

The invention relates to the technical field of network security, in particular to a network security intrusion detection method based on factor analysis and subspace collaborative representation.

Background

The wide application of information technology and the rapid development of network space greatly promote social prosperity and progress, but the information security problem in the informatization development process is increasingly prominent, such as virus infection, illegal invasion, brute force cracking, denial of service attack and the like. In order to prevent the accidents, the network safety prediction is judged and analyzed in advance, and corresponding protective measures are taken according to the safety hazard degree, so that the asset loss can be effectively reduced.

The operation and maintenance work which is as important as the scientific and technological construction work has been gradually paid attention to. How to save the operation and maintenance cost, improve the operation and maintenance efficiency and ensure the operation and maintenance safety is a very wide subject. The network security analysis is an essential link in operation and maintenance, and is concerned with the stable operation of the most critical system, and is concerned particularly.

Network security is an indispensable part in large and small enterprise management, and current network security prediction judgment and analysis cannot realize situation assessment of network security, cannot establish a method for quantitative analysis of network security, and is a problem to be solved.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention aims to provide a network security intrusion detection method based on factor analysis and subspace collaborative representation, which obtains security factors reflecting different security states of a network at a certain moment or within a certain time period, wherein the security factors comprise TCP connection basic characteristics, TCP connection content characteristics, time-based network traffic statistical characteristics and host-based network traffic statistical characteristics, the factors are analyzed through 4 algorithms to obtain contribution weights of different factors to network intrusion detection, the factors are ranked according to the mean values of the contribution weights obtained by the 4 algorithms, N factors with the largest contribution weights are extracted, and a subspace collaborative representation classification algorithm is used for detecting the security states of the network.

In order to achieve the purpose, the invention adopts the technical scheme that:

a network security intrusion detection method based on factor analysis and subspace collaborative representation comprises the following steps,

step (A), standardizing and normalizing network data;

step (B), performing factor analysis on the network data, obtaining the contribution weight of each factor by using four factor analysis methods, and calculating the contribution weight mean value of each factor;

step (C), extracting N factors with the largest contribution weight, detecting the network data by utilizing a subspace collaborative representation classification algorithm, and returning to the step (A) if the detection is normal, and evaluating the network data situation at the next moment or in the next time period; if the detection result is abnormal, the detection result is an abnormal event, and an alarm is output.

In the invention, in the step (A), the network data comprises TCP connection basic characteristics, TCP connection content characteristics, time-based network flow statistical characteristics and host-based network flow statistical characteristics.

The aforementioned TCP connection basic characteristics include the following factors,

network connection duration, network protocol type, network service type of the target host, number of bytes of data from the source host to the target host, number of bytes of data from the target host to the source host.

The content characteristics of the aforementioned TCP connection include the following factors,

number of failed login attempts, whether login was successful, number of compromised conditions occurred.

The aforementioned time-based network traffic statistics include factors,

the number of connections having the same target host as the current connection in the past two seconds, the number of connections having the same service as the current connection in the past two seconds, the percentage of connections having the same target host as the current connection in the past two seconds, the percentage of connections having "SYN" error in the past two seconds, the percentage of connections having the same service as the current connection in the connection having the same service as the current connection in the past two seconds, the percentage of connections having "REJ" error in the connection having the same target host as the current connection in the past two seconds, the percentage of connections having "REJ" error in the connection having the same service as the current connection in the past two seconds, the percentage of connections having the same service as the current connection in the connection having the same target host as the current connection in the past two seconds, the percentage of connections having the same target host as the current connection in the past two seconds, the connection having the same target host as the current connection in the current connection, percentage of connections with different services from the current connection, percentage of connections with different target hosts in the last two seconds, in connections with the same services as the current connection.

The aforementioned host-based network traffic statistics include factors,

of the first 100 connections, the number of connections having the same destination host as the current connection, the percentage of connections having the same service as the current connection and the same destination host as the current connection among the first 100 connections, the percentage of connections having different service from the current connection and the same source port as the current connection among the first 100 connections, the percentage of connections having the same destination host as the current connection among the first 100 connections, the percentage of connections having different source hosts from the current connection among the connections having the same destination host as the current connection among the first 100 connections, the percentage of connections having "SYN" errors among the connections having the same destination host as the current connection among the first 100 connections, and the first 100 connections, the percentage of the connections with the same service as the current connection and the same target host, in which the "SYN" error occurs, the percentage of the connections with the same target host as the current connection, in the first 100 connections, in which the "REJ" error occurs, the percentage of the connections with the same target host as the current connection, in the first 100 connections, and the percentage of the connections with the same service as the current connection and the same target host as the current connection, in which the "REJ" error occurs.

A step (A) of normalizing said data, comprising the steps of,

(A1) and calculating the average value of the samples,

wherein X is sample data;

(A2) and the standard deviation of the samples is calculated,

(A3) normalizing the data according to the mean and standard deviation of the sample,

the data is normalized, and the method comprises the following steps,

(A4) calculating the minimum value of the sample, X_min＝min{X’_ij}；

(A5) Calculating the maximum value of the sample, X_max＝max{X’_ij}；

(A6) Normalizing the data according to the minimum and maximum values of the samples,

the network security intrusion detection method based on the factor analysis and the subspace collaborative representation, step (B), performing factor analysis on the network data, obtaining the contribution weight of each factor by using four factor analysis methods and calculating the contribution weight mean value of each factor, includes the following steps,

(B1) calculating the contribution weight of each factor by using a variance threshold filtering method;

(B2) calculating the contribution weight of each factor by using a characteristic selection method based on mutual information;

(B3) calculating the contribution weight of each factor by using a feature selection method based on Lasso regression;

(B4) calculating the contribution weight of each factor by using a feature selection method based on a Relieff algorithm;

(B5) the 4 contribution weights of each factor are averaged

The aforementioned variance threshold filtering method, (B1), comprising the steps of,

(1) calculating the variance var (i) of each factor;

(2) sorting the obtained variances in a descending order;

(3) factors of variance greater than the threshold T are truncated as a filtered result.

The aforementioned mutual information-based feature selection method, (B2), comprises the steps of,

(1) the feature matrix is recorded as

The class (label) vector is

Where n is the number of samples, s is the number of features, x_iIs the ith eigenvector (i ═ 1, …, s);

(2) calculating each feature vector x_iMutual information mi (i) with Y;

(3) sequencing the acquired mutual information MI in a descending order;

(4) and intercepting the factor of the mutual information which is larger than the threshold value T as a result of feature selection.

The mutual information MI: the calculation formula is as follows,

where X and Y are two discrete random variables, p (X, Y) is the joint probability distribution of X and Y, and p (X) and p (Y) are the edge distribution probabilities of X, Y, respectively.

The aforementioned feature selection method based on Lasso regression, (B3), comprises the following steps,

(1) the feature matrix is recorded as

The class (label) vector is

Wherein n is the number of samples and s is the number of features;

(2) performing Lasso regression on X and Y;

(3) sorting the obtained Lasso regression results in a descending order;

(4) and intercepting a factor of which the regression result is larger than the threshold value T as a result of feature selection.

The aforementioned feature selection method based on the ReliefF algorithm, (B4), comprises the following steps,

(1) the feature matrix is recorded as

The class (label) vector is

Wherein n is the number of samples and s is the number of features;

(2) calculating the weight W of each factor by utilizing a Relief algorithm;

(3) sorting the obtained factor weights in a descending order;

(4) and taking a factor of which the intercepted result is larger than the threshold value T as a result of feature selection.

The calculation method of the weight W in the step (2) is as follows:

(1) randomly selecting a sample point R in X_i；

(2) Find and R_iSimilar k nearest neighbor samples H_j；

(3) For each C ≠ class (R)_i) Find R respectively_iDifferent classes of k nearest neighbor samples M_j(C)；

(4) And circulating for p times, updating the contribution weight of each factor, wherein the updating formula is as follows:

(5) and (4) repeating the steps (1), (2), (3) and (4) for m times.

The network security intrusion detection method based on factor analysis and subspace collaborative representation comprises the following steps of (C) extracting N factors with the maximum contribution weight, and detecting network data by utilizing a subspace collaborative representation classification algorithm,

(C1) inputting a training sample X (the number of classes is C) and a corresponding label B, a sample y to be tested and a parameter lambda, and extracting N factors with the maximum contribution weight;

(C2) dividing the training sample X into C subsets according to the number of categories;

(C3) calculating the offset Tikhonov matrix gamma of the l-th class and the test sample y_l,y，

(C4) Calculating the approximate value of the test sample of the l-th class

(C5) Repeating the steps (C3) and (C4) C times

(C6) Respectively calculating the distance r between each class and y_lBy passing

Obtaining a classification of y;

(C7) if the classification result is normal, monitoring the next piece of network data; and if the classification result is abnormal, a warning is given out.

The offset Tikhonov matrix gamma_l,yThe calculation formula is as follows:

wherein x₁,x₂,…,x_nSubspace X forming class I_l。

Approximate values of the test sample

The calculation formula is as follows:

said approximation

The distance from sample y is calculated as follows:

the invention has the beneficial effects that: the invention relates to a network security intrusion detection method based on factor analysis and subspace collaborative representation, which obtains security factors reflecting different security states of a network at a certain moment or within a certain time period, wherein the security factors comprise TCP connection basic characteristics, TCP connection content characteristics, time-based network traffic statistical characteristics and host-based network traffic statistical characteristics, the factors are analyzed through 4 algorithms to obtain contribution weights of different factors to network intrusion detection, the factors are ranked according to the mean value of the contribution weights obtained by the 4 algorithms, N factors with the maximum contribution weight are extracted, and the subspace collaborative representation classification algorithm is utilized to detect the security state of the network.

Drawings

FIG. 1 is a flow chart of a network security intrusion detection method based on factor analysis and subspace collaborative representation according to the present invention;

FIG. 2 is a block diagram of the intrusion detection method for network security based on factor analysis and subspace collaborative representation according to the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

As shown in FIG. 1, the network security intrusion detection method based on factor analysis and subspace collaborative representation of the present invention obtains security factors reflecting different security states of a network at a certain time or within a certain time period, including TCP connection basic features, TCP connection content features, time-based network traffic statistical features, host-based network traffic statistical features, analyzes the factors through 4 algorithms, obtains contribution weights of different factors to network intrusion detection, sorts the factors according to the mean value of the contribution weights obtained by the 4 algorithms, extracts N factors with the largest contribution weights, detects the security state of the network by using a subspace collaborative representation classification algorithm, is fast and effective, has ingenious method and novel concept, and comprises the following steps,

step (A), standardizing and normalizing network data;

And (A), the network data comprises TCP connection basic characteristics, TCP connection content characteristics, time-based network flow statistical characteristics and host-based network flow statistical characteristics.

The aforementioned time-based network traffic statistics include factors,

The aforementioned host-based network traffic statistics include factors,

A step (A) of normalizing said data, comprising the steps of,

(A1) and calculating the average value of the samples,

wherein X is a data sample;

(A2) and the standard deviation of the samples is calculated,

the data is normalized, and the method comprises the following steps,

(A4) calculating the minimum value of the sample, X_min＝min{X’_ij}；

(A5) Calculating the maximum value of the sample, X_max＝max{X’_ij}；

(B5) the 4 contribution weights of each factor are averaged

(1) calculating the variance var (i) of each factor;

(2) sorting the obtained variances in a descending order;

(1) the feature matrix is recorded as

The class (label) vector is

(2) calculating each feature vector x_iMutual information mi (i) with Y;

(3) sequencing the acquired mutual information MI in a descending order;

The mutual information MI: the calculation formula is as follows,

(1) the feature matrix is recorded as

The class (label) vector is

Wherein n is the number of samples and s is the number of features;

(2) performing Lasso regression on X and Y;

(3) sorting the obtained Lasso regression results in a descending order;

(1) the feature matrix is recorded as

The class (label) vector is

Wherein n is the number of samples and s is the number of features;

(2) calculating the weight W of each factor by utilizing a Relief algorithm;

(3) sorting the obtained factor weights in a descending order;

The calculation method of the weight W in the step (2) is as follows:

(1) randomly selecting a sample point R in X_i；

(2) Find and R_iSimilar k nearest neighbor samples H_j；

(5) and (4) repeating the steps (1), (2), (3) and (4) for m times.

(C4) Calculating the approximate value of the test sample of the l-th class

(C5) Repeating the steps (C3) and (C4) C times

Obtaining a classification of y;

The offset Tikhonov matrix gamma_l,yThe calculation formula is as follows:

wherein x₁,x₂,…,x_nSubspace X forming class I_l。

Approximate values of the test sample

The calculation formula is as follows:

said approximation

The distance from sample y is calculated as follows:

the network security intrusion detection method based on the factor analysis and the subspace collaborative representation can obtain the contribution weight mean value of each factor by utilizing four factor analysis methods under the condition of obtaining limited network data factors, and then can predict whether the network receives the attack by utilizing N factors with the maximum contribution weight and utilizing a classification algorithm of the subspace collaborative representation, and the method can achieve the lowest accuracy rate of 97.6 percent through detection.

The foregoing illustrates and describes the principles, general features, and advantages of the present invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A network security intrusion detection method based on factor analysis and subspace collaborative representation is characterized in that: comprises the following steps of (a) carrying out,

step (A), standardizing and normalizing network data;

step (B), performing factor analysis on the network data, obtaining the contribution weight of each factor by using four factor analysis algorithms, and calculating the contribution weight mean value of each factor;

2. The method for detecting network security intrusion based on factor analysis and subspace collaborative representation according to claim 1, wherein: step (A), the network data comprises TCP connection basic characteristics, TCP connection content characteristics, time-based network flow statistical characteristics and host-based network flow statistical characteristics;

the TCP connection base characteristics include the following factors: network connection duration, network protocol type, network service type of the target host, number of bytes of data from the source host to the target host, number of bytes of data from the target host to the source host;

the content characteristics of the TCP connection include the following factors: the number of failed login attempts, whether login was successful, and the number of occurrences of the completed condition;

the time-based network traffic statistics include the following factors: the number of connections having the same target host as the current connection in the past two seconds, the number of connections having the same service as the current connection in the past two seconds, the percentage of connections having the same target host as the current connection in the past two seconds, the percentage of connections having "SYN" error in the past two seconds, the percentage of connections having the same service as the current connection in the connection having the same service as the current connection in the past two seconds, the percentage of connections having "REJ" error in the connection having the same target host as the current connection in the past two seconds, the percentage of connections having "REJ" error in the connection having the same service as the current connection in the past two seconds, the percentage of connections having the same service as the current connection in the connection having the same target host as the current connection in the past two seconds, the percentage of connections having the same target host as the current connection in the past two seconds, the connection having the same target host as the current connection in the current connection, percentage of connections with different services from the current connection, percentage of connections with different target hosts from the current connection within the last two seconds, in connections with the same services as the current connection;

the host-based network traffic statistics include the following factors: of the first 100 connections, the number of connections having the same destination host as the current connection, the percentage of connections having the same service as the current connection and the same destination host as the current connection among the first 100 connections, the percentage of connections having different service from the current connection and the same source port as the current connection among the first 100 connections, the percentage of connections having the same destination host as the current connection among the first 100 connections, the percentage of connections having different source hosts from the current connection among the connections having the same destination host as the current connection among the first 100 connections, the percentage of connections having "SYN" errors among the connections having the same destination host as the current connection among the first 100 connections, and the first 100 connections, the percentage of the connections with the same service as the current connection and the same target host, in which the "SYN" error occurs, the percentage of the connections with the same target host as the current connection, in the first 100 connections, in which the "REJ" error occurs, the percentage of the connections with the same target host as the current connection, in the first 100 connections, and the percentage of the connections with the same service as the current connection and the same target host as the current connection, in which the "REJ" error occurs.

3. The method for detecting network security intrusion based on factor analysis and subspace collaborative representation according to claim 1, wherein: a step (A) of normalizing said data, comprising the steps of,

(A1) and calculating the average value of the samples,

wherein X is a data sample;

(A2) and the standard deviation of the samples is calculated,

4. the method for detecting network security intrusion based on factor analysis and subspace collaborative representation according to claim 1, wherein: a step (A) of normalizing the data, comprising the steps of,

(A4) calculating the minimum value of the sample, X_min＝min{X_i'_j}；

(A5) Calculating the maximum value of the sample, X_max＝max{X_i'_j}；

5. the method for detecting network security intrusion based on factor analysis and subspace collaborative representation according to claim 1, wherein: step (B), the network data is subjected to factor analysis, the contribution weight of each factor is obtained by utilizing four factor analysis algorithms, and the contribution weight mean value of each factor is calculated, the method comprises the following steps,

(B5) the 4 contribution weights for each factor are averaged.

6. The method according to claim 5, wherein the intrusion detection method based on factor analysis and subspace collaborative representation comprises: a method (B1), the variance threshold filtering method, comprising the steps of,

(1) calculating the variance var (i) of each factor;

(2) sorting the obtained variances in a descending order;

7. The method according to claim 5, wherein the intrusion detection method based on factor analysis and subspace collaborative representation comprises: a method (B2), the mutual information based feature selection method, comprising the steps of,

(1) the feature matrix is recorded as

The class (label) vector is

(2) calculating each feature vector x_iMutual information mi (i) with Y;

(3) sequencing the acquired mutual information MI in a descending order;

(4) intercepting a factor of mutual information larger than a threshold value T as a result of feature selection;

the mutual information MI: the calculation formula is as follows,

8. The method according to claim 5, wherein the intrusion detection method based on factor analysis and subspace collaborative representation comprises: method (B3), the feature selection method based on Lasso regression, comprising the following steps,

(1) the feature matrix is recorded as

The class (label) vector is

Wherein n is the number of samples and s is the number of features;

(2) performing Lasso regression on X and Y;

(3) sorting the obtained Lasso regression results in a descending order;

9. The method according to claim 5, wherein the intrusion detection method based on factor analysis and subspace collaborative representation comprises: method (B4), the feature selection method based on the Relieff algorithm, comprising the following steps,

(1) the feature matrix is recorded as

The class (label) vector is

Wherein n is the number of samples and s is the number of features;

(2) calculating the weight W of each factor by utilizing a Relief algorithm;

(3) sorting the obtained factor weights in a descending order;

(4) taking a factor of which the interception result is greater than a threshold value T as a result of feature selection;

the calculation method of the weight W in the step (2) is as follows:

(1) randomly selecting a sample point R in X_i；

(2) Find and R_iSimilar k nearest neighbor samples H_j；

(5) and (4) repeating the steps (1), (2), (3) and (4) for m times.

10. The method for detecting network security intrusion based on factor analysis and subspace collaborative representation according to claim 1, wherein: the step (C) of extracting N factors with the maximum contribution weight and detecting the network data by utilizing a subspace collaborative representation classification algorithm comprises the following steps,

(C1) inputting training samples X with the category number of C and corresponding labels B, samples y to be tested and parameters lambda, and extracting N factors with the maximum contribution weight;

(C4) Calculating the approximate value of the test sample of the l-th class

(C5) Repeating the steps (C3) and (C4) C times

Obtaining a classification of y;

(C7) if the classification result is normal, monitoring the next piece of network data; if the classification result is abnormal, a warning is given out;

the offset Tikhonov matrix gamma_l,yThe calculation formula is as follows:

wherein x₁,x₂,…,x_nSubspace X forming class I_l；

Approximate values of the test sample

The calculation formula is as follows:

said approximation

The distance from sample y is calculated as follows: