CN115396235A

CN115396235A - Network attacker identification method and system based on hacker portrait

Info

Publication number: CN115396235A
Application number: CN202211308863.6A
Authority: CN
Inventors: 王建龙; 关乐嘉; 庄唯; 李姝�; 殷倩
Original assignee: Beijing Tianyun Sea Number Technology Co ltd
Current assignee: Beijing Tianyun Sea Number Technology Co ltd
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2022-11-25
Anticipated expiration: 2042-10-25
Also published as: CN115396235B

Abstract

The invention discloses a network attacker identification method and a system based on a hacker portrait, wherein the method comprises the following steps: collecting network flow data, and extracting flow and time characteristics as flow characteristic data; preprocessing the flow characteristic data, inputting an abnormal flow classifier constructed and trained based on an SAE-BNN network model, and obtaining a normal or abnormal classification result; determining attack attribute characteristics aiming at the abnormal flow characteristic data, and calculating the similarity of a data connection vector of hacker data of the current attack attribute characteristics to obtain similarity characteristics; the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic are used as hacker portraits, and the similarity between the hacker portraits and the same attack attribute characteristic portraits in the hacker portraits library is calculated; and judging whether the image corresponds to the hacker or not according to the relation between the similarity and a preset threshold value. By the technical scheme, the hacker identification efficiency is improved, and the generalization capability and identification effect of the model are effectively improved.

Description

Network attacker identification method and system based on hacker portrait

Technical Field

The invention relates to the technical field of network security, in particular to a network attacker identification method based on hacker portrayal and a network attacker identification system based on hacker portrayal.

Background

Network space has become the "fifth dimension space" in the sea, land, air and the sky, has become a new battlefield for various countries to compete, and network space security has become an indispensable part of security layout. The construction of a safe network space environment not only needs to alarm and take remedial measures for the network attack actions, but also needs to confirm the identity of an attacker who implements the attack actions, find out the attacker hidden behind the network, and fundamentally solve the problem.

Network attacks by sending malicious traffic are a common form of attack. According to the research of the ATLAS security engineering and response team of NETSCOUT, 290 thousands of Distributed Denial of Service (DDoS) attack events occur in the first quarter of 2021, which is 31% longer than that in the same period of 2020, and the attacked industries include medical treatment, education, online Service and other fields. Moreover, the team predicts that DDoS attacks will continue to increase in the future, the number of broken records is reached, and the scope is more and more complex.

Therefore, how to accurately identify malicious traffic in massive traffic data and find out hackers behind sending the malicious traffic and implementing an attack is an important task for maintaining network space security.

At present, in the intrusion detection research aiming at network traffic identification, network traffic data is mainly trained through a deep learning method, and a classifier is obtained to distinguish normality from abnormality.

Document [1] (mamingbai, chenwei, wu etiquette. Intrusion detection method [ J ] based on CNN _ BiLSTM network computer engineering and application: 2022, 58 (10): 116-124.) features of UNSW0NB15 data set are first screened by using a random forest method, then CNN (conditional Neural Networks, CNN) and BiLSTM (Bi-directional Long Short-Term Memory) are used for extracting the features in parallel, then the features extracted by the two models are spliced, then a self-attention mechanism is used for carrying out secondary extraction of the features, and the features after secondary extraction are classified by using a Gated Recovery Unit (GRU). Document [2] (facial brightness, ji Sha Bei, liu Wan, xie Jian Wu, network intrusion detection based on GRU and feature embedding [ J ]. Applied science academic newspaper, 2021,39 (04): 559-568.) proposes a network intrusion detection model based on GRU and feature embedding, and a vector mapped by a word embedding layer is constructed into continuous features, so that time sequence information in data can be effectively extracted, then the features are transmitted into a GRU layer, and the result is output through a full connecting layer of two layers. Document [3] (YANG H, WANG F. Wireless Network instruction Detection Based on Improved conditional Neural Network [ J ]. IEEE Access, 2019, 7: 64366-64374.) allows CNN to store convolution results in a second convolution, then performs convolution, pooling and full concatenation respectively in parallel, and uses softmax for classification after merging features.

In the training process, if the feature dimensions are too much, the calculation cost is increased, and the redundant features in the training process obviously improve the classification result five, so that a learner proposes a feature selection method to screen important features. Document [4] (wufeng. Intrusion detection system feature selection method based on improved pigeon group optimization algorithm [ J ]. University of southwest schoolwork (natural science edition): 2021, 46 (05): 140-146.) features in two data sets of KDDCUP99 and UNSW-NB15 are selected by using the improved pigeon group optimization algorithm, redundant features are removed, then a decision tree model is used for classification, and the convergence speed of the model is accelerated on the premise of ensuring the classification accuracy. Document [5] (SELVAKUMAR B, museescharan k. Firefly basic feature selection for network intervention [ J ]. Computers & Security, 2019, 81) performs dimensionality reduction operation on flow features by using a Firefly algorithm, and trains the dimensionality-reduced data by using a C4.5 decision tree and a bayesian network model, thereby obtaining higher detection accuracy. Document [6] (ALZUBI Q M, ANBAR M, ALGATTAN Z, et al. Interrupt detection system based on a modified binary grid wolf optimization [ J ]. Neural Computing and Applications, 2020, 32.) ] proposes a feature selection algorithm that modifies a binary grayish optimization algorithm, reduces redundant features, only retains a portion of key features, and simulation results show that the detection model after feature selection is a model without feature selection in terms of time and precision.

The feature selection method just screens out different features, cannot further extract implicit relations of the features, and cannot form high-level reconstruction features capable of effectively expressing the labels.

Both the method based on machine learning and the method based on deep learning can only identify normal behaviors or aggressive behaviors, but cannot find out attackers behind the implementation of the aggressive behaviors. Thus, researchers have applied user profiling techniques to the identification of network attackers. Document [7] (Hongfei, liao light faithful.) hacker portrait early warning model based on K-media clustering [ J ]. Computer engineering and design, 2021,42 (05): 1244-1249.) hacker portrait is constructed by extracting hacker behavior characteristics according to security log data, hacker portrait is clustered by using a K-medoid clustering method, hacker group portrait is constructed, characteristics of each cluster are analyzed at the same time, and corresponding defense means are given according to different attack means. Document [8] (Zhao just, yaohong. Kernel. Abnormal behavior detection model based on user portrait [ J ]. Information network security, 2017 (7): 18-24.) constructs user portrait from both attributes and behaviors of users, and proposes an intrusion detection model based on the user portrait. Document [9] (yellow aspiration macro, zhang wave, attacker portrait construction based on Big Data and graph community clustering algorithm [ J ]. Computer application research, 2021,38 (01): 232-236.) proposes a method for constructing a hacker portrait based on Big Data Stream Analysis technology and Louvain community discovery algorithm (Big Data Stream Analysis and Louvain, BDSAL), which can quickly form uniform attack events from massive, multi-source and heterogeneous Data, construct a hacker portrait capable of accurately depicting hacker information, and discover attackers.

The hacker portrayal-based research identifies whether a user is an attacker, i.e. a hacker, by constructing the hacker portrayal instead of determining the identity of the hacker through the hacker portrayal, and lacks an efficient hacker identification method, so that the hacker identification cannot be performed quickly.

If only intrusion detection is carried out on the flow, normal flow and abnormal flow in the flow can be identified, but hacker users behind the malicious flow to be sent cannot be found out; if the identity of the hacker is recognized only by the hacker portrait, the time consumption is huge because in a real scene, the data volume of normal traffic is much larger than that of abnormal traffic, and for the normal traffic, we do not need to recognize the data. Therefore, intrusion detection and a hacker portrait technology are combined, the intrusion detection technology is firstly utilized to screen abnormal traffic in a network, and then the abnormal traffic is analyzed, features are extracted, and then the abnormal traffic is identified to find out a hacker user who sends the abnormal traffic.

However, at present, no research combining the two methods exists, and the identity of a hacker sending abnormal traffic cannot be efficiently and accurately identified.

Disclosure of Invention

Aiming at the problems, the invention provides a network attacker identification method and a network attacker identification system based on a hacker portrait, wherein a hacker portrait is formed by extracting a hacker attribute label, a flow characteristic label, a time characteristic label and a connection similarity label of a hacker, an initial hacker portrait library is established, then characteristic extraction is carried out on abnormal flow data to be identified, a connection similarity label is extracted by analyzing the similarity between hackers with the same attack type, and on the basis, a user portrait is established for an identified abnormal flow user, so that the hacker is accurately depicted and matched with the hacker in the initial portrait library according to the corresponding abnormal type, the hacker identification efficiency is improved, and an abnormal flow classifier based on an SAE-BNN network model can effectively improve the generalization capability and the identification effect of the model.

In order to achieve the above object, the present invention provides a network attacker identification method based on a hacker figure, comprising:

collecting network traffic data, and extracting traffic characteristics and time characteristics of the network traffic data as traffic characteristic data;

preprocessing the flow characteristic data, inputting the flow characteristic data into an abnormal flow classifier constructed and trained based on an SAE-BNN network model, coding the flow characteristic data by a sparse self-encoder SAE added with a hidden layer neuron with sparsity limitation, taking the characteristic data after coding reconstruction as the input of a Bayesian neural network BNN, obtaining a classification result prediction score by forward propagation calculation and a one-dimensional full connection layer, and obtaining a normal or abnormal classification result by softmax function classification;

aiming at abnormal flow characteristic data, determining attack attribute characteristics corresponding to the abnormal flow characteristic data, and calculating the similarity of a data connection vector corresponding to the abnormal flow characteristic data relative to a data connection vector of hacker data conforming to the current attack attribute characteristics to obtain similarity characteristics;

taking the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to the abnormal flow characteristic data as a hacker portrait, and calculating the similarity of the hacker portrait relative to the same attack attribute characteristic portrait in a pre-constructed hacker portrait library;

if the similarity between the image and a certain image is higher than a preset threshold value, judging that the image corresponds to a hacker as an attacking hacker of the current abnormal flow characteristic data;

and if the similarity between the hacker portrait and all the portraits in the hacker portrait library is not higher than the preset threshold value, adding the hacker portrait to the hacker portrait library.

In the above technical solution, preferably, the flow characteristic data is preprocessed in a specific process:

converting the flow characteristic data into numerical characteristic data;

carrying out standardization processing on the flow characteristic data by adopting a mean variance normalization method so that the characteristic range of the flow characteristic data is in the same preset interval;

and carrying out class imbalance data processing on the flow characteristic data by adopting a Borderline-smote oversampling algorithm.

In the above technical solution, preferably, the method for constructing and training the abnormal flow classifier comprises:

the abnormal flow classifier comprises the sparse self-encoder SAE and the Bayesian neural network BNN, data is input by the sparse self-encoder SAE, the sparse self-encoder SAE comprises a hidden layer neuron added with sparsity limitation, reconstructed feature data output by the sparse self-encoder SAE is used as the input of the Bayesian neural network BNN, and the Bayesian neural network BNN comprises a forward propagation calculation layer, a one-dimensional full-connection layer and a softmax function layer;

and respectively taking the abnormal flow characteristic sample data and the corresponding normal or abnormal classification result as the input and the output of the abnormal flow classifier, and training the abnormal flow classifier until the classification result loss value loss of the abnormal flow classifier reaches a convergence threshold value.

In the above technical solution, preferably, the specific process of calculating the similarity between the data connection vector of the abnormal traffic characteristic data and the data connection vector of the hacker data conforming to the current attack attribute characteristic includes:

recording a source IP, a source port, a destination IP and a destination port corresponding to the abnormal flow characteristic data to form a data connection vector corresponding to the abnormal flow characteristic data;

acquiring data connection vectors of all abnormal network traffic sample data with the same attack attribute characteristics as the abnormal traffic characteristic data;

and calculating cosine similarity of the data connection vector corresponding to the abnormal flow characteristic data and the data connection vector of each abnormal flow characteristic sample data, summing and averaging all the cosine similarity, calculating total connection similarity, and obtaining the similarity characteristic according to the total connection similarity.

In the above technical solution, preferably, the pre-constructing process of the hacker profile library includes:

classifying the abnormal flow characteristic sample data according to attack attribute characteristics, extracting the flow characteristics and time characteristics of each piece of abnormal flow characteristic sample data, and calculating the corresponding similarity characteristics of each piece of abnormal flow characteristic sample data;

and the traffic characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to each abnormal traffic characteristic sample data are used as hacker portraits of hackers corresponding to the current abnormal traffic characteristic sample data and are added into the hacker portraits database.

The invention also provides a network attacker identification system based on the hacker portrait, which is applied to the network attacker identification method based on the hacker portrait disclosed by any one of the technical schemes and comprises the following steps:

the data feature extraction module is used for collecting network traffic data and extracting traffic features and time features of the network traffic data as traffic feature data;

the abnormal data classification module is used for preprocessing the flow characteristic data, inputting the flow characteristic data into an abnormal flow classifier constructed and trained based on an SAE-BNN network model, coding the flow characteristic data through a sparse self-encoder SAE added with a hidden layer neuron with sparsity limitation, taking the coded and reconstructed characteristic data as the input of a Bayesian neural network BNN, obtaining a classification result prediction score through forward propagation calculation and a one-dimensional full-link layer, and obtaining a normal or abnormal classification result through softmax function classification;

the data characteristic calculation module is used for determining the attack attribute characteristics corresponding to the abnormal flow characteristic data aiming at the abnormal flow characteristic data, and calculating the similarity of the data connection vector corresponding to the abnormal flow characteristic data relative to the data connection vector of the hacker data according with the current attack attribute characteristics to obtain the similarity characteristics;

the abnormal data matching module is used for taking the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to the abnormal flow characteristic data as a hacker portrait and calculating the similarity of the hacker portrait relative to the same attack attribute characteristic portrait in a pre-constructed hacker portrait library;

the hacker identification module is used for judging that the image corresponds to a hacker as the hacker of the current abnormal flow characteristic data when the similarity between the image and a certain image is higher than a preset threshold value; and the hacker portrait is added to the hacker portrait library when the similarity between the hacker portrait and all the portraits in the hacker portrait library is not higher than the preset threshold value.

In the above technical solution, preferably, the abnormal data classification module performs a specific process of preprocessing the flow characteristic data:

converting the flow characteristic data into numerical characteristic data;

In the above technical solution, preferably, the method for constructing and training the abnormal traffic classifier includes:

the abnormal flow classifier comprises the sparse self-encoder SAE and the Bayesian neural network BNN, data is input by the sparse self-encoder SAE, the sparse self-encoder SAE comprises a hidden layer neuron added with sparsity limitation, reconstructed feature data output by the sparse self-encoder SAE is used as input of the Bayesian neural network BNN, and the Bayesian neural network BNN comprises a forward propagation calculation layer, a one-dimensional full connection layer and a softmax function layer;

and training the abnormal flow classifier by using the abnormal flow characteristic sample data and the corresponding normal or abnormal classification result as the input and the output of the abnormal flow classifier respectively until the classification result loss value loss of the abnormal flow classifier reaches a convergence threshold value.

In the above technical solution, preferably, the specific process of the data feature calculation module calculating the similarity of the data connection vector of the abnormal traffic feature data with respect to the data connection vector of the hacker data conforming to the current attack attribute feature includes:

acquiring data connection vectors of all abnormal traffic characteristic sample data with the same attack attribute characteristics as the abnormal traffic characteristic data;

calculating cosine similarity of the corresponding data connection vector of the abnormal flow characteristic data and the data connection vector of each abnormal flow characteristic sample data, summing and averaging all the cosine similarity, calculating total connection similarity, and obtaining the similarity characteristic according to the total connection similarity.

In the above technical solution, preferably, the pre-constructing process of the hacker sketch library includes:

and the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to each piece of abnormal flow characteristic sample data are used as hacker portraits of hackers corresponding to the current abnormal flow characteristic sample data and are added into the hacker portraits library.

Compared with the prior art, the invention has the beneficial effects that: the hacker attribute labels, the flow characteristic labels, the time characteristic labels and the connection similarity labels of hackers are extracted to jointly form a hacker portrait, an initial hacker portrait library is established, then characteristic extraction is carried out on abnormal flow data to be identified, the connection similarity labels are extracted by analyzing the similarity between hackers of the same attack type, on the basis, a user portrait is constructed on identified abnormal flow users, accordingly, accurate depiction is carried out on the hackers, the hackers are matched with the hackers in the initial portrait library according to the corresponding abnormal types, hacker identification efficiency is improved, and the generalization capability and the identification effect of the models can be effectively improved through an abnormal flow classifier based on an SAE-BNN network model.

Drawings

FIG. 1 is a flowchart illustrating a network attacker identification method based on a hacker profile according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a model structure of an abnormal traffic classifier based on an SAE-BNN network model according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an example of a hacker image according to one embodiment of the disclosure;

FIG. 4 is a schematic diagram of a process for constructing an initial hacker profile library according to one embodiment of the disclosure;

FIG. 5 is a block diagram of a hacker profile-based network attacker identification system according to an embodiment of the present invention.

In the drawings, the correspondence between each component and the reference numeral is:

1. the system comprises a data feature extraction module, 2 an abnormal data classification module, 3 a data feature calculation module, 4 an abnormal data matching module and 5 an attack hacker identification module.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The invention is described in further detail below with reference to the following drawings:

as shown in fig. 1, a network attacker identification method based on hacker portrayal provided by the present invention includes:

collecting network flow data, and extracting flow characteristics and time characteristics of the network flow data as flow characteristic data;

preprocessing flow characteristic data, inputting the flow characteristic data into an abnormal flow classifier constructed and trained based on an SAE-BNN network model, coding the flow characteristic data by a sparse self-encoder SAE added with a hidden layer neuron with sparsity limitation, taking the characteristic data after coding reconstruction as the input of a Bayesian neural network BNN, obtaining a classification result prediction score by forward propagation calculation and a one-dimensional full connection layer, and obtaining a normal or abnormal classification result by softmax function classification;

aiming at the abnormal flow characteristic data, determining the attack attribute characteristics corresponding to the abnormal flow characteristic data, and calculating the similarity of the data connection vector corresponding to the abnormal flow characteristic data relative to the data connection vector of the hacker data according with the current attack attribute characteristics to obtain the similarity characteristics;

if the similarity between the image and a certain portrait is higher than a preset threshold value, judging that the image corresponds to a hacker as an attacking hacker of current abnormal flow characteristic data;

and if the similarity between the hacker portrait and all the portraits in the hacker portrait library is not higher than a preset threshold value, adding the hacker portrait to the hacker portrait library.

In the embodiment, a hacker portrait is formed by extracting a hacker attribute tag, a flow characteristic tag, a time characteristic tag and a connection similarity tag of a hacker together, an initial hacker portrait library is established, then characteristic extraction is carried out on abnormal flow data to be identified, and the connection similarity tag is extracted by analyzing the similarity between hackers of the same attack type, on the basis, a user portrait is constructed on an identified abnormal flow user, so that the hacker is accurately depicted, and is matched with the hacker in the initial portrait library according to the corresponding abnormal type, thereby improving the hacker identification efficiency.

Specifically, firstly, collecting network flow data, performing feature extraction and preprocessing on the data, classifying the data to obtain abnormal flow feature data, constructing a hacker portrait of the abnormal flow feature data, matching the hacker portrait with a corresponding hacker portrait of the same type in an initial hacker portrait library, determining the identity of an unknown hacker according to a matching result, and otherwise, constructing a new hacker portrait of the hacker and updating the hacker portrait library.

The abnormal flow classifier which can accurately classify different abnormal types of network flow data can be obtained by training the SAE-BNN network model, and the classification effect of the network flow data is effectively improved.

In the embodiment of the invention, a UNSW-NB15 data set is selected to perform an experiment on a network attacker identification method, wherein the data set comprises 254 ten thousand records, 49 characteristics and 9 attack types, the normal data accounts for 88 percent, the attack data accounts for 12 percent, and the specific table is as follows:

TABLE 1 UNSW-NB15 dataset

The method comprises the following steps of summarizing the existing documents and deeply analyzing the traffic sent by a network hacker to extract the traffic characteristics and the time characteristics of network traffic data, and comprises the following specific steps:

flow characteristic extraction: the flow characteristics specifically include four characteristics of destination port, destination IP, byte number and protocol type.

The destination port and the destination IP of a normal user have diversity, and a hacker may continuously attack a certain IP or port when performing an attack action, so that the destination port and the destination IP are unique and different from the normal user. The destination IP and the destination port reflect the difference between normal users and abnormal users from the perspective of a flow receiver. The operation of the same user is different, the size of the byte number is also different, and an attacker can continuously carry out the same attack operation sometimes, so that the difference of the byte number is not large. The types of protocols on which different attack means depend also differ. The number of bytes and the type of the protocol can reflect the difference between normal users and abnormal users from the perspective of a flow sender.

Time characteristic extraction: the time characteristics specifically include traffic arrival time and traffic duration, and the time characteristics can represent the difference between normal users and hackers and between different hackers in the time dimension.

The flow from normal users generally has no obvious regularity, and the flow types have diversity and generally accord with the network flow distribution rule. And the attack traffic from hackers can be concentrated in a certain time period, and the sent traffic is generally the same type of traffic, which does not conform to the network traffic distribution rule. From this, it is possible to distinguish whether normal traffic or attack traffic.

In the collected network traffic data, the attack data is much less than the normal data, and part of the data is non-numerical data, so that the data needs to be preprocessed, and preferably, the specific process of preprocessing the traffic characteristic data includes:

converting the flow characteristic data into numerical characteristic data;

and carrying out imbalance-like data processing on the flow characteristic data by adopting a Borderline-smote oversampling algorithm.

Specifically, in the implementation process, the processing mode is as follows:

(1) And (3) data type conversion: since the extracted part of the features is not a numerical feature, the part of the data needs to be converted.

And converting the IP of the character string type into shaping data aiming at the target IP. The IP address comprises four bytes, the value range of each byte is 0-225, the number of each byte is represented by a two-bit 16-system number, if the number is less than two, 0 is supplemented in the front to obtain 4 16-system numbers, the 16-system numbers are spliced into 8-bit 16-system numbers and converted into a 10-system number, namely the numerical representation of the IP address. For example, an IP address of 192.163.88.5, each byte represented by c0, a3, 58, 05 in 16, is concatenated to obtain the 16-digit c0a35805, which is converted to 3231930373 in 10, i.e., a numeric IP address.

And aiming at the protocol types, different types of protocols are sequenced according to the occurrence sequence, and the sequence number corresponding to each protocol is the discrete numerical characteristic of the protocol.

(2) Data normalization: the numerical range of the converted features is too large, for example, data of an IP address after conversion may be hundreds of thousands or millions, but the duration may be only milliseconds, and when the ranges of different features are too large, the classification result and the accuracy are obviously affected, so that different feature ranges need to be in the same interval. The invention processes through mean variance normalization, so that different characteristics conform to normal distribution with the mean value of 0 and the variance of 1, and the formula of the mean variance normalization is as follows:

wherein the content of the first and second substances,xin order for the value to be normalized,x _scale is a value after being normalized by the number of the points,μis taken as the mean value of the average value,Sis the standard deviation.

(3) Processing the class imbalance data:

since the attack data in the data set is far less than the normal data, there is a serious class imbalance problem, and the classifier obtained by using the unbalanced data training can bias the classification result to the labels with more class samples, resulting in a high classification accuracy. Common methods for solving data imbalance are oversampling and undersampling. The undersampling method enables the number of positive and negative examples to be close by removing most samples in data, but the removed data may contain data which has a large influence on a classification result, so that a classifier loses information which is important for the most samples. The over-sampling method makes the positive and negative cases approach to each other by adding a few types of samples, so as to reach balance. In order to avoid the problem of classification accuracy rate reduction possibly caused by information loss, the invention uses an oversampling method to solve the class imbalance problem.

Specifically, the method carries out class imbalance processing on the data set through a borderline-smote oversampling algorithm. The boederline-smote process is an improvement over the smote process. The addition of a few class samples of the smote algorithm has the following basic steps:

1) Calculating Euclidean distances from each sample in the minority class to all the minority class samples to obtain k nearest neighbors;

2) According to the sampling multiplying power N, for each minority sample x, randomly selecting a plurality of samples y from k neighbors of the minority sample x;

3) For each sample x and the neighboring sample y, a new sample is synthesized by the following equation.

Wherein

In order to synthesize the sample points of the sample,rand(0, 1) is a random value between 0 and 1x-yAnd | is the distance between two points.

The smote algorithm does not take into account the situation of surrounding samples. If most of the surrounding samples are samples of a few classes, the synthesized new sample has little effective information; if the surrounding samples are mostly samples of most classes, the synthesized sample points may be noise, affecting the classification result.

The Borderline-smote method divides a few types of samples into three types, namely a Safe type (more than half of the samples are all the few types), a Danger type (more than half of the samples are all the majority types, namely boundary sample points) and a Noise type (the samples are all the majority types), only randomly selects from the Danger type samples, and synthesizes a new sample by using the smote method. Compared with the smote method, the boundary sample synthesis method has the advantages that the boundary sample synthesis method only carries out synthesis of a few types, so that the distribution of synthesized few sample points is more reasonable, and the added few types of samples are more accurate.

A small amount of negative data in the network flow data is increased through a borderline-smote oversampling method, so that the positive and negative samples are balanced, and the classification effect of the model is improved.

As shown in fig. 2, in the above embodiment, preferably, the method for constructing and training the abnormal traffic classifier includes:

the abnormal flow classifier comprises a sparse self-encoder SAE and a Bayesian neural network BNN, flow characteristic data after preprocessing operation is input by the sparse self-encoder SAE, the sparse self-encoder SAE comprises a hidden layer neuron added with sparsity limitation, preferably the SAE comprising two hidden layer neurons, and reconstruction characteristics for efficient expression of original data after dimension reduction are obtained through encoding of the SAE.

Then, the reconstructed feature data output by the SAE coding of the sparse self-encoder is converted into data in a distributed form and used as input of a Bayesian neural network BNN, the Bayesian neural network BNN comprises a forward propagation calculation layer, a one-dimensional full-connection layer and a softmax function layer, forward propagation calculation is carried out on the input data, features are abstracted through a hidden layer to obtain data capable of better dividing different types, a classification result prediction score is obtained through a standard one-dimensional full-connection layer (dense layer), abnormal flow data is classified through a softmax function, and a classification result (normal flow or attack flow) is output.

In the abnormal flow classifier, the network flow data is firstly subjected to primary feature extraction, then features are coded through SAE, the expression of the features is further optimized, and finally the features extracted through SAE are trained by using BNN, so that the generalization capability and the recognition effect of the model can be effectively improved. The SAE is similar to the working mode of a human brain, can complete specified actions by only stimulating certain neurons, and inhibits most output of hidden layer neurons by adding sparsity limitation to the hidden layer neurons of an encoder. Therefore, by extracting features through SAE, effective information in input can be compressed, and important features can be extracted. In addition, the weight parameters of the Bayes neural network BNN are random variables, and are different from the values of the traditional neural network fitting labels through loss functions, and the Bayes neural network fitting posterior distribution can learn predicted distribution, so that the robustness and the generalization capability of the network model are improved. The two are combined, so that the abnormal flow identification effect can be improved on the premise of extracting lower dimensional characteristics, and each type of abnormal flow is well identified.

Based on the network model structure of the abnormal traffic classifier, the abnormal traffic classifier is trained by using the abnormal traffic feature sample data and the corresponding normal or abnormal classification result as the input and the output of the abnormal traffic classifier respectively until the classification result loss value loss of the abnormal traffic classifier reaches the convergence threshold value. And marking the label of a certain attack traffic as 1, and marking the rest as 0, regarding the label as normal data, and constructing a classifier aiming at the abnormal traffic. And respectively training all abnormal type flows to obtain abnormal flow classifiers capable of identifying different types.

And inputting the preprocessed flow characteristic data into the classifier based on the trained abnormal flow classifier to obtain a normal or abnormal classification result.

As shown in fig. 3 and 4, when an initial image library is constructed for abnormal traffic characteristic data, an attack type may be directly obtained from training data as an attack attribute characteristic; for the abnormal flow characteristic data to be identified, the classifier is respectively constructed for the data of each attack type, and the abnormal flow characteristic data is identified in sequence to obtain the attack type of the abnormal flow characteristic data as the attack attribute characteristic.

Further preferably, the destination port ratio is calculated by:

Sthe number of the destination ports which are the same with the attack data under the same attack type,Dthe number of data of the attack type is used for the attack.

Different hackers usually have obvious difference of the ratio of destination ports, for example, when a hacker performs a scanning attack on a single port, the number of the same port is large, and thus the value of the ratio of destination ports is large. When multiple ports are scanned, the value of destination port ratio is smaller. Therefore, the 'destination port ratio' can effectively distinguish different kinds of hackers.

Further, the specific calculation process of the connection similarity includes:

recording a source IP, a source port, a destination IP and a destination port corresponding to one abnormal flow characteristic data, forming a data connection vector corresponding to the abnormal flow characteristic data, and recording the data connection vector as

；

Obtaining data connection vectors of all abnormal network flow sample data with the same attack attribute characteristics as the abnormal flow characteristic data, and recording the data connection vectors as the abnormal flow characteristic data

；

Calculating the cosine similarity of the data connection vector corresponding to the abnormal flow characteristic data and the data connection vector of each abnormal flow characteristic sample data according to the following formula to obtain

And

connection similarity ofsim _i ：

The total connection similarity is calculated by summing and averaging all (N) cosine similarities according to the following formulasimAnd obtaining similarity characteristics according to the total connection similarity:

。

similarity analysis is carried out on the malicious traffic, and similarity characteristics can be obtained and used for distinguishing different types of hackers. Moreover, the same hacker can attack in different time periods, but the implemented attack behaviors and means are the same, so that the similarity characteristics basically keep consistent, and the characteristic tag can also identify the same type of hacker which attacks in different time periods.

Based on the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to the abnormal flow characteristic data obtained by calculation in the embodiment, the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic are jointly used as a hacker portrait of the hacker attacking the abnormal flow characteristic data, and the similarity between the hacker to be detected and the hacker portrait in the hacker portrait library corresponding to the attack attribute characteristic is obtained by calculating the cosine similarity. And judging the maximum value of the similarity of the hacker portrait and a threshold value D. And if the similarity between the hacker user and the hacker portrait in the hacker portrait library is greater than a threshold value D, the matching is successful, and the hacker with the same identity as the hacker to be identified exists in the hacker portrait library, so that the identity of the hacker is determined.

If the hacker is proved to be a new type of hacker in the attack type if the hacker can not be successfully matched with all hacker portraits in the attack type, the constructed hacker portraits are added into a hacker portraits library, and the initial hacker portraits library is updated.

In the above embodiment, preferably, the pre-constructing process of the hacker profile library includes:

classifying the abnormal flow characteristic sample data according to the attack attribute characteristics, extracting the flow characteristics and the time characteristics of each abnormal flow characteristic sample data, and calculating the corresponding similarity characteristics of each abnormal flow characteristic sample data;

and (3) taking the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to each abnormal flow characteristic sample data as a hacker portrait of the hacker correspondingly attacked by the current abnormal flow characteristic sample data, and adding the hacker portrait into a hacker portrait library. The hacker portraits continue to be built for the next hacker until all of the hacker portraits for each attack category are added to the hacker portraits library.

As shown in fig. 5, the present invention further provides a network attacker identification system based on a hacker portrayal, which is applied to the network attacker identification method based on a hacker portrayal disclosed in any one of the above embodiments, and the method comprises:

the data feature extraction module 1 is used for collecting network traffic data and extracting traffic features and time features of the network traffic data as traffic feature data;

the abnormal data classification module 2 is used for preprocessing the flow characteristic data, inputting the flow characteristic data into an abnormal flow classifier constructed and trained based on an SAE-BNN network model, coding the flow characteristic data by a sparse self-encoder SAE added with a hidden layer neuron with sparsity limitation, taking the characteristic data after coding reconstruction as the input of a Bayesian neural network BNN, obtaining a classification result prediction score by forward propagation calculation and a one-dimensional full-link layer, and obtaining a normal or abnormal classification result by softmax function classification;

the data feature calculation module 3 is configured to determine, for the abnormal traffic feature data, an attack attribute feature corresponding to the abnormal traffic feature data, and calculate a similarity of a data connection vector corresponding to the abnormal traffic feature data with respect to a data connection vector of hacker data conforming to the current attack attribute feature, to obtain a similarity feature;

the abnormal data matching module 4 is used for calculating the similarity of the hacker portrait relative to the same attack attribute feature portrait in a pre-constructed hacker portrait library by taking the flow feature, the time feature, the attack attribute feature and the similarity feature corresponding to the abnormal flow feature data as the hacker portrait;

the hacker identification module 5 is used for judging that the image corresponds to a hacker of the current abnormal flow characteristic data when the similarity between the image and a certain image is higher than a preset threshold value; and the image processing device is also used for adding the hacker portrait to the hacker portrait library when the similarity between the hacker portrait and all the portraits in the hacker portrait library is not higher than a preset threshold value.

In the above embodiment, preferably, the abnormal data classification module 2 performs a specific process of preprocessing the flow characteristic data:

converting the flow characteristic data into numerical characteristic data;

In the above embodiment, preferably, the method for constructing and training the abnormal traffic classifier includes:

the abnormal flow classifier comprises a sparse self-encoder SAE and a Bayesian neural network BNN, data are input by the sparse self-encoder SAE, the sparse self-encoder SAE comprises a hidden layer neuron added with sparsity limitation, reconstructed characteristic data output by the sparse self-encoder SAE is used as input of the Bayesian neural network BNN, and the Bayesian neural network BNN comprises a forward propagation calculation layer, a one-dimensional full-connection layer and a softmax function layer;

and training the abnormal flow classifier by using the abnormal flow characteristic sample data and the corresponding normal or abnormal classification result as the input and the output of the abnormal flow classifier respectively until the classification result loss value loss of the abnormal flow classifier reaches the convergence threshold value.

In the foregoing embodiment, preferably, the specific process of the data feature calculating module 3 calculating the similarity of the data connection vector of the abnormal traffic feature data with respect to the data connection vector of the hacker data conforming to the current attack attribute feature includes:

calculating cosine similarity of the corresponding data connection vector of the abnormal flow characteristic data and the data connection vector of each abnormal flow characteristic sample data, summing and averaging all the cosine similarity, calculating to obtain total connection similarity, and obtaining similarity characteristics according to the total connection similarity.

and (3) taking the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to each abnormal flow characteristic sample data as a hacker portrait of the current abnormal flow characteristic sample data corresponding to the hacker, and adding the hacker portrait to a hacker portrait library.

According to the hacker profile-based network attacker identification system disclosed in the above embodiment, the hacker profile-based network attacker identification method disclosed in the above embodiment is applied, and in a specific implementation process, the modules are implemented according to the steps in the network attacker identification method, which is not described herein again.

The present invention has been described in terms of the preferred embodiment, and it is not intended to be limited to the embodiment. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A network attacker identification method based on hacker portrayal is characterized by comprising the following steps:

preprocessing the flow characteristic data, inputting the flow characteristic data into an abnormal flow classifier constructed and trained based on an SAE-BNN network model, coding the flow characteristic data by a sparse self-encoder SAE added with a hidden layer neuron with sparsity limitation, taking the characteristic data after coding reconstruction as the input of a Bayesian neural network BNN, obtaining a classification result prediction score by forward propagation calculation and a one-dimensional full-link layer, and classifying the flow characteristic data by a softmax function to obtain a normal or abnormal classification result;

2. The hacker portrayal-based network attacker identification method according to claim 1, wherein the specific process of preprocessing the traffic feature data is as follows:

converting the flow characteristic data into numerical characteristic data;

3. The hacker portrayal-based network attacker identification method of claim 1, wherein the construction and training method of the abnormal traffic classifier comprises:

4. The hacker portrayal-based network attacker identification method of claim 3, wherein the specific process of calculating the similarity of the data connection vector of the abnormal traffic feature data relative to the data connection vector of the hacker data conforming to the current attack attribute features comprises:

5. The hacker portrayal-based network attacker identification method of claim 3, wherein the pre-construction process of the hacker portrayal library comprises:

classifying the abnormal flow characteristic sample data according to the attack attribute characteristics, extracting the flow characteristics and the time characteristics of each piece of abnormal flow characteristic sample data, and calculating the corresponding similarity characteristics of each piece of abnormal flow characteristic sample data;

6. A hacker portrayal-based network attacker identification system, which is applied to the hacker portrayal-based network attacker identification method of any one of claims 1 to 5, and comprises:

the data feature calculation module is used for determining attack attribute features corresponding to abnormal traffic feature data aiming at the abnormal traffic feature data, and calculating the similarity of data connection vectors corresponding to the abnormal traffic feature data relative to data connection vectors of hacker data conforming to the current attack attribute features to obtain similarity features;

the hacker attacking identification module is used for judging that a hacker corresponding to the image is an hacker attacking current abnormal flow characteristic data when the similarity between the hacker attacking identification module and the image is higher than a preset threshold value; and the hacker portrait is added to the hacker portrait library when the similarity between the hacker portrait and all the portraits in the hacker portrait library is not higher than the preset threshold value.

7. The hacker profile-based network attacker identification system of claim 6, wherein the anomaly data classification module performs a specific process of preprocessing the traffic characteristic data:

converting the flow characteristic data into numerical characteristic data;

8. The hacker portrayal-based network attacker identification system of claim 6, wherein the construction and training method of the abnormal traffic classifier comprises:

9. The hacker portrayal-based network attacker identification system of claim 8, wherein the specific process of the data feature calculation module calculating the similarity of the data connection vector of the abnormal traffic feature data relative to the data connection vector of the hacker data conforming to the current attack attribute feature comprises:

10. The hacker portrayal-based network attacker identification system of claim 8, wherein the pre-construction process of the hacker portrayal library comprises: