CN115396235B

CN115396235B - Network attacker identification method and system based on hacker portrait

Info

Publication number: CN115396235B
Application number: CN202211308863.6A
Authority: CN
Inventors: 王建龙; 关乐嘉; 庄唯; 李姝�; 殷倩
Original assignee: Beijing Tianyun Sea Number Technology Co ltd
Current assignee: Beijing Tianyun Sea Number Technology Co ltd
Priority date: 2022-10-25
Filing date: 2022-10-25
Publication date: 2023-01-13
Anticipated expiration: 2042-10-25
Also published as: CN115396235A

Abstract

The invention discloses a network attacker identification method and a system based on a hacker portrait, wherein the method comprises the following steps: collecting network flow data, and extracting flow and time characteristics as flow characteristic data; preprocessing the flow characteristic data, inputting an abnormal flow classifier constructed and trained based on an SAE-BNN network model, and obtaining a normal or abnormal classification result; determining attack attribute characteristics aiming at abnormal flow characteristic data, and calculating the similarity of a data connection vector of hacker data of the current attack attribute characteristics to obtain similarity characteristics; the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic are used as hacker portraits, and the similarity between the hacker portraits and the same attack attribute characteristic portraits in the hacker portraits library is calculated; and judging whether the image corresponds to the hacker or not according to the relation between the similarity and a preset threshold value. By the technical scheme, the hacker identification efficiency is improved, and the generalization capability and identification effect of the model are effectively improved.

Description

Network attacker identification method and system based on hacker portrait

Technical Field

The invention relates to the technical field of network security, in particular to a network attacker identification method based on hacker portrayal and a network attacker identification system based on hacker portrayal.

Background

Network space has become the "fifth dimension space" in succession to sea, land, air, and the world, becoming a new battlefield for various countries to compete, and network space security has become an indispensable part of security arrangements. The construction of a safe network space environment not only needs to alarm and take remedial measures for the network attack actions, but also needs to confirm the identity of an attacker who implements the attack actions, find out the attacker hidden behind the network, and fundamentally solve the problem.

Network attacks by sending malicious traffic are a common form of attack. According to the research of the ATLAS security engineering and response team of NETSCOUT, 290 thousands of Distributed Denial of Service (DDoS) attack events occur in the first quarter of 2021, which is 31% longer than that in the same period of 2020, and the attacked industries include medical treatment, education, online Service and other fields. Moreover, the team predicts that DDoS attacks will continue to increase in the future, the number of broken records is reached, and the scope is more and more complex.

Therefore, how to accurately identify malicious traffic in massive traffic data, and find out a hacker behind sending the malicious traffic and implementing an attack is an important task for maintaining the security of a network space.

At present, in the intrusion detection research aiming at network traffic identification, network traffic data is mainly trained through a deep learning method, and a classifier is obtained to distinguish normality from abnormality.

Document [1] (mamingbai, chenwei, wu etiquette. Intrusion detection method [ J ] based on CNN _ BiLSTM network computer engineering and application: 2022, 58 (10): 116-124.) features of UNSW0NB15 data set are first screened by using a random forest method, then CNN (conditional Neural Networks, CNN) and BiLSTM (Bi-directional Long Short-Term Memory) are used for extracting the features in parallel, then the features extracted by the two models are spliced, then a self-attention mechanism is used for carrying out secondary extraction of the features, and the features after secondary extraction are classified by using a Gated Recovery Unit (GRU). Document [2] (facial brightness, ji Sha Bei, liu Wan, xie Jian Wu, network intrusion detection based on GRU and feature embedding [ J ]. Applied science academic newspaper, 2021,39 (04): 559-568.) proposes a network intrusion detection model based on GRU and feature embedding, and a vector mapped by a word embedding layer is constructed into continuous features, so that time sequence information in data can be effectively extracted, then the features are transmitted into a GRU layer, and the result is output through a full connecting layer of two layers. Document [3] (YANG H, WANG F. Wireless Network interruption Detection Based on Improved conditional Neural Network [ J ]. IEEE Access, 2019, 7: 64366-64374.) allows CNN to store convolution results in the second convolution, then perform convolution, pooling and full concatenation respectively in parallel, and classify by softmax after merging features.

In the training process, if the feature dimensions are too much, the calculation cost is increased, and the redundant features in the training process obviously improve the classification result five, so that a learner proposes a feature selection method to screen important features. Document [4] (wufeng. Intrusion detection system feature selection method based on improved pigeon group optimization algorithm [ J ]. University of southwest schoolwork (natural science edition): 2021, 46 (05): 140-146.) features in two data sets of KDDCUP99 and UNSW-NB15 are selected by using the improved pigeon group optimization algorithm, redundant features are removed, then a decision tree model is used for classification, and the convergence speed of the model is accelerated on the premise of ensuring the classification accuracy. Document [5] (SELVAKUMAR B, museescharan k. Firefly basic feature selection for network intervention [ J ]. Computers & Security, 2019, 81) performs dimensionality reduction operation on flow features by using a Firefly algorithm, and trains the dimensionality-reduced data by using a C4.5 decision tree and a bayesian network model, thereby obtaining higher detection accuracy. Document [6] (ALZUBI Q M, ANBAR M, ALGATTAN Z, et al. Interrupt detection system based on a modified binary grid wolf optimization [ J ]. Neural Computing and Applications, 2020, 32.) ] proposes a feature selection algorithm that modifies a binary grayish optimization algorithm, reduces redundant features, only retains a portion of key features, and simulation results show that the detection model after feature selection is a model without feature selection in terms of time and precision.

The feature selection method just screens out different features, cannot further extract implicit relations of the features, and cannot form high-level reconstruction features capable of effectively expressing the labels.

Both the method based on machine learning and the method based on deep learning can only identify normal behaviors or attack behaviors, but cannot find out an attacker behind the implementation of the attack behaviors. Thus, researchers have applied user profiling techniques to the identification of network attackers. Document [7] (flood flight, liao light faithfulness, hacker portrait early warning model [ J ] based on K-Medoide clustering, computer engineering and design, 2021,42 (05): 1244-1249.) hacker portrait is constructed by extracting hacker behavior characteristics according to security log data, using K-medoid clustering method to cluster hacker portrait, constructing hacker group portrait, analyzing characteristics of each cluster at the same time, and providing corresponding defense means according to different attack means. Document [8] (Zhao just, yao xing ren. Abnormal behavior detection model [ J ]. Information network security based on user portrait, 2017 (7): 18-24.) constructs user portrait from both user attributes and behaviors, and proposes an intrusion detection model based on the user portrait. Document [9] (yellow aspiration macro, zhang wave, attacker portrait construction based on Big Data and graph community clustering algorithm [ J ]. Computer application research, 2021,38 (01): 232-236.) proposes a method for constructing a hacker portrait based on Big Data Stream Analysis technology and Louvain community discovery algorithm (Big Data Stream Analysis and Louvain, BDSAL), which can quickly form uniform attack events from massive, multi-source and heterogeneous Data, construct a hacker portrait capable of accurately depicting hacker information, and discover attackers.

The hacker portrayal-based research identifies whether a user is an attacker, i.e. a hacker, by constructing the hacker portrayal instead of determining the identity of the hacker through the hacker portrayal, and lacks an efficient hacker identification method, so that the hacker identification cannot be performed quickly.

If only intrusion detection is carried out on the traffic, normal traffic and abnormal traffic in the traffic can be identified, but hacker users behind the malicious traffic can not be found out; if the identity of the hacker is recognized only by the hacking image, the time consumption is huge because in a real scene, the data volume of normal traffic is much larger than that of abnormal traffic, and for the normal traffic, the identity of the hacker does not need to be recognized. Therefore, intrusion detection and a hacker portrait technology are combined, firstly, abnormal traffic in a network is screened out by the intrusion detection technology, and then, after analyzing and extracting characteristics aiming at the abnormal traffic, the abnormal traffic is identified to find out a hacker user who sends the abnormal traffic.

However, at present, no research combining the two methods exists, and the identity of a hacker sending abnormal traffic cannot be efficiently and accurately identified.

Disclosure of Invention

Aiming at the problems, the invention provides a network attacker identification method and a network attacker identification system based on hacker portraits, which jointly form the hacker portraits by extracting hacker attribute labels, flow characteristic labels, time characteristic labels and connection similarity labels of the hackers, establish an initial hacker portraits library, then perform characteristic extraction on abnormal flow data to be identified, extract connection similarity labels by analyzing the similarity between the hackers with the same attack types, and on the basis, construct user portraits for identified abnormal flow users, thereby accurately portraying the hackers, matching the hackers with the hackers in the initial portraits library according to corresponding abnormal types, improving the hacker identification efficiency, and effectively improving the generalization capability and identification effect of the model by an abnormal flow classifier based on an SAE-BNN network model.

In order to achieve the above object, the present invention provides a network attacker identification method based on hacker portrayal, comprising:

collecting network flow data, and extracting flow characteristics and time characteristics of the network flow data to serve as flow characteristic data;

preprocessing the flow characteristic data, inputting the flow characteristic data into an abnormal flow classifier constructed and trained based on an SAE-BNN network model, coding the flow characteristic data by a sparse self-encoder SAE added with a hidden layer neuron with sparsity limitation, taking the characteristic data after coding reconstruction as the input of a Bayesian neural network BNN, obtaining a classification result prediction score by forward propagation calculation and a one-dimensional full connection layer, and obtaining a normal or abnormal classification result by softmax function classification;

aiming at abnormal flow characteristic data, determining attack attribute characteristics corresponding to the abnormal flow characteristic data, and calculating the similarity of a data connection vector corresponding to the abnormal flow characteristic data relative to a data connection vector of hacker data conforming to the current attack attribute characteristics to obtain similarity characteristics;

taking the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to the abnormal flow characteristic data as a hacker portrait, and calculating the similarity of the hacker portrait relative to the same attack attribute characteristic portrait in a pre-constructed hacker portrait library;

if the similarity between the image and a certain portrait is higher than a preset threshold value, judging that the image corresponds to a hacker as an attacking hacker of current abnormal flow characteristic data;

and if the similarity between the hacker portrait and all the portraits in the hacker portrait library is not higher than the preset threshold value, adding the hacker portrait to the hacker portrait library.

In the above technical solution, preferably, the specific process of preprocessing the flow characteristic data is as follows:

converting the flow characteristic data into numerical characteristic data;

carrying out standardization processing on the flow characteristic data by adopting a mean variance normalization method so that the characteristic range of the flow characteristic data is in the same preset interval;

and carrying out class imbalance data processing on the flow characteristic data by adopting a Borderline-smote oversampling algorithm.

In the above technical solution, preferably, the method for constructing and training the abnormal traffic classifier includes:

the abnormal flow classifier comprises the sparse self-encoder SAE and the Bayesian neural network BNN, data is input by the sparse self-encoder SAE, the sparse self-encoder SAE comprises a hidden layer neuron added with sparsity limitation, reconstructed feature data output by the sparse self-encoder SAE is used as the input of the Bayesian neural network BNN, and the Bayesian neural network BNN comprises a forward propagation calculation layer, a one-dimensional full-connection layer and a softmax function layer;

and training the abnormal flow classifier by using the abnormal flow characteristic sample data and the corresponding normal or abnormal classification result as the input and the output of the abnormal flow classifier respectively until the classification result loss value loss of the abnormal flow classifier reaches a convergence threshold value.

In the above technical solution, preferably, the specific process of calculating the similarity between the data connection vector of the abnormal traffic characteristic data and the data connection vector of the hacker data conforming to the current attack attribute characteristic includes:

recording a source IP, a source port, a destination IP and a destination port corresponding to the abnormal flow characteristic data to form a data connection vector corresponding to the abnormal flow characteristic data;

acquiring data connection vectors of all abnormal network traffic sample data with the same attack attribute characteristics as the abnormal traffic characteristic data;

calculating cosine similarity of the data connection vector corresponding to the abnormal flow characteristic data and the data connection vector of each abnormal flow characteristic sample data, summing and averaging all the cosine similarity, calculating total connection similarity, and obtaining the similarity characteristic according to the total connection similarity.

In the above technical solution, preferably, the pre-constructing process of the hacker profile library includes:

classifying the abnormal flow characteristic sample data according to the attack attribute characteristics, extracting the flow characteristics and the time characteristics of each piece of abnormal flow characteristic sample data, and calculating the corresponding similarity characteristics of each piece of abnormal flow characteristic sample data;

and the traffic characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to each abnormal traffic characteristic sample data are used as hacker portraits of hackers corresponding to the current abnormal traffic characteristic sample data and are added into the hacker portraits database.

The invention also provides a network attacker identification system based on the hacker portrait, which is applied to the network attacker identification method based on the hacker portrait disclosed by any one of the technical schemes and comprises the following steps:

the data feature extraction module is used for collecting network traffic data and extracting traffic features and time features of the network traffic data as traffic feature data;

the abnormal data classification module is used for preprocessing the flow characteristic data, inputting the flow characteristic data into an abnormal flow classifier constructed and trained based on an SAE-BNN network model, coding the flow characteristic data through a sparse self-encoder SAE added with a hidden layer neuron with sparsity limitation, taking the characteristic data after coding reconstruction as the input of a Bayesian neural network BNN, obtaining a classification result prediction score through forward propagation calculation and a one-dimensional full-link layer, and obtaining a normal or abnormal classification result through softmax function classification;

the data feature calculation module is used for determining attack attribute features corresponding to abnormal traffic feature data aiming at the abnormal traffic feature data, and calculating the similarity of data connection vectors corresponding to the abnormal traffic feature data relative to data connection vectors of hacker data conforming to the current attack attribute features to obtain similarity features;

the abnormal data matching module is used for taking the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to the abnormal flow characteristic data as a hacker portrait and calculating the similarity of the hacker portrait relative to the same attack attribute characteristic portrait in a pre-constructed hacker portrait library;

the hacker identification module is used for judging that the image corresponds to a hacker as the hacker of the current abnormal flow characteristic data when the similarity between the image and a certain image is higher than a preset threshold value; and the hacker portrait adding module is further used for adding the hacker portrait to the hacker portrait base when the similarity between the hacker portrait and all portraits in the hacker portrait base is not higher than the preset threshold value.

In the above technical solution, preferably, the abnormal data classification module performs a specific process of preprocessing the flow characteristic data:

converting the flow characteristic data into numerical characteristic data;

and carrying out imbalance-like data processing on the flow characteristic data by adopting a Borderline-smote oversampling algorithm.

In the above technical solution, preferably, the specific process of the data feature calculating module calculating the similarity between the data connection vector of the abnormal traffic feature data and the data connection vector of the hacker data conforming to the current attack attribute feature includes:

acquiring data connection vectors of all abnormal traffic characteristic sample data with the same attack attribute characteristics as the abnormal traffic characteristic data;

calculating cosine similarity of the corresponding data connection vector of the abnormal flow characteristic data and the data connection vector of each abnormal flow characteristic sample data, summing and averaging all the cosine similarity, calculating total connection similarity, and obtaining the similarity characteristic according to the total connection similarity.

Compared with the prior art, the invention has the beneficial effects that: a hacker portrait is formed by extracting a hacker attribute label, a flow characteristic label, a time characteristic label and a connection similarity label of a hacker, an initial hacker portrait library is established, then characteristic extraction is carried out on abnormal flow data to be identified, the connection similarity label is extracted by analyzing the similarity between hackers of the same attack type, and on the basis, a user portrait is constructed for an identified abnormal flow user, so that the hacker is accurately depicted and matched with the hacker in the initial portrait library according to the corresponding abnormal type, the hacker identification efficiency is improved, and the generalization capability and the identification effect of the model can be effectively improved by using the abnormal flow classifier based on the SAE-BNN network model.

Drawings

FIG. 1 is a flowchart illustrating a network attacker identification method based on a hacker profile according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a model structure of an abnormal traffic classifier based on an SAE-BNN network model according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an example of a hacker image according to one embodiment of the disclosure;

FIG. 4 is a schematic diagram of a process for constructing an initial hacker profile library according to one embodiment of the disclosure;

fig. 5 is a block diagram of a hacker profile-based network attacker identification system according to an embodiment of the present invention.

In the drawings, the correspondence between each component and the reference numeral is:

1. the system comprises a data feature extraction module, 2 an abnormal data classification module, 3 a data feature calculation module, 4 an abnormal data matching module and 5 an attack hacker identification module.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

The invention is described in further detail below with reference to the attached drawing figures:

as shown in fig. 1, the network attacker identification method based on hacker portrayal provided by the invention comprises:

collecting network flow data, and extracting flow characteristics and time characteristics of the network flow data as flow characteristic data;

preprocessing flow characteristic data, inputting the flow characteristic data into an abnormal flow classifier constructed and trained based on an SAE-BNN network model, coding the flow characteristic data by a sparse self-encoder SAE added with a hidden layer neuron with sparsity limitation, taking the characteristic data after coding reconstruction as the input of a Bayesian neural network BNN, obtaining a classification result prediction score by forward propagation calculation and a one-dimensional full-link layer, and obtaining a normal or abnormal classification result by softmax function classification;

aiming at the abnormal flow characteristic data, determining the attack attribute characteristics corresponding to the abnormal flow characteristic data, and calculating the similarity of the data connection vector corresponding to the abnormal flow characteristic data relative to the data connection vector of the hacker data according with the current attack attribute characteristics to obtain the similarity characteristics;

and if the similarity between the hacker portrait and all the portraits in the hacker portrait library is not higher than a preset threshold value, adding the hacker portrait into the hacker portrait library.

In the embodiment, a hacker portrait is formed by extracting a hacker attribute label, a flow characteristic label, a time characteristic label and a connection similarity label of a hacker, an initial hacker portrait library is established, then characteristic extraction is carried out on abnormal flow data to be identified, the connection similarity label is extracted by analyzing the similarity between hackers of the same attack type, and on the basis, a user portrait is constructed for the identified abnormal flow user, so that the hacker is accurately depicted and matched with the hacker in the initial portrait library according to the corresponding abnormal type, the hacker identification efficiency is improved, and the abnormal flow classifier based on the SAE-BNN network model can effectively improve the generalization capability and the identification effect of the model.

Specifically, firstly, collecting network flow data, performing feature extraction and preprocessing on the data, classifying the data to obtain abnormal flow feature data, constructing a hacker portrait of the abnormal flow feature data, matching the hacker portrait with a corresponding hacker portrait of the same type in an initial hacker portrait library, determining the identity of an unknown hacker according to a matching result, and otherwise, constructing a new hacker portrait of the hacker and updating the hacker portrait library.

The SAE-BNN network model is trained, so that the abnormal traffic classifier which can accurately classify network traffic data of different abnormal types can be obtained, and the classification effect of the network traffic data is effectively improved.

In the embodiment of the invention, a UNSW-NB15 data set is selected to perform an experiment on a network attacker identification method, wherein the data set comprises 254 ten thousand records, 49 characteristics and 9 attack types, wherein the normal data accounts for 88 percent, the attack data accounts for 12 percent, and the specific table is as follows:

TABLE 1 UNSW-NB15 dataset

The method comprises the following steps of summarizing the existing documents and deeply analyzing the traffic sent by a network hacker to extract the traffic characteristics and the time characteristics of network traffic data, and comprises the following specific steps:

flow characteristic extraction: the traffic characteristics specifically comprise four characteristics of destination port, destination IP, byte number and protocol type.

The destination port and the destination IP of a normal user have diversity, and a hacker may continuously attack a certain IP or port when performing an attack action, so that the destination port and the destination IP are unique and different from the normal user. The destination IP and the destination port reflect the difference between normal users and abnormal users from the perspective of a traffic receiver. The operation of the same user is different, the size of the byte number is also different, and an attacker can continuously carry out the same attack operation sometimes, so that the difference of the size of the byte number is not large. The types of protocols on which different attack means depend also differ. The number of bytes and the type of the protocol can reflect the difference between normal users and abnormal users from the perspective of a flow sender.

Time feature extraction: the time characteristics specifically include traffic arrival time and traffic duration, and the time characteristics can represent the difference between normal users and hackers and between different hackers in the time dimension.

The flow from normal users generally has no obvious regularity, and the flow types have diversity and generally accord with the network flow distribution rule. And the attack traffic from hackers can be concentrated in a certain time period, and the sent traffic is generally the same type of traffic, which does not conform to the network traffic distribution rule. From this, it is possible to distinguish whether normal traffic or attack traffic.

In the collected network traffic data, the attack data is much smaller than the normal data, and part of the data is non-numerical data, so that the data needs to be preprocessed, and preferably, the specific process of preprocessing the traffic characteristic data includes:

converting the flow characteristic data into numerical characteristic data;

Specifically, in the implementation process, the processing mode is as follows:

(1) And (3) data type conversion: since the extracted part of the features is not a numerical feature, the part of the data needs to be converted.

And converting the IP of the character string type into shaping data aiming at the target IP. The IP address comprises four bytes, the value range of each byte is 0-225, the number of each byte is represented by a two-bit 16-system, if the number is less than two, 0 is supplemented in the front to obtain 4 16-system numbers, the 16-system numbers are spliced into 8-bit 16-system numbers and converted into a 10-system number, namely the numerical representation of the IP address. For example, an IP address of 192.163.88.5, each byte represented by c0, a3, 58, 05 in 16, is concatenated to obtain the 16-digit c0a35805, which is converted to 3231930373 in 10, i.e., a numeric IP address.

And aiming at the protocol types, sequencing the protocols of different types according to the appearance sequence, wherein the sequence number corresponding to each protocol is the discrete numerical characteristic of the protocol.

(2) Data normalization: the numerical range of the converted features is too large, for example, data of an IP address after conversion may be in the order of hundred thousand or million, but the duration may be only in the order of milliseconds. The invention carries out treatment by mean variance normalization, so that different characteristics conform to normal distribution with the mean value of 0 and the variance of 1, and the formula of the mean variance normalization is as follows:

wherein the content of the first and second substances,xin order for the value to be normalized,x _scale is a value after being normalized by the number of the points,μis taken as the mean value of the average value,Sis the standard deviation.

(3) Processing the class imbalance data:

because the attack data in the data set is far less than the normal data, the serious class imbalance problem exists, the classifier obtained by using the imbalance data training can bias the classification result to labels with more class samples, and the classification accuracy is high. Common methods for resolving data imbalance are oversampling and undersampling. The undersampling method enables the number of positive and negative examples to be close by removing most samples in data, but the removed data may contain data which has a large influence on a classification result, so that a classifier loses information which is important for the most samples. The over-sampling method makes the positive and negative cases approach to each other by adding a few types of samples, so as to reach balance. In order to avoid the problem of classification accuracy rate reduction possibly caused by information loss, the invention uses an oversampling method to solve the class imbalance problem.

Specifically, the method carries out class imbalance processing on the data set through a borderline-smote oversampling algorithm. The boederline-smote process is an improvement over the smote process. The addition of a few class samples of the smote algorithm has the following basic steps:

1) Calculating Euclidean distances from each sample in the minority class to all the minority class samples to obtain k neighbor;

2) According to the sampling multiplying power N, for each minority sample x, randomly selecting a plurality of samples y from k neighbors of the minority sample x;

3) For each sample x and the neighboring sample y, a new sample is synthesized by the following equation.

Wherein

In order to synthesize the sample points of the sample,rand(0, 1) is a random value between 0 and 1, a circuit areax-yAnd | is the distance between two points.

The smote algorithm does not take into account the situation of surrounding samples. If most of the surrounding samples are samples of a few classes, the effective information in the synthesized new sample is not much; if the surrounding samples are mostly samples of most classes, the synthesized sample points may be noise, affecting the classification result.

The Borderline-smote method divides a few class samples into three classes, namely a Safe class (more than half of the samples are all the few classes), a Danger class (more than half of the samples are all the majority classes, namely boundary sample points) and a Noise class (all the classes are all the majority classes), and only selects samples randomly from the Danger class samples, and synthesizes new samples by a smote method. Compared with the smote method, the boundary sample synthesis method has the advantages that the boundary sample synthesis method only carries out synthesis of a few types, so that the distribution of synthesized few sample points is more reasonable, and the added few types of samples are more accurate.

A small amount of negative data in the network flow data is increased through a borderline-smote oversampling method, so that the positive and negative samples are balanced, and the classification effect of the model is improved.

As shown in fig. 2, in the above embodiment, preferably, the method for constructing and training the abnormal traffic classifier includes:

the abnormal flow classifier comprises a sparse self-encoder SAE and a Bayesian neural network BNN, flow characteristic data after preprocessing operation is input by the sparse self-encoder SAE, the sparse self-encoder SAE comprises a hidden layer neuron added with sparsity limitation, preferably the SAE comprising two hidden layer neurons, and reconstruction characteristics for efficient expression of original data after dimension reduction are obtained through encoding of the SAE.

Then, the reconstructed feature data output by the SAE coding of the sparse self-encoder is converted into data in a distributed form and used as input of a Bayesian neural network BNN, the Bayesian neural network BNN comprises a forward propagation calculation layer, a one-dimensional full-connection layer and a softmax function layer, forward propagation calculation is carried out on the input data, features are abstracted through a hidden layer to obtain data capable of better dividing different types, a classification result prediction score is obtained through a standard one-dimensional full-connection layer (dense layer), abnormal flow data is classified through a softmax function, and a classification result (normal flow or attack flow) is output.

In the abnormal flow classifier, the network flow data is firstly subjected to primary feature extraction, then features are coded through SAE, the expression of the features is further optimized, and finally the features extracted by SAE are trained by BNN, so that the generalization capability and the recognition effect of the model can be effectively improved. The SAE is similar to the working mode of a human brain, can complete specified actions by only stimulating certain neurons, and inhibits most output of hidden layer neurons by adding sparsity limitation to the hidden layer neurons of an encoder. Therefore, by extracting features through SAE, effective information in input can be compressed, and important features can be extracted. In addition, the weight parameter of the Bayes neural network BNN is a random variable, and is different from the value of the traditional neural network fitting label through a loss function, the Bayes neural network fitting posterior distribution can learn the prediction distribution, and therefore the robustness and the generalization capability of the network model are improved. The two are combined, so that the abnormal flow identification effect can be improved on the premise of extracting lower dimensional characteristics, and each type of abnormal flow is identified with good identification effect.

Based on the network model structure of the abnormal traffic classifier, the abnormal traffic classifier is trained by using the abnormal traffic feature sample data and the corresponding normal or abnormal classification result as the input and the output of the abnormal traffic classifier respectively until the classification result loss value loss of the abnormal traffic classifier reaches the convergence threshold value. And marking the label of a certain attack traffic as 1, and marking the rest as 0, regarding the label as normal data, and constructing a classifier aiming at the abnormal traffic. And respectively training all abnormal type flows to obtain abnormal flow classifiers capable of identifying different types.

And inputting the preprocessed flow characteristic data into the classifier based on the trained abnormal flow classifier to obtain a normal or abnormal classification result.

As shown in fig. 3 and 4, when an initial image library is constructed for abnormal traffic characteristic data, an attack type may be directly obtained from training data as an attack attribute characteristic; for the abnormal flow characteristic data to be identified, the classifier is respectively constructed for the data of each attack type, and the abnormal flow characteristic data is sequentially identified to obtain the attack type of the abnormal flow characteristic data as the attack attribute characteristic.

Further preferably, the destination port ratio is calculated by:

Sthe number of the destination ports which are the same with the attack data under the same attack type,Dthe number of data of this attack type.

Different hackers usually have obvious difference of destination port ratio, for example, when a hacker performs a scanning attack on a single port, the number of the same port is large, and thus the value of the destination port ratio is large. When multiple ports are scanned, the value of destination port ratio is smaller. Therefore, the 'destination port ratio' can effectively distinguish different kinds of hackers.

Further, the specific calculation process of the connection similarity includes:

recording a source IP, a source port, a destination IP and a destination port corresponding to one abnormal flow characteristic data, forming a data connection vector corresponding to the abnormal flow characteristic data, and recording the data connection vector as

；

Obtaining all abnormal network flow sample data with the same attack attribute characteristics as the abnormal flow characteristic dataData connection vector of

；

Calculating the cosine similarity of the data connection vector corresponding to the abnormal flow characteristic data and the data connection vector of each abnormal flow characteristic sample data according to the following formula to obtain

And

connection similarity ofsim _i ：

The total connection similarity is calculated by summing and averaging all (N) cosine similarities according to the following formulasimAnd obtaining similarity characteristics according to the total connection similarity:

。

similarity analysis is carried out on the malicious traffic, and similarity characteristics can be obtained and used for distinguishing different types of hackers. Moreover, the same hacker can attack in different time periods, but the implemented attack behaviors and means are the same, so that the similarity characteristics basically keep consistent, and the characteristic tag can also identify the same type of hacker which attacks in different time periods.

Based on the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to the abnormal flow characteristic data obtained by calculation in the embodiment, the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic are jointly used as a hacker portrait of the hacker attacking the abnormal flow characteristic data, and the similarity between the hacker to be detected and the hacker portrait in the hacker portrait library corresponding to the attack attribute characteristic is obtained by calculating the cosine similarity. And judging the maximum value of the similarity of the hacker portrait and a threshold value D. And if the similarity between the hacker user and the hacker portrait in the hacker portrait library is greater than a threshold value D, the matching is successful, and the hacker with the same identity as the hacker to be identified exists in the hacker portrait library, so that the identity of the hacker is determined.

If the hacker is proved to be a new hacker in the attack type if the hacker cannot be successfully matched with all the hacker portraits in the attack type, the constructed hacker portraits are added into a hacker portraits library, and the initial hacker portraits library is updated.

In the above embodiment, preferably, the pre-constructing process of the hacker profile library includes:

classifying the abnormal flow characteristic sample data according to the attack attribute characteristics, extracting the flow characteristics and the time characteristics of each abnormal flow characteristic sample data, and calculating the corresponding similarity characteristics of each abnormal flow characteristic sample data;

and (3) taking the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to each abnormal flow characteristic sample data as a hacker portrait of the current abnormal flow characteristic sample data corresponding to the hacker, and adding the hacker portrait to a hacker portrait library. The hack portrayal continues to be built for the next hacker until all of the hack portrayals for each attack category are added to the hack portrayal library.

As shown in fig. 5, the present invention further provides a network attacker identification system based on hacker portrayal, which is applied to the network attacker identification method based on hacker portrayal disclosed in any one of the above embodiments, and includes:

the data feature extraction module 1 is used for collecting network traffic data and extracting traffic features and time features of the network traffic data as traffic feature data;

the abnormal data classification module 2 is used for preprocessing the flow characteristic data, inputting the flow characteristic data into an abnormal flow classifier constructed and trained based on an SAE-BNN network model, coding the flow characteristic data by a sparse self-encoder SAE added with a hidden layer neuron with sparsity limitation, taking the coded and reconstructed characteristic data as the input of a Bayesian neural network BNN, obtaining a classification result prediction score by forward propagation calculation and a one-dimensional full-link layer, and obtaining a normal or abnormal classification result by softmax function classification;

the data feature calculation module 3 is configured to determine, for the abnormal traffic feature data, an attack attribute feature corresponding to the abnormal traffic feature data, and calculate a similarity of a data connection vector corresponding to the abnormal traffic feature data with respect to a data connection vector of hacker data conforming to the current attack attribute feature, to obtain a similarity feature;

the abnormal data matching module 4 is used for calculating the similarity of the hacker portrait relative to the same attack attribute feature portrait in a pre-constructed hacker portrait library by taking the flow feature, the time feature, the attack attribute feature and the similarity feature corresponding to the abnormal flow feature data as the hacker portrait;

the hacker identification module 5 is used for judging that the image corresponds to a hacker of the current abnormal flow characteristic data when the similarity between the image and a certain image is higher than a preset threshold value; and the image processing device is also used for adding the hacker portrait to the hacker portrait library when the similarity between the hacker portrait and all the portraits in the hacker portrait library is not higher than a preset threshold value.

In the above embodiment, preferably, the abnormal data classification module 2 performs a specific process of preprocessing the flow characteristic data:

converting the flow characteristic data into numerical characteristic data;

In the above embodiment, preferably, the method for constructing and training the abnormal traffic classifier comprises:

the abnormal flow classifier comprises a sparse self-encoder SAE and a Bayesian neural network BNN, data are input by the sparse self-encoder SAE, the sparse self-encoder SAE comprises a hidden layer neuron added with sparsity limitation, reconstructed characteristic data output by the sparse self-encoder SAE is used as input of the Bayesian neural network BNN, and the Bayesian neural network BNN comprises a forward propagation calculation layer, a one-dimensional full-connection layer and a softmax function layer;

and training the abnormal flow classifier by using the abnormal flow characteristic sample data and the corresponding normal or abnormal classification result as the input and the output of the abnormal flow classifier respectively until the classification result loss value loss of the abnormal flow classifier reaches the convergence threshold value.

In the above embodiment, preferably, the specific process of the data feature calculating module 3 calculating the similarity of the data connection vector of the abnormal traffic feature data with respect to the data connection vector of the hacker data conforming to the current attack attribute feature includes:

and calculating cosine similarity of the corresponding data connection vector of the abnormal flow characteristic data and the data connection vector of each abnormal flow characteristic sample data, summing all the cosine similarity and averaging to obtain total connection similarity, and obtaining the similarity characteristic according to the total connection similarity.

and (3) taking the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to each abnormal flow characteristic sample data as a hacker portrait of the hacker correspondingly attacked by the current abnormal flow characteristic sample data, and adding the hacker portrait into a hacker portrait library.

According to the hacker-portrait-based network attacker identification system disclosed by the above embodiment, the hacker-portrait-based network attacker identification method disclosed by the above embodiment is applied, and in a specific implementation process, the modules are implemented according to the steps in the network attacker identification method, and are not described again here.

The above is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes will occur to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A network attacker identification method based on hacker portrayal is characterized by comprising the following steps:

collecting network traffic data, and extracting traffic characteristics and time characteristics of the network traffic data as traffic characteristic data;

preprocessing the flow characteristic data, inputting the flow characteristic data into an abnormal flow classifier constructed and trained based on an SAE-BNN network model, coding the flow characteristic data by a sparse self-encoder SAE added with a hidden layer neuron with sparsity limitation, taking the characteristic data after coding reconstruction as the input of a Bayesian neural network BNN, obtaining a classification result prediction score by forward propagation calculation and a one-dimensional full-link layer, and classifying the flow characteristic data by a softmax function to obtain a normal or abnormal classification result;

if the similarity between the image and a certain image is higher than a preset threshold value, judging that the image corresponds to a hacker as an attacking hacker of the current abnormal flow characteristic data;

2. The hacker portrayal-based network attacker identification method according to claim 1, wherein the specific process of preprocessing the traffic feature data is as follows:

converting the flow characteristic data into numerical characteristic data;

3. The hacker portrayal-based network attacker identification method of claim 1, wherein the construction and training method of the abnormal traffic classifier comprises:

4. The hacker portrayal-based network attacker identification method of claim 3, wherein the specific process of calculating the similarity of the data connection vector of the abnormal traffic feature data relative to the data connection vector of the hacker data conforming to the current attack attribute features comprises:

5. The hacker portrayal-based network attacker identification method of claim 3, wherein the pre-construction process of the hacker portrayal library comprises:

and the flow characteristic, the time characteristic, the attack attribute characteristic and the similarity characteristic corresponding to each piece of abnormal flow characteristic sample data are used as hacker portraits of hackers corresponding to the current abnormal flow characteristic sample data and are added into the hacker portraits library.

6. A hacker portrayal-based network attacker identification system, which is applied to the hacker portrayal-based network attacker identification method of any one of claims 1 to 5, and comprises:

the abnormal data classification module is used for preprocessing the flow characteristic data, inputting the flow characteristic data into an abnormal flow classifier constructed and trained based on an SAE-BNN network model, coding the flow characteristic data through a sparse self-encoder SAE added with a hidden layer neuron with sparsity limitation, taking the coded and reconstructed characteristic data as the input of a Bayesian neural network BNN, obtaining a classification result prediction score through forward propagation calculation and a one-dimensional full-link layer, and obtaining a normal or abnormal classification result through softmax function classification;

the data characteristic calculation module is used for determining the attack attribute characteristics corresponding to the abnormal flow characteristic data aiming at the abnormal flow characteristic data, and calculating the similarity of the data connection vector corresponding to the abnormal flow characteristic data relative to the data connection vector of the hacker data according with the current attack attribute characteristics to obtain the similarity characteristics;

7. The hacker portrayal-based network attacker identification system of claim 6, wherein the abnormal data classification module is used for preprocessing the traffic characteristic data according to a specific process that:

converting the flow characteristic data into numerical characteristic data;

8. The hacker portrayal-based network attacker identification system of claim 6, wherein the construction and training method of the abnormal traffic classifier comprises:

9. The hacker portrayal-based network attacker identification system of claim 8, wherein the specific process of the data feature calculation module calculating the similarity of the data connection vector of the abnormal traffic feature data relative to the data connection vector of the hacker data conforming to the current attack attribute feature comprises:

and calculating cosine similarity of a corresponding data connection vector of the abnormal flow characteristic data and a data connection vector of each abnormal flow characteristic sample data, summing and averaging all the cosine similarity, calculating total connection similarity, and obtaining the similarity characteristic according to the total connection similarity.

10. The hacker portrayal-based network attacker identification system of claim 8, wherein the pre-construction process of the hacker portrayal library comprises:

classifying the abnormal flow characteristic sample data according to attack attribute characteristics, extracting the flow characteristics and time characteristics of each piece of abnormal flow characteristic sample data, and calculating the corresponding similarity characteristics of each piece of abnormal flow characteristic sample data;