CN112994966A - General network flow identification method based on deep learning - Google Patents

General network flow identification method based on deep learning Download PDF

Info

Publication number
CN112994966A
CN112994966A CN201911298083.6A CN201911298083A CN112994966A CN 112994966 A CN112994966 A CN 112994966A CN 201911298083 A CN201911298083 A CN 201911298083A CN 112994966 A CN112994966 A CN 112994966A
Authority
CN
China
Prior art keywords
network
flow
data
identification
deep neural
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911298083.6A
Other languages
Chinese (zh)
Inventor
邹智超
张舜卿
徐树公
曹姗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Shanghai for Science and Technology
Original Assignee
University of Shanghai for Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Shanghai for Science and Technology filed Critical University of Shanghai for Science and Technology
Priority to CN201911298083.6A priority Critical patent/CN112994966A/en
Publication of CN112994966A publication Critical patent/CN112994966A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Environmental & Geological Engineering (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

A general network flow identification method based on deep learning is characterized in that in the process of establishing connection and communication between user equipment and a remote server, the user equipment accesses a network through wired switch equipment or a wireless base station, flow data are forwarded through a gateway, the flow data are captured from the gateway, network layer information in the flow data is subjected to flow division, characteristics of a flow level are extracted from header information of transmission layer information and application layer information in the flow data and are used as input of a deep neural network, and therefore the type of service to be identified is judged. The invention has higher identification accuracy in various flow identification tasks in various wired and wireless network environments, and simultaneously reduces the actual deployment complexity of the method.

Description

General network flow identification method based on deep learning
Technical Field
The invention relates to a technology in the field of communication, in particular to a general network traffic identification method based on deep learning, which is used for scenes such as video service type identification, webpage traffic analysis and the like.
Background
The network flow identification means that the characteristics which are convenient for identifying the service types of the network flow are extracted and analyzed by capturing data packet information in the network flow, and then the service types in the network flow are predicted by using the corresponding characteristics. The conventional method is to determine the service type by manual work or matching transmission ports used by a Transmission Control Protocol (TCP) and a User Datagram Protocol (UDP), but as network services change and develop, new network services use more dynamic (private) ports defined by the internet digital distribution mechanism instead of recognized ports and registered ports for communication, and the network traffic identification meaning of providers and specific service attributes which cannot obtain services by identifying the service protocol is also very limited.
In the prior art, a classifier based on machine learning identifies corresponding network services by collecting features of a data packet, a large amount of feature information needs to be captured for training the classifier, and feature selection work is always a problem to be solved for classifier input. This makes the technology generally only capable of identifying a certain class of web services, and attempts to combine features are made on new identification requirements.
Disclosure of Invention
Aiming at the defects in the prior art, the invention provides a general network traffic identification method based on deep learning, which is used for collecting and processing traffic data at a gateway and identifying network services in the traffic data. By analyzing the characteristics of the flow data, key flow characteristics are extracted. The recognition model is established by adopting a deep learning method, so that the method has higher recognition accuracy in various flow recognition tasks in various wired and wireless network environments, and the actual deployment complexity of the method is reduced.
The invention is realized by the following technical scheme:
the invention relates to a general network flow identification method based on deep learning, in the process of establishing connection and communication between user equipment and a remote server, the user equipment accesses a network through wired switch equipment or a wireless base station, flow data is forwarded through a gateway, the flow data is captured from the gateway, the network layer information is subjected to flow division, the head information of the transmission layer information and the application layer information, namely the header, is extracted with the characteristics of the flow level and is used as the input of a deep neural network, and therefore the service type to be identified is judged.
The invention relates to a network flow type identification system for realizing the method, which comprises the following steps: data capture module, data preprocessing module, characteristic selection module and identification module, wherein: the data preprocessing module captures flow data from a specific network card through the data capturing module in the captured flow data, the data preprocessing module extracts characteristics of a flow level from the captured flow data and carries out digital processing, and the characteristic selecting module determines characteristic input of the classifier according to the label value characteristics and the likelihood function of the value characteristics obtained by the captured flow data samples.
Technical effects
Compared with the prior art, the method selects different small flow level characteristics for different task types in the network flow identification task, thereby realizing the flow type identification in the data flow transmission process. By utilizing both transport layer and application layer information and employing a common network traffic identification framework, different traffic types, including voice over internet protocol and video streaming, can be distinguished. Since the framework builds on a Deep Neural Network (DNN) architecture with strong generalization capability, it can be directly extended to other classification tasks. From measurements, the method can improve the recognition accuracy by more than 30%. The invention extracts useful information from the packet header. Since the selected information uses the stream-level features, the complexity of the recognition algorithm and the recognition time consumption of more than 20% are greatly reduced.
Drawings
FIG. 1 is a schematic diagram of deep learning-based network service identification according to the present invention;
FIG. 2 is a schematic diagram of a token value signature and a numerical signature of the present invention;
FIG. 3 is a schematic diagram of a network traffic identification framework;
FIG. 4 is a schematic diagram of an Android simulator and a wireshark flow monitor in an embodiment;
FIG. 5 is a diagram showing the comparison of accuracy in the examples;
fig. 6 is a schematic diagram of recognition results in different feature configurations in the embodiment.
Detailed Description
As shown in fig. 1, for the method for identifying a general network traffic based on deep learning according to this embodiment, after a network service is initiated, a specific service type is detected and identified by capturing traffic data information, specifically: in the process of establishing connection and communication between user equipment and a remote server, the user equipment accesses a network through wired switch equipment or a wireless base station, flow data is forwarded through a gateway, the flow data is captured from the gateway, flow division is carried out on network layer information in the flow data, a training set is extracted from head information of transmission layer information and application layer information in the flow data and used for training a deep learning network and determining characteristic input of different recognition tasks, corresponding characteristics are extracted according to the different recognition tasks, and recognition results are obtained through the trained deep learning network.
And the flow division realizes the analysis of the flow-based marking value characteristics and the numerical value characteristics of the packet header information which is extracted from the network card and captured into the data packet through the likelihood function.
The mark value characteristics are as follows: features that do not have numerical significance, that need to be converted into numerical form, i.e. data traffic features such as protocol version, network port number and transmission direction, and that collect different values of some specific attributes such as total number of packet sizes transmitted, as shown in fig. 2, specifically: when the k characteristic
Figure BDA0002321111470000021
With MkA different value, wherein:
Figure BDA0002321111470000022
a set of all the tag value features;
Figure BDA0002321111470000023
for M in the ith data samplekThe k-th feature f of the probability distribution function of the valuekLikelihood function of
Figure BDA0002321111470000031
Wherein: m and N represent the total number of two types of features, i and j represent different data samples, the k-th mark value feature fkLikelihood function between ith and j data samples
Figure BDA0002321111470000032
NsIs the total number of all data samples.
The numerical characteristics are specifically as follows:
Figure BDA0002321111470000033
for the set of all numerical features
Figure BDA0002321111470000034
The kth feature of (1), when gkWith MkThe possible values are then distributed over the ith sample as
Figure BDA0002321111470000035
Corresponding to its likelihood function
Figure BDA0002321111470000036
Wherein: gkLikelihood function on ith data sample
Figure BDA0002321111470000037
As shown in fig. 3, the network traffic type identification system related to the implementation of the above method for this embodiment includes: data capture module, data preprocessing module, characteristic selection module and identification module, wherein: the data preprocessing module captures flow data from a specific network card through a data capturing module in the captured flow data, the data preprocessing module extracts flow-level features from the captured flow data and carries out digital processing, namely non-digital information (such as version information and protocol name information) is converted into digital information, and the feature selection module determines feature input of the classifier according to a label value feature and a likelihood function of the value feature obtained by a captured flow data sample.
The identification module comprises: an input controller and a deep neural network for adjusting inputs to accommodate different network recognition tasks, wherein: the input controller generates a training set used for training the deep neural network and receives an online recognition task, and the deep neural network obtains a corresponding recognition result according to the training set or a likelihood function and network service obtained based on the online recognition task.
The training set is as follows: by collecting different web services SiGenerating an offline training set by the corresponding marking data set of the deep neural network, and training the deep neural network by an offline pre-training method; when the input controller is in accordance with the off-line training phase for a given recognition task, e.g. recognizing a web service SiWhether the traffic exists in the captured traffic or not is judged as the identification result
Figure BDA0002321111470000038
Wherein: b (S)i) Is a binary value, and "1" is the service SiIn the captured traffic, "0" is absent;
the input controller is used for obtaining a likelihood function { alpha (f) by calculation aiming at the online identification taskk) And { beta (g) } and { beta (g)k) And network service SiAnd outputting the mapping result to a deep neural network to obtain a mapping result B (S)i)。
The deep neural network comprises an input layer, three hidden layers and an output layer, wherein: the activation function of the hidden layer uses ReLu, whose size is 128, 64 and 32 in order, and the size of the output layer is 2.
The deep neural network preferably has initial parameters weighted by a likelihood function { alpha (f)k) And { beta (g) } and { beta (g)k)}。
As shown in fig. 4, the present embodiment is implemented by the following specific environments: network traffic transport on the mobile device operating system is simulated on the windows 10 operating system. The Android simulator and the wireshark flow monitor are used for respectively generating data flow and capturing network flow from the equipped network card. To compare the performance of the present method to the conventional method, the present embodiment selects two common classifiers, J48 and Bayesian, for the performance result output of the baseline scenario.
Step 1) parameter configuration: data preprocessing selected header information features, as shown in table 1; in the characteristic selection process, a certain flow data sample is selected to be calculated to obtain { alpha (f)k) And { beta (g) } and { beta (g)k) Results are shown in Table 2.
TABLE 1
Figure BDA0002321111470000041
TABLE 2
Figure BDA0002321111470000042
Step 2) in the process of utilizing the deep learning network to carry out identification, according to the { alpha (f)k) And { beta (g) } and { beta (g)k) The results of (c) select different feature configurations for comparison, as shown in table 3.
TABLE 3
Figure BDA0002321111470000051
As shown in FIG. 5, the accuracy of the method and the accuracy of the traditional method are evaluated by using the F-Measure index, which shows that the method is superior to the traditional J48 and Bayesian network, the overall accuracy is improved by 30%, the F-Measure exceeds 98%, and the fluctuation of the identification accuracy is small in different network flow identifications. By further comparing the recognition accuracy performance under different feature configurations, it can be seen that: having a minimum value of alpha (f)k) Or beta (g)k) The characteristics of (a) have less influence on the identification accuracy. Deleting alpha (f) having maximum valuek) Or beta (g)k) The recognition accuracy can vary significantly with the features of (1).
As shown in fig. 6 and table 4, the total time consumption of the present embodiment compared to the conventional method, i.e. the time involved in modeling and identification services: the method is significantly better than J48 in all services, and the minimum delay is reduced by more than 20% compared to bayesian net. The better identification delay performance enables the proposed method to complete the flow identification in real-time network service, meanwhile, the flow borne in the network is scheduled under the condition of not influencing the communication performance, and the dangerous service flow is cut off in time to ensure the safety of the network.
TABLE 4
Figure BDA0002321111470000052
The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (9)

1. A general network flow identification method based on deep learning is characterized in that in the process of establishing connection and communication between user equipment and a remote server, the user equipment accesses a network through wired switch equipment or a wireless base station, flow data are forwarded through a gateway, the flow data are captured from the gateway, network layer information in the flow data is subjected to flow division, characteristics of a flow level are extracted from header information of transmission layer information and application layer information in the flow data and are used as input of a deep neural network, and therefore the type of service to be identified is judged;
the flow division realizes the analysis of the label value characteristic and the numerical value characteristic based on the flow through a likelihood function, wherein: the marked value characteristic is a characteristic which does not have numerical significance and needs to be converted into a digital form; numerical features are different values of a particular attribute.
2. The method as claimed in claim 1, wherein the signature value characteristics, i.e. data traffic characteristics, include protocol version, network port number, transmission direction and total number of packet sizes transmitted.
3. The method as claimed in claim 2, wherein the total number of the packet sizes to be transmitted is: when the k characteristic
Figure FDA0002321111460000011
With MkA different value, wherein:
Figure FDA0002321111460000012
a set of all the tag value features;
Figure FDA0002321111460000013
for M in the ith data samplekThe k-th feature f of the probability distribution function of the valuekLikelihood function of
Figure FDA0002321111460000014
Wherein: m and N represent the total number of two types of features, i and j represent different data samples, the k-th mark value feature fkLikelihood function between ith and j data samples
Figure FDA0002321111460000015
NsIs the total number of all data samples.
4. The deep learning-based general network traffic identification method according to claim 1, wherein the numerical features are specifically:
Figure FDA0002321111460000016
for the set of all numerical features
Figure FDA0002321111460000017
The kth feature of (1), when gkWith MkThe possible values are then distributed over the ith sample as
Figure FDA0002321111460000018
Corresponding to its likelihood function
Figure FDA0002321111460000019
Wherein: gkLikelihood function on ith data sample
Figure FDA00023211114600000110
5. A network traffic type identification system for implementing the method of any preceding claim, comprising: data capture module, data preprocessing module, characteristic selection module and identification module, wherein: the data preprocessing module captures flow data from a specific network card through the data capturing module in the captured flow data, the data preprocessing module extracts characteristics of a flow level from the captured flow data and carries out digital processing, and the characteristic selecting module determines characteristic input of the classifier according to the label value characteristics and the likelihood function of the value characteristics obtained by the captured flow data samples.
6. The system of claim 5, wherein the identification module comprises: an input controller and a deep neural network for adjusting inputs to accommodate different network recognition tasks, wherein: the input controller generates a training set used for training the deep neural network and receives an online recognition task, and the deep neural network obtains a corresponding recognition result according to the training set or a likelihood function and network service obtained based on the online recognition task.
7. The network traffic type recognition system of claim 5, wherein the training set is: tong (Chinese character of 'tong')Over-collecting different network services { SiGenerating an offline training set by the corresponding marking data set of the deep neural network, and training the deep neural network by an offline pre-training method; when the input controller is in accordance with the off-line training phase for a given recognition task, i.e. recognizing the web service SiWhether the traffic exists in the captured traffic or not is judged as the identification result
Figure FDA0002321111460000021
Wherein: b (S)i) Is a binary value, and "1" is the service SiIn the captured traffic, "0" is absent.
8. The system of claim 5, wherein the input controller is configured to compute the likelihood function { α (f) } for the online identification taskk) And { beta (g) } and { beta (g)k) And network service SiAnd outputting the mapping result to a deep neural network to obtain a mapping result B (S)i)。
9. The network traffic type recognition system of claim 5, wherein the deep neural network comprises an input layer, three hidden layers, and an output layer, wherein: the activation function of the hidden layer adopts ReLu, the sizes of the ReLu are 128, 64 and 32 in sequence, and the size of the output layer is 2; the initial parameters of the deep neural network are weighted by a likelihood function { alpha (f)k) And { beta (g) } and { beta (g)k)}。
CN201911298083.6A 2019-12-17 2019-12-17 General network flow identification method based on deep learning Pending CN112994966A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911298083.6A CN112994966A (en) 2019-12-17 2019-12-17 General network flow identification method based on deep learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911298083.6A CN112994966A (en) 2019-12-17 2019-12-17 General network flow identification method based on deep learning

Publications (1)

Publication Number Publication Date
CN112994966A true CN112994966A (en) 2021-06-18

Family

ID=76341798

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911298083.6A Pending CN112994966A (en) 2019-12-17 2019-12-17 General network flow identification method based on deep learning

Country Status (1)

Country Link
CN (1) CN112994966A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170317894A1 (en) * 2016-05-02 2017-11-02 Huawei Technologies Co., Ltd. Method and apparatus for communication network quality of service capability exposure
US20180212992A1 (en) * 2017-01-24 2018-07-26 Cisco Technology, Inc. Service usage model for traffic analysis
CN108900432A (en) * 2018-07-05 2018-11-27 中山大学 A kind of perception of content method based on network Flow Behavior
CN110247930A (en) * 2019-07-01 2019-09-17 北京理工大学 A kind of refined net method for recognizing flux based on deep neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170317894A1 (en) * 2016-05-02 2017-11-02 Huawei Technologies Co., Ltd. Method and apparatus for communication network quality of service capability exposure
US20180212992A1 (en) * 2017-01-24 2018-07-26 Cisco Technology, Inc. Service usage model for traffic analysis
CN108900432A (en) * 2018-07-05 2018-11-27 中山大学 A kind of perception of content method based on network Flow Behavior
CN110247930A (en) * 2019-07-01 2019-09-17 北京理工大学 A kind of refined net method for recognizing flux based on deep neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张路煜等: "基于卷积神经网络的未知协议识别方法", 《微电子学与计算机》 *
邹智超等: "《A Real-time Network Traffic Identifier for Open 5G/B5G Networks via Prototype Analysis》", 《2019 IEEE GLOBECOM WORKSHOPS》 *
陈雪娇等: "网络应用流类别不平衡环境下的SSL加密应用流识别关键技术", 《电信科学》 *

Similar Documents

Publication Publication Date Title
Shapira et al. FlowPic: A generic representation for encrypted traffic classification and applications identification
Shapira et al. Flowpic: Encrypted internet traffic classification is as easy as image recognition
CN113179223B (en) Network application identification method and system based on deep learning and serialization features
Zhang et al. Autonomous unknown-application filtering and labeling for dl-based traffic classifier update
CN110311829B (en) Network traffic classification method based on machine learning acceleration
CN112564974B (en) Deep learning-based fingerprint identification method for Internet of things equipment
CN110247930B (en) Encrypted network flow identification method based on deep neural network
CN105871832B (en) A kind of network application encryption method for recognizing flux and its device based on protocol attribute
CN110391958B (en) Method for automatically extracting and identifying characteristics of network encrypted flow
WO2020119481A1 (en) Network traffic classification method and system based on deep learning, and electronic device
WO2020062390A1 (en) Network traffic classification method and system, and electronic device
CN112163594A (en) Network encryption traffic identification method and device
CN111953669B (en) Tor flow tracing and application type identification method and system suitable for SDN
US20230119593A1 (en) Method and apparatus for training facial feature extraction model, method and apparatus for extracting facial features, device, and storage medium
CN111860628A (en) Deep learning-based traffic identification and feature extraction method
CN109525508B (en) Encrypted stream identification method and device based on flow similarity comparison and storage medium
Song et al. Encrypted traffic classification based on text convolution neural networks
CN111147394B (en) Multi-stage classification detection method for remote desktop protocol traffic behavior
CN110460502B (en) Application program flow identification method under VPN based on distributed feature random forest
CN111385297A (en) Wireless device fingerprint identification method, system, device and readable storage medium
CN111611280A (en) Encrypted traffic identification method based on CNN and SAE
CN109299742A (en) Method, apparatus, equipment and the storage medium of automatic discovery unknown network stream
CN109151880A (en) Mobile application flow identification method based on multilayer classifier
CN112367274A (en) Industrial control unknown protocol flow identification method
CN109660656A (en) A kind of intelligent terminal method for identifying application program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210618

RJ01 Rejection of invention patent application after publication