Software image flow identification and classification method based on PCA algorithm
Technical Field
The invention belongs to the technical field of software flow identification, and particularly relates to a software mirror image flow identification and classification method based on a PCA algorithm.
Background
The network assets are mainly various devices used in a computer (or communication) network, mainly including a host, network devices (routers, switches, etc.) and security devices (firewalls, etc.), and the value of the network is proportional to the square of the number of network users. There is great freedom in network assets, the applications of installation and deployment are very different, and it is not easy to manage, and although each asset deploys software management tools, there are few asset software management tools for the whole network.
In recent years, network technology is developed rapidly, a variety of application software is produced, and software application combinations installed and deployed in network assets are also diversified. However, due to different functional requirements, the quality of software products is different, and due to the fact that a large number of various software are stacked in network assets, various kinds of vulnerabilities existing in the network are increased due to the fact that the various kinds of software are associated through the network, and through the vulnerabilities, a hidden door which threatens the information security of individuals and enterprises through the network assets and even threatens the national network security is provided for purposeful people.
However, the software management tools in the prior art are mainly directed to a single or specific type of software application, and cannot know which software applications are deployed in all assets mounted in the network through network data. Therefore, when a certain software application deployed in a certain asset has a bug, the existing software management tool can only inform an asset manager of perfect repair in a notification mode, but it is unknown whether the software deployed in other assets in the network has the bug, so that the network asset management scheme in the prior art lacks analysis and collection of overall deployment data of network assets, only each asset can realize software management, and the manager does not know the actual software deployment condition of each asset and finds and repairs the bug in time.
Disclosure of Invention
The invention aims to provide a software classification method for classifying captured mirror image flow by means of mirror image flow and a machine learning method and by means of IP and domain name behaviors spontaneously generated in the using process of software.
The technical scheme of the invention provides a software mirror image flow identification and classification method based on a PCA algorithm, which comprises a model base generation step, a test base generation step and a classification and identification step;
the model base generation step is to collect and install installation packages of a plurality of different types of application software, collect and analyze flow data in the installation process of the application software, collect domain name and IP data, correspondingly generate a training set marked with software names and software classes, and train the training set through a PCA algorithm to acquire a feature matrix of each software class to form a software classification model;
the test library generation step comprises the steps of screening out source IP and IP sessions thereof which accord with software classification by acquiring and analyzing mirror image flow data of each asset in a network, and then forming a test library by taking a load group of application layer load byte data of the IP sessions as a test set;
and the step of classification identification is to compare and identify the test set in the step of generating the test library with the software classification model in the step of generating the model library, and output the class of the software.
Specifically, the step of generating the model base comprises the processes of application software collection, software flow collection, software-related domain name collection, software-related IP collection, training set generation and training model generation;
the application software collection is to collect installation packages of a plurality of types of application software including communication software, transmission software, office software and multimedia software through an internet way;
the software flow collection is to collect the IP session flow which is spontaneously and outwards initiated in the operation processes of installation, use, update and the like of each application software collected by the application software; the session is completely applied to load acquisition, the client and the server can exchange a large amount of host information in the handshaking process, and the encrypted session can also exchange digital certificates.
The software-related domain name collection is realized by analyzing DNS protocol flow spontaneously formed by each application software and extracting domain names and/or CNAME domain names used for software and software servers in the DNS protocol flow, wherein the domain names are mainly used for realizing information collection such as asset terminal information uploading, synchronization, software updating, user operation collection and the like by software;
the software-related IP collection is to analyze the response data of the A command and/or AAA command of DNS in DNS protocol flow spontaneously formed by each application software, extract the analysis IP of the software-related domain name, or obtain the latest domain name analysis IP through the Internet, such as using *** public DNS;
the training set generation is to collect the application layer load byte data of the IP session spontaneously formed by each application software, label the software name and software classification for each application layer load byte, and use the load group as the training set;
the training model generation step is to take the training set as a training sample, train the training sample by a PCA algorithm and obtain a feature matrix of each software classification; the PCA algorithm is used. The training complexity is reduced, and the training and recognition speed is increased through the characteristics of dimension reduction retention and the highest software category information quantity; different from the traditional method of using fixed features, the method utilizes session content as training content, and can automatically complete feature acquisition, training and recognition.
The IP session traffic comprises IP sessions including DNS, HTTP and HTTPS protocols.
The application layer load byte data is not less than 128 bytes, and if the application layer load byte data is less than 128 bytes, the application layer load byte data is filled in a 0 complementing mode;
the software classification comprises communication software, transmission software, office software, multimedia software, development software, safety software, mail software, industry software, game software and mobile phone application software.
Further, the test library generation step comprises the processes of mirror image flow extraction and analysis, source IP data extraction and test set generation;
the mirror image flow extraction and analysis means that the asset flow data in the network is subjected to mirror image acquisition, then the acquired mirror image flow is analyzed, the source IP in the DNS protocol of the mirror image flow is extracted, and the source IP and the IP session thereof which accord with software classification are screened out;
the source IP data extraction refers to extracting an IP session in the source IP;
the test set generation refers to extracting a load group of application layer load byte data of each IP session in the source IP to generate a test set;
the source IP and the IP session thereof conforming to the classification for software satisfy: the A and/or AAA command request session of DNS in DNS protocol flow comprises the domain name related to the software, or the CNAME request of DNS comprises the domain name related to the software, or the A and/or AAA command response session of DNS comprises the IP related to the software.
The IP session comprises application protocol session flow which is spontaneously formed by software of HTTP and HTTPS; the software spontaneous behaviors comprise defaults of visiting official websites, updating, reporting error logs, reporting statistical logs, operating backup and uploading configuration. Most of these operations are done based on HTTP, HTTPs.
The application layer payload byte data is not less than 128 bytes, and if the application layer payload byte data is not less than 128 bytes, the application layer payload byte data is filled in a 0 complementing mode.
Compared with the prior art, the technical scheme of the invention analyzes the domain name and IP request of each asset effective load in the flow through the collection and the mirror image collection of the core switch, arranges a special behavior time sequence characteristic, and finally identifies the local software name of each asset generating network behavior in the real-time flow through the real-time and off-line DNS data and the behavior characteristics of a large amount of local software. Therefore, the invention discovers the software classification possibly existing in the mirror flow by constructing the session characteristics related to the software, and helps the user to master which type of software is deployed in each asset.
Drawings
The foregoing and following detailed description of the invention will be apparent when read in conjunction with the following drawings, in which:
FIG. 1 is a logical schematic of a basic scheme of the present invention.
Detailed Description
The technical solutions for achieving the objects of the present invention are further illustrated by the following specific examples, and it should be noted that the technical solutions claimed in the present invention include, but are not limited to, the following examples.
Example 1
As a most basic implementation scheme of the present invention, as shown in fig. 1, the software image flow identification and classification method based on the PCA algorithm disclosed in this embodiment includes a model library generation step, a test library generation step, and a classification identification step.
In the step of generating the model base, a plurality of installation packages of different types of application software are collected and installed, flow data in the installation process of the application software are collected and analyzed, domain name and IP data are collected, a training set marked with software names and software classifications is correspondingly generated, and then the training set is trained through a PCA algorithm to acquire a feature matrix of each software classification to form a software classification model.
And in the test library generation step, a source IP and an IP session thereof which accord with software classification are screened out by acquiring and analyzing mirror image flow data of each asset in the network, and then a load group of application layer load byte data of the IP session is used as a test set to form the test library.
And the step of classification identification is to compare and identify the test set in the step of generating the test library with the software classification model in the step of generating the model library, and output the class of the software.
The method comprises the steps of analyzing domain names and IP requests of effective loads of all assets in flow through convergence and core switch mirror image collection, sorting out a special behavior time sequence characteristic, and finally identifying local software names of network behaviors generated by all assets in real-time flow through real-time and offline DNS data and behavior characteristics of a large amount of local software. Therefore, the invention discovers the software classification possibly existing in the mirror flow by constructing the session characteristics related to the software, and helps the user to master which type of software is deployed in each asset.
Example 2
As a preferred implementation scheme of the present invention, on the basis of the foregoing example 1, further, the step of generating the model library includes processes of application software collection, software traffic collection, software-related domain name collection, software-related IP collection, training set generation, and training model generation.
The application software collection is to collect installation packages of a plurality of types of application software including communication software, transmission software, office software and multimedia software through an internet way;
the software flow collection is to collect the IP session flow which is spontaneously and outwards initiated in the operation process of installation, use, update and the like of each application software collected by the application software, wherein the IP session flow comprises IP sessions including DNS, HTTP and HTTPS protocols; the session is completely applied to load acquisition, the client and the server can exchange a large amount of host information in the handshaking process, and the encrypted session can also exchange digital certificates.
The software-related domain name collection is realized by analyzing DNS protocol flow spontaneously formed by each application software and extracting domain names and/or CNAME domain names used for software and software servers in the DNS protocol flow, wherein the domain names are mainly used for realizing information collection such as asset terminal information uploading, synchronization, software updating, user operation collection and the like by software.
The software-related IP collection is to analyze the response data of the A command and/or AAA command of DNS in DNS protocol flow which is spontaneously formed by each application software, extract the analysis IP of the software-related domain name, or acquire the latest domain name analysis IP through the Internet, such as using *** public DNS.
The training set generation is to collect application layer load byte data of an IP session spontaneously formed by each application software, the application layer load byte data is not less than 128 bytes, if the application layer load byte data is less than 128 bytes, the application layer load byte data is filled in a 0-complementing mode, a software name and a software classification are marked for each application layer load byte, and the load group is used as a training set.
The training model generation step is to take the training set as a training sample, train the training sample by a PCA algorithm and obtain a feature matrix of each software classification; the PCA algorithm is used. The training complexity is reduced, and the training and recognition speed is increased through the characteristics of dimension reduction retention and the highest software category information quantity; different from the traditional method of using fixed features, the method utilizes session content as training content, and can automatically complete feature acquisition, training and recognition.
Preferably, the software classification includes communication software, transmission software, office software, multimedia software, development software, security software, mail software, industry software, game software and mobile phone application software.
Further, the test library generation step comprises the processes of mirror flow extraction and analysis, source IP data extraction and test set generation.
The mirror image flow extraction and analysis means that the asset flow data in the network is subjected to mirror image acquisition, then the acquired mirror image flow is analyzed, the source IP in the DNS protocol of the mirror image flow is extracted, and the source IP and the IP session thereof which accord with software classification are screened out; the source IP data extraction is to extract IP sessions in the source IP, the test set generation is to extract a load group of application layer load byte data of each IP session in the source IP and generate a test set, the application layer load byte data is not less than 128 bytes, and if the application layer load byte data is less than 128 bytes, the test set is filled in a 0-complementing mode;
and the source IP and its IP session conforming to the classification for software should satisfy at least any one of:
1. a and/or AAA command request session of DNS in DNS protocol flow comprises domain name related to the software;
2. the CNAME request of the DNS comprises a domain name related to the software;
3. the A and/or AAA command response session of the DNS contains the software-related IP.
The IP session comprises application protocol session traffic spontaneously formed by software of HTTP and HTTPS; the software spontaneous behaviors comprise defaults of visiting official websites, updating, reporting error logs, reporting statistical logs, operating backup and uploading configuration. Most of these operations are done based on HTTP, HTTPs.