CN116186708A

CN116186708A - Class identification model generation method, device, computer equipment and storage medium

Info

Publication number: CN116186708A
Application number: CN202211668533.8A
Authority: CN
Inventors: 郑子彬; 黄进波; 林昊; 蔡倬; 赵山河; 张文锋
Original assignee: Merchants Union Consumer Finance Co Ltd; Sun Yat Sen University
Current assignee: Merchants Union Consumer Finance Co Ltd; Sun Yat Sen University
Priority date: 2022-12-24
Filing date: 2022-12-24
Publication date: 2023-05-30

Abstract

The application relates to a class identification model generation method, a class identification model generation device, a class identification model generation computer device and a class identification model storage medium. The method comprises the following steps: acquiring original network request data, and performing text field expansion on the original network request data to obtain target network request data; selecting target network field data from the target network request data, and calculating target word frequency and target reverse file frequency corresponding to each word in the target network field data; obtaining target word frequency characteristics based on the target word frequency and the target reverse file frequency corresponding to each word; performing category coding on each network field in the target network request data to obtain target category characteristics corresponding to each network field; fusing the target word frequency characteristics and the target category characteristics to obtain target text characteristics; training the initial decision tree model based on the target text characteristics and the target network request data to obtain a target class identification model. The method can improve the efficiency of detecting and identifying the network attack category.

Description

Class identification model generation method, device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a class identification model generating method, apparatus, and computer device.

Background

With the development of internet technology, the internet industry has been integrated into the aspects of life of people, and the business efficiency of each industry is continuously improved. However, the current network technology is not perfect, and lawbreakers often use existing internet vulnerabilities to frequently launch attacks to illegally steal data of website privacy, and if some platforms are hacked (such as financial platforms), the consequences are not envisaged. In addition, because the threat is continuously evolved, the threat rule update is not real-time, and the novel attack cannot be resisted, so that the problems of false alarm rate, false alarm rate and the like of the existing rule-based Web application intrusion threat and protection product exist.

Therefore, the conventional technology mainly performs rule matching by analyzing known attack characteristics, and has the problems of difficult rule maintenance and bloated characteristic library, so that unknown vulnerabilities or attack techniques cannot be detected, and the efficiency of detecting and identifying network attack categories is low.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a method, an apparatus, a computer device, and a storage medium for generating a class recognition model capable of detecting and recognizing a network attack class, which improve the efficiency of detecting and recognizing a network attack class.

A class identification model generation method, the method comprising:

acquiring original network request data, and performing text field expansion on the original network request data to obtain target network request data;

selecting target network field data from the target network request data, and calculating target word frequency and target reverse file frequency corresponding to each word in the target network field data;

fusing the target word frequency corresponding to each word in the target network field data with the target reverse file frequency to obtain an intermediate word frequency characteristic matrix;

performing dimension reduction processing on the intermediate word frequency feature matrix to obtain target word frequency features;

performing category coding on each network field in the target network request data to obtain target category characteristics corresponding to each network field;

fusing the target word frequency characteristics and the target category characteristics to obtain target text characteristics;

training an initial decision tree model based on the target text features and the target network request data to obtain a target class identification model, wherein the target class identification model is used for obtaining a target network attack class based on the network request data to be identified.

In one embodiment, obtaining original network request data, performing text field expansion on the original network request data, and obtaining target network request data includes:

searching the missing network fields in the original network request data to obtain target missing positions corresponding to the missing network fields;

acquiring target special characters, and adding the target special characters in each target missing position to obtain intermediate network request data;

and obtaining a target expansion field, selecting a network field to be expanded from the intermediate network request data, and adding the target expansion field at a position corresponding to the network field to be expanded to obtain the target network request data.

In one embodiment, selecting target network field data from the target network request data, and calculating a target word frequency and a target reverse file frequency corresponding to each word in the target network field data includes:

calculating the number of times of each word in the target network field data in each target network request data corresponding to the target network field data to obtain the target number;

calculating the sum of the occurrence times of each word in each target network request data corresponding to the target network field data to obtain the total number of targets;

Obtaining target word frequencies corresponding to words in the target network field data based on the proportion of the target number to the corresponding target total number;

taking the total number of the target network request data corresponding to the target network field data as a first total number;

taking the total number of the target network request data corresponding to each word in the target network field data as a second total number;

calculating the ratio of the first total number to the corresponding second total number, and taking the ratio as a target ratio;

and calculating the logarithm of the target proportion to obtain the target reverse file frequency corresponding to each word in the target network field data.

In one embodiment, performing the dimension reduction processing on the intermediate word frequency feature to obtain the target word frequency feature includes:

calculating a transposed matrix corresponding to the intermediate word frequency feature matrix to obtain a transposed word frequency feature matrix;

fusing the transposed word frequency feature matrix and the intermediate word frequency feature matrix to obtain a first feature matrix;

obtaining a left singular matrix based on the first feature matrix;

calculating square roots corresponding to all eigenvalues in the first eigenvalue matrix, and fusing the square roots corresponding to all eigenvalues to obtain an intermediate singular value matrix;

Fusing the intermediate word frequency feature matrix and the transposed word frequency feature matrix to obtain a second feature matrix;

obtaining a right singular matrix based on the second feature matrix;

and fusing the left singular matrix, the intermediate singular value matrix and the right singular matrix to obtain the target word frequency characteristic.

In one embodiment, performing category encoding on each network field in the target network request data, and obtaining the target category feature corresponding to each network field includes:

calculating the total number of categories of each category corresponding to each network field in the target network request data;

and sequentially carrying out numerical coding on each category of each network field until the total number of categories corresponding to each category of each network field is coded, and obtaining the target category characteristics corresponding to each network field.

In one embodiment, training the initial decision tree model based on the target text feature and the target network request data to obtain a target class identification model includes:

calculating splitting gains corresponding to all text features in the target text features based on the initial decision tree model;

Determining root node division characteristics based on comparison results of the splitting gains corresponding to the text characteristics;

constructing a first histogram based on the root node partition characteristics, wherein the first histogram comprises a target number of bins, each bin in the target number of bins comprises a corresponding first-order derivative total value and a second-order derivative total value of samples, and the samples are sample data in the target request data;

obtaining a plurality of first splitting thresholds based on the first derivative total value of the sample and the second derivative total value of the sample corresponding to each bin;

determining a first target splitting threshold based on the comparison result of the plurality of first splitting thresholds, dividing the first histogram based on the first target splitting threshold, and generating branch nodes, wherein each branch node contains sample data corresponding to the target network request data;

calculating the splitting gain of each text feature corresponding to each branch node, and determining the node division feature of each branch node based on the comparison result of the splitting gain of each text feature corresponding to each branch node;

constructing a corresponding current histogram based on each node partition characteristic, and obtaining a plurality of current split thresholds corresponding to the current histogram based on a first-order total value and a second-order total value of samples of each bin corresponding to the current histogram;

Determining a current target splitting threshold value based on comparison results of a plurality of current splitting threshold values corresponding to the current histogram, dividing the current histogram based on the current target splitting threshold value, and generating a new branch node;

repeating the calculation of the splitting gain of each text feature corresponding to each branch node, determining the node division feature operation of each branch node based on the comparison result of the splitting gain of each text feature corresponding to each branch node until each text feature in the target text feature builds a corresponding current histogram, and determining the target category recognition model.

In one embodiment, training the initial decision tree model based on the target text feature and the target network request data, and after obtaining the target class identification model, further includes:

acquiring network request data to be identified, and performing text field expansion on the network request data to be identified to obtain target network request data to be identified;

selecting target network field data to be identified from the target network request data to be identified, and calculating current target word frequency and current target reverse file frequency corresponding to each word in the target network field data to be identified;

Fusing the current target word frequency corresponding to each word in the target network field data to be identified with the current target reverse file frequency to obtain a current intermediate word frequency characteristic matrix;

performing dimension reduction processing on the current intermediate word frequency feature matrix to obtain current target word frequency features;

performing category coding on each network field in the network request data to be identified to obtain the current target category characteristics corresponding to each network field;

fusing the current target word frequency characteristics and the current target category characteristics to obtain current target text characteristics;

and acquiring a target class identification model, and inputting the current target text characteristics and the target network request data to be identified into the target class identification model to obtain a target network attack class corresponding to the network request data to be identified.

A class identification model generation apparatus, the apparatus comprising:

the data processing module is used for acquiring original network request data, and performing text field expansion on the original network request data to obtain target network request data;

the calculation module is used for selecting target network field data from the target network request data and calculating target word frequency and target reverse file frequency corresponding to each word in the target network field data;

The matrix generation module is used for fusing the target word frequency corresponding to each word in the target network field data with the target reverse file frequency to obtain an intermediate word frequency characteristic matrix;

the first feature determining module is used for performing dimension reduction processing on the intermediate word frequency feature matrix to obtain target word frequency features;

the second feature determining module is used for carrying out category coding on each network field in the target network request data to obtain target category features corresponding to each network field;

the target feature determining module is used for fusing the target word frequency feature and the target category feature to obtain a target text feature;

the model generation module is used for training the initial decision tree model based on the target text characteristics and the target network request data to obtain a target class identification model, and the target class identification model is used for obtaining a target network attack class based on the network request data to be identified.

A computer device comprising a memory storing a computer program and a processor which when executing the computer program performs the steps of:

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

The category identification model generation method, the category identification model generation device, the computer equipment and the storage medium are used for obtaining original network request data, and performing text field expansion on the original network request data to obtain target network request data; selecting target network field data from the target network request data, and calculating target word frequency and target reverse file frequency corresponding to each word in the target network field data; fusing the target word frequency corresponding to each word in the target network field data with the target reverse file frequency to obtain an intermediate word frequency characteristic matrix; performing dimension reduction processing on the intermediate word frequency feature matrix to obtain target word frequency features; performing category coding on each network field in the target network request data to obtain target category characteristics corresponding to each network field; fusing the target word frequency characteristics and the target category characteristics to obtain target text characteristics; training an initial decision tree model based on the target text feature and the target network request data to obtain a target class identification model, wherein the target class identification model is used for obtaining a target network attack class based on the network request data to be identified, obtaining target network request data by carrying out missing value processing and text field expansion on the original network request data, selecting target network field data from the target network request data, calculating a target word frequency and a target reverse text word frequency corresponding to each word in the target network data, obtaining an intermediate word frequency feature matrix based on the target word frequency and the target reverse text word frequency, carrying out dimension reduction processing on the intermediate word frequency feature matrix to obtain target word frequency features, carrying out class coding on the target network request data to obtain target class features, the method comprises the steps of fusing target category characteristics with target word frequency characteristics to obtain target text characteristics, training an initial decision tree based on the target text characteristics and target request data, determining a target category identification model, converting text information in original network request data into numerical characteristics which can reflect the text characteristics in a numerical form, extracting the target word frequency characteristics corresponding to target network field data with more important information, greatly improving the accuracy and efficiency of detecting and identifying network attack categories by the target category identification model, extracting category characteristics of each field in the target network request data, enriching information obtained by training the initial classification model, and being beneficial to learning more rules by the target category identification model to a certain extent, wherein the double functions of the target word frequency characteristics and the target category characteristics are achieved, so that the efficiency of detecting and identifying network attack categories is improved.

Drawings

FIG. 1 is an application environment diagram of a class identification model generation method in one embodiment;

FIG. 2 is a flow diagram of a class identification model generation method in one embodiment;

FIG. 3 is a flow diagram of data processing in one embodiment;

FIG. 4 is a flow diagram of data computation in one embodiment;

FIG. 5 is a flow chart of target word frequency feature determination in one embodiment;

FIG. 6 is a flow diagram of object class feature determination in one embodiment;

FIG. 7 is a flow diagram of object class identification model determination in one embodiment;

FIG. 8 is a flow diagram of the use of a target class identification model in one embodiment;

FIG. 9 is a flow diagram of a target attack class identification process in one embodiment;

FIG. 10 is a block diagram showing the structure of a class identification model generating device in one embodiment;

FIG. 11 is an internal block diagram of a computer device in one embodiment;

fig. 12 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The category identification model generation method provided by the embodiment of the application can be applied to an application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The terminal 102 is used to input the original network request data. The server 104 is configured to obtain target network request data and target text features corresponding to the original network request data, train an initial classification model based on the target text features and the target network request data, determine a target class recognition model, obtain network request data to be recognized, and obtain a network attack class corresponding to the network request data to be recognized based on the target class recognition model. The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, where the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart vehicle devices, and the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.

In one embodiment, as shown in fig. 2, a class identification model generating method is provided, and the method is applied to the server in fig. 1 for illustration, and includes the following steps:

step S200, original network request data are obtained, text field expansion is carried out on the original network request data, and target network request data are obtained.

The original network request data refer to web request data collected through various websites, and mainly comprise the web request data and pre-labeled attack category labels. web request data refers to protocol content of http including, but not limited to, http request method (method), user agent information (user agent), uniform resource locator (url), reference information of http request header (header), and body information of http. The attack categories include, but are not limited to, SQL injection attack, XSS cross-site scripting attack, command execution attack, directory traversal attack and remote code execution attack, and the tags of the attack categories are sequentially 0-4. Text field augmentation refers to an operation of selecting a prescribed field on the basis of original network request data, and adding corresponding text information to the prescribed field. The target network request data refers to web request data obtained after the original network request data has undergone missing value processing and text field expansion. Missing value processing refers to the operation of replacing the missing fields in the original network request data with a unified text character.

Specifically, the original network request data collected from each website has the problems of partial field missing, partial field insufficient information and the like, and in order to obtain more accurate and detailed data information, the obtained original network request data needs to be subjected to missing value processing and text field expansion so as to enrich the web request data and make full data preparation for the execution of subsequent processes.

Step S202, selecting target network field data from the target network request data, and calculating target word frequency and target reverse file frequency corresponding to each word in the target network field data.

The target network field data refers to selecting some network fields with great influence on network attack detection and identification from target network request data, and in this embodiment, network fields corresponding to user agent information (user agent) and uniform resource locator (url) are selected as target network field data. The target word Frequency refers to a TF (Term Frequency) word Frequency corresponding to TF-IDF (Term Frequency-inverse document Frequency), which can be used to illustrate the strength of a certain word for the expression capability of a certain document, and the underlying assumption is that: the words in the query key should be more important than other words and the importance of the words in the query key to the document is proportional to the number of times the words appear in the document. The target reverse file frequency refers to the corresponding IDF (IDF, inverse Document Frequency) in the TF-IDF, the IDF can be used for measuring the class distinguishing capability of a certain word on the whole document set level, certain words which occur too many times in the document but cannot well express the meaning of the document can be found according to the IDF, for example, words which are used as words which are connected or keep sentences smooth in sentences can occur in most documents, the TF corresponding to the words is larger, the TF corresponding to the words cannot help distinguishing the relevance of the documents, therefore, the IDF needs to be considered, the IDF of a certain word can be obtained by dividing the total document data by the data of the document containing the word, and the obtained quotient logarithm is obtained. TF-IDF is a common weighting technique for information retrieval and text mining, which is a statistical method commonly used to evaluate the importance of a word to one of a set of documents or a corpus, the main idea being: if a word appears in one article with high frequency and rarely appears in other articles, the word or phrase is considered to have good category distinguishing capability and is suitable for classifying documents, each field in the target network field data is text, so that the method can be applied to acquire TF-IDF values corresponding to each field word in the target network field data to acquire corresponding text features.

Specifically, in order to obtain field information with more information from the target network field data, the word frequency characteristics corresponding to each word are obtained by calculating the target word frequency and the target reverse file frequency corresponding to each word in the target network field data. Wherein, the calculation formula of the target word frequency is shown as a formula (1), n in the formula (1) _i,j Representing the number of times the ith word appears in the target network field data j, denominator Σ _k n _k,j Representing the sum of the number of occurrences of all words in the target network field data j. The calculation formula of the target reverse text frequency is shown as a formula (2), wherein in the formula (2), I represents the total number of target network request data, and I is {: t _i ∈d _i The expression } | includes the word t _i The number of target network request data (i.e., n _i,j Not equal to 0) and, if the word is not in the target network request data, the denominator is zero, so that in general |1+ { j: t is used _i ∈d _i The denominator is } |.

And step S204, fusing the target word frequency corresponding to each word in the target network field data with the target reverse file frequency to obtain an intermediate word frequency characteristic matrix.

The fusion refers to multiplying the target word frequency corresponding to each word by the target reverse text frequency to obtain TF-IDF corresponding to each word, and the TF-IDF corresponding to each word is represented in a matrix, namely the intermediate word frequency feature matrix. The intermediate word frequency feature matrix refers to a TF-IDF matrix corresponding to each word extracted from the target network field data, and can be used for representing the TF-IDF features of the target network field data.

Specifically, when the target word frequency (TF) and the target reverse file frequency (IDF) corresponding to each word are calculated, the target word frequency and the target reverse file frequency corresponding to each word still need to be multiplied to obtain TF-IDF values corresponding to each word, and the TF-IDF corresponding to each word is represented in a matrix to obtain an intermediate word frequency feature matrix.

And S206, performing dimension reduction processing on the intermediate word frequency feature matrix to obtain target word frequency features.

The target word frequency feature is a feature matrix obtained after the intermediate word frequency feature matrix is subjected to dimension reduction, and in the embodiment, the intermediate word frequency feature is subjected to dimension reduction by adopting a singular value decomposition method. The singular value decomposition is mainly used for searching main dimensions in data distribution, and mapping original high-dimensional data into a low-dimensional subspace to realize data dimension reduction.

Specifically, as the number of text words corresponding to the target network field data is large, the feature dimension of the obtained intermediate word frequency feature is too high, which is not beneficial to the classification of the machine learning model in the subsequent process, so that the intermediate word frequency feature matrix needs to be subjected to dimension reduction processing to extract the key part in the intermediate word frequency feature matrix. Wherein the singular value decomposition of matrix A is defined as shown in equation (3), where in equation (3) the column vectors U in matrix U _j Is a left singular vector; the matrix S is a singular value matrix, is a diagonal matrix with singular values of the matrix A arranged from large to small, and is _j Is the j-th singular value; column vector V in matrix V _j Is the right singular vector. Further, since in many cases, the sum of the singular values of the first 10% or even 1% accounts for 99% or more of the sum of all singular values, the matrix can be approximately described by the first r singular values, and then for the m×n-dimensional matrix a, it can be approximately expressed according to the expression (4).

A＝USV ^T (3)

A _m×n ≈U _m×r S _r×r V _r×n (4)

Step S208, performing category coding on each network field in the target network request data to obtain target category characteristics corresponding to each network field.

Wherein, the class coding means that each class in the network field is coded with a numerical value, and in this embodiment, a hard coding method is used to code each class for each network field in the target network request data with a numerical value. The target category feature refers to the conversion of each network field in the target network request data into a corresponding numerical feature.

Specifically, some network fields in the target network request data have fixed categories, so that each network field in the target network request data can be subjected to category coding to obtain target category characteristics corresponding to the target request data. In this embodiment, the hard coding method is used to perform class coding on each network field, that is, the method based on counting is directly used to perform coding on each text class.

And step S210, fusing the target word frequency characteristics and the target category characteristics to obtain target text characteristics.

The fusion means to combine the target word frequency characteristics and the target category characteristics and integrate the two types of characteristics into the same matrix. The target text feature refers to a feature used in classification of the machine learning model, which is a feature set of TF-IDF features extracted from target request data and class features obtained after class encoding of the target request data.

Specifically, the features finally input into the machine learning are each of the target word frequency features and the target category features, and the two types of features are combined to obtain the target text features finally input into the machine learning model.

Step S212, training an initial decision tree model based on the target text features and the target network request data to obtain a target class identification model, wherein the target class identification model is used for obtaining a target network attack class based on the network request data to be identified.

The initial decision tree model refers to an original decision tree model, the initial decision tree model is trained through target text features and target network request data, and model parameters are adjusted, and in the embodiment, the initial decision tree model refers to a LightGBM model.

Specifically, in order to obtain a target category recognition model with better classification capability, an initial decision tree model is required to be used for classifying target text features extracted from target network request data. In this embodiment, the initial decision tree model is a LightGBM model, where the LightGBM uses a decision tree algorithm based on a histogram, where the basic idea of the decision tree algorithm of the histogram is to discretize continuous floating point feature values into k integers, and form a histogram with a width of k at the same time, when the variable histogram is binned into data, statistics are accumulated in the histogram according to the discretized values as indexes, after the data is traversed once, the histogram accumulates the required statistics, and then, according to the discrete values of the histogram, the optimal segmentation point of the histogram is traversed and found. After the final decision tree is constructed, calculating the probability of each category obtained by classifying the decision tree, taking the category corresponding to the highest probability as the final attack category, and if all the categories are lower than 0.5, considering that no network attack threat exists.

According to the category identification model generation method, original network request data are obtained, text field expansion is carried out on the original network request data, and target network request data are obtained; selecting target network field data from the target network request data, and calculating target word frequency and target reverse file frequency corresponding to each word in the target network field data; fusing the target word frequency corresponding to each word in the target network field data with the target reverse file frequency to obtain an intermediate word frequency characteristic matrix; performing dimension reduction processing on the intermediate word frequency feature matrix to obtain target word frequency features; performing category coding on each network field in the target network request data to obtain target category characteristics corresponding to each network field; fusing the target word frequency characteristics and the target category characteristics to obtain target text characteristics; training an initial decision tree model based on the target text feature and the target network request data to obtain a target class identification model, wherein the target class identification model is used for obtaining a target network attack class based on the network request data to be identified, obtaining target network request data by carrying out missing value processing and text field expansion on the original network request data, selecting target network field data from the target network request data, calculating a target word frequency and a target reverse text word frequency corresponding to each word in the target network data, obtaining an intermediate word frequency feature matrix based on the target word frequency and the target reverse text word frequency, carrying out dimension reduction processing on the intermediate word frequency feature matrix to obtain target word frequency features, carrying out class coding on the target network request data to obtain target class features, the method comprises the steps of fusing target category characteristics with target word frequency characteristics to obtain target text characteristics, training an initial decision tree based on the target text characteristics and target request data, determining a target category identification model, converting text information in original network request data into numerical characteristics which can reflect the text characteristics in a numerical form, extracting the target word frequency characteristics corresponding to target network field data with more important information, greatly improving the accuracy and efficiency of detecting and identifying network attack categories by the target category identification model, extracting category characteristics of each field in the target network request data, enriching information obtained by training the initial classification model, and being beneficial to learning more rules by the target category identification model to a certain extent, wherein the double functions of the target word frequency characteristics and the target category characteristics are achieved, so that the efficiency of detecting and identifying network attack categories is improved.

In one embodiment, as shown in fig. 3, step S200 includes:

step S300, searching the missing network fields in the original network request data to obtain target missing positions corresponding to the missing network fields.

The target missing position refers to a position corresponding to a certain field in the original network request data when the field lacks text.

Specifically, the original request data may include partial network field data missing, and the missing network field data needs to be processed uniformly for subsequent use of the machine learning model. In this embodiment, the positions of the missing network fields in the original request data are searched by using the pandas library of python, and the positions are used as target missing positions corresponding to the missing network fields.

And step S302, obtaining target special characters, and adding the target special characters at each target missing position to obtain intermediate network request data.

Wherein, the target special character refers to a specific character for uniformly deleting the network field, and in the embodiment, "NAN" is used as the special character. The intermediate network request data refers to network request data obtained after target special character filling is performed on the missing network field in the original network request data.

Specifically, because all network fields in the original network request data are text information, the missing network fields can be filled with unified target special characters so as to ensure that the texts of the missing network fields are consistent, and the extraction of the subsequent target text features is convenient.

Step S304, a target expansion field is obtained, a network field to be expanded is selected from the intermediate network request data, and the target expansion field is added at a position corresponding to the network field to be expanded, so that the target network request data is obtained.

Wherein the target augmentation field refers to text information for populating the network field to be augmented. The network field to be expanded refers to a network field having text content that has a large influence on the machine learning model classification result, and in this embodiment, user agent information (user agent) and uniform resource locator (url) are selected as the network field to be expanded.

Specifically, in order to extract more useful information from the intermediate network request data, the intermediate network request data needs to be expanded, and since some network fields in the intermediate network request data have more useful information, the network field having more useful information is selected as the network field to be expanded. In this embodiment, user agent information (user agent) and uniform resource locator (url) are selected as network fields to be extended, the user agent information (user agent) is divided into browser name, operating system name and device name, device brand and device model by using a user_agents library in python, and the uniform resource locator (url) is divided into protocol name, domain name server, relative path and query condition by using a url server. Wherein the user agents is a Python library, providing a simple way to identify/detect devices by parsing the browser/HTTP user agent string; URL parameter is a Python library that defines a standard interface for decomposing Uniform Resource Locator (URL) strings (protocols, domain name servers, paths, etc.) in components.

In this embodiment, the missing network field processing and the text field expansion are performed on the original network request data, so that the text of the missing network field can be unified, the extraction of the subsequent target text features and the increase of the information content of useful information are facilitated, the subsequently acquired target text features are better, and the execution efficiency of the subsequent process is improved.

In one embodiment, as shown in fig. 4, step S202 includes:

step S400, calculating the number of times of each word in the target network field data in each target network request data corresponding to the target network field data to obtain the target number.

Wherein, the target number refers to the number of times each word in the target network field data appears in each target network request data, namely the number of times a word appears in one target network request data.

Specifically, before calculating the corresponding target word frequency, the number of times each word in the target network field data appears in each target network request data needs to be known, so as to provide a data basis for the execution of the subsequent process, and in this embodiment, the target number is the numerator in equation (1).

Step S402, calculating the sum of the occurrence times of each word in each target network request data corresponding to the target network field data to obtain the target total number.

Wherein, the target total number refers to the sum of the occurrence times of all words corresponding to the target network field data in each target network request data.

Specifically, in order to calculate the target word frequency later, the sum of the occurrence times of all words corresponding to the target network field data in each target network request data needs to be calculated first, and the sum is taken as the target total number. In this embodiment, the target total number is the denominator in the formula (1).

And step S404, obtaining the target word frequency corresponding to each word in the target network field data based on the ratio of the target number to the corresponding target total number.

Specifically, the target word frequency corresponding to each word is obtained by calculating the ratio of the target number corresponding to each word to the corresponding target total number, and the target word frequency corresponding to each word can be used for calculating the text feature corresponding to the word. The formula of the target word frequency can be seen as formula (1).

In step S406, the total number of the target network request data corresponding to the target network field data is taken as the first total number.

Wherein the first total number is the total number of data requested by the target network.

Specifically, since each target network request data selects the corresponding target network field data, the total number of the target network request data corresponding to the target network field data is actually the total number of the target network request data, and the first total number is obtained to provide a data basis for calculating the target reverse file frequency of the subsequent process.

Step S408, taking the total number of the target network request data corresponding to each word in the target network field data as the second total number.

The second total number refers to the total number of target network request data containing a word in the target network field data, that is, the corresponding number of target network request data of the target network request data where the word in the target network field data is located.

Specifically, the step mainly comprises the steps of calculating the second total number to provide a data basis for the subsequent process, and the smaller the second total number corresponding to a certain word is, the larger the IDF is, which indicates that the word has good category distinguishing capability. Further, in the present embodiment, it is noted that when a certain word does not exist in any one of the target network request data, the second total number is generally set to 1.

In step S410, a ratio of the first total number to the corresponding second total number is calculated, and the ratio is taken as a target ratio.

The target ratio refers to a ratio corresponding to the first total number as a numerator and the second total number as a denominator.

Specifically, the quotient obtained by dividing the first total number by the second total number is needed to obtain the target reverse file frequency in the subsequent process, and data preparation is performed for obtaining the target reverse file frequency.

Step S412, the logarithm of the target proportion is calculated, and the target reverse file frequency corresponding to each word in the target network field data is obtained.

Specifically, the target reverse file frequency corresponding to each word in the target network field data is obtained, and the logarithm is also required to be taken for each target proportion corresponding to each word in the target network field data. The formula of the calculation of the target reverse file frequency is shown in formula (2).

In this embodiment, by calculating the target word frequency and the target reverse text frequency corresponding to each word in the target network field data, the text corresponding to each network field in the target network field data can be converted into the numerical feature reflecting the text feature, which is beneficial to classifying and training the machine learning model in the subsequent process, so that the efficiency of detecting and identifying the network attack type by the class identification model obtained by training is improved.

In one embodiment, as shown in fig. 5, step S206 includes:

and S500, calculating a transposed matrix corresponding to the intermediate word frequency feature matrix to obtain the transposed word frequency feature matrix.

The transposed matrix refers to a new matrix obtained by interchanging rows and columns of the original matrix. The transposed word frequency feature matrix refers to a matrix obtained by exchanging rows and columns in the intermediate word frequency feature matrix.

Specifically, in order to calculate the target word frequency feature after the dimension reduction of the intermediate word frequency feature, a transposed matrix of the intermediate word frequency feature matrix, that is, the transposed word frequency feature matrix, needs to be calculated, so that data preparation is performed for the execution of the subsequent process.

Step S502, fusing the transposed word frequency feature matrix and the intermediate word frequency feature matrix to obtain a first feature matrix.

Wherein, the fusion refers to the operation of multiplying the transposed word frequency feature matrix with the intermediate word frequency feature matrix. The first feature matrix refers to a new matrix obtained by multiplying the transposed word frequency feature matrix by the intermediate word frequency feature matrix.

Specifically, the first feature matrix is calculated to prepare data for the execution of the subsequent process, so as to support the operation of reducing the dimension of the intermediate word frequency feature matrix, thereby improving the execution efficiency of the related process.

Step S504, obtaining a left singular matrix based on the first feature matrix.

The left singular matrix refers to a set of left singular vectors arranged according to a rectangular array, and is a feature vector matrix, and a specific example is a matrix U of a visible equation (3). The left singular vector, which describes the direction vector of maximum effect of the matrix, is a feature vector for representing some characteristics of the words in the target network field data.

Specifically, in this embodiment, a singular value decomposition method is used to perform dimension reduction processing on the intermediate word frequency feature matrix, and a left singular matrix needs to be calculated in the singular value decomposition process, and a column vector in the left singular matrix is a left singular vector.

Step S506, calculating square roots corresponding to all eigenvalues in the first eigenvalue matrix, and fusing the square roots corresponding to all eigenvalues to obtain an intermediate singular value matrix.

Wherein the intermediate singular value matrix is a diagonal matrix whose square root corresponding to each eigenvalue in the first eigenvalue matrix arranged in descending order from large to small on the diagonal, and the specific example is the matrix S of the expression (3). The diagonal matrix is a matrix in which elements other than the main diagonal are all 0.

Specifically, in this embodiment, the singular value decomposition method is used to perform the dimension reduction processing on the intermediate word frequency feature matrix, and in the singular value decomposition process, a singular value matrix, that is, an intermediate singular value matrix, is required to be calculated, where elements on a diagonal line of the intermediate singular value matrix are arranged in order from large to small, where the elements are equal to square roots of corresponding feature values in the first feature matrix.

And step S508, fusing the intermediate word frequency feature matrix and the transposed word frequency feature matrix to obtain a second feature matrix.

Wherein, the fusion refers to the operation of multiplying the intermediate word frequency characteristic matrix with the transposed word frequency characteristic matrix. The second feature matrix refers to a new matrix obtained by multiplying the intermediate word frequency feature matrix by the transposed word frequency feature matrix.

Specifically, the positions of the matrices are different when the matrices are multiplied, the meaning of the final multiplication result representation is also different, and the second feature matrix is mainly used for preparing data for the subsequent right singular matrix.

And step S510, obtaining a right singular matrix based on the second feature matrix.

The right singular matrix is a eigenvector matrix, and is a set of right singular vectors arranged according to a rectangular array. The right singular vector is a feature vector for representing some characteristics of the target network request data (document), which is also a direction vector describing the maximum effect of the matrix.

Specifically, the intermediate word frequency feature matrix is subjected to dimension reduction processing, a right singular matrix needs to be known, column vectors in the right singular matrix are right singular vectors, and a left singular matrix, an intermediate singular value matrix and the right singular matrix are data necessary for performing singular value dimension reduction decomposition on the intermediate word frequency feature.

And step S512, fusing the left singular matrix, the intermediate singular value matrix and the right singular matrix to obtain the target word frequency characteristic.

Specifically, in order to obtain more key feature dimensions in the middle word frequency feature matrix, the left singular matrix, the middle singular value matrix and the right singular matrix are multiplied in sequence, so that dimension reduction processing of the middle word frequency feature matrix is realized, and the target word frequency feature is obtained.

In the embodiment, the left singular matrix, the middle singular value matrix and the right singular matrix are obtained through calculation based on the middle word frequency characteristic matrix, and the left singular matrix, the middle singular value matrix and the right singular matrix are fused to obtain the target word frequency characteristic, so that the extraction of the key part in the middle word frequency characteristic matrix is realized, more useful information is facilitated to be obtained, the learning and training of a subsequent initial classification model are facilitated, and the classification capacity of the target class recognition model and the efficiency of detecting and recognizing attack classes are improved.

In one embodiment, as shown in fig. 6, step S208 includes:

step S600, calculating the total number of categories of each category corresponding to each network field in the target network request data.

Wherein the total number of categories refers to the number of categories corresponding to each network field in the target network request data.

Specifically, in this embodiment, before the hard coding is used to perform the class coding on each network field in the target network request data, it is necessary to know what the total number of classes corresponding to each network field is, and provide data preparation for the deadline of the numerical coding when the class coding is performed in the subsequent process.

And step S602, sequentially carrying out numerical coding on each category of each network field until the total number of categories corresponding to each category of each network field is coded, and obtaining the target category characteristics corresponding to each network field.

Wherein numerical coding refers to coding text with numbers.

Specifically, in order to obtain more feature information about the target network request data and convert texts corresponding to each network field of the target network request data into numerical data for machine learning model processing, numerical coding is performed on the texts corresponding to each network field in a direct counting mode, for example, the network field of a request method (method) contains values of two categories of GET and POST, the total number of the corresponding categories is 2, when numerical coding is performed, GET is replaced by 1, and POST is replaced by 2; for example, the corresponding text value in the network field of the user agent information may have 100 types, and then the corresponding types are sequentially subjected to 1-100 numerical codes.

In this embodiment, the target category feature is obtained by performing category encoding on the target network request data, so that the text corresponding to each network field in the target network request data can be converted into numerical data, and the processing efficiency of the machine learning model in the subsequent process is improved. In addition, the extraction of the target category features enriches the target text features when the initial classification model is finally input, and is beneficial to training and learning the rules in the target network request data by the initial classification model, so that the efficiency of identifying the attack type by the machine learning model is improved.

In one embodiment, as shown in fig. 7, step S212 includes:

step S700, calculating a splitting gain corresponding to each text feature in the target text features based on the initial decision tree model.

Where split gain refers to a feature selection criterion that continuously divides a data set into higher purity data sets in the metric decision tree generation process.

Specifically, in this embodiment, the initial decision tree model is a LightGBM model, where the LightGBM model uses a decision tree algorithm based on a histogram, and a corresponding histogram needs to be built for each feature, and before the histogram is built, the text feature with the most classification capability in the target text feature needs to be determined by calculating the splitting gain corresponding to each text feature in the target text feature before the current node is divided.

Step S702, determining the root node division characteristics based on the comparison result of the splitting gains corresponding to the text characteristics.

The root node dividing feature refers to the text feature with the most classification and distinguishing capability in the target text features.

Specifically, before determining the root node division feature, the splitting gain corresponding to each text feature in the target text feature is calculated, and the text feature corresponding to the maximum splitting gain is screened from the target text features to be used as the root node division feature.

Step S704, constructing a first histogram based on the root node partition feature, where the first histogram includes a target number of bins, each of the target number of bins includes a corresponding first derivative total value of samples and a second derivative total value of samples, and the samples are sample data in the target request data.

The first histogram refers to a histogram formed by boxing all feature values corresponding to the root node dividing features to form individual sub-boxes. The total value of the first-order derivatives of the samples refers to the sum of the first-order derivatives of the loss function corresponding to each sample in the corresponding sub-bin. The total value of the second derivative of the sample refers to the sum of the second derivative of the loss function corresponding to each sample in the corresponding bin. The target number refers to the defined bin number.

Specifically, in the process of constructing the first histogram, that is, the process of classifying the root node partition feature, continuous floating point feature values corresponding to the root node partition feature are required to be discretized into integers of a target number, a histogram with the width being the target number is constructed on the basis, and a first derivative accumulation sum, a second derivative accumulation sum and a sample number sum of a loss function of each sample of each bin in the root node partition feature are calculated respectively, so that a corresponding first histogram is constructed.

Step S706, obtaining a plurality of first split thresholds based on the first derivative total value of the samples and the second derivative total value of the samples corresponding to each bin.

The first splitting threshold refers to an index for judging the optimal dividing point of the histogram.

Specifically, before dividing the first histogram, an optimal dividing point corresponding to the first histogram needs to be found, for continuous features, only one splitting threshold value exists, and for discrete values, multiple splitting thresholds may exist, each splitting threshold value corresponds to one bin number, wherein the splitting threshold value corresponding to each bin is obtained by a first derivative total value/a second derivative total value of samples+a regular term of samples corresponding to each bin, and based on the first derivative total value of samples and the second derivative total value of samples corresponding to each bin in the first histogram, the corresponding splitting threshold value can be calculated, so that data preparation is made for execution of a subsequent process.

Step S708, determining a first target split threshold based on the comparison result of the plurality of first split thresholds, dividing the first histogram based on the first target split threshold, and generating branch nodes, where each branch node contains sample data corresponding to the target network request data.

The first target classification threshold refers to the largest classification threshold in the first splitting threshold, and can be used for judging the optimal splitting point of the first histogram. The branch nodes refer to the respective child nodes divided based on the root node.

Specifically, a first splitting threshold with the largest splitting threshold is selected from the first splitting thresholds to serve as a first target splitting threshold, the first histogram is divided at a position corresponding to the first histogram according to the first target splitting threshold, and a new decision tree node is generated. Wherein, for the continuous feature, there is only one splitting threshold, there may be multiple splitting thresholds for the discrete value, each splitting threshold corresponds to the one bin number, when splitting is performed by using the discrete feature, as long as the bin number corresponding to the sample data is in the bin set corresponding to the splitting threshold corresponding to the discrete feature, the sample data is added to the left subtree after splitting, otherwise, the split right subtree is added.

Step S710, calculating a splitting gain of each text feature corresponding to each branch node, and determining a node division feature of each branch node based on a comparison result of the splitting gain of each text feature corresponding to each branch node.

The node division feature is to calculate a splitting gain corresponding to a text feature of which a corresponding histogram is not yet constructed in the current node, and select a text feature corresponding to the maximum splitting gain in the splitting gains as the node division feature.

Specifically, after a new node is generated, each node has corresponding sample data, each sample data contains target text features, when the current node is a division node, the splitting gain of the text features which are not used as the node division features in the current node is calculated, and the text features which are corresponding to the maximum splitting gain are selected from the splitting gains to be used as the node division features of the node, so that the node is divided.

Step S712, constructing a corresponding current histogram based on each node partition feature, and obtaining a plurality of current split thresholds corresponding to the current histogram based on the first-order total value of the samples and the second-order total value of the samples of each bin corresponding to the current histogram.

The current histogram refers to a corresponding histogram when the current node is a partition node. The current split threshold refers to the split threshold corresponding to the current histogram.

Specifically, each time a node partition feature corresponding to a partition node under the current condition is determined, a corresponding current histogram is constructed for the node partition feature, a current splitting threshold corresponding to the current histogram is calculated according to a first-order derivative total value and a second-order derivative total value of a sample corresponding to each bin of the current histogram, and a judgment basis is improved for the subsequent partition of the current histogram.

Step S714, determining a current target split threshold based on the comparison result of the multiple current split thresholds corresponding to the current histogram, and dividing the current histogram based on the current target split threshold to generate a new branch node.

Wherein, the current target splitting threshold value refers to the maximum current splitting threshold value selected from the current splitting threshold values, and the splitting threshold value can determine the optimal dividing point of the current histogram.

Specifically, a current splitting threshold with the largest splitting threshold is selected from a plurality of current splitting thresholds corresponding to the current histogram to serve as a current target splitting threshold, a dividing point of the current histogram is determined according to the current splitting threshold, and the current histogram is divided, so that a new branch node is generated.

Step S716, repeatedly calculating splitting gains of each text feature corresponding to each branch node, determining node division features of each branch node based on comparison results of the splitting gains of each text feature corresponding to each branch node, until each text feature in the target text features constructs a corresponding current histogram, and determining the target category recognition model.

Specifically, node dividing features corresponding to each branch node are obtained, a current histogram corresponding to the node dividing features is constructed, an optimal dividing point in the current histogram is found, a new branch node is generated until each sample data has a definite attack category, probabilities corresponding to various attack categories are calculated according to classification paths corresponding to the sample data, and the attack category with the highest probability is used as a target attack category. In one embodiment, if the probabilities corresponding to all attack categories are lower than 0.5, the final result is determined to be no web attack threat. In addition, training the initial classification model for multiple times to obtain a plurality of decision tree models with different parameters, and selecting a decision tree model with the most classification capability and the most generalization performance from the decision tree models as a target class identification model.

According to the method, the initial decision tree model is trained based on the target text features and the target network request data to obtain the target class identification model, and the target text features contain abundant useful information, so that the initial decision tree model can learn the rules in the target network request data from multiple times of training, the target class identification model with optimal classification capability in the training process is determined, and the efficiency of the target class identification model in detecting and identifying network attack classes is improved to a great extent.

In one embodiment, as shown in fig. 8, after step S212, the method further includes:

step S800, obtaining network request data to be identified, and performing text field expansion on the network request data to be identified to obtain target network request data to be identified.

The network request data to be identified refers to web request data to be input into a target class identification model for judging the network attack class of the network request data.

Specifically, before the network request to be identified is input into the target class identification model, the data needs to be subjected to missing value processing and text field expansion to obtain more useful information of the data, so that text features corresponding to the data can be extracted better.

Step S802, selecting target network field data to be identified from the target network request data to be identified, and calculating current target word frequency and current target reverse file frequency corresponding to each word in the target network field data to be identified.

The target network field data to be identified refers to selecting some network fields with larger influence on network attack type detection and identification from target network request data to be identified. The current target word frequency refers to TF word frequency corresponding to each word in the target network field data to be identified. The current target reverse file frequency refers to IDF values corresponding to words in the network field data to be identified.

Specifically, in order to obtain word frequency characteristics (TF-IDF characteristics) corresponding to each word in the network field data to be identified, TF values and IDF values corresponding to each word need to be calculated, so as to prepare data for a subsequent process.

Step S804, the current target word frequency corresponding to each word in the target network field data to be identified and the current target reverse file frequency are fused, and a current intermediate word frequency characteristic matrix is obtained.

The fusion refers to multiplying the current target word frequency corresponding to each word by the current target reverse text frequency to obtain the current TF-IDF value corresponding to each word, and integrating the values into a current intermediate word frequency feature matrix according to the arrangement of the square array. The current intermediate word frequency feature matrix refers to a TF-IDF matrix corresponding to each word extracted from the network field data to be identified.

Specifically, when the product of the current target word frequency corresponding to each word and the current target reverse file frequency is calculated, each product value corresponding to each word is integrated into a matrix to form a current intermediate word frequency feature matrix.

And step S806, performing dimension reduction processing on the current intermediate word frequency feature matrix to obtain the current target word frequency feature.

The current target word frequency feature refers to a matrix obtained after the dimension reduction of the current intermediate word frequency feature matrix, and the matrix contains key features of the target network request data to be identified.

Specifically, after an intermediate word frequency feature matrix corresponding to the network request data to be identified is obtained, performing dimension reduction processing on the intermediate word frequency feature matrix by adopting a singular value decomposition method to obtain dimension reduced target word frequency features.

Step S808, performing category coding on each network field in the network request data to be identified to obtain the current target category characteristics corresponding to each network field.

The current category characteristics refer to that each network field in the target network request data to be identified is converted into corresponding numerical characteristics.

Specifically, calculating the total number of categories corresponding to each network field in the network request data to be identified, sequentially carrying out numerical coding on the categories corresponding to each network field, converting the text into a numerical form, and obtaining the characteristics of the current target category corresponding to each network field.

Step 810, fusing the current target word frequency characteristic and the current target category characteristic to obtain a current target text characteristic.

The fusion refers to the operation of arranging and integrating the current word frequency characteristics and the current target category characteristics into a matrix according to a rectangular array. The current target text feature refers to a set of all features extracted from the target network request data to be identified.

Specifically, the collected current target word frequency characteristic and the current target category characteristic are combined in a matrix to form a target text characteristic so as to be conveniently input into a target category recognition model.

Step S812, a target class identification model is obtained, the current target text characteristics and the target network request data to be identified are input into the target class identification model, and the target network attack class corresponding to the network request data to be identified is obtained.

The target network attack category refers to an attack category of network request data to be identified through detection and identification of a target category identification model.

Specifically, the current target text characteristics and the target network request data to be identified are input into a target class identification model, the probability of various predicted attack classes is calculated according to the target class identification model, the attack class corresponding to the highest probability is selected from the probabilities to serve as the target network attack class, and when the highest probability is lower than 0.5, the network request data to be identified is judged to have no web attack threat.

In this embodiment, by acquiring the target network request data to be identified and the target text feature corresponding to the network request data to be identified, inputting the target text feature and the target network request data to be identified into the target class identification model, and detecting and identifying the network attack class existing in the network request data to be identified based on the target class identification model, the efficiency of detecting and identifying the target network attack class in the network request data to be identified can be improved.

In one embodiment, web request data collected from various financial platforms or web websites and the like and corresponding labels of the web request data are used as original network request data, the original network request data is subjected to missing value processing and text field expansion to obtain corresponding target network request data, target network field data are selected from the target network request data, target word frequency and target reverse file frequency corresponding to each word in the target network field data are calculated, an intermediate word frequency feature matrix is obtained based on the target word frequency and the target reverse text frequency corresponding to each word, singular value decomposition is carried out on the intermediate word frequency feature matrix to obtain reduced-dimension target word frequency features, numerical class coding is carried out on each network field in the target network request data, namely, numerical class coding is carried out on text class of each network field in a counting mode to obtain corresponding target class features, target text features are obtained based on the target word frequency features and the target network request features, multiple training is carried out on an initial decision tree model to obtain multiple decision tree models with different parameters, and the multiple decision tree models with different parameters are selected from the multiple decision tree models with different parameters as the classification models. And acquiring network request data to be identified from a certain platform or website, extracting target network request data to be identified and current target text characteristics corresponding to the network request data to be identified, inputting the current target text characteristics into a target class identification model, calculating the probability that the network request data to be identified is identified into various attack classes, and judging whether attack classes with the existence probability larger than 0.5 exist in the various attack classes. If the network request data to be identified exists, selecting an attack class corresponding to the highest probability from attack classes with the probability larger than 0.5 as a target network attack class corresponding to the network request data to be identified; if the network attack threat does not exist, judging that the network attack does not exist in the network request data to be identified, namely that the network attack threat does not exist. In one embodiment, a process of processing original network request data is shown in fig. 9, SVD refers to singular value decomposition, and Web threat probability refers to probability corresponding to various attack categories. Furthermore, in some embodiment, the initial decision tree model may employ other machine learning classifiers, such as the decision tree model AdaBoost, XGBoost, in addition to the LightGBM model. From the above, it can be seen that the detection and recognition of the network attack class based on the decision tree model is beneficial to improving the efficiency of detecting and recognizing the network attack class.

Based on the same inventive concept, the embodiment of the application also provides a category identification model generating device for realizing the category identification model generating method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in the embodiments of the device for generating a class identification model provided below may be referred to the limitation of the method for generating a class identification model hereinabove, and will not be described herein.

In one embodiment, as shown in fig. 10, there is provided a class identification model generating apparatus including: a data processing module 1000, a computing module 1002, a matrix generation module 1004, a first feature determination module 1006, a second feature determination module 1008, a target feature determination module 1010, and a model generation module 1012, wherein:

the data processing module 1000 is configured to obtain original network request data, and perform text field expansion on the original network request data to obtain target network request data.

A calculating module 1002, configured to select target network field data from the target network request data, and calculate a target word frequency and a target reverse file frequency corresponding to each word in the target network field data.

And the matrix generating module 1004 is configured to fuse the target word frequency corresponding to each word in the target network field data with the target reverse file frequency to obtain an intermediate word frequency feature matrix.

And the first feature determining module 1006 is configured to perform dimension reduction processing on the intermediate word frequency feature matrix to obtain a target word frequency feature.

And the second feature determining module 1008 is configured to perform category encoding on each network field in the target network request data, so as to obtain a target category feature corresponding to each network field.

And the target feature determining module 1010 is configured to fuse the target word frequency feature and the target category feature to obtain a target text feature.

The model generating module 1012 is configured to train the initial decision tree model based on the target text feature and the target network request data to obtain a target class identification model, where the target class identification model is configured to obtain a target network attack class based on the network request data to be identified.

In one embodiment, the data processing module 1000 is further configured to search for missing network fields in the original network request data, so as to obtain target missing positions corresponding to each missing network field; acquiring target special characters, and adding the target special characters in each target missing position to obtain intermediate network request data; and obtaining a target expansion field, selecting a network field to be expanded from the intermediate network request data, and adding the target expansion field at a position corresponding to the network field to be expanded to obtain the target network request data.

In one embodiment, the calculating module 1002 is further configured to calculate the number of times that each word in the target network field data appears in each target network request data corresponding to the target network field data, to obtain a target number; calculating the sum of the occurrence times of each word in each target network request data corresponding to the target network field data to obtain the total number of targets; obtaining target word frequencies corresponding to words in the target network field data based on the proportion of the target number to the corresponding target total number; taking the total number of the target network request data corresponding to the target network field data as a first total number; taking the total number of the target network request data corresponding to each word in the target network field data as a second total number; calculating the ratio of the first total number to the corresponding second total number, and taking the ratio as a target ratio; and calculating the logarithm of the target proportion to obtain the target reverse file frequency corresponding to each word in the target network field data.

In one embodiment, the first feature determining module 1006 is further configured to calculate a transposed matrix corresponding to the intermediate word frequency feature matrix, to obtain a transposed word frequency feature matrix; fusing the transposed word frequency feature matrix and the intermediate word frequency feature matrix to obtain a first feature matrix; obtaining a left singular matrix based on the first feature matrix; calculating square roots corresponding to all eigenvalues in the first eigenvalue matrix, and fusing the square roots corresponding to all eigenvalues to obtain an intermediate singular value matrix; fusing the intermediate word frequency feature matrix and the transposed word frequency feature matrix to obtain a second feature matrix; obtaining a right singular matrix based on the second feature matrix; and fusing the left singular matrix, the intermediate singular value matrix and the right singular matrix to obtain the target word frequency characteristic.

In one embodiment, the second feature determining module 1008 is further configured to calculate a total number of categories of each category corresponding to each network field in the target network request data; and sequentially carrying out numerical coding on each category of each network field until the total number of categories corresponding to each category of each network field is coded, and obtaining the target category characteristics corresponding to each network field.

In one embodiment, the model generating module 1012 is further configured to calculate, based on the initial decision tree model, a splitting gain corresponding to each text feature in the target text feature; determining root node division characteristics based on comparison results of the splitting gains corresponding to the text characteristics; constructing a first histogram based on the root node partition characteristics, wherein the first histogram comprises a target number of bins, each bin in the target number of bins comprises a corresponding first-order derivative total value and a second-order derivative total value of samples, and the samples are sample data in the target request data; obtaining a plurality of first splitting thresholds based on the first derivative total value of the sample and the second derivative total value of the sample corresponding to each bin; determining a first target splitting threshold based on the comparison result of the plurality of first splitting thresholds, dividing the first histogram based on the first target splitting threshold, and generating branch nodes, wherein each branch node contains sample data corresponding to the target network request data; calculating the splitting gain of each text feature corresponding to each branch node, and determining the node division feature of each branch node based on the comparison result of the splitting gain of each text feature corresponding to each branch node; constructing a corresponding current histogram based on each node partition characteristic, and obtaining a plurality of current split thresholds corresponding to the current histogram based on a first-order total value and a second-order total value of samples of each bin corresponding to the current histogram; determining a current target splitting threshold value based on comparison results of a plurality of current splitting threshold values corresponding to the current histogram, dividing the current histogram based on the current target splitting threshold value, and generating a new branch node; repeating the calculation of the splitting gain of each text feature corresponding to each branch node, determining the node division feature operation of each branch node based on the comparison result of the splitting gain of each text feature corresponding to each branch node until each text feature in the target text feature builds a corresponding current histogram, and determining the target category recognition model.

In one embodiment, the class identification model generating device further includes a model using module 1014, configured to obtain network request data to be identified, and perform text field expansion on the network request data to be identified to obtain target network request data to be identified; selecting target network field data to be identified from the target network request data to be identified, and calculating current target word frequency and current target reverse file frequency corresponding to each word in the target network field data to be identified; fusing the current target word frequency corresponding to each word in the target network field data to be identified with the current target reverse file frequency to obtain a current intermediate word frequency characteristic matrix; performing dimension reduction processing on the current intermediate word frequency feature matrix to obtain current target word frequency features; performing category coding on each network field in the network request data to be identified to obtain the current target category characteristics corresponding to each network field; fusing the current target word frequency characteristics and the current target category characteristics to obtain current target text characteristics; and acquiring a target class identification model, and inputting the current target text characteristics and the target network request data to be identified into the target class identification model to obtain a target network attack class corresponding to the network request data to be identified.

The respective modules in the above-described class identification model generation apparatus may be implemented in whole or in part by software, hardware, or a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 11. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is for storing network request data. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a class identification model generation method.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure thereof may be as shown in fig. 12. The computer device includes a processor, a memory, an input/output interface, a communication interface, a display unit, and an input means. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface, the display unit and the input device are connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a class identification model generation method. The display unit of the computer device is used for forming a visual picture, and can be a display screen, a projection device or a virtual reality imaging device. The display screen can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be a key, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structures shown in fig. 11 and 12 are block diagrams of only portions of structures that are relevant to the present application and are not intended to limit the computer device on which the present application may be implemented, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided that includes a memory having a computer program stored therein and a processor that when executed implements the steps of the method examples described above.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method embodiments described above.

In one embodiment, a computer program product or computer program is provided that includes computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the steps in the above-described method embodiments.

It should be noted that, the user information (including, but not limited to, user equipment information, user personal information, etc.) and the data (including, but not limited to, data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A class identification model generation method, the method comprising:

2. The method of claim 1, wherein the obtaining the original network request data, performing text field expansion on the original network request data, and obtaining the target network request data comprises:

3. The method of claim 1, wherein selecting target network field data from the target network request data, and calculating a target word frequency and a target reverse file frequency corresponding to each word in the target network field data comprises:

4. The method of claim 1, wherein the performing the dimension reduction on the intermediate word frequency feature to obtain a target word frequency feature comprises:

obtaining a left singular matrix based on the first feature matrix;

obtaining a right singular matrix based on the second feature matrix;

5. The method of claim 1, wherein the performing category encoding on each network field in the target network request data to obtain the target category feature corresponding to each network field comprises:

6. The method of claim 1, wherein training an initial decision tree model based on the target text feature and the target network request data to obtain a target class identification model comprises:

7. The method of claim 1, wherein training the initial decision tree model based on the target text feature and the target network request data, after obtaining a target class identification model, further comprises:

8. A class identification model generation apparatus, characterized in that the apparatus comprises:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.