CN112116078A

CN112116078A - Information security baseline learning method based on artificial intelligence

Info

Publication number: CN112116078A
Application number: CN202011002408.4A
Authority: CN
Inventors: 郑忠斌; 王朝栋; 彭新; 张雪帆; 严明; 梁晓萌
Original assignee: Industrial Internet Innovation Center Shanghai Co ltd
Current assignee: Industrial Internet Innovation Center Shanghai Co ltd
Priority date: 2020-09-22
Filing date: 2020-09-22
Publication date: 2020-12-22

Abstract

The invention discloses an information security baseline learning method based on artificial intelligence. In the invention, the safety baseline training model is constructed based on the deep learning algorithm, the finally obtained safety baseline classification model can present strong self-learning capability and adaptive capability through training and learning, therefore, in subsequent application, the safety baseline classification model can automatically generate an information safety baseline capable of evaluating the information safety of the industrial internet platform from a multi-network layer and an application layer according to different scenes, therefore, the dependence on manpower is eliminated, the manual setting of workers is not needed, and because the training characteristic data used for constructing the safety baseline classification model comes from the data involved in the interaction process of the client and the server, therefore, the information security baseline configured based on the security baseline classification model can realize the security monitoring of the client and the security monitoring of the server and the interaction process of the client and the server.

Description

Information security baseline learning method based on artificial intelligence

Technical Field

The embodiment of the invention relates to the technical field of information security, in particular to an information security baseline learning method based on artificial intelligence.

Background

Industrial internet is a result of the convergence of global industrial systems with advanced computing, analytics, sensing technologies and internet connectivity. The essence of the method is that equipment, production lines, factories, suppliers, products and customers are closely connected and fused together through an open and global industrial network platform, and various element resources in industrial economy are efficiently shared, so that the cost is reduced, the efficiency is increased through an automatic and intelligent production mode, the industrial chain is prolonged for the manufacturing industry, and the transformation development of the manufacturing industry is promoted. That is, the industrial internet reconstructs the global industry, stimulates productivity, and makes the world better, faster, safer, cleaner, and more economical by connecting intelligent machines and finally connecting human-machines, combining software and big data analysis.

At present, in order to ensure the operation safety of the industrial internet platform, a safety baseline is usually configured for the industrial internet platform to implement information safety assessment of the industrial internet industry, such as network layer information safety assessment and application layer information safety assessment.

However, in the configuration of the information security baseline of the industrial internet platform, security monitoring of the client is mostly performed, and security of a network data packet sent by a user is not fully considered, so that the existing information security baseline cannot be detected at all if a Distributed denial of service attack (DDOS) occurs. Meanwhile, the information security baseline configured based on the configuration requirement considering only the security monitoring of the client itself does not fully consider the security of the server, and thus the abnormality of the server cannot be detected. Due to the lack of security detection on the server and the generalization and permanence of the information security baseline, a targeted security judgment cannot be made for a certain interaction with the server.

In addition, at present, the configuration of the information security baseline of the industrial internet platform is usually that a security evaluation threshold is manually set by related staff, that is, hard security evaluation is performed by using index data, which is not only humanized, but also different scenes cannot be perfectly adapted to the same configuration standard, so that the staff is required to set different security evaluation thresholds according to different scenes, and the workload is undoubtedly increased.

In summary, how to implement automatic configuration of an information security baseline in an industrial internet, and at the same time, making the information security baseline not limited to a client-side for simple security monitoring is a technical problem that needs to be solved urgently at present.

Disclosure of Invention

The embodiment of the invention aims to provide an information security baseline learning method based on artificial intelligence, and aims to solve the technical problems.

In order to solve the above technical problem, an embodiment of the present invention provides an information security baseline learning method based on artificial intelligence, including the following steps:

based on business requirements, extracting training characteristic data from a sample data set which is constructed in advance, wherein the data stored in the sample data set is interactive data related to the interactive process between a client and a server;

constructing a safety baseline training model by using a preset deep learning algorithm;

and cleaning the training characteristic data, and performing iterative training on the safety baseline training model by using the cleaned training characteristic data until a preset iteration termination condition is met to obtain a safety baseline classification model.

The embodiment of the invention also provides an information security baseline learning device based on artificial intelligence, which comprises:

the extraction module is used for extracting training characteristic data from a sample data set which is constructed in advance based on business requirements, wherein the data stored in the sample data set is interactive data related to the interactive process of a client and a server;

the building module is used for building a safety baseline training model by utilizing a preset deep learning algorithm;

and the training module is used for cleaning the training characteristic data, and performing iterative training on the safety baseline training model by using the cleaned training characteristic data until a preset iteration termination condition is met to obtain a safety baseline classification model.

The embodiment of the invention also provides information security baseline learning equipment based on artificial intelligence, which comprises:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform an artificial intelligence based information security baseline learning method as described above.

Embodiments of the present invention also provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the artificial intelligence based information security baseline learning method as described above.

The information security baseline learning method based on artificial intelligence provided by the embodiment of the invention constructs a security baseline training model based on a deep learning algorithm, and the finally obtained security baseline classification model can present strong self-learning capability and adaptive capability through training and learning, so that in subsequent application, the security baseline classification model can automatically generate an information security baseline capable of evaluating the information security of an industrial internet platform from a multi-network layer and an application layer aiming at different scenes, thereby getting rid of dependence on manpower and needing no manual setting of working personnel, and because training characteristic data used by the security baseline classification model is constructed from data involved in the interaction process of a client and a server, the information security baseline configured based on the security baseline classification model not only can realize the security monitoring of the client, and the safety monitoring of the server and the interactive process of the client and the server can be realized.

In addition, before extracting training feature data from a pre-constructed sample data set based on the business requirement, the method further includes: acquiring interactive data related to the interactive process between the client and the server in different network scenes at different time, wherein the interactive data comprises safe data and non-safe data; based on different service requirements, selecting safety data and non-safety data which meet the service requirements from the interactive data to obtain sample safety data and sample non-safety data; establishing a corresponding relation between the sample safety data and the client and the server, and establishing a corresponding relation between the sample non-safety data and the client and the server to obtain the sample data set; wherein each sample data in the sample data set comprises a number of eigenvalues and the eigenvalue of the last column in the sample data is used to identify whether the sample data is secure data or non-secure data. In the embodiment of the invention, the data in the sample data set is real data, namely, the interactive data involved in the interactive process of the client and the server of the industrial internet platform in different scenes at different time can be better adapted to the actual environment based on the safety baseline classification model trained by the real sample data.

In addition, before extracting training feature data from a pre-constructed sample data set based on the business requirement, the method further includes: simulating a secure network scene at different times, and simulating the sending and receiving operations of data packets performed by the client and the server when different services are executed under the secure network scene to obtain secure data; simulating an insecure network scene at different times, and simulating the sending and receiving operations of data packets performed by the client and the server when different services are executed under the insecure network scene to obtain insecure data; based on different service requirements, selecting safety data meeting the service requirements from the safety data to obtain sample safety data, and selecting non-safety data meeting the service requirements from the non-safety data to obtain sample non-safety data; establishing a corresponding relation between the sample safety data and the client and the server, and establishing a corresponding relation between the sample non-safety data and the client and the server to obtain the sample data set; wherein each sample data in the sample data set comprises a number of eigenvalues and the eigenvalue of the last column in the sample data is used to identify whether the sample data is secure data or non-secure data. In the embodiment of the invention, the data in the sample data set is pseudo data, namely data generated by scene simulation, so that the construction of the sample data set can be realized under the condition of no real data.

In addition, the construction of the safe baseline training model by using the preset deep learning algorithm includes: constructing a neural network regression model by using a deep neural network algorithm to obtain the safety baseline training model; the neural network regression model consists of an input layer, a plurality of hidden layers sequentially connected with the input layer and an output layer connected with the last hidden layer; and each hidden layer comprises a linear layer and an active layer, wherein the linear layer is used for calculating linear transformation weight values and Bayesian bias values of input data input into the hidden layer so as to determine input data security values, and the active layer is used for mapping the output processed by each hidden layer to be between (0, 1). According to the embodiment of the invention, the neural network regression model is constructed based on the deep neural network algorithm to serve as the safety baseline training model, so that the safety baseline training model is relatively simple in structure, the training speed is improved, the output variable of the safety baseline classification model obtained based on the training of the safety baseline training model of the type can be well interpreted according to the probability based on the characteristics of logistic regression, and meanwhile, the safety baseline classification model can be suitable for continuous independent variables and discrete independent variables.

In addition, before the cleaning the training feature data, and performing iterative training on the safety baseline training model by using the cleaned training feature data until a preset iteration termination condition is met to obtain a safety baseline classification model, the method further includes: constructing an input matrix according to the training characteristic data in a preset format; the step of cleaning the training feature data, and performing iterative training on the safety baseline training model by using the cleaned training feature data until a preset iteration termination condition is met to obtain a safety baseline classification model includes: cleaning training characteristic data in the input matrix, inputting the cleaned training characteristic data in the input matrix as input data to the input layer in the safety baseline training model, so that the input layer transmits the input data to each network neuron in the adjacent hidden layers respectively, the network neurons calculate linear transformation weight values and Bayesian bias values of the input data in the linear layer, and the obtained linear transformation weight values and the Bayesian bias values are mapped between (0,1) through an activation layer and enter the next hidden layer until being processed by all hidden layers in the safety baseline training model and then output through the output layer; calculating a loss value of a training result output by the safety baseline training model based on a cross entropy loss function; and when the loss value tends to converge, determining that the preset iteration termination condition is met, stopping training the safe baseline training model, and taking the current safe baseline training model as the safe baseline classification model. The embodiment of the invention provides a specific training mode, based on which not only can the probability meeting the requirements be obtained, but also the output result can be mapped between (0,1), thereby greatly reducing the calculation difficulty and complexity.

In addition, after the cleaning of the training feature data and the iterative training of the safety baseline training model by using the cleaned training feature data are performed until a preset iteration termination condition is met to obtain a safety baseline classification model, the method includes: acquiring a data stream generated by an industrial internet platform to be monitored by using a mirror image interface; analyzing and processing the data stream by using the safety baseline classification model to obtain the probability that the data stream is an unsafe data stream; and when the probability is greater than a preset threshold value, carrying out safety alarm. According to the embodiment of the invention, when the information security of the industrial internet platform to be monitored is monitored based on the obtained security baseline classification model, the data stream generated by the industrial internet platform to be monitored is obtained by utilizing the mirror image interface, so that the industrial internet platform can be monitored under the condition that the normal throughput of the source port is not seriously influenced.

In addition, the data stream is an encrypted data stream; the analyzing and processing the data stream by using the safety baseline classification model comprises: acquiring time characteristic information corresponding to each data packet in the encrypted data stream; and analyzing and processing the time characteristic information corresponding to each data packet by using the safety baseline classification model. According to the embodiment of the invention, for the encrypted data streams, the relevance between the data streams is determined by analyzing and processing the time characteristic information, and then the current data stream is determined to be the secure data stream or the non-secure data stream, so that the secure supervision of the encrypted data streams is realized.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a detailed flowchart of an artificial intelligence-based information security baseline learning method according to a first embodiment of the invention;

FIG. 2 is a network structure diagram of a security baseline training model involved in the artificial intelligence-based information security baseline learning method shown in FIG. 1;

FIG. 3 is a diagram illustrating an output result of the total active layer transformation of the hidden layer in the artificial intelligence based information security baseline learning method shown in FIG. 1 according to the first embodiment of the present invention;

FIG. 4 is a detailed flowchart of an artificial intelligence-based information security baseline learning method according to a second embodiment of the invention;

FIG. 5 is a schematic structural diagram of an artificial intelligence based information security baseline learning apparatus according to a third embodiment of the present invention;

fig. 6 is a schematic structural diagram of an artificial intelligence based information security baseline learning apparatus according to a fourth embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention will be described in detail with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that in various embodiments of the invention, numerous technical details are set forth in order to provide a better understanding of the present application. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments. The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.

The first embodiment relates to an information security baseline learning method based on artificial intelligence, which is applied to information security baseline learning equipment based on artificial intelligence.

The following describes implementation details of the information security baseline learning method based on artificial intelligence, and the following description is only provided for easy understanding and is not necessary to implement the present solution.

The specific flow of this embodiment is shown in fig. 1, and specifically includes the following steps:

step 101, extracting training characteristic data from a sample data set constructed in advance based on business requirements.

Specifically, the essence of the information security baseline learning method based on artificial intelligence provided in this embodiment is a training method or a construction method of a security baseline classification model (which may also be referred to as a security baseline classifier). Moreover, the safety baseline classification model obtained by training based on the artificial intelligence-based information safety baseline learning method provided by the embodiment is mainly applied to an industrial internet platform or an industrial internet system so as to realize the evaluation of the information safety of the industrial internet platform.

Therefore, the business requirement may be a requirement for the security of the industrial internet platform itself, a requirement for the security of the network layer, or a requirement for the security of the business layer.

It should be understood that the above examples are only examples for better understanding of the technical solution of the present embodiment, and are not to be taken as the only limitation to the present embodiment. In practical applications, there may also be business requirements for any combination of the several considerations listed above.

In addition, it is worth mentioning that, the main purpose of the information security baseline learning method based on artificial intelligence provided in this embodiment is to enable the security baseline classification model obtained by training based on this method to be capable of evaluating the security of each data stream from the client to the server from multiple angles, such as a network layer and an application layer, in practical application, that is, in monitoring of an industrial internet platform, the security baseline classification model can capture data streams from the industrial internet platform. Therefore, the extracted training feature data needs to be able to embody the features of the client and the server.

Therefore, the data stored in the pre-constructed sample data set needs to be interactive data involved in the interactive process between the client and the server.

Furthermore, it should be understood that in practical applications, the interaction between the client and the server may not only be a mere transmission and reception of a data packet, but also may generate new data along with the interaction. Therefore, the above-mentioned interactive data may be data forwarded in the interactive process, and may also be new data generated in the interactive process.

Correspondingly, when the data stored in the sample data set is the interactive data related to the interactive process between the client and the server, the training characteristic data extracted from the sample data set is the characteristic data capable of reflecting the relationship between the client and the server based on the service requirement.

Specifically, in this embodiment, the training feature data may be any one or more of the following types:

the number of threads and the number of kernels of a server side;

the average data flow processing quantity of the server;

the size of a data packet sent by a transmitted data stream and the size of a data packet responded in the interaction process of the client and the server;

fourthly, in the interaction process of each client and the server, the retransmission rate of the data packets sent by the transmitted data stream is increased;

in the interaction process of each client and the server, the corresponding response time delay after each data packet in the transmitted data stream is sent;

sixthly, in the interaction process of each client and the server, the sending interval duration of each data packet in the transmitted data stream is long;

and seventhly, in the interaction process of each client and the server, the transmission is the duration of the data stream.

It should be understood that the above examples are only examples for better understanding of the technical solution of the present embodiment, and are not to be taken as the only limitation to the present embodiment.

In addition, it is worth mentioning that, in practical application, in order to ensure the smooth execution of the information security baseline learning method based on artificial intelligence provided in this embodiment, a sample data set needs to be constructed in advance.

Specifically, the sources of data in the sample data set can be roughly classified into two types: one is interactive data related to the interaction process between a client and a server under different time and different scenes of an industrial internet platform, namely real data when data in sample data set are sampled; the other is dummy data, i.e., data generated by scene simulation.

To facilitate understanding of the way in which the sample data set is constructed based on these two types of data, the following is specifically described:

mode 1: the data in the sample data set is real data

And (1.1) acquiring interactive data related to the client and the server in the interactive process in different network scenes at different times.

Specifically, the acquired interactive data are data with known security, that is, secure data and non-secure data.

And (1.2) selecting safety data and non-safety data which meet the business requirements from the interactive data based on different business requirements to obtain sample safety data and sample non-safety data.

(1.3) establishing a corresponding relation between the sample safety data and the client and the server, and establishing a corresponding relation between the sample non-safety data and the client and the server to obtain the sample data set.

Mode 2: the data in the sample data set is dummy data

And (2.1) simulating a secure network scene at different times, and simulating the sending and receiving operations of the data packets performed by the client and the server when different services are executed under the secure network scene to obtain the security data.

And (2.2) simulating an insecure network scene at different times, and simulating the sending and receiving operations of the data packet performed by the client and the server when different services are executed under the insecure network scene to obtain insecure data.

And (2.3) selecting safety data meeting the business requirements from the safety data based on different business requirements to obtain sample safety data, and selecting non-safety data meeting the business requirements from the non-safety data to obtain sample non-safety data.

And (2.4) establishing a corresponding relation between the sample safety data and the client and the server, and establishing a corresponding relation between the sample non-safety data and the client and the server to obtain the sample data set.

It should be noted that, no matter a real data construction sample data set is obtained from a database or an industrial internet platform, or a sample data set is constructed based on pseudo data generated in a scene simulation mode, in order to ensure that a security baseline classification model obtained by training based on training feature data extracted from the sample data set can better adapt to various network scenes, the network scenes should be as comprehensive as possible, and at least need to cover network scenes of various situations, such as an operation scene of third-party software, an information interception and falsification scene of a middle person, a health scene of different networks, a malicious data packet sending scene, and the like.

In addition, in order to ensure the accuracy of the safety baseline classification model obtained by training, the data in the sample data set needs to be as much as possible.

In addition, it is worth mentioning that, in order to ensure the accuracy of the subsequent data analysis processing of the safety baseline classification model obtained by training, each sample data in the constructed sample data set usually includes several feature values.

Further, in order to ensure that each piece of training feature data can be quickly identified as safe data or non-safe data in the training process, the embodiment provides that the feature value in the last column in the sample data is used for identifying whether the sample data is safe data or non-safe data.

For example, for a certain sample data X, assuming that it consists of n-dimensional feature values, that is, n feature values are included, the storage format of the sample data X in the sample data set may be X ═ (X (1), X (2),.., X (n-1), X (n)).

Where X (n) represents the number of characteristic values of the sample data X.

When the sample data X has 5 feature values in total, in order to record all feature values of the sample data X in the sample data set and reflect whether the sample data X is safe data or non-safe data, the value of n may be 6, that is, the first 5 columns use 5 feature values for sequentially storing the sample data X, and the last column, that is, the feature value in X (6), is used to identify whether the sample data X is safe data or non-safe data.

In addition, in order to reduce implementation complexity and difficulty as much as possible, the sample data may be identified as non-secure data by "0" and as secure data by "1".

For ease of understanding, the following description will take the data generated by scene simulation in the case of the acquired data as an example:

firstly, in consideration of security, an industrial internet platform connected based on a hypertext Transfer Protocol over secure session Layer (HTTPS) is built, and a server, specifically, a Central Processing Unit (CPU) kernel of a server is used to perform multithread synchronization operation, so as to avoid resource waste.

And then, capturing an interactive data packet generated by interaction between the client and the server by using a packet capturing module in the established industrial internet platform, and storing the captured interactive data packet as a file in a process characteristic analysis software package (pcap) format.

And then, analyzing the file stored in the pcap format, namely the captured interactive data packet, by using a data analysis module in the established industrial internet platform.

Specifically, the analysis of the interactive data packet mainly extracts fields to be acquired according to service requirements, such as a quintuple of the interactive data packet, whether the interactive data packet is a retransmission packet, a behavior pattern of a user, a data size, data packet sending and receiving time and the like, and stores the analyzed data in a queue.

And then, transmitting the data in the queue to a corresponding database for storage by using a database module in the established industrial internet platform so as to extract relevant data afterwards and extract and realize the application development part and the front-end page when realizing.

Further, after the above operations are completed, entering a cmd window in a server control interface, if the server is a linux server, inputting a "grep 'Core id'/proc/cpu info | sort-u | wc-l" command to obtain the Core number of the server, recording the numerical value as Core, inputting "grep 'processor'/proc/cpu info | sort-u | wc-l" to obtain the Thread number of the server, and recording the numerical value as Thread.

Then, through pseudo code realization, recording the average throughput rate pps of the network in the operation process of an industrial Internet platform (hereinafter referred to as platform), namely the quantity of data packets forwarded by a switch per second, and recording the numerical value as P pps; recording the average bandwidth rate bps of the network in the running process of the platform, namely the number of bits of information transmitted per second, wherein the numerical value is recorded as B bps; in the process that the client and the service interact through long connection, the heartbeat sent to the server by the client at regular time is recorded so as to record the number of Users participating in use in the operation of the platform, and the numerical value is recorded as Users.

Further, the number of data packets of the application layer in each data flow is screened out from the obtained original data packets through pseudo codes, and the numerical value is recorded as FlowPacketsapplication.

Further, on the basis of the FlowPacketsapplication, the following operations are carried out:

recording the size of all data packets sent by each client in the data flow interactively transmitted by each client and each server, wherein the numerical value is recorded as FlowPacketsBulk; recording the number of all data packets sent by each client in the data flow interactively transmitted between each client and the server, wherein the numerical value is recorded as FlowPacketsCount; recording the size of all data packets sent by a server in data flow interactively transmitted by each client and the server, wherein the numerical value is recorded as backFlowPacketsBulk; recording the number of all data packets sent by the server in the data flow interactively transmitted between each client and the server, wherein the numerical value is recorded as BackFlowPacketsCount; recording the retransmission rate of a data packet in the data flow interactively transmitted by each client and each server, wherein the numerical value is recorded as FlowRetrans; recording the sum of response time delays of data packets in data streams interactively transmitted by each client and each server, wherein the numerical value is recorded as FlowLatency; recording the sum of interval durations of all data packets sent by each client in data streams interactively transmitted by each client and each server, wherein the numerical value is recorded as FlowInterval; recording the sum of interval durations of all data packets sent by a server in data streams interactively transmitted by each client and the server, wherein the numerical value is recorded as BackFlowInterval; and recording the duration of the data flow in the data flow interactively transmitted by each client and each server, wherein the numerical value is recorded as FlowDuration.

It should be understood that, the above-mentioned records of the extracted values of each feature information as the above-mentioned english parameters, and in practical applications, the corresponding values or feature information are specific values.

Furthermore, it should be understood that the above examples are only examples for better understanding of the technical solution of the present embodiment, and are not to be taken as the only limitation to the present embodiment.

And 102, constructing a safety baseline training model by using a preset deep learning algorithm.

Specifically, in order to enable the safety baseline classification model obtained through training to present strong self-learning capability and adaptive capability to the environment, the safety baseline classification model is specifically constructed based on a deep learning algorithm belonging to the artificial intelligence category when the safety baseline training model is constructed.

In order to make the structure of the safety baseline training model as simple as possible, thereby reducing the training difficulty, improving the training speed, making the output variable of the safety baseline classification model obtained by final training have a good probability explanation, and meanwhile, the method can be suitable for continuous and discrete independent variables. The embodiment specifically includes that a neural network regression model is built based on a deep neural network algorithm, and then the built neural network regression model is used as a safety baseline training model required by subsequent training.

The neural network regression model is composed of an input layer, an output layer and a plurality of hidden layers between the input layer and the output layer. And, the hidden layer includes a linear layer and an active layer for each layer.

The linear layer is used for calculating linear transformation weight values and Bayes bias values of input data input into the hidden layer so as to determine input data security values, and the active layer is used for mapping outputs processed by each hidden layer to (0,1) so as to reduce calculation difficulty and complexity.

For ease of understanding and explanation, the following description is made in conjunction with fig. 2:

as shown in fig. 2, the constructed neural network regression model has 3 hidden layers, which are a first hidden layer, a second hidden layer and a third hidden layer.

The input layer is used for receiving training characteristic data to be analyzed in a training process and transmitting the received training characteristic data to the first hidden layer, the linear layer in the first hidden layer calculates linear transformation weight values of the training characteristic data in the first hidden layer based on a weight calculation function respectively, the bias calculation function calculates Bayes bias values of the training characteristic data in the first hidden layer, and a safety value of the training characteristic data is determined based on the obtained linear transformation weight values and the Bayes bias values; then, the activation layer in the first hidden layer uses a preset activation function, such as a sigmoid function, to map the output result after the linear layer processing, i.e. the security value, between (0, 1); then, the second hidden layer receives the result output by the first hidden layer, repeats the operation of the first hidden layer, inputs the output result to the third hidden layer, the third hidden layer continues to repeat the operation of the first hidden layer, and finally outputs the processing result through the output layer.

The weight calculation function is specifically: weight, Parameter (torr. sensor (out, in)); the offset calculation function specifically comprises: bias Parameter (out).

Wherein out is the number of neurons in the current layer, in is the number of neurons in the previous layer, weight is the linear transformation weight value obtained by calculation, and bias is the Bayes bias value obtained by calculation.

It should be understood that for the first hidden layer, i.e. the hidden layer adjacent to the input layer, in needed to calculate weight is the number of neurons in the input layer.

Accordingly, when calculating the processing result, i.e. the security value, of the input training sample data of the corresponding hidden layer based on the weight and bias obtained by the above calculation, it is specifically based on formula (1):

in the formula, X represents training feature data selected from the sample data set, and has n feature values, and the specific format is described in detail in the example in step 101; h_iA hidden layer representing the i-th layer is h_iDimension vector, W_iIs h_i×h_i-1Parameter matrix of b_iIs h_iBayesian bias vector of dimension, W_outIs an output layer of 1 × h_mParameter matrix of b_outIs the Bayesian bias value of the final output of the output layer, and relu and sigma represent the activation function.

In this embodiment, the activation function used for σ is a sigmoid function, that is, y obtained according to equation (1), and it is necessary to perform processing of y-sigmoid (x) so that the output result is mapped between (0, 1).

Specifically, an image corresponding to an output result after being processed by the sigmoid function is shown in fig. 3.

Furthermore, it should be understood that fig. 2 is only for convenience of illustration, in practical applications, the number of layers of the hidden layer is not limited to 3, the number of neurons in the input layer is not limited to 4, the number of neurons in each hidden layer is not limited to 3, and the number of neurons in the output layer is not limited to 2, that is, in practical applications, the number of layers of the hidden layer is set according to the accuracy of the security baseline classification model that can be obtained as needed, and the number of neurons in the input layer, the hidden layer, and the output layer can also be set as needed.

And 103, cleaning the training characteristic data, and performing iterative training on the safety baseline training model by using the cleaned training characteristic data until a preset iteration termination condition is met to obtain a safety baseline classification model.

It should be noted that, since the data input into the safety baseline training model must satisfy a certain format, in order to ensure the normal training operation, before executing step 103, the training feature data needs to be constructed into an input matrix according to a preset format.

For ease of understanding, this embodiment presents a specific input matrix format, detailed in table 1.

TABLE 1 input matrix

Description of the features	Characteristic data
		Binding thread number of each kernel of system	Core/Thread
Network throughput pps/per user	P/Users
		Bandwidth rate bps/per user	B/Users
Number of application layer packets per stream	FlowPacketsApplication
		All packet sizes sent by clients in each stream	FlowPacketsBulk
The total number of data packets sent by the client in each stream	FlowPacketsCount
		All data packet sizes sent by the server in each flow	BackFlowPacketsBulk
Total number of packets sent by the server in each flow	BackFlowPacketsCount
		Retransmission rate of packets in each stream	FlowRetrans
Sum of response delays of packets in each stream	FlowLatency
		Sum of interval duration of all data packets sent by client in each stream	FlowInterval
Sum of interval duration of all data packets sent by the server in each flow	BackFlowInterval
		Duration of data stream in each stream	FlowDuration

It is easy to find that table 1 is an input matrix constructed based on the feature data extracted in step 101 as training sample data, and in practical applications, the content and style included in the input matrix may be determined as needed, which is not limited in this embodiment.

Note that, in practical applications, the contents listed in the feature data column in table 1 should be feature information of specific training sample data, and the above description is simply presented in corresponding english for convenience of illustration and description.

Accordingly, after completing the construction of the training sample data, the operation in step 103 specifically includes:

(1) cleaning training characteristic data in the input matrix, inputting the cleaned training characteristic data in the input matrix as input data to the input layer in the safety baseline training model, so that the input layer transmits the input data to each network neuron in the adjacent hidden layers respectively, the network neurons calculate linear transformation weight values and Bayesian bias values of the input data in the linear layer, and the obtained linear transformation weight values and the Bayesian bias values are mapped between (0,1) through an activation layer and enter the next hidden layer until being processed by all hidden layers in the safety baseline training model and then output through the output layer;

(2) calculating a loss value of a training result output by the safety baseline training model based on a cross entropy loss function;

(3) and when the loss value tends to converge, determining that the preset iteration termination condition is met, stopping training the safe baseline training model, and taking the current safe baseline training model as the safe baseline classification model.

The loss value mentioned above is calculated based on the following formula (2):

in the formula, y_TrueTraining feature data representing an input safety baseline training model is safety data, P (Y ═ 1) represents the probability of safety data, P (Y ═ 0) represents the probability of non-safety data, and L is a loss value obtained through calculation.

Since the output result is already outputThe channels are mapped between (0,1), so that P (Y ═ 1) ═ Y_True，P(Y＝0)＝1-y_True。

Furthermore, it should be understood that, since the feature value of the last column in each training feature data is to indicate whether the training feature data is safe data or non-safe data, that is, the training feature data already carries identification information capable of identifying its characteristic. Therefore, by performing iterative training on the safety baseline training model by using the training feature data, and by comparing the result input by the safety baseline training model with the feature value for identifying the safety (safety data or non-safety data) of the training feature data, when the loss value of the result tends to converge, it can be considered that the safety baseline training model at the current moment can be used as the safety baseline classification model, i.e., the training is successful.

Further, in practical application, before the current safety baseline training model is used as the safety baseline classification model, the current safety baseline training model may be tested based on test feature data of known safety, and a specific test process is similar to the above training process, except that new feature data is input into the safety baseline training model for analysis processing.

Correspondingly, if the output result after the analysis processing of the safety baseline training model is similar to the characteristic value for identifying the safety of the test characteristic data, the safety baseline training model at the current moment can be regarded as the safety baseline classification model, namely the training is successful.

In addition, it is worth mentioning that, regarding the above-mentioned cleaning of the training feature data, in the embodiment, any one or more of the following may be used: missing value processing, abnormal value processing, duplicate removal processing, noise data processing and the like.

Specifically, for the distribution situation of missing data of the input matrix, when the number of missing values is small and appears randomly, deleting the group of data; when the number of the missing values is large, an interpolation method can be adopted, namely a sample (matched sample) similar to the sample where the missing value is located is found in the non-missing data set, and the missing value is interpolated by using an observation value in the sample; aiming at the abnormal value of the input matrix, carrying out standardization operation on the abnormal value, taking a value of the character string, summing according to ANSI code values to obtain the value of the character string, and if the value is too large, taking a proper prime number to carry out modulo calculation on the value; aiming at the situation that input matrix data are repeated, a series of rules need to be written, and repeated data are removed; for noise data, quantitative interference data is required according to actual conditions.

It is not difficult to find through the above description that the information security baseline learning method based on artificial intelligence provided in this embodiment constructs a security baseline training model based on a deep learning algorithm, and the finally obtained security baseline classification model can exhibit strong self-learning capability and adaptive capability through training and learning, that is, when subsequently monitoring the security of the industrial internet platform, the security baseline classification model can automatically generate an information security baseline capable of evaluating the information security of the industrial internet platform from a multi-network layer and an application layer according to different scenes, so as to get rid of dependence on manual work and without manual setting by a worker, and because training feature data used by the security baseline classification model is constructed from data involved in an interaction process between a client and a server, the information security baseline configured by the security baseline classification model is based on the information security baseline, the safety monitoring of the client side, the safety monitoring of the server side and the safety monitoring of the interaction process of the client side and the server side can be realized.

The second embodiment of the invention relates to an information safety baseline learning method based on artificial intelligence. The second embodiment mainly uses the security baseline classification model obtained by the training of the first embodiment, that is, the security baseline classification model obtained by the training of the first embodiment is used for information security monitoring of the industrial internet platform to be monitored.

As shown in fig. 4, the second embodiment relates to an artificial intelligence-based information security baseline learning method, which includes the following steps:

step 401, based on the service requirement, extracting training characteristic data from a sample data set which is constructed in advance, wherein the data stored in the sample data set is interactive data related to the interactive process between the client and the server.

And 402, constructing a safety baseline training model by using a preset deep learning algorithm.

And 403, cleaning the training characteristic data, and performing iterative training on the safety baseline training model by using the cleaned training characteristic data until a preset iteration termination condition is met, so as to obtain a safety baseline classification model.

It is to be understood that steps 401 to 403 in this embodiment are substantially the same as steps 101 to 103 in the first embodiment, and are not repeated here.

And step 404, acquiring a data stream generated by the industrial internet platform to be monitored by using the mirror image interface.

Step 405, analyzing and processing the data stream by using the safety baseline classification model to obtain the probability that the data stream is an unsafe data stream.

And 406, performing safety alarm when the probability is greater than a preset threshold value.

Therefore, when the information security baseline learning method based on artificial intelligence provided in this embodiment monitors information security of the industrial internet platform to be monitored based on the obtained security baseline classification model, the data stream generated by the industrial internet platform to be monitored is obtained by using the mirror image interface, so that the industrial internet platform can be monitored without seriously affecting the normal throughput of the source port.

In addition, it is worth mentioning that, because the current industrial internet platform generally establishes a connection based on HTTPS in consideration of the security of the industrial internet platform, a data stream generated by the industrial internet platform to be monitored, which is acquired through the mirror interface, may be an encrypted data stream, that is, specific content of the authentication code information cannot be identified. However, in general, even if data streams are encrypted, there is a certain correlation between the data streams.

Therefore, when the security baseline classification model is used to analyze and process the encrypted data stream, the time characteristic information corresponding to each data packet in the encrypted data stream may be obtained first, and then the time characteristic information corresponding to each data packet is analyzed and processed by using the security baseline classification model, that is, whether the currently obtained encrypted data stream is a secure data stream or a non-secure data stream is analyzed based on the time characteristic information.

It should be noted that the above-mentioned time characteristic information may specifically be, for example, a packet size, a packet response time, a data flow active time, a co-directional packet transmission time difference, a data flow duration, and the like in this embodiment.

Therefore, the information security baseline learning method based on artificial intelligence provided in this embodiment determines the relevance between data streams by analyzing and processing the time characteristic information, and further determines whether the current data stream is a secure data stream or an insecure data stream, thereby implementing security supervision on encrypted data streams.

It should be understood that the above steps of the various methods are divided for clarity, and the implementation may be combined into one step or split into a plurality of steps, and all that includes the same logical relationship is within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A third embodiment of the present invention relates to an information security baseline learning apparatus based on artificial intelligence, as shown in fig. 5, including: an extraction module 501, a construction module 502 and a training module 503.

The extracting module 501 is configured to extract training feature data from a sample data set constructed in advance based on a service requirement, where the data stored in the sample data set is interactive data related to an interactive process between a client and a server; a building module 502, configured to build a safety baseline training model by using a preset deep learning algorithm; the training module 503 is configured to clean the training feature data, and perform iterative training on the safety baseline training model by using the cleaned training feature data until a preset iteration termination condition is met, so as to obtain a safety baseline classification model.

In addition, in another example, the artificial intelligence based information security baseline learning apparatus may further include a sample data set construction module.

Specifically, the sample data set construction module may construct the sample data set based on:

acquiring interactive data related to the interactive process between the client and the server in different network scenes at different time, wherein the interactive data comprises safe data and non-safe data;

based on different service requirements, selecting safety data and non-safety data which meet the service requirements from the interactive data to obtain sample safety data and sample non-safety data;

establishing a corresponding relation between the sample safety data and the client and the server, and establishing a corresponding relation between the sample non-safety data and the client and the server to obtain the sample data set;

wherein each sample data in the sample data set comprises a number of eigenvalues and the eigenvalue of the last column in the sample data is used to identify whether the sample data is secure data or non-secure data.

In addition, in another example, the sample data set construction module may further construct the sample data set based on:

simulating a secure network scene at different times, and simulating the sending and receiving operations of data packets performed by the client and the server when different services are executed under the secure network scene to obtain secure data;

simulating an insecure network scene at different times, and simulating the sending and receiving operations of data packets performed by the client and the server when different services are executed under the insecure network scene to obtain insecure data;

based on different service requirements, selecting safety data meeting the service requirements from the safety data to obtain sample safety data, and selecting non-safety data meeting the service requirements from the non-safety data to obtain sample non-safety data;

In addition, in another example, when the building module 502 builds the safety baseline training model by using the preset deep learning algorithm, specifically:

constructing a neural network regression model by using a deep neural network algorithm to obtain the safety baseline training model;

the neural network regression model consists of an input layer, a plurality of hidden layers sequentially connected with the input layer and an output layer connected with the last hidden layer;

and each hidden layer comprises a linear layer and an active layer, wherein the linear layer is used for calculating linear transformation weight values and Bayesian bias values of input data input into the hidden layer so as to determine input data security values, and the active layer is used for mapping the output processed by each hidden layer to be between (0, 1).

In addition, in another example, the artificial intelligence based information security baseline learning apparatus can further include a matrix construction module.

Specifically, the matrix construction module is configured to construct an input matrix according to a preset format for the training feature data, so as to ensure that input data input into the security baseline training model can be normally processed.

Correspondingly, the training module 503 cleans the training feature data, and performs iterative training on the safety baseline training model by using the cleaned training feature data until a preset iteration termination condition is met, so as to obtain a safety baseline classification model, which specifically includes:

cleaning training characteristic data in the input matrix, inputting the cleaned training characteristic data in the input matrix as input data to the input layer in the safety baseline training model, so that the input layer transmits the input data to each network neuron in the adjacent hidden layers respectively, the network neurons calculate linear transformation weight values and Bayesian bias values of the input data in the linear layer, and the obtained linear transformation weight values and the Bayesian bias values are mapped between (0,1) through an activation layer and enter the next hidden layer until being processed by all hidden layers in the safety baseline training model and then output through the output layer;

calculating a loss value of a training result output by the safety baseline training model based on a cross entropy loss function;

and when the loss value tends to converge, determining that the preset iteration termination condition is met, stopping training the safe baseline training model, and taking the current safe baseline training model as the safe baseline classification model.

In addition, in another example, the artificial intelligence based information security baseline learning apparatus may further include a data stream acquisition module and an early warning module.

Specifically, the data stream acquiring module is used for acquiring a data stream generated by the industrial internet platform to be monitored by using the mirror image interface.

Correspondingly, the safety baseline classification model obtained through training by the training module 503 obtains the probability that the data stream is an unsafe data stream by analyzing and processing the data stream; and then when the probability is larger than a preset threshold value, triggering an early warning module to perform safety warning.

In addition, in another example, the data stream acquired by the data stream acquisition module may be an encrypted data stream.

Correspondingly, when the security baseline classification model obtained through training by the training module 503 is used for analyzing and processing the encrypted data stream, specifically:

acquiring time characteristic information corresponding to each data packet in the encrypted data stream;

and analyzing and processing the time characteristic information corresponding to each data packet by using the safety baseline classification model.

It should be understood that the present embodiment is a device embodiment corresponding to the first or second embodiment, and the present embodiment can be implemented in cooperation with the first or second embodiment. The related technical details mentioned in the first or second embodiment are still valid in this embodiment, and are not described herein again to reduce repetition. Accordingly, the related-art details mentioned in the present embodiment can also be applied to the first or second embodiment.

It should be noted that, all the modules involved in this embodiment are logic modules, and in practical application, one logic unit may be one physical unit, may also be a part of one physical unit, and may also be implemented by a combination of multiple physical units. In addition, in order to highlight the innovative part of the present invention, a unit which is not so closely related to solve the technical problem proposed by the present invention is not introduced in the present embodiment, but this does not indicate that there is no other unit in the present embodiment.

A fourth embodiment of the present invention relates to an artificial intelligence based information security baseline learning apparatus, as shown in fig. 6, comprising at least one processor 601; and a memory 602 communicatively coupled to the at least one processor 601; wherein the memory 602 stores instructions executable by the at least one processor 601 to enable the at least one processor 601 to perform the artificial intelligence based information security baseline learning method described in the first or second embodiments above.

Where the memory 602 and the processor 601 are coupled by a bus, the bus may comprise any number of interconnected buses and bridges that couple one or more of the various circuits of the processor 601 and the memory 602 together. The bus may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 601 is transmitted over a wireless medium via an antenna, which further receives the data and transmits the data to the processor 601.

The processor 601 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. While memory 602 may be used to store data used by processor 601 in performing operations.

A fifth embodiment of the present invention relates to a computer-readable storage medium storing a computer program. The computer program, when executed by the processor, implements the artificial intelligence based information security baseline learning method embodiments described above.

That is, as can be understood by those skilled in the art, all or part of the steps in the method for implementing the embodiments described above may be implemented by a program instructing related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, or the like) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific embodiments for practicing the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. An information security baseline learning method based on artificial intelligence is characterized by comprising the following steps:

2. The artificial intelligence based information security baseline learning method of claim 1, wherein before extracting training feature data from a pre-constructed sample data set based on business requirements, the method further comprises:

3. The artificial intelligence based information security baseline learning method of claim 1, wherein before extracting training feature data from a pre-constructed sample data set based on business requirements, the method further comprises:

4. The information security baseline learning method based on artificial intelligence of claim 1, wherein the constructing of the security baseline training model by using the preset deep learning algorithm comprises:

5. The method of claim 4, wherein before the training feature data are cleaned, and the cleaned training feature data are used to perform iterative training on the security baseline training model until a preset iteration termination condition is met to obtain a security baseline classification model, the method further comprises:

constructing an input matrix according to the training characteristic data in a preset format;

the step of cleaning the training feature data, and performing iterative training on the safety baseline training model by using the cleaned training feature data until a preset iteration termination condition is met to obtain a safety baseline classification model includes:

6. The method for learning the information security baseline based on the artificial intelligence as claimed in any one of claims 1 to 5, wherein after the training feature data are cleaned, and the cleaned training feature data are used to perform iterative training on the security baseline training model until a preset iteration termination condition is met, and a security baseline classification model is obtained, the method comprises:

acquiring a data stream generated by an industrial internet platform to be monitored by using a mirror image interface;

analyzing and processing the data stream by using the safety baseline classification model to obtain the probability that the data stream is an unsafe data stream;

and when the probability is greater than a preset threshold value, carrying out safety alarm.

7. The artificial intelligence based information security baseline learning method of claim 6, wherein the data stream is an encrypted data stream;

the analyzing and processing the data stream by using the safety baseline classification model comprises:

8. An information security baseline learning device based on artificial intelligence is characterized in that,

9. An information security baseline learning apparatus based on artificial intelligence, comprising:

at least one processor; and the number of the first and second groups,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the artificial intelligence based information security baseline learning method of any of claims 1-7.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the artificial intelligence based information security baseline learning method of any of claims 1 to 7.