CN113033774A

CN113033774A - Method and device for training graph processing network model, electronic equipment and storage medium

Info

Publication number: CN113033774A
Application number: CN202110261670.9A
Authority: CN
Inventors: 杨喜鹏; 蒋旻悦; 谭啸; 孙昊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Accurate Pointing Information Technology Co ltd
Priority date: 2021-03-10
Filing date: 2021-03-10
Publication date: 2021-06-25
Anticipated expiration: 2041-03-10
Also published as: CN113033774B

Abstract

The invention discloses a training method and device of a graph processing network model, electronic equipment and a storage medium, and particularly relates to the technical field of artificial intelligence such as deep learning and computer vision. The specific implementation scheme is as follows: respectively inputting training samples into a student network and a teacher network to obtain a first feature map output by the ith layer of the student network and a second feature map output by the ith layer of the teacher network; determining a first correction gradient corresponding to the student network according to the difference between the first characteristic diagram and the second characteristic diagram; acquiring a first soft label output by the student network and a second soft label output by the teacher network; determining a corresponding second correction gradient in the student network according to the difference between the first soft label and the second soft label; and correcting the student network based on the first correction gradient and the second correction gradient. Therefore, the learning ability and performance of the student network are improved.

Description

Method and device for training graph processing network model, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the technical fields of artificial intelligence, such as deep learning and computer vision, and in particular, to a method and an apparatus for training a graph processing network model, an electronic device, and a storage medium.

Background

With the development of computer technology, deep learning makes outstanding breakthrough in various fields, and because the artificial neural network has strong self-learning capability, the artificial neural network is more and more widely applied in the fields of pattern recognition, intelligent robots, automatic control, biology, medicine, economy and the like. However, the more advanced network models require more model parameters, the more storage space and computing resources are occupied, and thus knowledge distillation (also known as teacher-student network) takes place. When a teacher network is used for training a student network, how to improve the effect of the student network becomes a problem to be solved urgently at present.

Disclosure of Invention

The disclosure provides a training method and device for a graph processing network model, electronic equipment and a storage medium.

In one aspect of the present disclosure, a method for training a graph processing network model is provided, including:

respectively inputting training samples into a student network and a teacher network to obtain a first feature map output by the ith layer of the student network and a second feature map output by the ith layer of the teacher network, wherein i is an integer which is greater than or equal to 1 and less than or equal to N, and N is the number of network layers included in the student network and the teacher network;

determining a first correction gradient corresponding to the student network according to the difference between the first characteristic diagram and the second characteristic diagram;

acquiring a first soft label output by the student network and a second soft label output by the teacher network;

determining a corresponding second correction gradient in the student network according to the difference between the first soft label and the second soft label;

and correcting the student network based on the first correction gradient and the second correction gradient.

In another aspect of the present disclosure, a training apparatus for a graph processing network model is provided, including:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for respectively inputting training samples into a student network and a teacher network so as to acquire a first characteristic diagram output by the ith layer of the student network and a second characteristic diagram output by the ith layer of the teacher network, i is an integer which is greater than or equal to 1 and less than or equal to N, and N is the number of network layers contained in the student network and the teacher network;

the first determining module is used for determining a first correction gradient corresponding to the student network according to the difference between the first feature map and the second feature map;

the second acquisition module is used for acquiring the first soft label output by the student network and the second soft label output by the teacher network;

a second determining module, configured to determine a corresponding second correction gradient in the student network according to a difference between the first soft tag and the second soft tag;

and the correction module is used for correcting the student network based on the first correction gradient and the second correction gradient.

In another aspect of the present disclosure, an electronic device is provided, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method for training a graph processing network model according to an embodiment of the above-described aspect.

In another aspect of the present disclosure, a non-transitory computer-readable storage medium storing thereon a computer program for causing a computer to execute a method for training a graph processing network model according to an embodiment of the above-described aspect is provided.

In another aspect of the present disclosure, a computer program product is provided, which includes a computer program, and when executed by a processor, the computer program implements the method for training a graph processing network model according to an embodiment of the above-mentioned aspect.

The training method, the device, the electronic device and the storage medium for the graph processing network model provided by the disclosure can be used for respectively inputting training samples into a student network and a teacher network to obtain a first feature map output by the ith layer of the student network and a second feature map output by the ith layer of the teacher network, then determining a first correction gradient corresponding to the student network according to the difference between the first feature map and the second feature map, then obtaining a first soft label output by the student network and a second soft label output by the teacher network, and then determining a second correction gradient corresponding to the student network according to the difference between the first soft label and the second soft label, so that the student network is corrected based on the first correction gradient and the second correction gradient. Therefore, when the student network is trained, local information is considered, and global information is concerned, so that the student network generated by training has characteristics similar to those of a teacher network, and has better learning ability and characteristic expression ability, and the effect and performance of the student network are greatly improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a schematic flowchart illustrating a training method for a graph processing network model according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart illustrating a training method for a graph processing network model according to another embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a training apparatus for processing a network model according to an embodiment of the present disclosure;

FIG. 4 is a block diagram of an electronic device for implementing a method for training a graph processing network model according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, a machine learning technology, a deep learning technology, a big data processing technology, a knowledge map technology and the like.

Deep learning refers to a multi-layered artificial neural network and a method of training it. One layer of neural network takes a large number of matrix numbers as input, weights are taken through a nonlinear activation method, and another data set is generated as output. Through the appropriate number of matrixes, multiple layers of tissues are linked together to form a neural network brain to carry out accurate and complex processing just like people identify object labeling pictures.

Computer vision is a interdisciplinary field of science, studying how computers gain a high level of understanding from digital images or videos. From an engineering point of view, it seeks for an automated task that the human visual system can accomplish. Computer vision tasks include methods of acquiring, processing, analyzing and understanding digital images, and methods of extracting high-dimensional data from the real world to produce numerical or symbolic information, for example, in the form of decisions.

A training method, an apparatus, an electronic device, and a storage medium of a graph processing network model according to embodiments of the present disclosure are described below with reference to the accompanying drawings.

The training method of the graph processing network model according to the embodiment of the present disclosure may be executed by the training apparatus of the graph processing network model according to the embodiment of the present disclosure, and the apparatus may be configured in an electronic device.

Fig. 1 is a schematic flowchart of a training method for a graph processing network model according to an embodiment of the present disclosure.

As shown in fig. 1, the training method of the processing network model of the figure may include the following steps:

step 101, respectively inputting training samples into a student network and a teacher network to obtain a first feature map output by the ith layer of the student network and a second feature map output by the ith layer of the teacher network.

Wherein i is an integer greater than or equal to 1 and less than or equal to N, and N is the number of network layers included in the student network and the teacher network.

In addition, the training sample may be an image sample containing any content, which is not limited in this disclosure.

It can be understood that the student network may be a neural network that has not yet reached the available state, and network parameters of the student network may be updated and trained by the teacher network to reach the available state. The teacher network may be a neural network that has been trained, or may also be a neural network that has been trained, and may also be a neural network that has been trained and can learn simultaneously with the student network, and the like, which is not limited in this disclosure.

In addition, the teacher network and the student network may have the same network structure and network layer number, and the network widths may be the same or different, which is not limited in this disclosure.

And 102, determining a first correction gradient corresponding to the student network according to the difference between the first characteristic diagram and the second characteristic diagram.

Optionally, the distance between each pixel point in the first feature map and the corresponding pixel point in the second feature map may be determined, and then the difference between the first feature map and the second feature map may be represented according to the distance between each corresponding pixel point.

For example, the euclidean distance formula may be used to determine the distance between corresponding pixels in the first feature map and the second feature map.

For example, the pixel point a in the first characteristic diagram₁The corresponding feature value is A1, and the pixel point a in the second feature map₁The pixel points at the same position are a₂，a₂The characteristic value of A2, the distance between the group of pixel points is calculated by using the Euclidean distance formula as follows:

or, the distance between corresponding pixel points of the first characteristic diagram and the second characteristic diagram can be calculated by using a Manhattan distance formula.

For example, the pixel b in the first characteristic diagram₁The corresponding feature value is B1, and the pixel point B is in the second feature map₁The pixel points at the same position are b₂，b₂The characteristic value of (a) is B2, and the distance between the group of pixel points is calculated by using a Manhattan distance formula as follows: d₂(b₁,b₂)＝|B2-B1|。

It should be noted that the above example is only an example, and cannot be used as a limitation for determining the distance between corresponding pixels in the first feature map and the second feature map in the embodiment of the present disclosure.

In addition, when the difference between the first feature map and the second feature map is determined according to the distance between each pixel point in the first feature map and the corresponding pixel point in the second feature map, various modes can be provided.

For example, the distances between corresponding pixels in the first feature map and the second feature map are: d₁、d₂、d₃、d₄、d₅. The distances of the corresponding pixel points may be summed to serve as the difference between the first feature map and the second feature map: (d)₁+d₂+d₃+d₄+d₅). Or, each corresponding pixel point can be averaged to be used as the space between the first characteristic diagram and the second characteristic diagramThe difference of (a): (d)₁+d₂+d₃+d₄+d₅)/5。

It should be noted that the above examples are merely illustrative, and cannot be taken as a limitation on the manner of determining the difference between the first characteristic diagram and the second characteristic diagram in the embodiments of the present disclosure.

In addition, the first correction gradient may be determined by using a gradient descent method, a random gradient descent method, or the like according to a difference between the first feature map and the second feature map, and the method of determining the first correction gradient is not limited in the present disclosure.

It can be appreciated that there are a number of situations for the value of i when training a student network.

For example, when the student network is trained, i may be fixed first, a first correction gradient corresponding to the ith layer of the student network is determined, then the ith layer is trained continuously until the network parameter of the ith layer is fixed, and i takes other values, and then other layers corresponding to the student network are trained.

For example, if i is 5, a first correction gradient corresponding to the student network layer 5 can be determined according to a difference between the first feature map output by the student network layer 5 and the second feature map output by the teacher network layer 5, and then the network parameters of the student network layer 5 are trained continuously. Until the network parameters of the 5 th layer are fixed, then i takes other values, such as 12, and then the 12 th layer of the student network is trained.

Or when the student network is trained, i can be set to different values, and a plurality of corresponding network layers in the student network can be trained simultaneously.

For example, in the student network, the first feature map a1 is output at layer 3, the first feature map B1 is output at layer 4, the first feature map C1 is output at layer 5, the second feature map a2 is output at layer 3, the second feature map B2 is output at layer 4, and the second feature map C2 is output at layer 5.

Then, the distance between corresponding pixels in the first feature map a1 and the second feature map a2, the distance between corresponding pixels in the first feature map B1 and the second feature map B2, and the distance between corresponding pixels in the first feature map C1 and the second feature map C2 may be calculated respectively by using, for example, a manhattan distance formula, so as to determine the difference between the first feature map a1 and the second feature map a2, the difference between B1 and B2, and the difference between C1 and C2, respectively. Then, three first correction gradients can be determined according to the difference between A1 and A2, the difference between B1 and B2 and the difference between C1 and C2.

It should be noted that the above examples are only illustrative, and should not be taken as a limitation on determining the corresponding first correction gradient and the like in the student network in the embodiments of the present disclosure.

It is understood that the difference between the first characteristic map and the second characteristic map, and the first modified gradient corresponding to the student network may be in a positive correlation relationship.

For example, the difference between the first characteristic diagram and the second characteristic diagram is larger, and the value of the first correction gradient corresponding to the student network is also larger; or the difference between the first characteristic diagram and the second characteristic diagram is small, and the value of the first correction gradient corresponding to the student network is also small, which is not limited by the disclosure.

In the embodiment of the disclosure, the first correction gradient corresponding to the student network is determined according to the difference between the first feature diagram and the second feature diagram corresponding to the student network, and the difference between the first feature diagram and the second feature diagram can be represented by the distance between the pixel points corresponding to the same positions in the first feature diagram and the second feature diagram, so that in the process of determining the first correction gradient, each local information is fully considered, and the accuracy and the reliability of the first correction gradient are higher.

And 103, acquiring a first soft label output by the student network and a second soft label output by the teacher network.

The first soft label can be each classification label output by the student network and the corresponding probability value; the second soft label can be each classification label and each corresponding probability value output by the teacher network.

In the embodiment of the disclosure, the training samples are respectively input into the student network and the teacher network, and the first soft label output by the student network can be obtained through layer-by-layer processing of the student network, and the second soft label output by the teacher network can be obtained through layer-by-layer processing of the teacher network.

And step 104, determining a corresponding second correction gradient in the student network according to the difference between the first soft label and the second soft label.

There may be various ways to determine the difference between the first soft label and the second soft label.

For example, the difference between the first soft label and the second soft label may be determined by using a manhattan distance formula, or the difference between the first soft label and the second soft label may also be determined by using an euclidean distance formula. It is to be understood that the above-mentioned manner of determining the difference between the first soft tag and the second soft tag is not limited to the manhattan distance formula, the euclidean distance formula, and the like.

For example, the first soft label output by the student network is: class 1, corresponding probability 0.15, class 2, corresponding probability 0.05, class 3, corresponding probability 0.7, class 4, corresponding probability 0.1. The second soft label output by the teacher network is: class 1, corresponding probability of 0.05, class 2, corresponding probability of 0.05, class 3, corresponding probability of 0.8, class 4, corresponding probability of 0.1.

For example, using the euclidean distance formula, the difference between the first soft tag and the second soft tag is determined as follows:

or, using a manhattan distance formula, determining that the difference between the first soft label and the second soft label is: d₂＝|0.15-0.05|+|0.05-0.05|+|0.7-0.8|+|0.1-0.1|＝0.2。

It should be noted that the above examples are only examples, and cannot be used as a limitation to each classification tag in the first soft tag and the second soft tag and their corresponding probability values, and the difference between the first soft tag and the second soft tag in the embodiments of the present disclosure.

In addition, after the difference between the first soft label and the second soft label is determined, the second correction gradient of the student network can be determined according to the difference, and then the second correction gradient corresponding to each layer in the student network can be determined layer by using back propagation according to the network parameters corresponding to each layer in the student network.

In the embodiment of the disclosure, the first soft label is a result output after the training sample is processed layer by layer in the student network, and the second soft label is a result output after the training sample is processed layer by layer in the teacher network, so that according to the difference between the first soft label and the second soft label and the network parameters of each layer of the student network, the determined second correction gradient corresponding to each layer in the student network fully considers the global information, and the second correction gradient is more accurate and reliable.

And 105, correcting the student network based on the first correction gradient and the second correction gradient.

When the student network is corrected based on the first correction gradient and the second correction gradient according to different conditions of determining the first correction gradient, there are also various conditions.

For example, the first correction gradient is determined by fixing the value of i according to a first feature map corresponding to the ith layer of the student network and a second feature map corresponding to the ith layer of the teacher network, and at this time, the network parameter of the ith layer in the student network may be corrected based on the second correction gradient and the first correction gradient, and then the network parameters of the remaining layers in the student network may be corrected based on the second correction gradient.

For example, the second correction gradient and the first correction gradient corresponding to the ith layer in the student network may be fused, and then the network parameter of the ith layer in the student network may be corrected.

The second modified gradient and the first modified gradient may be fused in various ways, for example, the second modified gradient and the first modified gradient may be directly fused, or may also be fused according to a weight, and the like, which is not limited in this disclosure.

For example, when the second correction gradient corresponding to the ith layer in the student network is determined to be +0.05 and the first correction gradient is-0.01, and the second correction gradient and the first correction gradient are directly merged, the obtained result is +0.04, and then the network parameter of the ith layer in the student network can be adjusted up to 0.04.

Or when the first correction gradient, the second correction gradient, and the weight corresponding to each correction gradient are fused, where the weight corresponding to each correction gradient may be a preset value, which is not limited in this disclosure.

For example, the first correction gradient corresponding to the i-th layer in the student network is-0.06, the second correction gradient is +0.03, the weight corresponding to the set first correction gradient is 0.3, the weight corresponding to the second correction extraction is 0.7, when the first correction gradient and the second correction gradient are fused according to the weights, the obtained result is +0.003, and then the network parameter of the i-th layer in the student network can be adjusted up by 0.003.

It should be noted that the above examples are only illustrative, and should not be taken as limitations on network parameter modification and the like of the ith layer in the student network in the embodiments of the present disclosure.

It can be understood that after the network parameter of the ith layer in the student network is corrected, the network parameters of the remaining layers can be corrected layer by layer according to the determined second correction gradient corresponding to the remaining layers in the student network.

Or, the network parameters of the ith layer and the network layers from the ith layer to the input layer in the student network can be corrected based on the second correction gradient and the first correction gradient, and then the network parameters of the network layers from the (i + 1) th layer to the output layer in the student network can be corrected based on the second correction gradient.

It can be understood that a plurality of first correction gradients can be obtained by taking different values for i and training a plurality of corresponding network layers in the student network at the same time, and at this time, when the student network is corrected, each network layer of the previous i layers needs to be considered.

For example, the value of i is 3, 4, and 5, and the 3 rd layer, the 4 th layer, and the 5 th layer in the student network are trained simultaneously, so that three first correction gradients can be obtained: t1, T2, T3.

Then, when the student network is modified, for the input layer to the 3 rd layer of the student network, the second modification gradients corresponding to the T1, the T2, the T3 and the first 3 layers can be respectively fused, so that the network parameters of the first 3 layers can be modified. For the layer 4, T2, T3 and the corresponding second correction gradient of the layer 4 may be fused, and then the network parameter of the layer 4 may be corrected. For layer 5, T3 and the corresponding second modification gradient of layer 5 may be fused, and then the network parameter of layer 5 may be modified.

It should be noted that T1 may be a general value, and the first modified gradients corresponding to the first three layers may all be considered to be T1. Or T1 may be a set, which may be a first modified gradient corresponding to each of the first three layers derived by back propagation according to the difference between the first feature map and the second feature map: t11, T12, T13. Similarly, T2 may be a common value or may be a set of: t21, T22, T23, T24; t3 may also be a generic value or may be a set: t31, T32, T33, T34, T35, etc., to which this disclosure is not limited.

For example, T1, T2, and T3 are all sets, and when the student network is modified, for the input layer to the 3 rd layer of the student network, T11, T21, T31 and the second modification gradient corresponding to the 1 st layer may be fused, T12, T22, T32 and the second modification gradient corresponding to the 2 nd layer may be fused, and T13, T23, T33 and the second modification gradient corresponding to the 3 rd layer may be fused, so that the network parameters of the first 3 layers may be modified respectively. For the layer 4, T24, T34 and the corresponding second correction gradient of the layer 4 may be fused, and then the network parameter of the layer 4 may be corrected. For layer 5, T35 and the corresponding second modification gradient of layer 5 may be fused, and then the network parameter of layer 5 may be modified.

It should be noted that the above examples are only illustrative, and should not be taken as a limitation when the embodiment of the present disclosure corrects the student network.

In addition, there are various ways to fuse the first modified gradient and the second modified gradient, for example, direct fusion, fusion according to weight, and the like, which are not described herein again.

And then, correcting the network parameters of each network layer from the 6 th layer to the output layer according to the second correction gradient corresponding to each of the rest network layers in the student network.

It should be noted that the above examples are only illustrative, and should not be taken as a limitation when modifications are made to the student network in the embodiments of the present disclosure.

In the embodiment of the disclosure, the student network obtained after the first correction gradient and the second correction gradient is corrected not only considers the global information but also pays attention to the local information, so that the performance of the student network is greatly improved, and the student network has characteristics more similar to those of a teacher network, so that the student network has better learning ability and characteristic expression, and the effect of the student network can be further improved.

According to the embodiment of the disclosure, training samples may be first input into a student network and a teacher network respectively to obtain a first feature map output by an i-th layer of the student network and a second feature map output by the i-th layer of the teacher network, then a first correction gradient corresponding to the student network is determined according to a difference between the first feature map and the second feature map, then a first soft label output by the student network and a second soft label output by the teacher network are obtained, and then a second correction gradient corresponding to the student network is determined according to a difference between the first soft label and the second soft label, so that the student network is corrected based on the first correction gradient and the second correction gradient. Therefore, when the student network is trained, local information is considered, and global information is concerned, so that the student network generated by training has characteristics similar to those of a teacher network, and has better learning ability and characteristic expression ability, and the effect and performance of the student network are greatly improved.

In the above embodiment, according to the first feature diagram output by the student network and the second feature diagram output by the teacher network, the first correction gradient corresponding to the student network can be determined, and then according to the first soft label output by the student network and the second soft label output by the teacher network, the second correction gradient corresponding to the student network can be determined, so that the student network is corrected. In one possible implementation, the performance of the student network may also be evaluated using a discriminative network, which is described in detail below in conjunction with fig. 2.

Fig. 2 is a schematic flowchart of a training method for a graph processing network model according to an embodiment of the present disclosure. As shown in fig. 2, the training method of the graph processing network model may include the following steps:

step 201, respectively inputting training samples into a student network and a teacher network to obtain a first feature map output by the ith layer of the student network and a second feature map output by the ith layer of the teacher network.

Step 202, determining a first correction gradient corresponding to the student network according to the Euclidean distance between each pixel point in the first characteristic diagram and the corresponding pixel point in the second characteristic diagram.

Step 203, acquiring a first soft label output by the student network and a second soft label output by the teacher network.

And step 204, determining a corresponding second correction gradient in the student network according to the difference between the first soft label and the second soft label.

Step 205, inputting the first feature map into the discriminant network to obtain a discriminant result output by the discriminant network.

The discriminant network may be a trained network model, or may also be a network model that has not been trained completely and can be trained jointly with the student network, which is not limited in this disclosure.

Optionally, each network parameter in the discriminant network may be fixed first, and the student network may be trained. And then, fixing each network parameter in the student network, and training the judgment network by using the student network and the teacher network. And then fixing each network parameter in the discrimination network, and using the discrimination network to train the student network continuously. The training process of the discrimination network and the student network can be alternately repeated, so that the training of the student network and the discrimination network is realized, and the requirement of the student network and the discrimination network is met.

In addition, the first feature map output by the student network is input into the discrimination network, and after the processing of each network layer of the discrimination network, the discrimination network can output the discrimination result of the first feature map.

It will be appreciated that the decision result output by the decision network may contain a confidence level in addition to information indicating which network the first feature map was generated from. For example, the determination result may be: the "first feature map is generated by the teacher network" with a confidence of 0.88, the "first feature map is generated by the student network" with a confidence of 0.12, and so on.

The above-mentioned discrimination results are merely illustrative, and are not intended to limit the discrimination results in the embodiments of the present disclosure.

And step 206, determining a third correction gradient corresponding to the student network according to the judgment result output by the judgment network.

There may be a plurality of cases when determining the third correction gradient according to different discrimination results.

Optionally, the generation source of the first feature map may be indicated as a student network or a teacher network according to the determination result, and then a third correction gradient corresponding to the student network is determined according to the corresponding confidence level.

For example, in the case that the discrimination result output by the discrimination network indicates that the first feature map is generated for the student network, the third correction gradient may be determined according to the confidence of the discrimination result.

And determining that the third correction gradient corresponds to a different degree of confidence.

For example, the determination result may be: the first feature map is generated by the student network and has a confidence level of 0.9, the first feature map is generated by the teacher network and has a confidence level of 0.1, and the judgment result can indicate that the first feature map is generated by the student network. The confidence level is 0.9, which indicates that the discrimination network is 90% sure that the first feature map is generated by the student network, and the performance of the student network is poor, so that the student network needs to continue training, and thus a third correction gradient G1 with a larger value can be determined.

Or, the judgment result is: the first feature map is generated by the student network and has a confidence level of 0.7, the first feature map is generated by the teacher network and has a confidence level of 0.3, and the judgment result can indicate that the first feature map is generated by the student network. The confidence level is 0.7, which indicates that the discrimination network has a 70% confidence that the first feature map was generated by the student network, so that the student network can be trained to determine the third correction gradient G2 with a relatively smaller value.

It should be noted that the above examples are only illustrative, and cannot be taken as limitations on the determination result and the corresponding confidence, the third correction gradient, and the like in the embodiments of the present disclosure.

Alternatively, in a case where the discrimination result output by the discrimination network indicates that the first feature map is generated by the teacher network, the third modified gradient may be determined to be zero.

For example, the determination result may be: the first feature map is generated by the student network and has a confidence level of 0.05, the first feature map is generated by the teacher network and has a confidence level of 0.95, and the judgment result can indicate that the first feature map is generated by the teacher network. Therefore, the performance of the student network is better, the training of the student network can be finished, and the third correction gradient can be determined to be zero.

It can be understood that, in the actual use process, in the case that the determination result indicates that the first feature map is generated for the teacher network, if the confidence does not reach the threshold that is set as expected, the student network may also be trained, so that the third correction gradient may be determined.

For example, the preset threshold is 0.9, although the determination result may indicate that the first feature map is generated by the teacher network, the confidence corresponding to the determination result is 0.75, and if the determination result does not reach the preset threshold, the corresponding third correction gradient may be determined, and the student network continues to be trained, so that the performance of the student network is closer to that of the teacher network.

In the embodiment of the disclosure, the third correction gradient is determined according to the discrimination result of the first feature map input into the discrimination network, and the global information is fully considered, so that the accuracy and reliability of the determined third correction gradient are improved.

And step 207, correcting the student network based on the first correction gradient, the second correction gradient and the third correction gradient.

When the student network is corrected according to the first correction gradient, the second correction gradient and the third correction gradient, various conditions exist.

For example, the first correction gradient is determined by fixing the value of i according to the ith layer of the student network, and the network parameters of the ith layer in the student network can be corrected according to the first correction gradient, the second correction gradient and the third correction gradient, and then the rest layers are corrected according to the second correction gradient.

For example, the second correction gradient, the third correction gradient, and the first correction gradient corresponding to the ith layer in the student network may be fused, and then the network parameter of the ith layer in the student network may be corrected.

The fusion method may be a plurality of methods, for example, direct fusion may be performed, or fusion may be performed according to a weight, and the like, which is not limited in this disclosure.

For example, when the third correction gradient is +0.02, the second correction gradient is-0.05, and the first correction gradient is +0.01, which correspond to the ith layer in the student network, and the three are directly merged, the obtained result is-0.02, and then the network parameter of the ith layer in the student network can be adjusted down by 0.02.

Alternatively, when the weights corresponding to the first modified gradient, the second modified gradient, the third modified gradient, and each modified gradient are fused, each weight may be a value set in advance, which is not limited in this disclosure.

For example, the first correction gradient corresponding to the ith layer in the student network is-0.1, the second correction gradient is +0.03, and the third correction gradient is +0.01, the weight corresponding to the set first correction gradient is 0.15, the weight corresponding to the second correction gradient is 0.25, and the weight corresponding to the third correction gradient is 0.6. When the first correction gradient, the second correction gradient, and the third correction gradient are fused according to the weight, the obtained result is +0.0525, and then the network parameter of the i-th layer in the student network may be adjusted up by 0.0525.

It should be noted that the above examples are only illustrative, and should not be taken as limitations on network parameter modification and the like in the student network in the embodiments of the present disclosure.

Or, the network parameters of the ith layer and the network layers from the ith layer to the input layer in the student network can be corrected based on the third correction gradient, the second correction gradient and the first correction gradient, and then the network parameters of the network layers from the (i + 1) th layer to the output layer in the student network can be corrected based on the second correction gradient.

It can be understood that a plurality of first correction gradients are obtained by taking different values for i and training a plurality of corresponding network layers in the student network at the same time, and at this time, when the student network is corrected, each network layer of the previous i layers needs to be considered.

The first feature map outputted by the layer 3 correspondence is inputted to the discrimination network, the identified third correction gradient is G1, the first feature map outputted by the layer 4 correspondence is inputted to the discrimination network, the identified third correction gradient is G2, the first feature map outputted by the layer 5 correspondence is inputted to the discrimination network, and the identified third correction gradient is G3.

Then, when the student network is modified, for the input layer to the 3 rd layer of the student network, T1, T2, T3, G1, G2, G3 and the second modification gradient corresponding to the 1 st layer may be fused, T1, T2, T3, G1, G2, G3 and the second modification gradient corresponding to the 2 nd layer may be fused, and T1, T2, T3, G1, G2, G3 and the second modification gradient corresponding to the 3 rd layer may be fused, so that the network parameter of the first 3 layers may be modified. For the layer 4, the T2, T3, G2, G3 and the corresponding second correction gradient of the layer 4 may be merged, and then the network parameter of the layer 4 may be corrected. For the 5 th layer, T3, G3 and the corresponding second correction gradient of the 5 th layer may be fused, and then the network parameter of the 5 th layer may be corrected.

It should be noted that T1 may be a general value, and the first modified gradients of the first three layers may be considered to be T1; accordingly, G1 may also be a common value, and the third modified gradients corresponding to the first three layers may be considered to be G1. Or T1 may also be a set, which may be a first modified gradient derived from the difference between the first feature map and the second feature map by back propagation for each of the first three layers: t11, T12, T13; accordingly, G1 may also be a set, which may be a first modified gradient derived from the backward propagation for each of the first three layers: g11, G12, G13. Similarly, T2 may be a common value or may be a set of: t21, T22, T23, T24; accordingly, G2 may also be a common value, or may also be a set: g21, G22, G23, G24. Similarly, T3 may be a common value or may be a set of: t31, T32, T33, T34, T35; accordingly, G3 may also be a common value, or may also be a set: g31, G32, G33, G33, G35, etc., to which this disclosure is not limited.

It should be noted that, for the specific process of correcting the network parameters of the student network based on each correction gradient, reference may be made to the process of correcting the student network based on each correction gradient in the above embodiment, and details are not described here.

And then, correcting the network parameters of each network layer from the 6 th layer to the output layer in the student network according to the second correction gradient corresponding to each of the other network layers in the student network.

In the embodiment of the disclosure, the student network obtained after the first correction gradient, the second correction gradient and the third correction gradient is corrected, not only global information of different dimensions is considered, but also local information is concerned, so that the performance of the student network is greatly improved, the student network has characteristics more similar to those of a teacher network, and the student network has better learning ability and characteristic expression, and further improves the effect of the student network.

In the embodiment of the disclosure, training samples may be first input into a student network and a teacher network respectively to obtain a first feature map output by an i-th layer of the student network and a second feature map output by the i-th layer of the teacher network, then a first correction gradient corresponding to the student network is determined according to an euclidean distance between each pixel point in the first feature map and a corresponding pixel point in the second feature map, then a first soft label output by the student network and a second soft label output by the teacher network are obtained, then a corresponding second correction gradient in the student network is determined according to a difference between the first soft label and the second soft label, then the first feature map may be input into the discrimination network to obtain a discrimination result output by the discrimination network, and then a third correction gradient corresponding to the student network is determined according to the discrimination result output by the discrimination network, so that the first correction gradient, the third correction gradient, and the second correction gradient are based on the first correction gradient, And the second correction gradient and the third correction gradient are used for correcting the student network. Therefore, when the student network is trained, the global information of different dimensions is considered, and the local information is concerned, so that the student network generated by training has characteristics more similar to those of a teacher network, and has better learning ability and characteristic expression ability, and the effect and performance of the student network are greatly improved.

In order to implement the above embodiments, the present disclosure further provides a training apparatus for a graph processing network model. Fig. 3 is a schematic structural diagram of a training apparatus for processing a network model according to an embodiment of the present disclosure.

As shown in fig. 3, the training apparatus 300 for processing a network model includes: a first obtaining module 310, a first determining module 320, a second obtaining module 330, a second determining module 340, and a modifying module 340.

The first obtaining module 310 is configured to input training samples into a student network and a teacher network respectively to obtain a first feature map output by an i-th layer of the student network and a second feature map output by the i-th layer of the teacher network, where i is an integer greater than or equal to 1 and less than or equal to N, and N is the number of network layers included in the student network and the teacher network.

A first determining module 320, configured to determine a first correction gradient corresponding to the student network according to a difference between the first feature map and the second feature map.

The second obtaining module 330 is configured to obtain the first soft label output by the student network and the second soft label output by the teacher network.

A second determining module 340, configured to determine a corresponding second correction gradient in the student network according to a difference between the first soft tag and the second soft tag.

A correcting module 350, configured to correct the student network based on the first correcting gradient and the second correcting gradient.

In a possible implementation manner, the apparatus 300 may further include:

and the third acquisition module is used for inputting the first characteristic diagram into a discrimination network so as to acquire a discrimination result output by the discrimination network.

And the third determining module is used for determining a third correction gradient corresponding to the student network according to the judgment result output by the judgment network.

In a possible implementation manner, the modification module 350 is specifically configured to modify the student network based on the first modification gradient, the second modification gradient, and the third modification gradient.

In a possible implementation manner, the third determining module is specifically configured to determine the third correction gradient according to a confidence of the discrimination result when the discrimination result output by the discrimination network indicates that the first feature map is generated for the student network.

In a possible implementation manner, the third determining module is specifically configured to determine that the third correction gradient is zero when the determination result output by the determination network indicates that the first feature map is generated by the teacher network.

In a possible implementation manner, the first determining module 320 is specifically configured to determine a first correction gradient corresponding to the student network according to an euclidean distance between each pixel point in the first feature map and a corresponding pixel point in the second feature map.

In a possible implementation manner, the modifying module 350 is further specifically configured to modify a network parameter of an i-th layer in the student network based on the second modification gradient and the first modification gradient; and correcting the network parameters of the rest layers in the student network based on the second correction gradient.

In a possible implementation manner, the modifying module 350 is further specifically configured to modify, based on the second modification gradient and the first modification gradient, network parameters of an i-th layer in the student network and network layers from the i-th layer to an input layer; and correcting the network parameters of each network layer from the i +1 th layer to the output layer in the student network based on the second correction gradient.

The functions and specific implementation principles of the modules in the embodiments of the present disclosure may refer to the embodiments of the methods, and are not described herein again.

The training device of the graph processing network model according to the embodiment of the present disclosure may first input training samples into the student network and the teacher network, respectively, to obtain a first feature map output by an i-th layer of the student network and a second feature map output by an i-th layer of the teacher network, then determine a first correction gradient corresponding to the student network according to a difference between the first feature map and the second feature map, then obtain a first soft label output by the student network and a second soft label output by the teacher network, and then determine a second correction gradient corresponding to the student network according to a difference between the first soft label and the second soft label, thereby correcting the student network based on the first correction gradient and the second correction gradient. Therefore, when the student network is trained, local information is considered, and global information is concerned, so that the student network generated by training has characteristics similar to those of a teacher network, and has better learning ability and characteristic expression ability, and the effect and performance of the student network are greatly improved.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 4 shows a schematic block diagram of an example electronic device 400 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 4, the apparatus 400 includes a computing unit 401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data required for the operation of the device 400 can also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

A number of components in device 400 are connected to I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, or the like; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408 such as a magnetic disk, optical disk, or the like; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 401 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 401 executes the respective methods and processes described above, such as a training method of a graph processing network model. For example, in some embodiments, the training method of the graph processing network model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM 402 and/or the communication unit 409. When loaded into RAM 403 and executed by computing unit 401, may perform one or more steps of the graph processing network model training method described above. Alternatively, in other embodiments, the computing unit 401 may be configured by any other suitable means (e.g., by means of firmware) to perform the training method of the graph processing network model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

According to the technical scheme, the training samples can be firstly input into the student network and the teacher network respectively to obtain a first feature map output by the ith layer of the student network and a second feature map output by the ith layer of the teacher network, then a first correction gradient corresponding to the student network is determined according to the difference between the first feature map and the second feature map, then a first soft label output by the student network and a second soft label output by the teacher network are obtained, and then a corresponding second correction gradient in the student network is determined according to the difference between the first soft label and the second soft label, so that the student network is corrected based on the first correction gradient and the second correction gradient. Therefore, when the student network is trained, local information is considered, and global information is concerned, so that the student network generated by training has characteristics similar to those of a teacher network, and has better learning ability and characteristic expression ability, and the effect and performance of the student network are greatly improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for training a graph processing network model, comprising:

determining a second correction gradient corresponding to the student network according to the difference between the first soft label and the second soft label;

2. The method of claim 1, wherein after the obtaining the first feature map output by the ith layer of the student network and the second feature map output by the ith layer of the teacher network, further comprising:

inputting the first feature map into a discrimination network to obtain a discrimination result output by the discrimination network;

determining a third correction gradient corresponding to the student network according to a judgment result output by the judgment network;

the modifying the student network based on the first modification gradient and the second modification gradient includes:

and correcting the student network based on the first correction gradient, the second correction gradient and the third correction gradient.

3. The method of claim 2, wherein the determining a third correction gradient corresponding to the student network according to the discrimination result output by the discrimination network comprises:

and under the condition that the judgment result output by the judgment network indicates that the first feature map is generated by the student network, determining the third correction gradient according to the confidence of the judgment result.

4. The method of claim 2, wherein the determining a third correction gradient corresponding to the student network according to the discrimination result output by the discrimination network comprises:

and determining the third correction gradient to be zero when the judgment result output by the judgment network indicates that the first feature map is generated by the teacher network.

5. The method of any one of claims 1-4, wherein the determining a first modified gradient corresponding to the student network based on the difference between the first feature map and the second feature map comprises:

and determining a first correction gradient corresponding to the student network according to the Euclidean distance between each pixel point in the first characteristic diagram and the corresponding pixel point in the second characteristic diagram.

6. The method of any of claims 1-4, wherein said modifying the student network based on the first modification gradient and the second modification gradient comprises:

correcting the network parameters of the ith layer in the student network based on the second correction gradient and the first correction gradient;

and correcting the network parameters of the rest layers in the student network based on the second correction gradient.

7. The method of any of claims 1-4, wherein said modifying the student network based on the first modification gradient and the second modification gradient comprises:

correcting network parameters of an ith layer in the student network and network layers from the ith layer to an input layer based on the second correction gradient and the first correction gradient;

and correcting the network parameters of each network layer from the i +1 th layer to the output layer in the student network based on the second correction gradient.

8. A training apparatus of a graph processing network model, comprising:

9. The apparatus of claim 8, further comprising:

the third obtaining module is used for inputting the first characteristic diagram into a discrimination network so as to obtain a discrimination result output by the discrimination network;

the third determining module is used for determining a third correction gradient corresponding to the student network according to the judgment result output by the judgment network;

the correction module is specifically configured to:

10. The apparatus of claim 9, wherein the third determining module is specifically configured to:

11. The apparatus of claim 9, wherein the third determining module is specifically configured to:

12. The apparatus of any one of claims 8-11, wherein the first determining module is specifically configured to:

13. The apparatus according to any one of claims 8-11, wherein the modification module is further specifically configured to:

14. The apparatus according to any one of claims 8-11, wherein the modification module is further specifically configured to:

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.