CN112686275B

CN112686275B - Knowledge distillation-fused generation playback frame type continuous image recognition system and method

Info

Publication number: CN112686275B
Application number: CN202110003856.4A
Authority: CN
Inventors: 许晓斐; 陈昊鹏
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-01-04
Filing date: 2021-01-04
Publication date: 2022-09-20
Anticipated expiration: 2041-01-04
Also published as: CN112686275A

Abstract

The invention provides a knowledge distillation-fused generation playback frame type continuous image recognition system and a knowledge distillation-fused generation playback frame type continuous image recognition method, which comprise a generator and a resolver; the generator comprises a feature generation model and an image generation model, and the feature generation model and the image generation model respectively correspond to different continuous image identification requirements and scenes; the solver comprises an image style feature extractor and a feature classifier, wherein the image style feature extractor is responsible for extracting visual features in the image, and the feature classifier is responsible for classifying feature vectors. The method can further improve the accuracy of recognition in the actual continuous image recognition task, more effectively avoid the forgetting of the model, and greatly reduce the reliability requirement, training difficulty and performance overhead of the generated model in the generation and playback framework by the characteristic generation strategy.

Description

Knowledge distillation-fused generation playback frame type continuous image recognition system and method

Technical Field

The invention relates to the technical field of information, in particular to a knowledge distillation-fused generation playback frame type continuous image recognition system and method.

Background

The joint school of artificial intelligence is taken as a cross discipline in the fields of cognitive science, brain science, computer science and the like, and the birth of the joint school is mainly established in a mode of simulating the physiological learning of human beings. From fingerprint unlocking to astronomical navigation, deep learning has been widely applied to various aspects as a representative technique in terms of software-based industrial practice. However, the development of industrialization has made the practice of deep learning gradually far from the original intention of imitating human learning. Humans are very good at accumulating knowledge and using old knowledge to learn new knowledge, while deep learning is the opposite.

As early as the 80 s, mccoskey and Cohen pointed out that when retraining new samples of certain trained neural networks, the neural networks quickly forgot old knowledge that they have learned. This phenomenon is called "catastrophic forgetfulness", which makes deep learning unsuitable for "continuous learning" and "lifelong learning" scenarios. Subsequently, Abraham and Robins further summarize this as a "plasticity/robustness dilemma" of neural networks.

Most "catastrophic forgetting" studies have focused on improving the algorithms for training networks, assuming that old training data is completely unavailable (or only partially available). Yet another solution to deep lifelong learning is to train a data-driven playback method so that old training data is reproduced, at which point the persistent learning degenerates into a common training task.

The image recognition problem is taken as an example by Zhizhong Li, Derek Hoiem In Learning with out Learning (In ECCV 2016), and the general solution to the continuous Learning problem (including feature extraction, parameter optimization and joint training) is summarized and an algorithm named as 'no Forgetting Learning' is provided In combination with the advantages of each family. The authors propose a general persistence-type image recognition scenario: a multi-task model is trained, wherein θ s is recorded as a shared network pivot of all tasks, θ 0 is recorded as a parameter learned for an old task, and θ n is recorded as a parameter randomly initialized for a new task. The goal is to optimally weigh these 3 sets of parameters simultaneously so that they perform as well as possible on both the old and new tasks.

The feature extraction optimizes the performance of the model in a new task as much as possible while completely preserving old knowledge. This approach is generally applicable to "continuous learning" from more difficult tasks to relatively simple tasks.

Joint training requires training both the old and new tasks and requires saving the training data of the old tasks. In such a scenario, it has degraded into a general deep learning problem. In the general documentation that follows, such methods are classified as "multitask learning".

It can be seen that both parameter tuning and feature extraction cannot simultaneously take into account the performance of both new and old tasks, and joint training requires additional data. The motivation of 'not forgetting learning' lies in overcoming the defects of the two methods at the same time, namely, simultaneously considering the performance of the model on the new task and the old task without keeping the old training data.

Another type of approach that has been proposed in the early days is to mimic the "review" behavior of humans during ongoing learning, which can also be considered a variation of the naive solution proposed above. Robins first proposed the use of "pseudo-patterns" in persistent learning, i.e., the input-output pairs randomly generated by existing models are mixed with the training data of new tasks, and the models are "spared" in a persistent learning scenario.

Such algorithms require explicit acquisition of the training data of the old task or simulation of the old training data in an approximate way.

After extensive research on generating neural networks, a relatively naive idea of using generating neural networks to replace "pseudo-patterns" has emerged. For "representative learning", the "representative sample" that the generation of the model makes possible so that the model storage can store may not be limited by the storage space. For regularization-based algorithms such as "forget-free learning," a generative neural network can also be used to replicate old task training data.

Hanul Shin, Jung Kwon Lee, Jaehong Kim, Jiwon Kim (continuous Learning with Deep genetic replay. in NIPS 2017.) was proposed first, in the context of continuous Learning, the generated model can be trained as a generator of the false data of old knowledge, and the new task is trained after the false data is mixed with the new data. So that the model can train a new task without forgetting old knowledge. The authors validate this approach against some simple data sets (mnist, SVHN, etc.), which achieves a competitive effect comparable to the "forgetting-free learning" algorithm described above, and unlike "forgetting-free learning", this approach does not require additional input during the model prediction phase.

Such results prove on the one hand the effectiveness of the generative model for continuous learning and on the other hand expose the disadvantage that it is too dependent on the perfect performance of the generative model. The authors acknowledge this deficiency in the text and assume that the generative model is always "perfect". It is clear that the known generative models for more complex, higher dimensional data still do not have the ability to generate "perfect spurious data". It is practically impossible to meet the continuous learning requirement of high-dimensional complex data using only this method.

Patent document CN110555417A (application number: CN201910843125.3) discloses a video image recognition system and method based on deep learning, the method includes the following steps: collecting video information and first picture information, and decomposing the video information into a plurality of continuous single-frame pictures to obtain second picture information; inputting the first picture information and/or the second picture information into a clustering model for clustering classification; determining a clustering center of each type of posture, and dividing each type of posture sample into subsets; optimizing the neural network model by using a training strategy of course learning according to the divided subsets; and receiving the information of the picture to be recognized, and recognizing the posture by using the optimized neural network model.

Disclosure of Invention

In view of the defects in the prior art, the invention aims to provide a knowledge-based distillation generation and playback frame type continuous image recognition system and method.

The continuous image recognition system of the fusion knowledge distillation generation playback frame type provided by the invention comprises a generator and a resolver;

the generator comprises a feature generation model and an image generation model, and the feature generation model and the image generation model respectively correspond to different continuous image identification requirements and scenes;

the solver comprises an image style feature extractor and a feature classifier, wherein the image style feature extractor is responsible for extracting visual features in the image, and the feature classifier is responsible for classifying feature vectors. .

Preferably, the feature generation model comprises a feature discriminator and a feature generator;

the feature discriminator performs feature discrimination and attention probability discrimination.

Preferably, the image generation model comprises an image discriminator and an image generator;

the image discriminator performs image discrimination and attention feature map discrimination.

Preferably, the image style feature extractor comprises a style sharing layer and a style specific layer;

the feature classifier is used to minimize cross entropy of class output and sample labels.

Preferably, the corresponding importance predictor is added to the output of all feature maps for each layer according to attention probability and distillation:

wherein f (-) represents a flatten or a global maximum or mean pooling and other transfer functions; x is a radical of a fluorine atom _l A feature map output representing the l-th layer of the feature extractor; w is a _l A weight parameter representing the l-th importance predictor full-link layer; b is a mixture of _l And representing the bias vector of the l-th importance predictor full-link layer.

Preferably, the distillation comprises probabilistic distillation, the formula being:

D ₁ (x) Representing binary cross entropy for judging whether the input data is true or false; d ₂ (. -) represents a binary cross entropy to discriminate whether the transformed data is true or false; z (-) is the probabilistic distillation of attention-binding in the solver of the persistence-type image recognition framework.

Preferably, the distillation further comprises attention-specific distillation, and the formula is:

r (-) is the weighted average of attention feature distillation and image discrimination of the image in the feature extractor in the continuous image recognition framework;

an image discriminator discriminates the image and the attention map in the image extractor, respectively.

The invention provides a knowledge-based distillation generation playback frame type continuous image recognition method, which comprises the following steps:

step 1: creating and randomly initializing a style specific feature extraction layer, a feature classifier, a feature generator and an image generator corresponding to a new image style;

step 2: freezing a sharing layer of the image feature extractor and specific feature extraction layers of other styles, and training a new specific layer of the feature extractor of the new style and a new feature classifier together until convergence;

and step 3: generating an old style image and a corresponding style label, and sampling the feature vector of the old style image;

and 4, step 4: using the sampled false feature marks the training data of the current task, and performing combined training on a sharing layer and a style specific layer of the feature extractor;

and 5: sampling a feature map of new image style training data using a shared layer of a feature extractor;

step 6: using the feature map output and the new style image to train an image generator of the new style;

and 7: performing feature sampling on training data of a new task by using a feature extractor of a new image style, and outputting the probability of image feature sampling by using a feature classifier;

and 8: a new image feature generator is trained using the probabilistic output of the feature classifier and the image features.

Compared with the prior art, the invention has the following beneficial effects:

the significance of the invention is that based on the generation of a playback framework, aiming at a specific continuous type deep learning problem: and image recognition, which provides a mixed generation strategy of features and images, and reduces the reliability requirement of generation playback on a generated model and the time-space overhead of training and sampling of the generated model. Using knowledge distillation techniques, a bridge is established between the generator and the solver, further enhancing the robustness of the generated model, as well as its quality of generation.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a schematic diagram of a technical architecture of the present invention;

FIG. 2 is a schematic view of a distillation model of probability and feature map with attention;

FIG. 3 is a schematic diagram of a feature generation model architecture;

FIG. 4 is a schematic diagram of an image generation model structure;

FIG. 5 is a schematic diagram of an image feature extractor;

FIG. 6 is a feature generation flow diagram;

FIG. 7 is a flow chart of image generation;

fig. 8 is a countermeasure training flow diagram.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

Example (b):

as shown in fig. 1, the present embodiment includes: the main modules of the system are divided into a generator and a resolver. The 'generator' is divided into a 'feature generation model' and an 'image generation model', and the two types of the 'generator' respectively correspond to different continuous image identification requirements and scenes. The 'solver' is divided into an 'image style feature extractor' and a 'feature classifier', wherein the image style feature extractor is responsible for extracting visual features in an image, and the feature classifier is responsible for classifying feature vectors. .

The "feature generation model" is a typical countermeasure generation network, and its model structure is composed of two parts, i.e., a discriminator and a generator, as shown in fig. 6, which is a feature generation flowchart. The discriminator is divided into two parts of feature discrimination and attention probability discrimination. Attention probability discrimination requires training an attention mask model of the probability. The model consists of two fully-connected layers, the middle layer output dimension is 512, the ReLu activation function is used, the output layer dimension depends on the maximum number of classes allowed to be identified, and the Softmax activation function is used. Attention probability discrimination the top-down attention probability prediction output from the "solver" with the middle tier dimension of 512, activated using the ReLu function, and activated using the Softmax function. The input of the feature discrimination is 2048-dimensional extracted image feature vectors, the 2 middle layer dimensions are 512 and 256, and the output layer is in binary classification. The middle layer uses the ReLu activation function, each layer sets Dropout with a probability of 0.5 to prevent overfitting. The output layer is activated using the Softmax function. The generator inputs a 128-dimensional random vector, the middle layer design and the middle layer design of the discriminator are symmetrically designed, the dimensions of 2 middle layers are 256 and 512 respectively, and LeakyReLu (alpha is 0.2) is used for activation. The output is a 2048-dimensional vector, activated using the Tanh function.

The "image generation model" is a typical convolution countermeasure generation network, and the structure is composed of two parts, namely a discriminator and a generator, as shown in fig. 7, and is an image generation flow chart. The image discriminator is divided into image discrimination and attention feature map discrimination. The input of the image discrimination is a tensor of 3 w h, which is subjected to normalization processing, of the original image. The design of the middle layer is basically symmetrical to that of the generator middle layer, 5 convolution layers are used, and the number of output channels is 64, 128, 256, 512 and 1 respectively. After each convolution layer, a batch normalization layer with e ═ 1e-5 and a LeakyReLu activation function with α ═ 0.2 are used to accelerate convergence. The output layer activates the function using softmax. Attention feature distillation requires additional training of the mask model of the feature map. The mask model consists of two fully-connected layers, the output dimension of the middle layer is 512, the ReLu activation function is used, the output layer dimension depends on the number of channels of the corresponding feature layer, and the Softmax activation function is used. The input of the attention feature map is an average value of the feature maps of the original image on the last layer of the feature extractor after being weighted by a mask model, and the dimension of the average value is 1 xw h. The intermediate layer uses 3 convolutional layers, and the number of output channels is 64, 128, and 1, respectively. After each convolution layer, a batch normalization layer with e ═ 1e-5 and a LeakyReLu activation function with α ═ 0.2 are used to accelerate convergence. The output layer activates the function using softmax.

The image style feature extractor comprises: the "style sharing layers" (ShareClayers) and the "style specific layers" (style specific layers). The style sharing layer refers to the structural design of a backbone network (backbone) of ResNet (deep residual error network). ResNet is one of the most influential image recognition infrastructures in recent years, and the emergence of ResNet directly makes the deep bottleneck problem of the convolutional neural network breakthrough to a great extent. One of the variants, ResNet-50, is selected in the system, taking into account the functional and performance tradeoffs. The "style specific layers" use a convolution layer with convolution kernel size 3 x 3, output channel number 2048, plus batch normalization and ReLu function activation. The convolutional layer was followed by a global mean pooling layer, fixing the output dimension to 1 x 2048, and activated using the Tanh function.

The "feature classifier" is used to minimize the cross entropy of the class output and the sample label. The classifier first reduces the input signal into a vector of (2048,). Followed by a class probability output layer whose dimensionality is determined by the maximum number of classes of the output class. In the present system, the default is 1000. The probability output and the mask tensor are subjected to point multiplication and activated by a softmax function, and the activation process of the mask tensor is controlled by meta information saved by a system.

Combining the probability of attention and the characteristic distillation process as shown in fig. 2, the corresponding importance predictor (importances) is added to the output of all the characteristic maps of each layer:

wherein f (-) represents a flatten or a global maximum or mean pooling and other transfer functions; x is the number of _l A feature map output representing the l-th layer of the feature extractor; w is a _l A weight parameter representing the l-th importance predictor full-link layer; b _l And representing the bias vector of the l-th importance predictor full-link layer.

This is equivalent to training a small two-layer fully-connected neural network for each layer signature. In conventional attention-driven techniques, the use of softmax activation may force the network to dynamically predict the significance of feature maps and suppress some insignificant feature outputs.

As shown in fig. 3 and 4, the knowledge-based distillation countermeasure generation network can be simply understood as expanding the input space of the discriminator. Specifically, the input of the discriminator is expanded to a vector space after the training data and the dummy data are transformed. When the discriminator for the knowledge-based distillation is trained, the vector of true and false data after the transformation of the target model can be directly marked as true or false without designing an additional discrimination standard. So that the discriminator can learn the discrimination standard of knowledge distillation. In addition, such a design avoids searching for distillation over-parameters λ in probabilistic distillation as well as in characteristic distillation.

Inspired by the distillation of feature and attention knowledge, in addition to probabilistic distillation, feature distillation and attention distillation can also be fused into the antagonistic learning framework of the arbiter and generator, as shown in fig. 8, which is an antagonistic training flow chart. In practice, a combination of feature and attention distillation is further used in the image generator, whereas traditional probabilistic distillation is used in the feature generator.

A. Probability distillation:

the improved discriminator of the feature generator has a training target composed of two parts, which can be formalized as:

D ₁ (x) Binary cross entropy representing the discrimination of the input data (features, images) from true to false; d ₁ (. The) represents the binary cross entropy to discriminate whether the transformed data (probability output, feature map) is true or false; z (-) is the probabilistic distillation of attention combined in the solver of the persistence type image recognition framework. Plus a weighted average of the original features. The discriminator discriminates the original data space and the transformed high-level abstract sample space in the solver, respectively.

B. Attention was given to the characteristic distillation:

the improved discriminator of the image generator has a training target which is also composed of two similar parts and can be formalized as follows:

wherein R (-) is the weighted average of the attention feature distillation of the image in the feature extractor and the image discrimination in the continuous image recognition framework. The discriminator discriminates the image and the attention map in the image extractor, respectively.

As shown in fig. 5, the "solver" is divided into an "image feature extractor" and an "image feature classifier", the former is responsible for extracting visual features in an image, and the latter is responsible for classifying feature vectors. In view of the actual continuous learning requirements, the output layer of the "image feature extractor" is a "multi-headed" style-specific layer, and in the inference phase, the model requires that the style of the input image be known.

The detailed flow of the algorithm is as follows:

the parameters of the persistence type image recognition system are marked as a 5-tuple (theta) _shr ，θ _style ，cls，G _img ，G _feat ). Wherein theta is _shr For the shared layer of feature extractors in the current version "solver", θ _style For the style-specific layer, cls is the current version feature classifier. G _img /G _feat A current version of the image generator set and a feature generator set, respectively.

A new image recognition task T _n Is a 3-tuple data (x) _i ，y _i ，s _i ) The distribution of (a) represents a specific image, a category label and a corresponding image style label, respectively.

The inputs to the algorithm are:

1. 5-tuple parameter (θ) of the current system _shr ，θ _style ，cls，G _img ，G _feat )；

2. A hyper-parameter of the system;

3. new recognition task data (x) _i ，y _i ，s _i )∈D _n ；

D _n Representing training data for the nth task.

The algorithm starts:

1. the new task data is classified according to the style label. Will new task T _n Divided into m subtasks t _{1，2，...，m} Making each subtask only contain the same style label;

2. for each subtask t that does not contain a new image style _j And circularly performing the following processes:

A) using an image feature extractor with a corresponding style to generate a feature vector corresponding to the image:

z＝θ _style (θ _shr (X))

B) sampling false image feature data simulating old task training data from random noise using a corresponding style image feature generator:

and N represents a normal distribution.

A) Using the true and false image features, retraining the feature classifier:

h represents the cross entropy loss.

B) Using a feature classifier with a corresponding style, mixing true and false image features, performing classification reasoning, and outputting a probability vector:

Z＝cls _j (z∪z′)

C) using (the weighted average of) the probability outputs z and the image features, the feature generation model is retrained:

x represents training data (feature); z represents transformation data (probability output); g (ε) represents the generator output.

For eachSub-tasks t comprising new image styles _j And circularly performing the following processes:

A. creating and randomly initializing a style specific feature extraction layer corresponding to a new image style, a feature classifier, a feature generator and an image generator:

B. a shared layer of frozen image feature extractors, and other styles of specific feature extraction layers. The new specific layers of the new style feature extractor and the new feature classifier are jointly trained to converge.

C. Generating an old style image (and a corresponding style label), and sampling the feature vector of the old style image:

f′＝θ _style (θ _shr (X'))

D. using the sampled false feature labels, the training data of the current task together perform joint training on the shared layer of the feature extractor, all style-specific layers:

E. sampling feature maps of new image style training data using a shared layer of feature extractors (in the present system, only the last layer of feature maps of the feature extractor shared layer is used):

Z＝θ _shr (X)

F. using (the weighted average of) the feature map output and the new style image, the new style image generator is trained:

G. the training data for the new task is feature sampled using a new image style feature extractor. Using a feature classifier to sample probability output of image features:

X＝θ _style (θ _shr (X))

Z＝cls _j (X)

H. using (a weighted average of) the probability outputs of the feature classifier and the image features, a new image feature generator is trained:

those skilled in the art will appreciate that, in addition to implementing the systems, apparatus, and various modules thereof provided by the present invention in purely computer readable program code, the same procedures can be implemented entirely by logically programming method steps such that the systems, apparatus, and various modules thereof are provided in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system, the device and the modules thereof provided by the present invention can be considered as a hardware component, and the modules included in the system, the device and the modules thereof for implementing various programs can also be considered as structures in the hardware component; modules for performing various functions may also be considered to be both software programs for performing the methods and structures within hardware components.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A knowledge-based distillation generation playback frame type continuous image recognition method is characterized by comprising a generator and a resolver;

the solver comprises a feature extractor and a feature classifier, wherein the feature extractor is responsible for extracting visual features in the image, and the feature classifier is responsible for classifying feature vectors;

the feature generation model comprises a feature discriminator and a feature generator;

the feature discriminator carries out feature discrimination and attention probability discrimination;

the image generation model comprises an image discriminator and an image generator;

the image discriminator carries out image discrimination and attention feature map discrimination;

the feature extractor comprises a style sharing layer and a style specific layer;

the feature classifier to minimize cross entropy of class output and sample labels;

the output of all feature maps for each layer is augmented by the corresponding importance predictor according to attention probability and distillation:

wherein f (-) represents a flatten or global maximum or mean pooling transfer function; x is the number of _l A feature map output representing the l-th layer of the feature extractor; w is a _l A weight parameter representing the l-th importance predictor full-link layer; b _l A bias vector representing the l-th importance predictor full-link layer;

distillation includes probabilistic distillation, with the formula:

D ₁ (x) Representing binary cross entropy for judging whether the input data is true or false; d ₂ (. -) represents a binary cross entropy to discriminate whether the transformed data is true or false; z (-) is a probabilistic distillation of attention-binding in the solver of a persistence-type image recognition framework;

distillation also includes attention-specific distillation, of the formula:

r (-) is the weighted average of the result after the attention feature distillation of the image in the feature extractor and the image discrimination in the continuous image recognition framework;

an image discriminator discriminates the image and the attention map in the image extractor, respectively;

the method for generating and playing back frame type continuous image recognition by fusing knowledge distillation comprises the following steps:

step 1: creating and randomly initializing a feature extractor, a feature classifier, a feature generator and an image generator corresponding to a new image style;

step 2: freezing a sharing layer of the feature extractor and other feature extractors, and training a style specific layer of a new feature extractor and a new feature classifier together until convergence;

step 6: training a new-style image generator by using the weighted average output by the feature map and the new-style image;