CN112307939B

CN112307939B - Video frame enhancement method using position mask attention mechanism

Info

Publication number: CN112307939B
Application number: CN202011172682.6A
Authority: CN
Inventors: 马汝辉; 王超逸; 宋涛; 华扬; 管海兵
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2022-10-04
Anticipated expiration: 2040-10-28
Also published as: CN112307939A

Abstract

A video frame enhancement method using a position mask attention mechanism inputs feature maps of two adjacent video frames, aligns the positions of the same pixel on different frames through position information, and thereby enhances the information content of a current frame by using the information of a previous frame, including two parts of position distance mask generation and position attention information fusion; generating a position distance mask by utilizing the distance between two adjacent frames of pixel points to generate a mask matched with the size of the characteristic image according to the size of the input characteristic image; and the position attention information fusion utilizes the generated position distance mask to guide the original attention mechanism to endow the aligned pixel points with larger weight, so that an enhanced feature map is generated to replace the original feature map of the current frame for subsequent processing. The method is based on the attention mechanism, does not need additional training parameters, can achieve higher convergence speed and better prediction result than the original attention mechanism, and can be widely applied to various video tasks.

Description

Video frame enhancement method using position mask attention mechanism

Technical Field

The invention relates to the field of video processing of computer vision directions, in particular to a method for enhancing a current frame in various video tasks by using an attention mechanism containing position information.

Background

Attention mechanism is one of the hot research problems in the field of deep learning. Attention mechanisms and variants thereof have attracted a great deal of attention in various fields and have advanced in great ways. In addition to Natural Language Processing (NLP), many attention-based approaches have achieved tremendous success in the field of Computer Vision (CV), such as object detection and instance segmentation.

In the video domain, attention mechanisms are often used for information enhancement of frames. Inputting two frames of feature maps processed by a feature extractor (feature extractor), converting the feature maps of the target frames into queries (query) by using convolution with three different kernel sizes being one by one, converting the feature maps of the reference frames into keys (key) and values (value), and obtaining a new feature map with the same size as the original feature map by using an attention mechanism to replace the feature map of the target frames for subsequent processing. The attention mechanism can learn the similarity of different pixel positions between two input frames during training and endow similar areas with larger weight. Therefore, the attention mechanism is a general method for solving the problems of occlusion, motion blur and the like in various video tasks.

The original attention mechanism is position-insensitive (position-insensitive), the output of which does not converge to different results with the rearrangement of the input sequence, and for some position-sensitive tasks, which contain some position-sensitive prior knowledge, such as video frame enhancement, it defaults that the position of the last frame pixel alignment between two adjacent frames is approximately close to the current frame pixel, so that the encoding position information can better model these tasks in the original attention mechanism.

The existing methods for encoding position information in the attention mechanism all use position embedding (position embedding). Location embedding defines a set of independent trainable parameters to apply to the relative location vector and apply the result to the similarity matrix resulting from the multiplication of query and key in softmax operation. Obviously, the position embedding method requires additional parameters in the training process, which results in additional memory cost, slow convergence speed and high training variance. In addition, the input size of the position embedding method must be fixed to ensure that the number of embedding parameters is unchanged in advance. In other words, the small difference in input size can make this approach unusable, limiting the model's migratability.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the technical problem to be solved by the present invention is to design a general video frame enhancement module that does not limit the input size and can encode position information in the attention mechanism, in combination with the original method for enhancing video frames in the attention mechanism. The module inputs the feature maps of two adjacent frames of videos, and replaces the original feature map with the output result, so that the module is a plug-and-play module which is universal for various video tasks. During the period, 2 technical difficulties need to be overcome:

(1) How to make the model focus more on the region with relatively high importance degree in the video; the importance of the frame in the video is different from that of the frame, the importance of a partial area in the video is higher than that of other areas, and the performance can be improved if the model can focus more on the area.

(2) How to design a representation of the encodable location information that does not require additional training; the original position embedding method adopts fixed parameter training position information, so that the input size is limited to be fixed, extra memory is needed for storing parameters, the convergence speed is reduced, the variance of a training result is increased, and the like.

The invention adopts the generation of a position distance mask and the fusion of attention information, wherein the position distance mask generates a pixel distance matrix for each pixel in a previous frame of feature map for each pixel in a current frame of feature map by utilizing Manhattan distance, and then the pixel distance matrices are combined into the position distance mask; and the position attention information fusion utilizes the generated position distance mask, and performs point multiplication on the product embedded by the adjacent two frames of feature maps through a learnable scale factor, so that the position information is encoded in the attention mechanism, and the generated enhanced feature map is endowed with higher weight to the adjacent position, thereby optimizing the original attention mechanism.

The method comprises the following steps:

step 1, inputting a video frame, and extracting a characteristic diagram through a pre-training convolutional neural network.

And 2, obtaining an enhanced feature map by using a feature map enhancement module.

And 3, performing subsequent processing and prediction by using the enhanced feature map.

And 4, outputting a prediction result.

Before using the feature map enhancement module, the feature map enhancement module needs to be trained, and the training step comprises the following steps:

step 2.1, initializing iterative counting;

step 2.2, if the iteration times are within N times, continuing, otherwise, ending the training;

2.3, inputting two adjacent frames of the video;

2.4, extracting a feature map by using a feature extractor;

step 2.5, embedding the two characteristic graphs into q, k and v respectively;

step 2.6, processing by using a multi-head attention mechanism;

step 2.7, calculating a position distance mask;

and 2.8, obtaining the added characteristic diagram to replace the original characteristic diagram, namely the characteristic diagram obtained in the step 2.4, carrying out subsequent processing, and turning to the step 2.2.

Further, in step 2.4, video frame features are extracted by using a pre-trained convolutional neural network.

Preferably, in step 2.4, the extracted video frame features are a feature map with more channels than the original image.

Preferably, in step 2.4, the feature extractor performs down-sampling, typically using ResNet, to obtain the feature map.

Preferably, in step 2.4, the number of feature map channels per frame is 1024.

Further, in step 2.5, q, k, v refer to query (query), key (key) and value (value), respectively. And performing channel compression on the feature map of the current frame by using convolution with convolution kernel size of 1 by 1 to serve as query (query), and performing channel compression on the feature map of the previous frame by using convolution with two different convolution kernels with convolution kernel sizes of 1 by 1 to respectively obtain a key (key) and a value (value).

Further, in step 2.6, the query, key and value are reshaped (reset) from a (batch, channel, height, width) tensor to a (batch, group, height width, sub _ channel) tensor as new query, key and value using a multi-head attention mechanism (multi-head attention).

Further, in step 2.7, the new query is multiplied by the transpose of the key using matrix multiplication to obtain a relationship matrix, and the activation function is applied to the relationship matrix.

And (3) inputting the height (height) and the width (width) of the original feature map, namely the feature map obtained in the step 2.4, calculating the distance between each pixel position and other positions by using the Manhattan distance, and generating a matrix with the size of height width at each position to obtain height width matrixes together. The matrices are reshaped and stitched together to obtain a position mask matrix of size (height width), a trainable scalar scale is broadcast, and an activation function is used.

Preferably, in step 2.7, tanh is used as the activation function to act on the relationship matrix, and sigmoid is used as the activation function to act on the position mask matrix.

Further, in step 2.8, element-level multiplication is performed on the relationship matrix passing through the activation function and the position mask matrix to obtain a weight matrix. And (4) performing softmax on the weight matrix along the last dimension, multiplying the obtained result by the new value (value) obtained in the step 2.6, reshaping the value to be the same as the original characteristic diagram, and obtaining an enhanced characteristic diagram to replace the current frame to complete subsequent processing and training.

Compared with the prior art, the invention has the following beneficial effects:

(1) Based on prior knowledge in video frame alignment, the invention uses a heuristic method to generate a position distance mask matched with the input size, better models the position relation in the video and obtains better performance in various tasks needing video frame enhancement.

(2) The invention uses a heuristic method to generate the position distance mask matched with the input size based on the prior knowledge in the video frame alignment, solves the problem of the requirement limitation of the prior position embedding method on the invariability of the input size, and is convenient for the model to be trained and transferred by using different input sizes.

(3) The invention uses a heuristic method to generate and input the position distance mask matched with the input size based on the prior knowledge in the video frame alignment, does not need additional parameters for training, reduces the limitation of model training on the memory, and can more quickly converge the test model to the optimal result.

Drawings

FIG. 1 is a functional block diagram of an embodiment of the present application;

FIG. 2 is a schematic of a training flow of an embodiment of the present application;

fig. 3 is a schematic operational flow diagram of an embodiment of the present application.

Detailed Description

The preferred embodiments of the present application will be described below with reference to the accompanying drawings for clarity and understanding of the technical contents thereof. The present application may be embodied in many different forms of embodiments and the scope of the present application is not limited to only the embodiments set forth herein.

The conception, the specific structure and the technical effects will be further described in order to fully understand the objects, the features and the effects of the present invention, but the present invention is not limited thereto.

An embodiment of the invention

As shown in fig. 1, the present embodiment provides two modules to implement the method of the present invention, one module is a feature extractor, and the other module is a video frame enhancement module.

The feature extractor includes a pre-trained convolutional neural network, which functions as: and receiving an input video frame, and extracting and outputting a feature map.

The video frame enhancement module has the functions of: and outputting the enhanced feature map through the attention information enhancement and the position distance mask.

As shown in fig. 3, the method for enhancing video frames by using a position mask attention mechanism according to the present invention includes the following steps:

And 4, outputting a prediction result.

Before using the feature map enhancement module, it needs to be trained, as shown in fig. 2, the training step includes:

step 2.1, initializing iterative counting;

2.3, inputting two adjacent frames of the video;

2.4, extracting a feature map by using a feature extractor;

step 2.5, embedding the two characteristic graphs into q, k and v respectively;

step 2.6, processing by using a multi-head attention mechanism;

step 2.7, calculating a position distance mask;

and 2.8, obtaining the added characteristic diagram, replacing the original characteristic diagram, namely the characteristic diagram obtained in the step 2.4, carrying out subsequent processing, and turning to the step 2.2.

In step 2.4, video frame features are extracted by using the pre-trained convolutional neural network, the extracted video frame features are usually feature maps with more channels than original images, and the feature extractor usually adopts ResNet to perform downsampling to obtain feature maps, so that the number of channels of each frame of feature maps is 1024.

In step 2.5, q, k, v refer to query, key and value, respectively. And performing channel compression on the feature map of the current frame by using convolution with convolution kernel size of 1 by 1 to serve as query (query), and performing channel compression on the feature map of the previous frame by using convolution with two different convolution kernels with convolution kernel sizes of 1 by 1 to respectively obtain a key (key) and a value (value).

In step 2.6, the query, key and value are reshaped (reset) from a (batch, channel, height, width) tensor to a (batch, group, height width, sub _ channel) tensor using a multi-head attention mechanism (multi-head attention) as new query, key and value.

In step 2.7, the new query is multiplied by the transpose of the key using matrix multiplication to obtain a relationship matrix, and tanh is used as an activation function to act on the relationship matrix.

And (3) inputting the height (height) and the width (width) of the original feature map, namely the feature map obtained in the step 2.4, calculating the distance between each pixel position and other positions by using the Manhattan distance, and generating a matrix with the size of height width at each position to obtain height width matrixes together. The matrices are reshaped and stitched together to obtain a position mask matrix with size (height width ), a trainable scalar scale is broadcast and sigmoid is used as an activation function.

It should be noted that in the above process, the mask matrix is calculated according to the size of the input feature map, and the position information parameter to be trained is only a scalar.

In step 2.8, element-level multiplication is performed on the relation matrix of the activation function and the position mask matrix to obtain a weight matrix. And (3) performing softmax on the weight matrix along the last dimension, multiplying the obtained result by the new value (value) obtained in the step 2.6, and reshaping to the size same as that of the original characteristic diagram to obtain an enhanced characteristic diagram to replace the current frame to complete subsequent processing and training.

The pseudo code of the main program of the training model algorithm is as follows:

the method utilizes prior information in video feature enhancement, avoids a large number of training parameters by a heuristic method, enables a model to be converged more quickly, and has an effect obviously superior to that of a traditional attention mechanism.

The foregoing detailed description of the preferred embodiments of the present application. It should be understood that numerous modifications and variations can be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the concepts of the present application should be within the scope of protection defined by the claims.

Claims

1. A method for video frame enhancement using a position mask attention mechanism, comprising the steps of:

step 1, inputting a video frame, and extracting a characteristic diagram through a pre-training convolutional neural network;

step 2, obtaining an enhanced feature map by using a feature map enhancement module;

step 3, using the enhanced feature map to perform subsequent processing and prediction;

step 4, outputting a prediction result;

the step of training the feature map enhancement module comprises:

step 2.1, initializing iterative counting;

2.3, inputting two adjacent frames of the video;

2.4, extracting a feature map by using a feature extractor;

step 2.5, embedding the two characteristic graphs into a query, a key and a value respectively;

step 2.6, processing by using a multi-head attention mechanism;

step 2.7, calculating a position distance mask;

2.8, obtaining the enhanced characteristic diagram, replacing the original characteristic diagram for subsequent processing, and turning to the step 2.2;

in the step 2.5, the feature map of the current frame is subjected to channel compression by using a convolution with a convolution kernel size of 1 by 1 as the query, and the feature map of the previous frame is subjected to channel compression by using two different convolutions with convolution kernels sizes of 1 by 1 to respectively obtain the key and the value;

in the step 2.6, the query, the key and the value obtained in the step 2.5 are reshaped from a tensor with a size of (batch, channel, height, width) to a tensor with a size of (batch, group, height width, sub _ channel) by using a multi-head attention mechanism to serve as a new query, key and value;

in the step 2.7, multiplying the new query obtained in the step 2.6 by the transpose of the key by using matrix multiplication to obtain a relation matrix, and acting an activation function on the relation matrix;

inputting the height and width of the original feature map, calculating the distance between each pixel position and other positions by using the Manhattan distance, and generating a matrix with the height and width at each position to obtain height and width matrixes altogether; the matrices are reshaped and stitched together to obtain a position mask matrix of size (height width), a trainable scalar scale is broadcast, and an activation function is used.

2. The video frame enhancement method of claim 1, wherein in step 2.4, the feature extractor comprises a pre-trained convolutional neural network, and the video frame features are extracted by using the pre-trained convolutional neural network.

3. The method of claim 2, wherein in step 2.4, the extracted features of the video frame are a feature map of smaller size and more channels than the original image.

4. The video frame enhancement method of claim 2, wherein in step 2.4, the feature extractor performs downsampling using ResNet to obtain a feature map.

5. The video frame enhancement method of claim 2, wherein in step 2.4, the number of feature map channels per frame is 1024.

6. The video frame enhancement method of claim 1, wherein in step 2.7, tanh is used as an activation function to act on the relation matrix, and sigmoid is used as an activation function to act on the position mask matrix.

7. The video frame enhancement method according to claim 6, characterized in that in step 2.8, element-level multiplication is performed on the relationship matrix passing through the activation function and the position mask matrix to obtain a weight matrix; and (3) performing softmax on the weight matrix along the last dimension, multiplying the obtained result by the new value obtained in the step (2.6), and reshaping to the size same as that of the original characteristic diagram to obtain the enhanced characteristic diagram so as to replace the current frame to complete subsequent processing and training.