CN116030077B

CN116030077B - Video salient region detection method based on multi-dataset collaborative learning

Info

Publication number: CN116030077B
Application number: CN202310314307.8A
Authority: CN
Inventors: 张云佐; 张天; 郑宇鑫; 武存宇; 刘亚猛; 于璞泽; 康伟丽; 朱鹏飞; 王双双
Original assignee: Shijiazhuang Tiedao University
Current assignee: Shijiazhuang Tiedao University
Priority date: 2023-03-28
Filing date: 2023-03-28
Publication date: 2023-06-06
Anticipated expiration: 2043-03-28
Also published as: CN116030077A

Abstract

The invention discloses a video salient region detection method based on multi-dataset collaborative learning. The method comprises the following steps: acquiring a plurality of video saliency data sets with different distributions; constructing a multi-dataset collaboration network, modeling the statistical characteristics of the multi-dataset through a dataset specific unit, and prompting the network to learn the commonality with the significance characteristic through a dataset countermeasure module, wherein the two units are combined to relieve the problem of the distribution difference between the datasets; aiming at different application scenes, corresponding multi-data set training and testing modes are provided, and a compound batch training mechanism is adopted to optimize the collaborative learning process. The method is different from a common single data set or a fine tuning training mode, and the detection precision of the video salient region is improved by utilizing the information of a plurality of data sets, and the generalization performance of the model on the data outside the domain is improved.

Description

Video salient region detection method based on multi-dataset collaborative learning

Technical Field

The invention relates to the technical field of image communication methods, in particular to a video salient region detection method based on multi-dataset collaborative learning.

Background

Video salient region detection is one of the basic tasks in video processing and computer vision, and is also an important preprocessing task in perceptual video coding. It aims to simulate human visual attention system, predicts the attention degree of human being to each video area when watching video freely, and expresses it in the form of saliency map. In the perceptual video coding, firstly, capturing a video salient region, then distributing more bit resources to the salient region, so that the salient region keeps high definition, and the non-salient region is properly distorted, thereby reducing video code rate under the condition of unchanged subjective visual perception, improving video compression rate, further reducing video storage space and relieving bandwidth burden of video communication.

With the development of deep learning, the field of video saliency area detection has advanced greatly, but most video saliency detection models are trained in a single data set or fine tuning manner. Due to the limited data volume of a single data set, the detection accuracy of the single data set is approaching saturation, and the single data set lacks enough generalization capability, so that the application of the models in real life is hindered. Training with multiple data sets expands the amount of training data, which seems to solve the above problem, but often there is a distribution bias between the data sets, and the model trained directly on the multiple data sets is often inferior to the single data set or the results under a fine-tuning model. It follows that modeling common features with significant information is critical to efficiently performing multi-dataset training how to decouple the distribution differences between datasets.

Disclosure of Invention

The invention provides a video salient region detection method based on multi-dataset collaborative learning in order to solve the problems in the prior art.

A video salient region detection method based on multi-dataset collaborative learning is characterized by comprising the following steps:

s1: acquiring a plurality of video saliency data sets with labels, wherein samples and label distributions of the plurality of data sets are different;

s2: a multi-dataset collaboration network is constructed, and a saliency map of the input video is acquired by utilizing information of the multi-dataset. The network consists of an encoder of a 3D convolution backbone network, a characteristic fusion module, a data set specific unit and a data set countermeasure moduleAnd a decoder. Wherein the data set specific unit comprises a data set specific batch normalization operation, a data set specific gaussian prior graph and a data set specific gaussian smoothing filter for modeling statistical properties of each data set; the data set countermeasure module is used for judging the data set label of the input sample, and generating classification loss

The commonality of the salient features of the network learning is promoted in the form of the countermeasure learning; the data set specific unit and the data set countermeasure module work cooperatively, so that the statistical characteristics and the remarkable commonalities of a plurality of data sets can be modeled, and the problem of distribution difference among the plurality of data sets is relieved together;

s3: aiming at an intra-domain scene, training and testing are carried out in a general mode; training and testing a target domain without a label in a domain self-adaptive mode; training and testing an unknown target domain in a domain generalization mode; and a composite batch training mechanism is adopted to assist the multi-dataset collaborative network training.

The data set specific unit sets a corresponding branch for each data set, and automatically switches a switch to activate the corresponding branch according to the label of the input data set, so as to model the exclusive characteristic of the data set; the specific application of the method is divided into a data set specific batch normalization operation, a data set specific Gaussian prior graph and a data set specific Gaussian smoothing filter; for different batch normalization parameter distributions across datasets, dataset specific batch normalization operations are training to learn the batch normalization mean and variance for each dataset; aiming at the difference of Gaussian prior graphs among data sets, a data set specific Gaussian prior graph builds a different two-dimensional Gaussian prior graph for each data set to model the central fixation deviation of each data set; for significant map sharpness differences between datasets, a learner-specific gaussian smoothing filter is employed to eliminate this bias.

The data set countermeasure module consists of a gradient inversion layer and a data set classifier; the data set classifier consists of a convolution layer and a full connection layer, which is used forIn predicting the data set to which the input video belongs, its loss function

Cross entropy loss for multiple classifications; the gradient inversion layer does not perform numerical transformation in forward propagation, but automatically inverts the gradient direction in backward propagation.

A further technical solution is that the generic approach aims at learning a unified model using information from multiple data sets to improve the performance of the model on each data set; during the training phase, batches of each dataset are propagated forward, and loss of significance prediction is propagated backward

And data set classification loss->

During the detection phase, the corresponding dataset specific cell branches are selected according to the tags of the input dataset without using the dataset countermeasure module.

The technical scheme is that the field self-adaption mode aims at improving the performance of the label-free target domain; during the training phase, batches from each source domain dataset and one unlabeled target domain are propagated forward, requiring computation and back-propagation of significant predictive losses for each source domain dataset

And data set classification loss->

For the target domain, only the classification loss of the back propagation data set is calculated>

The method comprises the steps of carrying out a first treatment on the surface of the In the test phase, for the source domain data sets, the corresponding data set specific element branches are selected according to the labels of the data sets, without using a data set countermeasure module, while for the target domain data, the source domain data set with the largest data amount is selected as the data set label thereof to determine the corresponding specific element branches.

The further technical proposal is that the domain generalization mode aims at trying to learn a generalization model from a plurality of source domain data sets without using target domain data; because of the lack of the target domain, the training phase is the same as the general mode, and the testing phase is the same as the domain adaptive mode.

The further technical scheme is that the composite batch training mechanism is used for promoting collaborative optimization of the training process and avoiding batch jitter caused by switching different data sets; the mechanism constructs batches from each dataset into composite batches according to the video quantity proportion of the plurality of source domain datasets; the loss for each dataset batch is calculated separately in forward propagation, and back propagation is performed to update the gradient when the loss from all dataset batches is calculated.

The beneficial effects of adopting above-mentioned technical scheme to produce lie in: the scheme breaks through the constraint of the traditional single data set or fine tuning training mode, and proposes a multi-data set collaborative learning paradigm for video salient region detection. The unified model is constructed by utilizing the information of a plurality of data sets, so that the detection precision of the salient region is improved, the generalization capability of the model for the data outside the domain is remarkably improved, and the method is more suitable for being applied to a real scene.

Drawings

The invention will be described in further detail with reference to the drawings and the detailed description.

Fig. 1 is a flow chart of a video salient region detection method based on multi-dataset collaborative learning in a first embodiment of the present invention;

fig. 2 is an overall structure diagram of a video salient region detection method based on multi-dataset collaborative learning in a first embodiment of the present invention;

FIG. 3 is a network detail schematic diagram of a video salient region detection method based on multi-dataset collaborative learning in a first embodiment of the present invention;

fig. 4 (a) is a schematic structural diagram of a spatial attention guiding fusion module according to an embodiment of the present invention; (b) Is a schematic structural diagram of a channel attention guide fusion module;

FIG. 5 is a schematic diagram of a data set specific unit according to a first embodiment of the present invention;

fig. 6 is a flow chart of a composite batch training method in accordance with the first embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Example 1

The embodiment of the invention provides a video salient region detection method based on multi-dataset collaborative learning, which is shown in a flow chart in fig. 1 and comprises the following steps:

s2: a multi-dataset collaboration network is constructed, and a saliency map of the input video is acquired by utilizing information of the multi-dataset. The network consists of an encoder, a feature fusion module, a dataset specific unit, a dataset challenge module and a decoder of a 3D convolutional backbone network, as shown in fig. 2. Wherein the data set specific unit comprises a data set specific batch normalization operation, a data set specific gaussian prior graph and a data set specific gaussian smoothing filter for modeling statistical properties of each data set; the data set countermeasure module is used for judging the data set label of the input sample, and generating classification loss

To combat learningThe form promotes the commonality of the salient features of the network learning; the data set specific unit and the data set countermeasure module work cooperatively, so that the statistical characteristics and the remarkable commonalities of a plurality of data sets can be modeled, and the problem of distribution difference among the plurality of data sets is relieved together;

The present invention provides a preferred embodiment to perform S1. Four commonly used video saliency datasets were employed: DHF1K, hollywood-2, UCF-Sports and LEDOV. The DHF1K is a large video fixation database, the coverage types are more, 1000 videos in the data set are divided into a training set, a verification set and a test set, and the number of the videos in the data set is 600, 100 and 300 respectively. Hollywood-2 is 1707 video from Hollywood movies, 823 videos for training and 884 videos for testing. UCF-Sports is a dataset from Sports videos, 103 of which are used for training and 47 of which are used for testing. LEDOV collects video from accessible public sources, including advertisements, documentaries, etc., consisting of 44 training samples, 20 verification samples, and 20 test samples. The sample and label distributions of the four are significantly different.

The present invention provides a preferred embodiment to perform S2, the network architecture of which is shown in fig. 3. The whole multi-dataset collaboration network is divided into the following parts:

a 3D convolutional backbone network encoder. The encoder employs an S3D network as a backbone network that is pre-trained on a Kinetics dataset, which can continually extract multi-scale spatio-temporal features from an input frame sequence in a sliding window fashion.

And a feature fusion module. The module body consists of a bidirectional space-time feature pyramid, which is fused by adopting an attention-directed fusion mechanism. The bi-directional spatio-temporal feature pyramid adds a bottom-up path to the top-down path of the TSFP-Net. Through the framework, the position information of the deep features is propagated along a top-down path, and the position information of the shallow features is propagated along a bottom-up path, so that multi-scale space-time features can be fully fused, and context information required by accurate prediction is generated. Furthermore, the framework employs an attention directed fusion mechanism to fusion of neighboring features instead of simple stitching or summing. The mechanism can automatically learn the fusion weight and carry out self-adaptive adjustment under different scenes. Specific applications thereof can be divided into a spatial attention directed fusion module (SAGFs) and a channel attention directed fusion module (CAGF), and specific operation processes can be referred to in FIG. 4. Through this module, multi-scale spatio-temporal features are further enhanced.

Data set specific units. The unit is configured to model the statistical properties of each dataset and have the remaining network share part learn the significant commonality representation. The dataset specific unit sets a respective branch for each dataset, and switches automatically to activate the respective branch according to the input dataset, the structure of which is shown in fig. 5. The specific forms thereof can be divided into: a dataset specific batch normalization operation, a dataset specific gaussian prior map and a dataset specific gaussian smoothing filter. The dataset specific batch normalization operation is to learn the batch normalization mean and variance for each dataset through training for differences in batch normalization parameter distribution across datasets. Since the parameters of the backbone network S3D come from the pre-training model, this embodiment sets a data set specific batch normalization decoder for each multi-scale branch, respectively, to generate features of the same resolution. Aiming at the difference of Gaussian prior graphs among data sets, the embodiment adds a group of data set specific Gaussian prior graphs and splices the data set specific Gaussian prior graphs with the combined fusion characteristics so as to model central attention deviation. The prior map consists of a two-dimensional Gaussian prior map, and the combination parameters of the prior map can be learned. For differences in saliency between datasets, this embodiment adds a dataset specific gaussian smoothing filter whose gaussian blur parameters can be learned before generating the final saliency map.

The data set antagonism module. The module consists of a gradient inversion layer and a data set classifier; data set partitioningThe classifier consists of a convolution layer and a full connection layer and is used for predicting the data set to which the input video belongs

Its loss function->

And a decoder. The decoder consists of four 3D convolutional layers and two upsampling layers. In the decoding process, the fused multi-scale features are aggregated along the time and channel dimensions and up-sampled to the resolution of the original frame, and then a sigmoid function is employed to generate the final saliency map.

The present invention provides a preferred embodiment to perform S3. Aiming at an intra-domain scene, training and testing are carried out in a general mode; training and testing a target domain without a label in a domain self-adaptive mode; training and testing an unknown target domain in a domain generalization mode; and a composite batch training mechanism is adopted to assist the multi-dataset collaborative network training.

The generic approach aims to learn a unified model using information from multiple data sets to improve the performance of the model on each data set; during the training phase, batches of each dataset are propagated forward, the significance prediction loss is propagated backward, and the dataset classification loss is propagated backward

During the detection phase, the corresponding dataset specific cell branches are selected according to the tags of the input dataset without using the dataset countermeasure module. Wherein the predicted loss of significance is->

：

/>

Wherein,,

=0.5，/>

=0.1，、/>

，/>

and->

KL divergence loss, linear correlation coefficient and regularized scan path significance loss, respectively.

The domain adaptation approach aims to improve performance on unlabeled target domains; during the training phase, batches from each source domain dataset and one unlabeled target domain are propagated forward, requiring computation and back-propagation of significant predictive losses for each source domain dataset

And data set classification loss->

Only the data set classification loss is calculated and back propagated for the target domain

The domain generalization approach aims at attempting to learn a generalization model from multiple source domain data sets without using target domain data; because of the lack of the target domain, the training phase is the same as the general mode, and the testing phase is the same as the domain adaptive mode.

The composite batch training mechanism is used for promoting collaborative optimization of the training process and avoiding batch jitter caused by switching different data sets; the mechanism constructs batches from each dataset into composite batches according to the video quantity ratio of the plurality of source domain datasets; the loss for each dataset batch is calculated separately during forward propagation, and after all the losses from all dataset batches are calculated, backward propagation is performed to update the gradient, as shown in the flowchart of fig. 6. According to the characteristics of the data set selected in S1, the embodiment constructs the composite batch by using the data of DHF1K, hollywood-2, UCF Sports and LEDOV according to the ratio of 8:3:1:4.

To verify the effectiveness of the first example above, the method of the present invention performs a performance comparison with other advanced methods on three data sets DHF1K, hollywood-2, UCF-Sports and LEDOV, and selects the 5 indices commonly used as the metrics: AUC-Judd (AUC-J), similarity measure (Similarity Metric, SIM), s_auc, CC and NSS. The larger these five indices, the more accurate the saliency region. The experimental results are shown in table 1.

Table 1 comparison of prediction accuracy over three data sets

Table 2 comparison of prediction accuracy on LEDOV datasets

As can be seen from tables 1 and 2, the present embodiment is advanced over the existing methods in terms of multiple indicators on each data set. Further, by comparing the effects of this embodiment with other training patterns, as shown in table 3, it can be seen that the accuracy of this embodiment is far higher than that of the previous training pattern.

TABLE 3 NSS index comparison results for multiple training patterns

The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. The video salient region detection method based on multi-dataset collaborative learning is characterized by comprising the following steps of:

s2: constructing a multi-data set collaboration network, and acquiring a saliency map of an input video by utilizing information of the multi-data set; the network consists of an encoder, a characteristic fusion module, a data set specific unit, a data set countermeasure module and a decoder of the 3D convolution backbone network; wherein the data set specific unit comprises a data set specific batch normalization operation, a data set specific gaussian prior graph and a data set specific gaussian smoothing filter for modeling statistical properties of each data set; the data set batch normalization operation refers to learning a specific batch normalization mean and variance for each data set for the case that the batch normalization parameter distribution across the data is different; the specific Gaussian prior graph of the data sets is that a specific two-dimensional Gaussian prior graph is built for each data set aiming at the difference of the Gaussian prior graphs among the data sets, and the specific two-dimensional Gaussian prior graph is used for modeling the central fixation deviation of each data set; the data set specific Gaussian smoothing filter is used for eliminating the definition deviation by learning the Gaussian smoothing filter with specific parameters for each data set according to the difference of the definition of the salient graphs among the data sets; the data set countermeasure module is used for judging the data set label of the input sample, and generating classification loss

The commonality of the salient features of the network learning is promoted in the form of the countermeasure learning; the data set specific unit and the data set countermeasure module work cooperatively to model the statistical characteristics and the remarkable commonalities of a plurality of data sets, and shareAnd simultaneously, the problem of distribution difference among a plurality of data sets is solved; the specific flow is as follows: firstly capturing multi-scale space-time characteristics through an encoder of a 3D convolution backbone network, fusing the multi-scale characteristics by adopting a characteristic fusion module, then transmitting the multi-scale characteristics into a convolution layer with a specific batch normalization operation of a data set to obtain characteristics with normalized deviation removed, wherein the characteristics are transmitted into a data set countermeasure module for countermeasure learning on one hand, and spliced with a specific Gaussian prior graph of the data set on the other hand, and finally output a significant graph is obtained through the encoder and a Gaussian smoothing filter of the data set;

s3: aiming at an intra-domain scene, training and testing are carried out in a general mode; training and testing a target domain without a label in a domain self-adaptive mode; training and testing an unknown target domain in a domain generalization mode; a compound batch training mechanism is adopted to assist the multi-data set collaborative network training; the composite batch training mechanism is that firstly, according to the video quantity proportion of a plurality of source domain data sets, batches from each data set are combined into composite batches, loss of each data set batch is calculated during forward transmission, and after the loss from all the data set batches is calculated, reverse transmission is carried out to update the gradient; the mechanism can promote collaborative optimization of the training process and avoid batch jitter caused by switching different data sets.

2. The method for detecting video saliency areas based on multi-dataset collaborative learning according to claim 1, wherein the dataset specific unit sets a corresponding branch for each dataset, and automatically switches a switch to activate the corresponding branch according to the tag of the input dataset, thereby modeling dataset specific features; the specific application is divided into a data set specific batch normalization operation, a data set specific Gaussian prior diagram and a data set specific Gaussian smoothing filter.

3. The method for detecting video salient regions based on multi-dataset collaborative learning as claimed in claim 1, wherein the dataset countermeasure module is composed of a gradient inversion layer and a datasetA classifier; the data set classifier consists of a convolution layer and a full connection layer and is used for predicting the data set to which the input video belongs, and the loss function of the data set classifier

4. The method for detecting regions of video saliency based on collaborative learning of multiple data sets according to claim 1, wherein the generic approach is directed to learning a unified model using information from multiple data sets to enhance the performance of the model on each data set; during the training phase, batches of each dataset are propagated forward, and loss of significance prediction is propagated backward

And data set classification loss->

5. The method for detecting video saliency areas based on multi-dataset collaborative learning according to claim 1, wherein the domain adaptation is aimed at improving performance on a label-free target domain; during the training phase, batches from each source domain dataset and one unlabeled target domain are propagated forward, requiring computation and back-propagation of significant predictive losses for each source domain dataset

And data set classification loss->

For the target domain, only the classification loss of the data set is calculated and back-propagatedLoss of function

6. The method for detecting video salient regions based on multi-dataset collaborative learning according to claim 1, wherein the domain generalization approach aims at attempting to learn a generalization model from a plurality of source domain datasets without using target domain data; because of the lack of the target domain, the training phase is the same as the general mode, and the testing phase is the same as the domain adaptive mode.