CN111488886B

CN111488886B - Panoramic image significance prediction method, system and terminal for arranging attention features

Info

Publication number: CN111488886B
Application number: CN202010171615.6A
Authority: CN
Inventors: 杨小康; 朱丹丹; 闵雄阔; 朱煜程; 朱文瀚; 翟广涛
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-03-12
Filing date: 2020-03-12
Publication date: 2023-04-28
Anticipated expiration: 2040-03-12
Also published as: CN111488886A

Abstract

The invention provides a panoramic image significance prediction method based on arrangement attention characteristics, which comprises the following steps: extracting a template feature map and a channel-by-channel feature map, and multiplying the template feature map and the channel-by-channel feature map to generate channel-by-channel features; performing attention feature arrangement on the generated channel-by-channel features; and selecting the channel-by-channel characteristics which are useful for fine granularity significance prediction according to the sequencing result, and inputting the selected channel-by-channel characteristics into a convolutional neural network for head fixation point prediction. The invention also provides a system and a terminal corresponding to the method. The invention not only can better simulate the human visual attention mechanism, but also can obtain higher prediction accuracy.

Description

Panoramic image significance prediction method, system and terminal for arranging attention features

Technical Field

The invention relates to the technical field of image saliency prediction, in particular to a panoramic image saliency prediction method based on arrangement attention features, and especially relates to a panoramic image saliency prediction method based on partial attention features, channel-by-channel features and arrangement attention features.

Background

In recent years, with the rapid development of mobile internet technology and advanced display technology, virtual Reality (VR) is gradually moving into people's lives and is widely used. Among them, the presentation of panoramic images and panoramic video by a Head Mounted Display (HMD) is a very important application of VR technology. Unlike traditional images and videos, panoramic images and panoramic videos may provide users with an immersive and interactive visual experience. Specifically, users can freely move their heads through the head-mounted display to view contents having a view angle field range within 360 ° x 180 °. In other words, people can freely rotate their heads to view the area of the panoramic image that is most attractive to people's vision. Thus, it can be seen that head gaze point is critical to exploring and modeling visual attention in panoramic images. It is necessary to predict the head gaze point in the panoramic image.

Models for significance prediction of head gaze points in panoramic images can be divided into two categories: one is a significance prediction method based on low-level feature extraction; the other category is a significance prediction method based on high-level semantic feature extraction of the deep learning technology. Among them, for the first type of saliency prediction method, representative work is "Gbvs360, BMS360, prosal: extending existing saliency prediction models from 2d to omnidirectional images" published by Lebreton et al in 2018, "Signal Processing: image Communication", which is a saliency prediction method of BMS360 and Gbvs360 proposed by expanding the conventional two saliency prediction methods BMS and Gbvs so that they are applicable to panoramic images.

In addition, there is "The prediction of head and eye movement for 360 degranulation images" published by Zhu et al in 2018, "Signal Processing: image Communication," which simulate a viewing angle window by projecting a panoramic image into a plurality of view blocks, then extracting bottom-up and top-down features on the plurality of view blocks, and finally fusing the extracted features to obtain a salient map of the head gaze point. However, these methods are heuristic and the accuracy of the significance prediction is not high. The second category is significance prediction methods based on deep learning, and the current method with better performance is Salgan: visual saliency prediction with adversarial networks published by Pan et al in 2018, "CVPR Scene Understanding Workshop," which realizes significance prediction by introducing a challenge sample and performing challenge training. However, when this type of CNN model based on various varieties performs saliency prediction on a panoramic image, not all features extracted through the CNN model are useful for final fine-grained saliency prediction, i.e., there are cases where feature redundancy, which may adversely affect saliency prediction.

Disclosure of Invention

In view of the above-mentioned shortcomings in the prior art, the present invention aims to provide a panoramic image saliency prediction method, a system and a terminal based on arrangement attention features, and a panoramic image saliency prediction based on partial attention features (foreground and background attention force), channel-by-channel features and arrangement attention feature models.

According to a first aspect of the present invention, there is provided a panoramic image saliency prediction method based on arrangement attention characteristics, comprising:

extracting a template feature map and a channel-by-channel feature map, and multiplying the template feature map and the channel-by-channel feature map to generate channel-by-channel features;

performing attention feature arrangement on the generated channel-by-channel features;

and selecting the channel-by-channel characteristics which are useful for fine granularity significance prediction according to the sequencing result, and inputting the selected channel-by-channel characteristics into a convolutional neural network for head fixation point prediction.

Optionally, the extracting the template feature map includes:

extracting a foreground attention map and a background attention map using a two-phase branched network based on a ResNet50 predictive network;

and carrying out weighted fusion on the obtained foreground attention force diagram and the obtained background attention force diagram to obtain a template feature diagram.

Optionally, the extracting the foreground attention map and the background attention map using the two-stage branch network of the ResNet 50-based predictive network includes:

the formula for the prediction in the first stage is as follows:

wherein ,F¹ and B¹ Representing a predicted foreground attention map and background attention map, M, respectively ¹ Is a feature map obtained by predicting a network through ResNet50, phi ¹ And

representing two independent ResNet50 predictive networks;

in the second stage, the foreground attention force diagram and the background attention force diagram generated in the first stage are enhanced, and the specific formula is as follows:

wherein ,F_att and B_att Representing the final predicted foreground attention map and background attention map, respectively.

Optionally, the obtained foreground attention map and the obtained background attention map are subjected to weighted fusion to obtain a template feature map, which means that:

and fusing the obtained foreground attention force diagram and the obtained background attention force diagram by adopting a linear weighting method to obtain a template feature diagram.

Optionally, the extracting the channel-by-channel feature map includes:

channel-by-channel feature maps are extracted using a network predicted based on ResNet50, which is the feature map output at the last layer of the ResNet50 predicted network.

Optionally, the ranking the generated channel-by-channel features with attention features includes:

the channel-by-channel feature maps are ranked from large to small according to their corresponding scores, and the greater the score of the channel-by-channel feature map, the more important the channel feature is for final fine-grained saliency prediction.

Optionally, the channel-by-channel feature to be generated is arranged for attention features, and is implemented according to the following method:

the importance of a channel-by-channel feature map is demonstrated by ranking the network automatically learning ranking scores, and the formula for calculating ranking scores is defined as:

r'＝f _n (S')+f _max (S')

wherein ,f_n Is a CNN-based network, f _max Is a network comprising a channel-by-channel global maximum pooling layer, S 'represents a channel-by-channel feature map, and r' represents an arrangement score;

and (3) arranging the channel-by-channel characteristic graphs from large to small according to the obtained arrangement score, wherein a specific calculation formula is as follows:

wherein ,

representing the ordered channel-by-channel feature map after alignment.

Optionally, the selecting feature enhancement of the channel-by-channel feature useful for fine-grained saliency prediction comprises:

selecting some features important for fine granularity significance prediction according to the arrangement score size and experimental effect of the channel-by-channel feature map, and discarding some features with smaller arrangement scores, namely redundant features;

and sending the selected channel-by-channel characteristics into a convolutional neural network, and outputting a predicted saliency map.

According to a second aspect of the present invention, there is provided a panoramic image saliency prediction system based on arrangement attention characteristics, comprising:

the feature extraction module is used for extracting a template feature map and a channel-by-channel feature map, and multiplying the template feature map and the channel-by-channel feature map to generate channel-by-channel features;

an attention feature arrangement module for arranging the attention features of the channel-by-channel features generated by the feature extraction module;

and the characteristic enhancement module is used for selecting the channel-by-channel characteristics which are useful for fine granularity significance prediction to carry out characteristic enhancement according to the sequencing result of the arrangement attention characteristic module, and inputting the selected channel-by-channel characteristics into a convolutional neural network to carry out head fixation point prediction.

According to a third aspect of the present invention, there is provided a panoramic image saliency prediction terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, said processor being operable to perform the above-described panoramic image saliency prediction method based on a permutation attention characteristic when said program is executed.

Compared with the prior art, the invention has the following beneficial effects:

according to the method, the system and the terminal, after the template feature map and the channel-by-channel feature map are extracted, the attention is arranged, the features which are useful for fine granularity saliency prediction are arranged and selected based on the score index, and the method can be used for obtaining high saliency prediction accuracy.

According to the method, the system and the terminal, partial attention (foreground and background attention) feature extraction and channel-by-channel feature extraction are organically integrated together, training is carried out in an end-to-end mode, a human visual attention mechanism can be well simulated, and meanwhile, high prediction accuracy can be obtained.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a flow chart of a conventional saliency prediction method;

FIG. 2 is a flowchart of a panoramic image saliency prediction method according to an embodiment of the present invention;

FIG. 3 is a block diagram of a panoramic image saliency prediction method according to a preferred embodiment of the present invention;

FIG. 4 is a diagram of an arrangement mechanism in arrangement attention according to a preferred embodiment of the present invention;

fig. 5 is a block diagram of a panoramic image saliency prediction system according to a preferred embodiment of the present invention.

Detailed Description

The following describes embodiments of the present invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and detailed implementation modes and specific operation processes are given. It should be noted that variations and modifications can be made by those skilled in the art without departing from the spirit of the invention, which falls within the scope of the invention.

Fig. 1 is a flowchart of a conventional panoramic image saliency prediction method, and it can be seen from the figure that the conventional panoramic image saliency prediction method generally only includes panoramic image input, feature extraction, and prediction result. The problem with this conventional approach is that all features extracted by the CNN model are used for fine-grained saliency prediction, but not all features are effective for fine-grained saliency prediction, i.e. there is a feature redundancy, thus resulting in lower prediction accuracy.

Fig. 2 is a flowchart of a panoramic image saliency prediction method according to an embodiment of the present invention.

Referring to fig. 2, in the panoramic image saliency prediction method based on the arrangement attention feature in the embodiment of the present invention, a template feature map and a channel-by-channel feature map are first extracted, and the template feature map and the channel-by-channel feature map are multiplied to generate a channel-by-channel feature; then the generated channel-by-channel characteristics are sent to an arrangement attention module for arranging the characteristics; and finally, selecting the characteristics useful for fine granularity significance prediction at the characteristic enhancement module, and inputting the characteristics into a convolutional neural network to predict a head fixation point. The method of the embodiment not only can better simulate the human visual attention mechanism, but also can obtain higher prediction accuracy.

Fig. 3 is a frame diagram of a panoramic image saliency prediction method according to a preferred embodiment of the present invention.

Referring to fig. 3, in the preferred embodiment, the panoramic image saliency prediction method based on the arrangement attention characteristics may be performed as follows:

s1: extracting attention patterns of foreground and background parts by using a two-stage branch network based on a ResNet50 prediction network, and carrying out weighted fusion on the obtained foreground attention patterns and background attention patterns to obtain a template feature map (mask);

s2: extracting a channel-by-channel feature map by using the ResNet50 predictive network, wherein the channel-by-channel feature map is output at the last layer of the ResNet50 network;

s3: multiplying the obtained template feature map by the channel-by-channel feature map to generate channel-by-channel features; then automatically learning a ranking score by a proposed ranking mechanism to reveal the importance of each feature map; finally, adding the characteristics (expressed as tensors) generated by the channel-by-channel global maximum pooling layer and the spatial attention characteristics extracted based on CNN element by element to generate an arrangement score corresponding to each characteristic map;

s4: features important for fine granularity significance prediction are selected for feature enhancement, and redundant features are discarded. Finally, the selected useful features are fed into a convolutional neural network to output a predicted saliency map.

In some preferred embodiments, S1 may be performed as follows:

s1.1, using an attention-seeking diagram of the two-stage branch network prediction foreground and background portions of a network based on the res net50 prediction, the formula for performing the prediction in the first stage is as follows:

two independent ResNet50 predicted networks are shown.

S1.2, fusing the obtained two attention attempts by adopting a linear weighting method to obtain a template characteristic diagram (mask).

In some preferred embodiments, S2 may be performed as follows:

s2.1: for ResNet50 to predict the feature map of the last layer output of the network, the up-sampling operation and the dimension reduction operation are used to obtain the adjusted feature map, and then the feature map is sent to the arrangement attention module for feature arrangement.

Referring to fig. 4, in a part of the preferred embodiment, S3 may be performed as follows:

s3.1: multiplying the obtained template feature map by the channel-by-channel feature map as a mask to obtain the channel-by-channel feature map;

s3.2: the importance of a channel-by-channel feature map is demonstrated by ranking the network automatically learning ranking scores, and the formula for calculating ranking scores is defined as:

r'＝f _n (S')+f _max (S')

wherein ,f_n Is a CNN-based network, f _max Is a network comprising a channel-by-channel global maximization pooling layer, S 'represents a channel-by-channel feature map, and r' represents a permutation score.

S3.3: according to the obtained arrangement score, the channel-by-channel characteristic diagram is arranged from large to small, and a specific calculation formula is as follows:

wherein ,

Representing the ordered channel-by-channel feature map after alignment.

In some preferred embodiments, S4 may be performed as follows:

s4.1: selecting some features important for fine granularity significance prediction according to the arrangement score size of the channel-by-channel feature map and experimental effect, and discarding some features with smaller arrangement scores (redundant features).

S4.2: and sending the selected important features into a convolutional neural network, and finally outputting a predicted saliency map.

Fig. 5 is a block diagram of a panoramic image saliency prediction system based on a permutation attention feature, which can be used to implement the above-described panoramic image saliency prediction method based on permutation attention feature, according to an embodiment of the present invention.

Referring to fig. 5, the panoramic image saliency prediction system based on the arrangement attention feature in this embodiment includes: the device comprises a feature extraction module, a attention arrangement feature module and a feature enhancement module; wherein: the feature extraction module is used for extracting a template feature map and a channel-by-channel feature map, and multiplying the template feature map and the channel-by-channel feature map to generate channel-by-channel features; the attention feature arrangement module is used for carrying out attention feature arrangement on the channel-by-channel features generated by the feature extraction module; the feature enhancement module selects channel-by-channel features useful for fine granularity significance prediction for feature enhancement according to the sequencing result of the arrangement attention feature module, and inputs the selected channel-by-channel features into the convolutional neural network for head gaze point prediction.

In the above embodiment, the feature extraction module includes an attention feature extraction sub-module and a channel-by-channel feature extraction sub-module, where the attention feature extraction module can better capture fine partial attention features (foreground and background areas) in the panoramic image, and perform weighted fusion on the generated foreground and background attention attempts to obtain a template feature map (mask). The channel-by-channel feature extraction submodule extracts a channel-by-channel feature map using a network based on ResNet50, where the channel-by-channel feature map refers to output at the last layer of the ResNet50 network.

The specific implementation techniques of the arrangement attention feature module and the feature enhancement module are the same as those of the corresponding steps in the panoramic image saliency prediction method based on the arrangement attention feature, and are easy to be implemented by those skilled in the art, and are not repeated here.

Based on the above embodiments, in another embodiment of the present invention, there is provided a panoramic image saliency prediction terminal including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor being operable to perform any one of the above methods for predicting panoramic image saliency based on arrangement attention characteristics when executing the program. The method not only can better simulate the human visual attention mechanism, but also can obtain higher prediction accuracy.

Optionally, a memory for storing a program; memory, which may include volatile memory (english) such as random-access memory (RAM), such as static random-access memory (SRAM), double data rate synchronous dynamic random-access memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM), and the like; the memory may also include a non-volatile memory (English) such as a flash memory (English). The memory is used to store computer programs (e.g., application programs, functional modules, etc. that implement the methods described above), computer instructions, etc., which may be stored in one or more memories in a partitioned manner. And the above-described computer programs, computer instructions, data, etc. may be invoked by a processor.

The computer programs, computer instructions, etc. described above may be stored in one or more memories in partitions. And the above-described computer programs, computer instructions, data, etc. may be invoked by a processor.

A processor for executing the computer program stored in the memory to implement the steps in the method according to the above embodiment. Reference may be made in particular to the description of the embodiments of the method described above.

The processor and the memory may be separate structures or may be integrated structures that are integrated together. When the processor and the memory are separate structures, the memory and the processor may be connected by a bus coupling.

It should be noted that, the steps in the method provided by the present invention may be implemented by using corresponding units in the apparatus, etc., and those skilled in the art may refer to a technical solution of the apparatus to implement the step flow of the method, that is, the embodiment in the apparatus may be understood as a preferred example for implementing the method, which is not described herein.

It will be appreciated by those skilled in the art that the apparatus provided by the present invention and its various units may be implemented as logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. by simply programming the logic of the method steps, except for implementing the apparatus provided by the present invention as pure computer readable program code. Therefore, the apparatus provided by the present invention may be regarded as a hardware component, and the units included therein for realizing various functions may also be regarded as structures within the hardware component; the means for achieving the various functions may also be considered as being either a software module for implementing the method or a structure within a hardware component.

The foregoing has been a description of specific embodiments of the invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the claims without affecting the spirit of the invention. The above preferred features may be used alone in any of the embodiments, or in any combination without interfering with each other.

Claims

1. A panoramic image saliency prediction method based on arrangement attention features, comprising:

selecting the channel-by-channel characteristics useful for fine granularity significance prediction to perform characteristic enhancement according to the sequencing result, and inputting the selected channel-by-channel characteristics into a convolutional neural network to perform head fixation point prediction;

the extracting the template feature map comprises the following steps:

carrying out weighted fusion on the obtained foreground attention force diagram and the background attention force diagram to obtain a template feature diagram;

the two-stage branch network extraction foreground attention and background attention profiles using a ResNet50 based predictive network, comprising:

the formula for the prediction in the first stage is as follows:

F ¹ ＝φ ¹ (M ¹ ),

representing two independent ResNet50 predictive networks;

F _att ＝φ ² (M ² |F ¹ ,B ¹ ),

wherein ,F_att and B_att Representing a final predicted foreground attention map and background attention map, M, respectively ² Is a feature map, phi, predicted by ResNet50 network in the second stage ² And

representing two independent ResNet50 predictive networks in the second phase;

the extracting the channel-by-channel feature map includes:

2. The method for predicting the saliency of a panoramic image based on aligned attention features of claim 1, wherein the weighted fusion of the foreground attention map and the background attention map to obtain a template feature map means:

3. The method for predicting the saliency of a panoramic image based on arrangement of attention features as recited in claim 1, wherein said arranging attention features on a channel-by-channel basis includes:

4. A panoramic image significance prediction method based on arrangement attention features according to claim 3, characterized in that said attention feature arrangement of generated channel-by-channel features is implemented as follows:

r'＝f _n (S')+f _max (S')

wherein ,

representing the ordered channel-by-channel feature map after alignment.

5. The method of permutation attention feature-based panoramic image saliency prediction according to any of claims 1 to 4, wherein the selecting feature-enhances channel-by-channel features useful for fine-granularity saliency prediction, comprising:

6. A panoramic image saliency prediction system based on a ranking attention feature, comprising:

the feature enhancement module is used for selecting the channel-by-channel features which are useful for fine granularity significance prediction to conduct feature enhancement according to the sequencing result of the arrangement attention feature module, and inputting the selected channel-by-channel features into a convolutional neural network to conduct head fixation point prediction;

the feature extraction module extracts a template feature map, including:

the formula for the prediction in the first stage is as follows:

F ¹ ＝φ ¹ (M ¹ ),

representing two independent ResNet50 predictive networks;

F _att ＝φ ² (M ² |F ¹ ,B ¹ ),

representing two independent ResNet50 predictive networks in the second phase;

the feature extraction module extracts a channel-by-channel feature map, including:

7. A panoramic image saliency prediction terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor is operable to perform the method of any one of claims 1 to 5 when executing the program.