CN114898120B

CN114898120B - 360-degree image salient object detection method based on convolutional neural network

Info

Publication number: CN114898120B
Application number: CN202210586991.0A
Authority: CN
Inventors: 周晓飞; 罗晨浩; 张继勇; 李世锋; 周振; 何帆
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2023-04-07
Anticipated expiration: 2042-05-27
Also published as: CN114898120A

Abstract

The invention discloses a 360-degree image salient target detection method based on a convolutional neural network, which comprises the following steps of: s1, image conversion; s2, building a characteristic pyramid network; s3, four feature aggregation modules are adopted, each module is used for completing conversion from cube projection features to equidistant features by a feature conversion submodule and combining the characteristics with original equidistant image features, and a cavity convolution pooling pyramid submodule is used for realizing feature optimization, so that multi-level aggregation features are obtained; and S4, connecting and feeding multi-level aggregation features to an attention integration module, adaptively selecting reliable space and channel information through an inference space and channel attention mechanism, fusing the reliable space and channel information with the original features to obtain final features, and completing the detection of the obvious target. The method uses an image mapping relation to construct a corresponding cubic projection image based on an equidistant 360-degree image, and uses double-type images as input to solve the problem of poor distortion of spherical surface-to-plane projection caused by single equal-rectangular image input.

Description

360-degree image salient target detection method based on convolutional neural network

Technical Field

The invention relates to the technical field of computer vision, in particular to a 360-degree image salient object detection method based on a convolutional neural network.

Background

The 360-degree image, namely the 360-degree panoramic image, is an image obtained by performing multi-angle all-round shooting on the existing scene by using shooting equipment and performing post-processing by using a computer, and is a three-dimensional virtual scene display technology. As a brand-new display form, the method has wide application scenes, such as all-around display of tourist attractions, hotels and guest houses, all-around analysis of road condition environment by automatic driving, development of VR film and television entertainment and the like, and the development of 360-degree image technology cannot be separated. The detection of the remarkable target in the 360-degree image is beneficial to quickly locking the pedestrian and the target building in the scene, and has higher research significance in different fields.

The detection and segmentation of salient objects in natural scenes, commonly referred to as salient object detection, aims to capture the most visually attractive object in an image, and can be applied to a wide range of visual fields such as image video segmentation, image understanding, semantic segmentation, image object emphasis and the like. In recent years, with the continuous development of convolutional neural networks, a conventional image salient object detection model has achieved high performance in a limited visual field scene. However, a 360-degree panoramic image is a novel form of image representation. At present, two common ways are to display global object information as a two-dimensional image in the form of an isometric projection or a cube projection respectively.

Among them, the isometric projection is one of the most common methods for storing a 360-degree panoramic image as a standard 2D image, and displays the full-range image information of a real 3D world with a single two-dimensional plane, but the real semantic information is forged by the projection distortion from a spherical surface to a plane. Currently, although many scholars have proposed various non-convolutional network algorithms to process these false information, most of the existing convolutional neural network-based salient object detection models still cannot accurately highlight salient objects in images from distorted semantic information due to the characteristic that convolutional neural networks are sensitive to regular grid data and insensitive to distorted data.

Compared with the isometric projection, the cube projection is to cut a 360-degree panoramic image into six faces of a cube, and present global information in images of 6 orientations (up, down, left, right, front and back).

It can be seen that, although the two forms of the isometric projection and the cube projection can show the global object information as a two-dimensional image, the projection distortion of the spherical surface to the plane is inevitably introduced. Resulting in that directly employing conventional object detection models will likely not accurately highlight salient objects in these images.

Disclosure of Invention

The invention provides a 360-degree image salient target detection method based on a convolutional neural network according to the defects of the prior art, a corresponding cubic projection image is constructed based on an equidistant 360-degree image by using an image mapping relation, and two kinds of images are used as input, so that the problem of poor distortion of spherical surface-to-plane projection caused by single input of the equidistant 360-degree image is solved.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a360-degree image salient object detection method based on a convolutional neural network comprises the following steps:

s1, image conversion

S1-1, creating a data set of an equidistant 360-degree image;

s1-2, establishing an image conversion module;

s1-3, after reading equidistant 360-degree images in a data set, generating corresponding cubic projection images by using an image conversion module;

s2, constructing a characteristic pyramid network, and performing characteristic extraction on the equidistant 360-degree image and the converted cube projection image to obtain equidistant 360-degree image characteristics and cube projection characteristics;

s3, four identical feature aggregation modules are adopted, each module is subjected to conversion from cube projection features to equidistant features by a feature conversion submodule, and is combined with the features of an equidistant 360-degree image, and then a cavity convolution pooling pyramid submodule is used for optimizing the features, so that multi-level aggregation features are obtained;

and S4, connecting and feeding the multi-level aggregation features to an attention integration module, adaptively selecting reliable space and channel information through deducing a space and channel attention mechanism, and fusing the reliable space and channel information with the multi-level aggregation features to obtain final features and finish the detection of the remarkable target.

Preferably, in step S1-2, the equidistant 360-degree image is generated into a corresponding cube projection image by using a mapping relationship between the equidistant projection and the cube projection.

Preferably, the expression of the mapping relationship between the isomorphic projection and the cubic projection is as follows:

q _i ＝R _fi ·p _i

wherein, theta _fi 、φ _fi Represents the latitude and longitude under the equidistant projection,

is the x, y, z component of the q coordinate, R _fi Representing a rotation matrix, f _i To know a certain imaging plane, p _i For a known imaging plane f _i One point on, x, y, z represents p _i In which 0. Ltoreq.x, y. Ltoreq.w-1>

w is the side length of the projection image of the cube.

Preferably, the image data input by the feature pyramid network comprises an equidistant 360-degree image and a cubic projection image, and the equidistant 360-degree image and the cubic projection image corresponding to the equidistant 360-degree image form an image sample.

Preferably, the method for constructing the feature pyramid network comprises the following steps: FPN is adopted as a backbone network, wherein a bottom-up path is built based on Resnet-50.

Preferably, in step S2, the feature extraction method includes:

feature extraction is carried out on seven input images of each image sample, namely, an isometric projection image and six face images of an isometric projection image, a cubic projection image, namely, an upper face image, a lower face image, a left face image, a right face image, a front face image and a rear face image, by adopting a feature pyramid network to obtain isometric image features and cubic projection features,

the upper layer Resnet of each independent FPN feature extraction module in the feature pyramid network is used as a part of a feedforward backsbone, each level up performs down-sampling by step =2, output 2-5 levels of features participate in prediction, conv 2-5 output layers and a last residual block layer are used as features of the FPN and respectively correspond to down-sampling multiples of 4,8, 16 and 32 of an input picture, the lowest layer from top to bottom amplifies a rightmost small feature map to the same size as a left feature map of the lowest layer in an up-sampling mode, and finally, the lowest layer is fused with the upper layer features and then output layer by layer to obtain feature results F1-4 of each layer.

Preferably, in step S3, a set of four sets of features is output by four identical feature aggregation modules.

Preferably, the conversion method of the feature conversion sub-module is as follows: and converting the 6 cube projection features into isometric projection features by utilizing the mapping relation between the isometric image features and the cube projection features. And combining the feature extracted by using the original isometry shape image to obtain the final mixed feature.

Preferably, the optimization method of the void convolution pooling pyramid sub-module comprises the following steps: the method comprises the following steps of performing convolution parallel sampling on given input holes with different sampling rates, splicing obtained results together, expanding the number of channels, reducing the number of channels to an expected value through convolution with 1 x 1, which is equivalent to capturing the context of an image in a plurality of proportions.

The invention has the following characteristics and beneficial effects:

the image mapping relation is used for constructing a corresponding cubic projection image based on the equidistant 360-degree image, and the problem of poor distortion of spherical surface-to-plane projection caused by single equal-rectangular image input is solved by using the double-type image as input.

And (3) extracting the features of the image of each scale by using a feature pyramid network to generate multi-scale feature representation, and fusing a feature map with strong low-resolution semantic information and a feature map with weak high-resolution semantic information and rich spatial information on the premise of increasing less calculation amount.

And the space and channel attention mechanism is used for adaptively selecting the space and channel information, so that the obtained final characteristics have higher reliability and a more accurate and obvious target image is generated.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a block diagram of an embodiment of the present invention;

fig. 2 is a block diagram of step S2 in the embodiment of the present invention.

Fig. 3 is a block diagram of step S3 in the embodiment of the present invention.

Fig. 4 is a diagram of the ASPP sub-module in step S3 in the embodiment of the present invention.

Fig. 5 is a block diagram of step S4 in the embodiment of the present invention.

FIG. 6 is a diagram of the attention mechanism submodule of step S4 in the embodiment of the present invention.

FIG. 7 is a graph showing the results of the example of the present invention.

Detailed Description

It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.

In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.

The invention provides a 360-degree image salient object detection method based on a convolutional neural network, which comprises the following steps as shown in figure 1:

s1, image conversion

S1-1, creating a data set of equidistant 360-degree images.

It should be noted that, in this embodiment, the adopted data set is a public 360-SOD common data set, which includes 500 equidistant 360-degree images with high resolution and their corresponding saliency maps, and most of the salient objects in the images are people, in this embodiment, 400 of the data sets are adopted as training data sets, 100 of the data sets are adopted as test data sets to perform training, testing and evaluation work on the model, and meanwhile, to ensure consistency of input data, the input equidistant 360-degree image size is further adjusted to 1024 × 512, and the cube projection image size is adjusted to 256 × 256.

S1-2, establishing an image conversion module, and generating a corresponding cube projection image from the isometric 360-degree image by using the mapping relation between the isometric projection and the cube projection.

Wherein, the expression of the mapping relation between the isometry projection and the cube projection is as follows:

q _i ＝R _fi ·p _i

is the x, y, z component of the q coordinate.

It will be appreciated that in the projected representation of an equi-spaced 360 degree image, the cube projection is typically represented as 6 faces, each face being a square with a side length w, 6 faces being up, down, front, back, left and right, respectively. Each face can be seen as an image taken independently by a camera with a focal length w/2 (field angle 90 deg.), the projected centers of the 6 cameras coinciding in a point, i.e. the center of the cube. If the world coordinate system origin is set at the cube center, the external parameters of the 6 cameras will be derived only from the rotation matrix R _fi Indicating that there is no translation vector. Given an imaging plane f in the camera system _i A point p on _i And it

Three-dimensional coordinates x, y, z (wherein x is more than or equal to 0, y is more than or equal to w-1,

)。

s1-3, after reading the equidistant 360-degree images in the data set, generating corresponding cubic projection images by using an image conversion module.

S2, a characteristic pyramid network is built, and characteristic extraction is carried out on the equidistant 360-degree image and the converted cube projection image to obtain equidistant 360-degree image characteristics and cube projection characteristics.

Specifically, as shown in fig. 2, the method for constructing the feature pyramid network includes: FPN is adopted as a backbone network, wherein a bottom-up path is built based on Resnet-50.

And (3) acquiring the features of the image at different levels by using a Resnet 50-based feature pyramid network and carrying out weight sharing processing.

The image data input by the feature pyramid network comprises an equidistant 360-degree image and a cubic projection image, and the equidistant 360-degree image and the cubic projection image corresponding to the equidistant 360-degree image form an image sample. Extracting the features of seven input images of each image sample, namely the six surface images of the isometric projection image and the cube projection image, namely the six surface images of the upper surface, the lower surface, the left surface, the right surface, the front surface and the rear surface, by adopting a feature pyramid network to obtain the isometric image features and the cube projection features,

it should be noted that, in this embodiment, model training is performed by using dual-type mixed data, a single sample includes one isometric projection image and six cube projection images, and the module needs to perform feature extraction on seven images, so that a set of seven sets of features is finally output.

It should be noted that, the feature pyramid network constructed in the present embodiment is used for extracting features, and those skilled in the art can easily obtain the feature pyramid network, specifically as shown in fig. 2, including top-level convolution ResNet50 and 4 convolution layers, the step lengths are 4,8, 16, and 32, respectively.

Further, the feature extraction method comprises the following steps:

S3, as shown in the figure 3, four identical feature aggregation modules are adopted, so that a set of four groups of features is output, each feature aggregation module is used for completing conversion from cube projection features to equidistant features through a feature conversion submodule (C2E feature conversion module), and is combined with the features of the equidistant 360-degree image, and then a hole convolution pooling pyramid submodule (ASPP submodule) is used for achieving feature optimization;

the conversion method of the feature conversion submodule comprises the following steps: and converting the 6 cube projection features into isometric projection features by utilizing the mapping relation between the isometric image features and the cube projection features.

It should be noted that: the expression of the mapping relation between the cubic projection feature and the isomorphic projection feature is as follows:

R _fi ·p _i ＝q _i

is the x, y, z component of the q coordinate.

It should be noted that: in this embodiment, the feature conversion is performed by the C2E feature conversion module, which is a conventional technical means, and therefore, the C2E feature conversion module is not specifically described and illustrated, and specifically refer to fig. 3.

Further, as shown in fig. 4, the optimization method of the void convolution pooling pyramid sub-module includes: the method comprises the following steps of performing convolution parallel sampling on given input holes with different sampling rates, splicing obtained results together, expanding the number of channels, reducing the number of channels to an expected value through convolution with 1 x 1, which is equivalent to capturing the context of an image with a plurality of proportions.

It should be noted that: in this embodiment, feature optimization is performed through a hollow convolution pooling pyramid sub-module (APSP sub-module), which is a conventional technical means, and specifically, referring to fig. 4, the feature optimization includes 3 1 × 1 convolution layers, 3 × 3 convolution layers, 1 × 1 pooling layers, and an upsampling layer, where sampling rates of the 3 × 3 convolution layers are 6, 12, and 18, respectively.

And S4, as shown in FIG. 5, connecting and feeding the multi-level aggregation features to an attention integration module, adaptively selecting reliable space and channel information through an inferred space and channel attention mechanism, and fusing the reliable space and channel information with the multi-level aggregation features to obtain final features and finish the detection of the salient target.

It should be noted that: in this embodiment, feature fusion is performed through the attention integration module, which is a conventional technical means, and specifically refer to fig. 5, including 3 1 × 1 convolutional layers, 1 3 × 3 convolutional layer, a space attention module, and a channel attention module. The space attention module and the channel attention module are conventional in the art, and therefore, they are not specifically described and illustrated in this embodiment.

As shown in fig. 6, the spatial attention mechanism in the present network first performs dimension reduction on the channels themselves, splices them into a one-dimensional feature map, and then uses a convolutional layer to learn the overall spatial attention and feeds it to the four channels for integration. The channel attention mechanism uses the maximum pooling algorithm and the mean pooling algorithm for the four-channel overall feature map at the same time, then obtains a conversion result through the convolution layer, and finally applies the conversion result to all the channels respectively to obtain the attention value of each channel.

In this embodiment, a network model is constructed using a pytorre framework, the sum of cross entropy loss and mean absolute error loss is used as a loss function, the weight of the feature extraction module is initialized by training a ResNet-50 model in advance on ImageNet, and the weight of the newly added convolutional layer is initialized by using a normal distribution method proposed by the hodcamine. And training the model end to end by using a Stochastic Gradient Descent (SGD) algorithm. Training batch was set to 4, momentum was 0.9, weight decay was 0.0005, initial learning rate was set to 0.002, and final training round was 40epochs. The model generates a salient object prediction map for a 360 degree image. The prediction map is a grayscale map of pixel values 0 to 1. In the figure, 1 indicates a region where a salient object is located, and 0 indicates a background region.

As can be seen from fig. 7, the present embodiment is improved on the basis of the existing conventional image salient object detection method, so that the present embodiment can be adapted to an equidistant 360-degree image for detection, and a better detection effect is obtained. The network consists of four large modules, wherein the four large modules comprise a data processing module (E2C image conversion module) and three network structure modules (a feature pyramid network, a feature aggregation module and an attention mechanism module). The image conversion module completes conversion from an isometric 360-degree image to a cubic projection image, is used for constructing dual-type input data required to be used in a network, and avoids poor distortion of spherical-to-plane projection caused by single isometric image input by taking the dual-type data as input. The FPN feature extraction module extracts multi-level features of various input data and realizes weight sharing, the feature aggregation module integrates and optimizes the multi-level features, and the attention mechanism integration module is used for realizing final reliability weight selection and screening to obtain high-quality significance images. The result is a gray scale image with the pixel value of [0,1], wherein 1 in the image is represented as the area where the salient object is located in the 360-degree image, and 0 in the image is represented as the background area, and the salient object detection task of the 360-degree image is successfully realized.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments, including components thereof, without departing from the principles and spirit of the invention, and still fall within the scope of the invention.

Claims

1. A360-degree image salient object detection method based on a convolutional neural network is characterized by comprising the following steps:

s1, image conversion

S1-1, creating a data set of an equidistant 360-degree image;

s1-2, establishing an image conversion module;

s2, constructing a characteristic pyramid network, and performing characteristic extraction on the equidistant 360-degree image and the converted cubic projection image to obtain equidistant 360-degree image characteristics and cubic projection characteristics;

s3, four identical feature aggregation modules are adopted, each feature aggregation module is used for completing conversion from cube projection features to equidistant features through a feature conversion submodule, and is combined with the equidistant 360-degree image features, and then a cavity convolution pooling pyramid submodule is used for achieving optimization of combined features, so that multi-level aggregation features are obtained;

and S4, connecting and feeding the multi-level aggregation features to an attention integration module, adaptively selecting reliable space and channel information through deducing a space and channel attention mechanism, fusing with the multi-level aggregation features to obtain final features, and completing the detection of the remarkable target.

2. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 1, wherein in step S1-2, the equidistant 360-degree image is generated into a corresponding cubic projection image by using a mapping relationship between equidistant projection and cubic projection.

3. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 2, wherein the expression of the mapping relationship of the isomorphic projection and the cubic projection is as follows:

q _i ＝R _fi ·p _i

wherein, theta _fi 、φ _fi Representing the latitude and longitude under the equidistant projection,

is the x, y, z component of the q coordinate, R _fi Representing a rotation matrix, f _i To know a certain imaging plane, p _i For a known imaging plane f _i One point of (A), x, y, z represents p _i Wherein x is more than or equal to 0, y is more than or equal to w-1>

w is the side length of the projection image of the cube.

4. The convolutional neural network-based 360 degree image salient object detection method of claim 1, wherein the image data input by the feature pyramid network comprises equidistant 360 degree images and cube projection images, and the equidistant 360 degree images and the cube projection images corresponding to the equidistant 360 degree images form an image sample.

5. The 360-degree image salient object detection method based on the convolutional neural network as claimed in claim 4, wherein the method for constructing the feature pyramid network is as follows: FPN is adopted as a backbone network, wherein a bottom-up path is built based on Resnet-50.

6. The convolutional neural network-based 360-degree image salient object detection method of claim 5, wherein in the step S2, the feature extraction method comprises the following steps:

the upper layer Resnet of each independent FPN feature extraction module in the feature pyramid network is used as a part of a feedforward backsbone, each level of up-sampling is carried out by using step =2, output 2-5 levels of features are used for prediction, the output layer of conv 2-5 and the last residual block layer are used as the features of the FPN and respectively correspond to the down-sampling multiples of an input picture to be 4,8, 16 and 32, the small feature diagram on the rightmost side is amplified to be as large as the feature diagram on the left side in the process of the lower layer from top to bottom in an up-sampling mode, and finally, the small feature diagram and the upper layer features are fused and then output layer by layer to obtain feature results F1-4 of each layer.

7. The convolutional neural network-based 360-degree image salient object detection method of claim 1, wherein in the step S3, a set of four groups of features is output through four identical feature aggregation modules.

8. The convolutional neural network-based 360-degree image salient object detection method of claim 6, wherein the feature transformation submodule is transformed by: and converting the 6 cube projection features into isometric projection features by using the mapping relation between the cube projection features and the isometric image features.

9. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 8, wherein the optimization method of the hole convolutional pooling pyramid sub-module is as follows: the method comprises the following steps of performing convolution parallel sampling on given input holes with different sampling rates, splicing obtained results together, expanding the number of channels, reducing the number of channels to an expected value through convolution with 1 x 1, which is equivalent to capturing the context of an image in a plurality of proportions.