CN114898120B - 360-degree image salient object detection method based on convolutional neural network - Google Patents

360-degree image salient object detection method based on convolutional neural network Download PDF

Info

Publication number
CN114898120B
CN114898120B CN202210586991.0A CN202210586991A CN114898120B CN 114898120 B CN114898120 B CN 114898120B CN 202210586991 A CN202210586991 A CN 202210586991A CN 114898120 B CN114898120 B CN 114898120B
Authority
CN
China
Prior art keywords
image
features
projection
feature
equidistant
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210586991.0A
Other languages
Chinese (zh)
Other versions
CN114898120A (en
Inventor
周晓飞
罗晨浩
张继勇
李世锋
周振
何帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202210586991.0A priority Critical patent/CN114898120B/en
Publication of CN114898120A publication Critical patent/CN114898120A/en
Application granted granted Critical
Publication of CN114898120B publication Critical patent/CN114898120B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a 360-degree image salient target detection method based on a convolutional neural network, which comprises the following steps of: s1, image conversion; s2, building a characteristic pyramid network; s3, four feature aggregation modules are adopted, each module is used for completing conversion from cube projection features to equidistant features by a feature conversion submodule and combining the characteristics with original equidistant image features, and a cavity convolution pooling pyramid submodule is used for realizing feature optimization, so that multi-level aggregation features are obtained; and S4, connecting and feeding multi-level aggregation features to an attention integration module, adaptively selecting reliable space and channel information through an inference space and channel attention mechanism, fusing the reliable space and channel information with the original features to obtain final features, and completing the detection of the obvious target. The method uses an image mapping relation to construct a corresponding cubic projection image based on an equidistant 360-degree image, and uses double-type images as input to solve the problem of poor distortion of spherical surface-to-plane projection caused by single equal-rectangular image input.

Description

360-degree image salient target detection method based on convolutional neural network
Technical Field
The invention relates to the technical field of computer vision, in particular to a 360-degree image salient object detection method based on a convolutional neural network.
Background
The 360-degree image, namely the 360-degree panoramic image, is an image obtained by performing multi-angle all-round shooting on the existing scene by using shooting equipment and performing post-processing by using a computer, and is a three-dimensional virtual scene display technology. As a brand-new display form, the method has wide application scenes, such as all-around display of tourist attractions, hotels and guest houses, all-around analysis of road condition environment by automatic driving, development of VR film and television entertainment and the like, and the development of 360-degree image technology cannot be separated. The detection of the remarkable target in the 360-degree image is beneficial to quickly locking the pedestrian and the target building in the scene, and has higher research significance in different fields.
The detection and segmentation of salient objects in natural scenes, commonly referred to as salient object detection, aims to capture the most visually attractive object in an image, and can be applied to a wide range of visual fields such as image video segmentation, image understanding, semantic segmentation, image object emphasis and the like. In recent years, with the continuous development of convolutional neural networks, a conventional image salient object detection model has achieved high performance in a limited visual field scene. However, a 360-degree panoramic image is a novel form of image representation. At present, two common ways are to display global object information as a two-dimensional image in the form of an isometric projection or a cube projection respectively.
Among them, the isometric projection is one of the most common methods for storing a 360-degree panoramic image as a standard 2D image, and displays the full-range image information of a real 3D world with a single two-dimensional plane, but the real semantic information is forged by the projection distortion from a spherical surface to a plane. Currently, although many scholars have proposed various non-convolutional network algorithms to process these false information, most of the existing convolutional neural network-based salient object detection models still cannot accurately highlight salient objects in images from distorted semantic information due to the characteristic that convolutional neural networks are sensitive to regular grid data and insensitive to distorted data.
Compared with the isometric projection, the cube projection is to cut a 360-degree panoramic image into six faces of a cube, and present global information in images of 6 orientations (up, down, left, right, front and back).
It can be seen that, although the two forms of the isometric projection and the cube projection can show the global object information as a two-dimensional image, the projection distortion of the spherical surface to the plane is inevitably introduced. Resulting in that directly employing conventional object detection models will likely not accurately highlight salient objects in these images.
Disclosure of Invention
The invention provides a 360-degree image salient target detection method based on a convolutional neural network according to the defects of the prior art, a corresponding cubic projection image is constructed based on an equidistant 360-degree image by using an image mapping relation, and two kinds of images are used as input, so that the problem of poor distortion of spherical surface-to-plane projection caused by single input of the equidistant 360-degree image is solved.
In order to solve the technical problems, the technical scheme of the invention is as follows:
a360-degree image salient object detection method based on a convolutional neural network comprises the following steps:
s1, image conversion
S1-1, creating a data set of an equidistant 360-degree image;
s1-2, establishing an image conversion module;
s1-3, after reading equidistant 360-degree images in a data set, generating corresponding cubic projection images by using an image conversion module;
s2, constructing a characteristic pyramid network, and performing characteristic extraction on the equidistant 360-degree image and the converted cube projection image to obtain equidistant 360-degree image characteristics and cube projection characteristics;
s3, four identical feature aggregation modules are adopted, each module is subjected to conversion from cube projection features to equidistant features by a feature conversion submodule, and is combined with the features of an equidistant 360-degree image, and then a cavity convolution pooling pyramid submodule is used for optimizing the features, so that multi-level aggregation features are obtained;
and S4, connecting and feeding the multi-level aggregation features to an attention integration module, adaptively selecting reliable space and channel information through deducing a space and channel attention mechanism, and fusing the reliable space and channel information with the multi-level aggregation features to obtain final features and finish the detection of the remarkable target.
Preferably, in step S1-2, the equidistant 360-degree image is generated into a corresponding cube projection image by using a mapping relationship between the equidistant projection and the cube projection.
Preferably, the expression of the mapping relationship between the isomorphic projection and the cubic projection is as follows:
q i =R fi ·p i
Figure GDA0004079516660000031
Figure GDA0004079516660000032
wherein, theta fi 、φ fi Represents the latitude and longitude under the equidistant projection,
Figure GDA0004079516660000033
is the x, y, z component of the q coordinate, R fi Representing a rotation matrix, f i To know a certain imaging plane, p i For a known imaging plane f i One point on, x, y, z represents p i In which 0. Ltoreq.x, y. Ltoreq.w-1>
Figure GDA0004079516660000034
w is the side length of the projection image of the cube.
Preferably, the image data input by the feature pyramid network comprises an equidistant 360-degree image and a cubic projection image, and the equidistant 360-degree image and the cubic projection image corresponding to the equidistant 360-degree image form an image sample.
Preferably, the method for constructing the feature pyramid network comprises the following steps: FPN is adopted as a backbone network, wherein a bottom-up path is built based on Resnet-50.
Preferably, in step S2, the feature extraction method includes:
feature extraction is carried out on seven input images of each image sample, namely, an isometric projection image and six face images of an isometric projection image, a cubic projection image, namely, an upper face image, a lower face image, a left face image, a right face image, a front face image and a rear face image, by adopting a feature pyramid network to obtain isometric image features and cubic projection features,
the upper layer Resnet of each independent FPN feature extraction module in the feature pyramid network is used as a part of a feedforward backsbone, each level up performs down-sampling by step =2, output 2-5 levels of features participate in prediction, conv 2-5 output layers and a last residual block layer are used as features of the FPN and respectively correspond to down-sampling multiples of 4,8, 16 and 32 of an input picture, the lowest layer from top to bottom amplifies a rightmost small feature map to the same size as a left feature map of the lowest layer in an up-sampling mode, and finally, the lowest layer is fused with the upper layer features and then output layer by layer to obtain feature results F1-4 of each layer.
Preferably, in step S3, a set of four sets of features is output by four identical feature aggregation modules.
Preferably, the conversion method of the feature conversion sub-module is as follows: and converting the 6 cube projection features into isometric projection features by utilizing the mapping relation between the isometric image features and the cube projection features. And combining the feature extracted by using the original isometry shape image to obtain the final mixed feature.
Preferably, the optimization method of the void convolution pooling pyramid sub-module comprises the following steps: the method comprises the following steps of performing convolution parallel sampling on given input holes with different sampling rates, splicing obtained results together, expanding the number of channels, reducing the number of channels to an expected value through convolution with 1 x 1, which is equivalent to capturing the context of an image in a plurality of proportions.
The invention has the following characteristics and beneficial effects:
the image mapping relation is used for constructing a corresponding cubic projection image based on the equidistant 360-degree image, and the problem of poor distortion of spherical surface-to-plane projection caused by single equal-rectangular image input is solved by using the double-type image as input.
And (3) extracting the features of the image of each scale by using a feature pyramid network to generate multi-scale feature representation, and fusing a feature map with strong low-resolution semantic information and a feature map with weak high-resolution semantic information and rich spatial information on the premise of increasing less calculation amount.
And the space and channel attention mechanism is used for adaptively selecting the space and channel information, so that the obtained final characteristics have higher reliability and a more accurate and obvious target image is generated.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a block diagram of an embodiment of the present invention;
fig. 2 is a block diagram of step S2 in the embodiment of the present invention.
Fig. 3 is a block diagram of step S3 in the embodiment of the present invention.
Fig. 4 is a diagram of the ASPP sub-module in step S3 in the embodiment of the present invention.
Fig. 5 is a block diagram of step S4 in the embodiment of the present invention.
FIG. 6 is a diagram of the attention mechanism submodule of step S4 in the embodiment of the present invention.
FIG. 7 is a graph showing the results of the example of the present invention.
Detailed Description
It should be noted that the embodiments and features of the embodiments of the present invention may be combined with each other without conflict.
In the description of the present invention, it is to be understood that the terms "center", "longitudinal", "lateral", "up", "down", "front", "back", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are used only for convenience in describing the present invention and for simplicity in description, and do not indicate or imply that the referenced devices or elements must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention. Furthermore, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of that feature. In the description of the present invention, "a plurality" means two or more unless otherwise specified.
In the description of the present invention, it should be noted that, unless otherwise explicitly specified or limited, the terms "mounted," "connected," and "connected" are to be construed broadly, e.g., as meaning either a fixed connection, a removable connection, or an integral connection; can be mechanically or electrically connected; they may be connected directly or indirectly through intervening media, or they may be interconnected between two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art through specific situations.
The invention provides a 360-degree image salient object detection method based on a convolutional neural network, which comprises the following steps as shown in figure 1:
s1, image conversion
S1-1, creating a data set of equidistant 360-degree images.
It should be noted that, in this embodiment, the adopted data set is a public 360-SOD common data set, which includes 500 equidistant 360-degree images with high resolution and their corresponding saliency maps, and most of the salient objects in the images are people, in this embodiment, 400 of the data sets are adopted as training data sets, 100 of the data sets are adopted as test data sets to perform training, testing and evaluation work on the model, and meanwhile, to ensure consistency of input data, the input equidistant 360-degree image size is further adjusted to 1024 × 512, and the cube projection image size is adjusted to 256 × 256.
S1-2, establishing an image conversion module, and generating a corresponding cube projection image from the isometric 360-degree image by using the mapping relation between the isometric projection and the cube projection.
Wherein, the expression of the mapping relation between the isometry projection and the cube projection is as follows:
q i =R fi ·p i
Figure GDA0004079516660000061
Figure GDA0004079516660000062
wherein, theta fi 、φ fi Represents the latitude and longitude under the equidistant projection,
Figure GDA0004079516660000063
is the x, y, z component of the q coordinate.
It will be appreciated that in the projected representation of an equi-spaced 360 degree image, the cube projection is typically represented as 6 faces, each face being a square with a side length w, 6 faces being up, down, front, back, left and right, respectively. Each face can be seen as an image taken independently by a camera with a focal length w/2 (field angle 90 deg.), the projected centers of the 6 cameras coinciding in a point, i.e. the center of the cube. If the world coordinate system origin is set at the cube center, the external parameters of the 6 cameras will be derived only from the rotation matrix R fi Indicating that there is no translation vector. Given an imaging plane f in the camera system i A point p on i And it
Three-dimensional coordinates x, y, z (wherein x is more than or equal to 0, y is more than or equal to w-1,
Figure GDA0004079516660000071
)。
s1-3, after reading the equidistant 360-degree images in the data set, generating corresponding cubic projection images by using an image conversion module.
S2, a characteristic pyramid network is built, and characteristic extraction is carried out on the equidistant 360-degree image and the converted cube projection image to obtain equidistant 360-degree image characteristics and cube projection characteristics.
Specifically, as shown in fig. 2, the method for constructing the feature pyramid network includes: FPN is adopted as a backbone network, wherein a bottom-up path is built based on Resnet-50.
And (3) acquiring the features of the image at different levels by using a Resnet 50-based feature pyramid network and carrying out weight sharing processing.
The image data input by the feature pyramid network comprises an equidistant 360-degree image and a cubic projection image, and the equidistant 360-degree image and the cubic projection image corresponding to the equidistant 360-degree image form an image sample. Extracting the features of seven input images of each image sample, namely the six surface images of the isometric projection image and the cube projection image, namely the six surface images of the upper surface, the lower surface, the left surface, the right surface, the front surface and the rear surface, by adopting a feature pyramid network to obtain the isometric image features and the cube projection features,
it should be noted that, in this embodiment, model training is performed by using dual-type mixed data, a single sample includes one isometric projection image and six cube projection images, and the module needs to perform feature extraction on seven images, so that a set of seven sets of features is finally output.
It should be noted that, the feature pyramid network constructed in the present embodiment is used for extracting features, and those skilled in the art can easily obtain the feature pyramid network, specifically as shown in fig. 2, including top-level convolution ResNet50 and 4 convolution layers, the step lengths are 4,8, 16, and 32, respectively.
Further, the feature extraction method comprises the following steps:
the upper layer Resnet of each independent FPN feature extraction module in the feature pyramid network is used as a part of a feedforward backsbone, each level up performs down-sampling by step =2, output 2-5 levels of features participate in prediction, conv 2-5 output layers and a last residual block layer are used as features of the FPN and respectively correspond to down-sampling multiples of 4,8, 16 and 32 of an input picture, the lowest layer from top to bottom amplifies a rightmost small feature map to the same size as a left feature map of the lowest layer in an up-sampling mode, and finally, the lowest layer is fused with the upper layer features and then output layer by layer to obtain feature results F1-4 of each layer.
S3, as shown in the figure 3, four identical feature aggregation modules are adopted, so that a set of four groups of features is output, each feature aggregation module is used for completing conversion from cube projection features to equidistant features through a feature conversion submodule (C2E feature conversion module), and is combined with the features of the equidistant 360-degree image, and then a hole convolution pooling pyramid submodule (ASPP submodule) is used for achieving feature optimization;
the conversion method of the feature conversion submodule comprises the following steps: and converting the 6 cube projection features into isometric projection features by utilizing the mapping relation between the isometric image features and the cube projection features.
It should be noted that: the expression of the mapping relation between the cubic projection feature and the isomorphic projection feature is as follows:
R fi ·p i =q i
Figure GDA0004079516660000081
Figure GDA0004079516660000082
wherein, theta fi 、φ fi Represents the latitude and longitude under the equidistant projection,
Figure GDA0004079516660000083
is the x, y, z component of the q coordinate.
It should be noted that: in this embodiment, the feature conversion is performed by the C2E feature conversion module, which is a conventional technical means, and therefore, the C2E feature conversion module is not specifically described and illustrated, and specifically refer to fig. 3.
Further, as shown in fig. 4, the optimization method of the void convolution pooling pyramid sub-module includes: the method comprises the following steps of performing convolution parallel sampling on given input holes with different sampling rates, splicing obtained results together, expanding the number of channels, reducing the number of channels to an expected value through convolution with 1 x 1, which is equivalent to capturing the context of an image with a plurality of proportions.
It should be noted that: in this embodiment, feature optimization is performed through a hollow convolution pooling pyramid sub-module (APSP sub-module), which is a conventional technical means, and specifically, referring to fig. 4, the feature optimization includes 3 1 × 1 convolution layers, 3 × 3 convolution layers, 1 × 1 pooling layers, and an upsampling layer, where sampling rates of the 3 × 3 convolution layers are 6, 12, and 18, respectively.
And S4, as shown in FIG. 5, connecting and feeding the multi-level aggregation features to an attention integration module, adaptively selecting reliable space and channel information through an inferred space and channel attention mechanism, and fusing the reliable space and channel information with the multi-level aggregation features to obtain final features and finish the detection of the salient target.
It should be noted that: in this embodiment, feature fusion is performed through the attention integration module, which is a conventional technical means, and specifically refer to fig. 5, including 3 1 × 1 convolutional layers, 1 3 × 3 convolutional layer, a space attention module, and a channel attention module. The space attention module and the channel attention module are conventional in the art, and therefore, they are not specifically described and illustrated in this embodiment.
As shown in fig. 6, the spatial attention mechanism in the present network first performs dimension reduction on the channels themselves, splices them into a one-dimensional feature map, and then uses a convolutional layer to learn the overall spatial attention and feeds it to the four channels for integration. The channel attention mechanism uses the maximum pooling algorithm and the mean pooling algorithm for the four-channel overall feature map at the same time, then obtains a conversion result through the convolution layer, and finally applies the conversion result to all the channels respectively to obtain the attention value of each channel.
In this embodiment, a network model is constructed using a pytorre framework, the sum of cross entropy loss and mean absolute error loss is used as a loss function, the weight of the feature extraction module is initialized by training a ResNet-50 model in advance on ImageNet, and the weight of the newly added convolutional layer is initialized by using a normal distribution method proposed by the hodcamine. And training the model end to end by using a Stochastic Gradient Descent (SGD) algorithm. Training batch was set to 4, momentum was 0.9, weight decay was 0.0005, initial learning rate was set to 0.002, and final training round was 40epochs. The model generates a salient object prediction map for a 360 degree image. The prediction map is a grayscale map of pixel values 0 to 1. In the figure, 1 indicates a region where a salient object is located, and 0 indicates a background region.
As can be seen from fig. 7, the present embodiment is improved on the basis of the existing conventional image salient object detection method, so that the present embodiment can be adapted to an equidistant 360-degree image for detection, and a better detection effect is obtained. The network consists of four large modules, wherein the four large modules comprise a data processing module (E2C image conversion module) and three network structure modules (a feature pyramid network, a feature aggregation module and an attention mechanism module). The image conversion module completes conversion from an isometric 360-degree image to a cubic projection image, is used for constructing dual-type input data required to be used in a network, and avoids poor distortion of spherical-to-plane projection caused by single isometric image input by taking the dual-type data as input. The FPN feature extraction module extracts multi-level features of various input data and realizes weight sharing, the feature aggregation module integrates and optimizes the multi-level features, and the attention mechanism integration module is used for realizing final reliability weight selection and screening to obtain high-quality significance images. The result is a gray scale image with the pixel value of [0,1], wherein 1 in the image is represented as the area where the salient object is located in the 360-degree image, and 0 in the image is represented as the background area, and the salient object detection task of the 360-degree image is successfully realized.
The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made in these embodiments, including components thereof, without departing from the principles and spirit of the invention, and still fall within the scope of the invention.

Claims (9)

1. A360-degree image salient object detection method based on a convolutional neural network is characterized by comprising the following steps:
s1, image conversion
S1-1, creating a data set of an equidistant 360-degree image;
s1-2, establishing an image conversion module;
s1-3, after reading equidistant 360-degree images in a data set, generating corresponding cubic projection images by using an image conversion module;
s2, constructing a characteristic pyramid network, and performing characteristic extraction on the equidistant 360-degree image and the converted cubic projection image to obtain equidistant 360-degree image characteristics and cubic projection characteristics;
s3, four identical feature aggregation modules are adopted, each feature aggregation module is used for completing conversion from cube projection features to equidistant features through a feature conversion submodule, and is combined with the equidistant 360-degree image features, and then a cavity convolution pooling pyramid submodule is used for achieving optimization of combined features, so that multi-level aggregation features are obtained;
and S4, connecting and feeding the multi-level aggregation features to an attention integration module, adaptively selecting reliable space and channel information through deducing a space and channel attention mechanism, fusing with the multi-level aggregation features to obtain final features, and completing the detection of the remarkable target.
2. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 1, wherein in step S1-2, the equidistant 360-degree image is generated into a corresponding cubic projection image by using a mapping relationship between equidistant projection and cubic projection.
3. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 2, wherein the expression of the mapping relationship of the isomorphic projection and the cubic projection is as follows:
q i =R fi ·p i
Figure FDA0004079516650000021
Figure FDA0004079516650000022
wherein, theta fi 、φ fi Representing the latitude and longitude under the equidistant projection,
Figure FDA0004079516650000023
is the x, y, z component of the q coordinate, R fi Representing a rotation matrix, f i To know a certain imaging plane, p i For a known imaging plane f i One point of (A), x, y, z represents p i Wherein x is more than or equal to 0, y is more than or equal to w-1>
Figure FDA0004079516650000024
w is the side length of the projection image of the cube.
4. The convolutional neural network-based 360 degree image salient object detection method of claim 1, wherein the image data input by the feature pyramid network comprises equidistant 360 degree images and cube projection images, and the equidistant 360 degree images and the cube projection images corresponding to the equidistant 360 degree images form an image sample.
5. The 360-degree image salient object detection method based on the convolutional neural network as claimed in claim 4, wherein the method for constructing the feature pyramid network is as follows: FPN is adopted as a backbone network, wherein a bottom-up path is built based on Resnet-50.
6. The convolutional neural network-based 360-degree image salient object detection method of claim 5, wherein in the step S2, the feature extraction method comprises the following steps:
feature extraction is carried out on seven input images of each image sample, namely, an isometric projection image and six face images of an isometric projection image, a cubic projection image, namely, an upper face image, a lower face image, a left face image, a right face image, a front face image and a rear face image, by adopting a feature pyramid network to obtain isometric image features and cubic projection features,
the upper layer Resnet of each independent FPN feature extraction module in the feature pyramid network is used as a part of a feedforward backsbone, each level of up-sampling is carried out by using step =2, output 2-5 levels of features are used for prediction, the output layer of conv 2-5 and the last residual block layer are used as the features of the FPN and respectively correspond to the down-sampling multiples of an input picture to be 4,8, 16 and 32, the small feature diagram on the rightmost side is amplified to be as large as the feature diagram on the left side in the process of the lower layer from top to bottom in an up-sampling mode, and finally, the small feature diagram and the upper layer features are fused and then output layer by layer to obtain feature results F1-4 of each layer.
7. The convolutional neural network-based 360-degree image salient object detection method of claim 1, wherein in the step S3, a set of four groups of features is output through four identical feature aggregation modules.
8. The convolutional neural network-based 360-degree image salient object detection method of claim 6, wherein the feature transformation submodule is transformed by: and converting the 6 cube projection features into isometric projection features by using the mapping relation between the cube projection features and the isometric image features.
9. The convolutional neural network-based 360-degree image salient object detection method as claimed in claim 8, wherein the optimization method of the hole convolutional pooling pyramid sub-module is as follows: the method comprises the following steps of performing convolution parallel sampling on given input holes with different sampling rates, splicing obtained results together, expanding the number of channels, reducing the number of channels to an expected value through convolution with 1 x 1, which is equivalent to capturing the context of an image in a plurality of proportions.
CN202210586991.0A 2022-05-27 2022-05-27 360-degree image salient object detection method based on convolutional neural network Active CN114898120B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210586991.0A CN114898120B (en) 2022-05-27 2022-05-27 360-degree image salient object detection method based on convolutional neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210586991.0A CN114898120B (en) 2022-05-27 2022-05-27 360-degree image salient object detection method based on convolutional neural network

Publications (2)

Publication Number Publication Date
CN114898120A CN114898120A (en) 2022-08-12
CN114898120B true CN114898120B (en) 2023-04-07

Family

ID=82725996

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210586991.0A Active CN114898120B (en) 2022-05-27 2022-05-27 360-degree image salient object detection method based on convolutional neural network

Country Status (1)

Country Link
CN (1) CN114898120B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110827193B (en) * 2019-10-21 2023-05-09 国家广播电视总局广播电视规划院 Panoramic video significance detection method based on multichannel characteristics
CN111178163B (en) * 2019-12-12 2023-06-09 宁波大学 Stereoscopic panoramic image salient region prediction method based on cube projection format
CN112381813B (en) * 2020-11-25 2023-09-05 华南理工大学 Panoramic view visual saliency detection method based on graph convolution neural network
AU2020103901A4 (en) * 2020-12-04 2021-02-11 Chongqing Normal University Image Semantic Segmentation Method Based on Deep Full Convolutional Network and Conditional Random Field
CN113536977B (en) * 2021-06-28 2023-08-18 杭州电子科技大学 360-degree panoramic image-oriented saliency target detection method
CN114359680A (en) * 2021-12-17 2022-04-15 中国人民解放军海军工程大学 Panoramic vision water surface target detection method based on deep learning

Also Published As

Publication number Publication date
CN114898120A (en) 2022-08-12

Similar Documents

Publication Publication Date Title
CN114004941B (en) Indoor scene three-dimensional reconstruction system and method based on nerve radiation field
CN112894832B (en) Three-dimensional modeling method, three-dimensional modeling device, electronic equipment and storage medium
CN107945282B (en) Rapid multi-view three-dimensional synthesis and display method and device based on countermeasure network
CN112330795B (en) Human body three-dimensional reconstruction method and system based on single RGBD image
WO2023280038A1 (en) Method for constructing three-dimensional real-scene model, and related apparatus
WO2022151661A1 (en) Three-dimensional reconstruction method and apparatus, device and storage medium
US11533431B2 (en) Method and device for generating a panoramic image
EP3407248B1 (en) An apparatus, a method and a computer program for video coding and decoding
CN114332385A (en) Monocular camera target detection and spatial positioning method based on three-dimensional virtual geographic scene
CN115690382B (en) Training method of deep learning model, and method and device for generating panorama
CN111340866A (en) Depth image generation method, device and storage medium
CN116051747A (en) House three-dimensional model reconstruction method, device and medium based on missing point cloud data
CN109788270B (en) 3D-360-degree panoramic image generation method and device
CN110580720A (en) camera pose estimation method based on panorama
CN116778288A (en) Multi-mode fusion target detection system and method
CN113902802A (en) Visual positioning method and related device, electronic equipment and storage medium
CN111028273A (en) Light field depth estimation method based on multi-stream convolution neural network and implementation system thereof
CN115527016A (en) Three-dimensional GIS video fusion registration method, system, medium, equipment and terminal
CN117197388A (en) Live-action three-dimensional virtual reality scene construction method and system based on generation of antagonistic neural network and oblique photography
US10354399B2 (en) Multi-view back-projection to a light-field
CN114092540A (en) Attention mechanism-based light field depth estimation method and computer readable medium
CN113076953A (en) Black car detection method, system, device and storage medium
CN114898120B (en) 360-degree image salient object detection method based on convolutional neural network
CN116843754A (en) Visual positioning method and system based on multi-feature fusion
CN114663599A (en) Human body surface reconstruction method and system based on multiple views

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant