CN115457275A

CN115457275A - Visual field perception panoramic image salient object segmentation method and device

Info

Publication number: CN115457275A
Application number: CN202211130460.7A
Authority: CN
Inventors: 李甲; 吴俊杰; 于天舒; 赵沁平
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-09-16
Filing date: 2022-09-16
Publication date: 2022-12-09

Abstract

The embodiment of the disclosure discloses a method and a device for segmenting a salient object of a panoramic image perceived by a visual field. One embodiment of the method comprises: acquiring a panoramic image; projecting the panoramic image to obtain equidistant cylindrical projection images and analyzing the equidistant cylindrical projection images; and acquiring a preset visual field perception convolutional neural network segmentation model, and segmenting the equidistant cylindrical projection image according to the preset visual field perception convolutional neural network segmentation model to obtain a significant object segmentation result image. The embodiment improves the reliability of segmenting the salient objects of the panoramic image.

Description

Visual field perception panoramic image salient object segmentation method and device

Technical Field

The embodiment of the disclosure relates to a computer vision technology, in particular to a method and a device for segmenting a panoramic image salient object based on visual field perception.

Background

The 360-degree omni-directional panoramic image can show scene information of a view field of 360 degrees in the horizontal direction and 180 degrees in the vertical direction, and compared with the traditional image, the panoramic image usually contains richer scene contents, so that the panoramic image is more and more used in real life. The salient object segmentation can automatically process interested parts in an image, meanwhile, uninteresting parts are omitted, and the salient object segmentation is an important basic problem of computer vision. The research on the remarkable object segmentation of the 360-degree omnibearing panoramic image has important significance on the compression, transmission, analysis and the like of the 360-degree panoramic image.

At present, relevant salient object segmentation methods are developed around traditional two-dimensional plane images, and good effects are achieved. However, when these salient object segmentation methods that achieve good performance on conventional planar images are applied to salient object segmentation of 360-degree omni-directional panoramic images, the effect is not satisfactory. In addition, the conventional salient object segmentation method for the 360-degree omni-directional panoramic image is less and has low reliability. For example, one method is to design a distortion adaptive module, cut an equidistant columnar projection image into four image blocks with the same size to learn different feature kernels, and design a multi-scale module to integrate context features, so as to improve the performance of segmenting a significant object of a panoramic image; or processing equidistant histogram projection maps by designing a multi-stage salient image segmentation method using less distorted perspective and object-level semantic saliency sorting. These prior art approaches to panoramic image salient object segmentation focus mainly on mitigating the distortion problem, rather than how to cater for the distortion, while ignoring one of the advantageous features of panoramic images, namely the continuous and complete panoramic field of view. Either the equidistant cylindrical projection view is diced or a perspective view with only a partial field of view is used, the full panoramic field of view is lost and more discontinuities in the borders may be introduced, among other problems.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Some embodiments of the present disclosure provide a method and an apparatus for segmenting salient objects in a panoramic image with view perception, so as to solve one or more of the technical problems mentioned in the above background.

Some embodiments of the present disclosure provide a method for segmenting a salient object in a panoramic image based on field-of-view perception, the method including: acquiring a panoramic image; projecting the panoramic image to obtain equidistant cylindrical projection images and analyzing the equidistant cylindrical projection images; and acquiring a preset visual field perception convolutional neural network segmentation model, and processing the equidistant cylindrical projection image according to the preset visual field perception convolutional neural network segmentation model to obtain a significant object segmentation result image.

One of the above-described various embodiments of the present disclosure has the following advantageous effects: first, a panoramic image is acquired. The panoramic image refers to a 360-degree omnidirectional image, can show scene information of 360-degree horizontal and 180-degree vertical views, and is used for processing in the scheme. Then, equidistant cylindrical projection processing is carried out on the panoramic image to obtain equidistant cylindrical projection images. The obtained panoramic image is projected to be a two-dimensional plane image, namely an equidistant cylindrical projection image. And finally, acquiring a preset visual field perception convolutional neural network segmentation model, and processing the equidistant cylindrical projection image according to the preset visual field perception convolutional neural network segmentation model to obtain a significant object segmentation result image. The scheme processes the panoramic image into the equidistant cylindrical projection image of the two-dimensional plane, is convenient for the processing of a subsequent convolution neural network model, simultaneously reserves the complete panoramic view and is favorable for segmenting the significant object of the panoramic image. The preset visual field perception convolutional neural network segmentation model can be a trained convolutional neural network, and a sample self-adaptive visual field transformation module contained in the model carries out focusing learning on features under different visual fields through horizontal, vertical and scaling visual field transformation, so that the adaptability of the model to objects with distortion, boundary effect or variable scale is enhanced, and the reliability of significant object segmentation of panoramic images is improved.

Drawings

The above and other features, advantages and aspects of various embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and elements are not necessarily drawn to scale.

Fig. 1 is a flow diagram of some embodiments of a field-of-view aware panoramic image salient object segmentation method according to the present disclosure;

FIG. 2 is a schematic illustration of a field of view perceived panoramic image salient object before and after segmentation in accordance with the present disclosure;

FIG. 3 is a block diagram of an overall framework of a view-aware convolutional neural network segmentation model of a view-aware panoramic image salient object segmentation method according to the present disclosure;

FIG. 4 is an overall frame schematic of the horizontal field of view transform submodule and the scaled field of view transform submodule in the field of view transform submodule according to the present disclosure;

fig. 5 is a schematic structural diagram of some embodiments of a device for segmenting salient objects in a panoramic image with view perception according to the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings. The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 is a flow diagram 100 of some embodiments of a field-of-view aware panoramic image salient object segmentation method according to the present disclosure. The method for segmenting the salient objects of the panoramic image perceived by the visual field comprises the following steps 101 to 103:

step 101, acquiring a panoramic image.

In some embodiments, the subject of the implementation of the visual field aware panoramic image salient object segmentation method may acquire the panoramic image through a wired connection or a wireless connection. The panoramic image can be a 360-degree omnidirectional image, can show scene information of 360-degree views in the horizontal direction and 180-degree views in the vertical direction, and is closer to a three-dimensional scene experienced by people in a real environment.

As an example, the panoramic image may refer to the schematic diagrams before and after the segmentation of the salient objects of the panoramic image perceived by the field of view shown in fig. 2. The left diagram in fig. 2 may represent the panoramic image described above. Due to the characteristic of a large visual field of the panoramic image, the panoramic image usually contains more scene information, but simultaneously, more problems are faced to the segmentation of the salient objects of the panoramic image, such as discontinuous boundaries caused by projection, distortion degree along with position transformation and the salient objects with variable scales, which all affect the reliability of the segmentation of the salient objects of the panoramic image.

And 102, projecting the panoramic image, acquiring equidistant cylindrical projection images and analyzing the equidistant cylindrical projection images.

In some embodiments, the execution subject may perform projection processing on the panoramic image to obtain equidistant cylindrical projection images and analyze the equidistant cylindrical projection images. In order to process the panoramic image by using a mature two-dimensional convolutional neural network, equidistant cylindrical projection can be performed on the panoramic image in an original three-dimensional spherical form, and the panoramic image is projected onto a two-dimensional plane to obtain a rectangular equidistant cylindrical projection image with the length-width ratio of 2. In addition, for the convenience of algorithm research, the characteristics of the equidistant cylindrical projection images can be analyzed. The equidistant cylindrical projection image may be characterized by a degree of distortion of the salient region.

Optionally, the executing body performs projection processing on the panoramic image to obtain an equidistant cylindrical projection image and analyzes the equidistant cylindrical projection image, and may further perform the following steps:

and analyzing the distortion degree of the salient region in the equidistant cylindrical projection image according to the corresponding mark of the equidistant cylindrical projection image to obtain the distortion degree of the salient region of the equidistant cylindrical projection image. Wherein, the distortion degree of the significant region can be determined by the following formula:

wherein D represents the significant region distortion degree. j represents the pixel row index number in the pixel image included in the equidistant cylindrical projection image. y represents a vertical coordinate of a pixel in the pixel image in the planar coordinate system. y is _j And a pixel ordinate of the j-th row of pixels in the pixel image in the plane coordinate system.

And a pixel spherical coordinate in the projection spherical coordinate system corresponding to a vertical coordinate in the plane coordinate system of a pixel in the pixel image.

Represents the projection spherical coordinate systemAnd a pixel spherical coordinate corresponding to a pixel ordinate of the j-th row of pixels in the pixel image in the plane coordinate system. Q represents a corresponding coordinate pair set consisting of pixel coordinates of pixels in the pixel image in the planar coordinate system and pixel spherical coordinates corresponding to the pixel coordinates. h represents the height of the equidistant cylindrical projection image. E represents a projection operator from the plane coordinate system to the projection spherical coordinate system. N represents the number of corresponding coordinate pairs in the set of corresponding coordinate pairs.

And the significant region corresponding to the pixel ordinate of the jth row of pixels in the pixel image under the plane coordinate system is shown.

And a region on the spherical image corresponding to the pixel spherical coordinate of the pixel ordinate of the jth row of pixels in the pixel image in the planar coordinate system in the spherical coordinate system in the equidistant cylindrical projection image.

And 103, acquiring a preset visual field perception convolutional neural network segmentation model, and segmenting the equidistant cylindrical projection image according to the preset visual field perception convolutional neural network segmentation model to obtain a significant object segmentation result image.

In some embodiments, the execution subject may obtain a preset field-of-view perceptual convolutional neural network segmentation model, and perform segmentation processing on the equidistant cylindrical projection image according to the preset field-of-view perceptual convolutional neural network segmentation model to obtain a salient object segmentation result image. The preset view perception convolutional neural network segmentation model can be a trained convolutional neural network.

As an example, the above-mentioned salient object segmentation result image may refer to schematic diagrams before and after the salient object segmentation of the view-sensing panoramic image shown in fig. 2. The right-hand diagram in fig. 2 may represent the above-described salient object segmentation result image. The preset view-sensing convolutional neural network segmentation model can refer to the overall frame schematic diagram of the view-sensing convolutional neural network segmentation model shown in fig. 3. The visual field perception convolutional neural network segmentation model comprises a basic feature extraction module, a channel adaptation module, a feature fusion module, an output prediction module and a sample self-adaptive visual field transformation module. The sample self-adaptive visual field conversion module comprises a sample self-adaptive submodule, a scaling visual field conversion submodule, a vertical visual field conversion submodule, a horizontal visual field conversion submodule, a visual field holding submodule and convolution operation of preset times. The steps in fig. 3 may correspond to the steps of performing the segmentation processing on the equidistant cylindrical projection image according to the preset view perception convolutional neural network segmentation model.

In some optional implementation manners of some embodiments, the executing main body obtaining a preset view-sensing convolutional neural network segmentation model, and performing segmentation processing on the equidistant cylindrical projection image according to the preset view-sensing convolutional neural network segmentation model to obtain a significant object segmentation result image may include the following steps:

firstly, extracting, adapting and modulating a plurality of stage features of the equidistant cylindrical projection image to obtain a plurality of stage modulation sub-features.

In some optional implementations of some embodiments, the performing multiple-stage feature extraction, adaptation and modulation on the equidistant cylindrical projection image to obtain multiple-stage modulation sub-features may include the following sub-steps:

and the first substep, extracting a plurality of stage features from the equidistant cylindrical projection image according to a basic feature extraction module in the preset visual field perception convolutional neural network segmentation model to obtain a plurality of stage basic sub-features.

In some optional implementation manners of some embodiments, the executing main body may sequentially perform five stages of feature extraction on the equidistant cylindrical projection image according to a basic feature extraction module in the preset view sensing convolutional neural network segmentation model, so as to obtain a plurality of stage basic sub-features respectively corresponding to two, three, four, and five stages. The basic sub-feature corresponding to one stage may be a basic sub-feature obtained by performing feature extraction on the equidistant cylindrical projection image for the first time, and is not included in the plurality of stage basic sub-features because no subsequent processing is performed. The basic feature extraction module may be a feature extraction backbone network.

As an example, the above feature extraction backbone network may be a 50-layer residual network (ResNet). The above may include a plurality of convolutional layers and downsampling layers. When the size of the equidistant cylindrical projection image can be expressed as H × W × 3, the size of the basic sub-feature can be expressed as H × W × 3

Here, H represents the height of the equidistant cylindrical projection image. W represents the width of the equidistant cylindrical projection image. k represents the number of stages. C represents the number of channels. C _k Indicating the number of channels in the k-th stage.

The number of the downsampling layers may be five, and the plurality of stages may be the last four stages, that is, the preset basic feature extraction module sequentially performs five stages of feature extraction on the equidistant cylindrical projection image according to the five downsampling layers, and uses the basic sub-features of the second, third, fourth, and fifth stages for model post-processing.

And a second substep, performing adaptation processing on the multiple stage basic sub-features according to a channel adaptation module in the preset visual field perception convolutional neural network segmentation model to obtain multiple stage adaptation sub-features.

In some optional implementations of some embodiments, the executing entity may perform a convolution operation twice on the basic sub-features of the multiple stages to obtain a multiple-stage adapter sub-feature. The space sizes of the convolution kernels of the two convolution operations are 3 multiplied by 3 and 1 multiplied by 1 in sequence, and the number of the convolution kernels of the two convolution operations is decreased progressively in sequence. The number of convolution kernels of the convolution layer corresponding to the last convolution operation may be 64, and the convolution layer corresponding to each convolution operation is connected to a BN (batch norm) layer and an activation function layer. The number of convolutional layers, the size of convolutional cores of each convolutional layer and the number of convolutional cores can be adjusted according to network requirements. In addition, different types of BN layers and activation function layers may also be selected as needed.

And a third substep, performing modulation processing on the plurality of stage adaptor characteristics according to a characteristic fusion module in the preset visual field perception convolutional neural network segmentation model to obtain a plurality of stage modulation sub-characteristics.

In some optional implementation manners of some embodiments, the modulating the multiple phase adaptor features according to the feature fusion module in the preset visual field perception convolutional neural network segmentation model to obtain multiple phase modulation sub-features may include the following steps:

firstly, according to the feature fusion module of the current stage, modulating each stage adapter feature of the plurality of stage adapter features and the enhancer feature of the higher stage adjacent to the current stage to obtain the modulation sub-feature of the current stage.

Then, in response to determining that the enhancer feature of the above-mentioned higher-level stage adjacent to the current stage does not exist, only the adaptor feature of the current stage is used as an input to the above-mentioned feature fusion module. The feature fusion module comprises convolution operation with preset times. And the input of the feature fusion module corresponding to the fifth stage is the adaptor feature of the fifth stage. The input of the feature fusion module corresponding to the fourth stage is the adaptor feature of the fourth stage and the enhancer feature of the fifth stage. The inputs of the feature fusion module corresponding to the third stage are the adaptor feature of the third stage and the enhancer feature of the fourth stage. The input of the feature fusion module corresponding to the second stage is the adaptor feature of the second stage and the enhancer feature of the third stage.

As an example, the above feature fusion process can be represented by the following formula:

wherein F represents a feature. k represents the number of stages.

The aptamer characteristics of the k-th stage are represented.

Indicating the modulation sub-feature of the k-th stage.

Indicates the enhancer characteristics of stage k + 1.

Representing the intermediate features of the kth stage. U denotes an upsampling function. C denotes the channel dimension stacking function. CONV ₁ Representing a convolutional layer in which each convolutional kernel has a spatial size of 1 × 1.CONV ₃₃ Two convolutional layers are shown, each of which has a spatial size of 3 × 3.* Representing a convolution operation. The size of the intermediate features can be expressed as

Here, H represents the height of the equidistant cylindrical projection image. W represents the width of the equidistant cylindrical projection image.

Each convolutional layer connects the BN layer and the activation function layer. In the above feature fusion process, the feature summation of the same size may be a summation of each feature element. The number of convolution layers, the convolution kernel size of each convolution layer and the number of convolution kernels can be adjusted according to network requirements. In addition, different types of BN layers and activation function layers may also be selected as desired.

And secondly, performing enhancement processing on the plurality of stage modulation sub-characteristics to obtain a plurality of stage enhancer characteristics. The observing behavior of human eyes on the panoramic image is applied to feature learning of the equidistant cylindrical projection image by the neural network.

In some optional implementations of some embodiments, the performing a body to perform an enhancement process on the plurality of stage modulation sub-features to obtain a plurality of stage enhancement sub-features may include: and performing enhancement processing on the plurality of stage modulation sub-features according to a sample self-adaptive visual field transformation module in the preset visual field perception convolutional neural network segmentation model to obtain a plurality of stage enhancement sub-features.

In some optional implementations of some embodiments, the performing a body performs enhancement processing on the plurality of stage modulation sub-features according to a sample adaptive visual field transformation module in the preset visual field perceptual convolutional neural network segmentation model to obtain a plurality of stage enhancement sub-features, and the method may include:

the first sub-step, according to the visual field holding submodule in the sample adaptive visual field transformation module, carries on visual field holding process to the said multiple stage modulation sub-characteristics, obtains multiple stage visual field holding sub-characteristics.

In some optional implementation manners of some embodiments, the executing entity may perform convolution processing on each phase modulation sub-feature of the multiple phase modulation sub-features for a preset number of times according to the view-keeping sub-module in the sample adaptive view transform module, so as to obtain multiple phase view-keeping sub-features. The convolution processing for performing the predetermined number of times for each of the plurality of phase modulation sub-features may be a combination of a predetermined number of convolution layers including convolution kernels having spatial sizes of 3 × 3 and 1 × 1, and may be further learning of the modulation sub-features.

And a second sub-step of performing field-of-view conversion processing on the plurality of stage modulation sub-features according to a field-of-view conversion sub-module in the sample adaptive field-of-view conversion module to obtain a plurality of stage field-of-view conversion sub-features.

In some optional implementation manners of some embodiments, the performing subject performs field-of-view transformation processing on the multiple phase modulation sub-features according to a field-of-view transformation submodule in the sample adaptive field-of-view transformation module to obtain multiple phase field-of-view transformation sub-features, and the method may include the following steps:

first, according to the horizontal visual field transformation submodule in the visual field transformation submodule, horizontal visual field transformation processing is performed on each of the plurality of stage modulation sub-features, and a plurality of stage horizontal visual field transformation sub-features are obtained.

And secondly, according to a vertical visual field transformation submodule in the visual field transformation submodule, performing vertical visual field transformation processing on each stage modulation sub-feature of the plurality of stage modulation sub-features to obtain a plurality of stage vertical visual field transformation sub-features.

And then, according to a zooming visual field transformation submodule in the visual field transformation submodule, carrying out zooming visual field transformation processing on each stage modulation sub-feature of the plurality of stage modulation sub-features to obtain a plurality of stage zooming visual field transformation sub-features.

And finally, obtaining each stage view field transformation sub-feature according to each stage horizontal view field transformation sub-feature of the plurality of stage horizontal view field transformation sub-features, each stage vertical view field transformation sub-feature of the plurality of stage vertical view field transformation sub-features and each stage scaling view field transformation sub-feature of the plurality of stage scaling view field transformation sub-features. The observation behavior of the panoramic image by the human eyes can be divided into horizontal left-right observation behavior in the horizontal direction, vertical up-down observation behavior in the vertical direction, and front-back far-near observation behavior, and the horizontal left-right observation behavior in the horizontal direction, the vertical up-down observation behavior in the vertical direction, and the front-back far-near observation behavior are applied to the feature learning of the equidistant cylindrical projection image by the neural network.

The conversion parts in the horizontal visual field conversion submodule, the vertical visual field conversion submodule and the scaling visual field conversion submodule all comprise certain parallel subbranches, and the subbranches in the parallel subbranches have the same form and different parameters.

Further, although the specific transformation functions of the horizontal visual field transformation process, the vertical visual field transformation process, and the scaled visual field transformation process are different from each other, the main operation flow is the same, and the main operation flow includes: the method comprises the steps of view forward transformation processing, view forward transformation feature learning, view inverse transformation processing, view inverse transformation feature learning and view transformation sub-branch feature fusion.

In some optional implementations of some embodiments, the main operation flow may include:

first, a view forward conversion processing function of the view forward conversion processing is shown by the following equation:

P′ _e ＝T(SP ^-1 (f(SP(T- ¹ (P _e )))))＝F(P _e )。

where F denotes the view forward conversion processing function. P is _e And the coordinates of the projection point of the modulation sub-feature in a plane coordinate system are shown. P' _e And a transformation projection point coordinate representing the projection point coordinate in a plane coordinate system corresponding to the equidistant cylindrical projection image after the view field orthogonal transformation. T represents a spherical coordinate transformation function from a projected spherical coordinate value in a projected spherical coordinate system corresponding to the equidistant cylindrical projection image to a planar coordinate value in a planar coordinate system corresponding to the equidistant cylindrical projection image. T is ^-1 Representing the inverse function of T. SP represents a spherical polar plane projection function for projecting the projected spherical coordinates in the projected spherical coordinate system to the projected complex plane coordinates in the projected complex plane coordinate system corresponding to the equidistant cylindrical projection image. SP ^-1 Representing the inverse function of the SP. f represents a Mobius transform function satisfying preset conditions, and the horizontal visual field transformation processing, the vertical visual field transformation processing and the zoom visual field transformation processing are realized under different preset conditions.

Next, the specific form of the mobius transform function satisfying the preset condition is shown in the following formula:

wherein the content of the first and second substances,

and a projection complex plane corresponding to the equidistant cylindrical projection image is shown. a. b, c and d are shown separatelyA constant complex number in the projected complex number plane is shown. z represents a projection complex variable.

The inverse view field conversion function of the inverse view field conversion process is expressed as an inverse function F of the forward view field conversion function ^-1 The view forward transform feature learning, the view inverse transform feature learning, and the view transform subbranch feature fusion include convolution operations.

Wherein, the difference between the rotation angles of the branches in the horizontal visual field transformation submodule may be set to 30 degrees. The vertical field of view transform submodule may have two sub-branches, and the rotation angles of the two sub-branches may be 30 degrees and-30 degrees, respectively.

Optionally, the zoom field-of-view transformation submodule may combine a parameter of the zoom center with the zoom scale factor parameter.

By way of example, if the spatial size of the module input features is h w, the zoom center coordinate may be h w

And

the set of scaling factors may be {0.8,1.2,0.7,1.3}.

Optionally, the rotation angle parameters of each branch in the horizontal transformation submodule and the view transformation submodule, and the central parameter and the scale factor in the scaling view transformation submodule may be adjusted according to actual task requirements. The sample adaptive field of view transform module described above may be determined as a feature enhancement module for other panoramic feature enhancement processing.

In some optional implementations of some embodiments, the f represents a mobius transform function that satisfies a preset condition, and implementing the horizontal field of view transform processing, the vertical field of view transform processing, and the scaled field of view transform processing under different constraint conditions may include:

first, a mobius transform function used for the horizontal field of view transform processing and the vertical field of view transform processing is determined as a first mobius transform function. Wherein, the specific form of the first Mobius transformation function is shown as the following formula:

wherein the content of the first and second substances,

represents the complex conjugate of b.

Denotes the complex conjugate of a.

Secondly, in response to determining that the direction rotation angle of the projected spherical vector passing through the origin under the projected spherical coordinate system is within a preset range, the specific form of the complex numbers a and b in the first mobius transform function is as shown in the following formula:

wherein L represents a projected spherical vector passing through the origin in the projected spherical coordinate system. l denotes the first coordinate of the projected spherical vector. m represents a second coordinate of the projected spherical vector. n represents the third coordinate of the projected spherical vector. i represents an imaginary unit. θ represents the above-described direction rotation angle.

Then, in response to determination of L = (0, 1), the above-described first mobius transform function is determined as the horizontal visual field transform processing function.

Then, in response to determination of L = (0, 1, 0), the above-described first mobius transform function is determined as the vertical field-of-view transform processing function.

Next, the mobius transform function used for the above-described scaled visual field transformation process is determined as a second mobius transform function. Wherein, the specific form of the second mobius transform function is shown as the following formula:

wherein a' represents a complex exponential in the projected complex plane. ρ represents a modulus of the complex exponential. e' represents a natural index. θ' represents the argument of the complex exponential.

Subsequently, in response to determination that θ' =0, ρ < 1, the above-described second mobius transform function is determined as a contraction function centered on the origin.

Finally, in response to a determination that θ' =0, ρ > 1, the above-described second mobius transform function is determined as the expansion function centered on the origin.

Optionally, in response to determining that the scaling center of the objective function is not at the origin, the objective function may be rotated to the origin, and then the transformed objective function may be reversely rotated back to the original objective function.

As an example, the above view transformation sub-module may refer to an overall frame schematic diagram of the horizontal view transformation sub-module and the scaled view transformation sub-module shown in fig. 4. Drawing (a) in FIG. 4 shows the overall framework of the horizontal visual field conversion submodule, wherein F _h Representing the first mobius transform function described above. Diagram (b) in fig. 4 represents the overall framework of the scaled view transform submodule, where F _zm Representing the second mobius transform function described above. The various steps in fig. 4 may correspond to various steps of the main operational flow described above. Here, the vertical view transform submodule and the horizontal view transform submodule described above are similar. c1 denotes a convolutional neural network. in represents the input. out (out) represents the output. The SE (Squeeze-and-Excitation) network in FIG. 4 can be composed of a global average pooling layer, a fully connected layer, and a classification layer.

And a third sub-step of performing field adaptive processing on the plurality of stage modulation sub-features according to a sample adaptive sub-module in the sample adaptive field-of-view transform module to obtain a plurality of stage sample adaptive sub-features.

In some optional implementations of some embodiments, the performing, according to the sample adaptation submodule in the sample adaptive view transform module, view-adaptive processing on the multiple stage modulation sub-features to multiple stage sample adaptation sub-features may include: and performing convolution processing for preset times on each phase modulation sub-feature of the multiple phases according to a sample adaptive sub-module in the sample adaptive view field conversion module to obtain the multiple phase sample adaptive sub-features. Wherein, each of the sample adaptive sub-features of each of the plurality of stages has a feature value size between 0 and 1.

Optionally, the sample adaptation module is expressed by the following formula:

where k represents the number of stages.

Representing the sample adaptation sub-feature of the k-th stage. Sigmoid denotes classification function. FC denotes the full connection function. ReLU denotes the activation function. GAP represents a global average pooling function.

A fourth substep of performing fusion processing on the plurality of stage view holding sub-features, view transformation sub-features, and sample adaptive sub-features to obtain the plurality of stage enhancer features.

In some optional implementations of some embodiments, the performing a fusion process on the plurality of stage view-keeping sub-features, the view-changing sub-feature and the sample adaptive sub-feature by the performing body to obtain the plurality of stage enhancer features may include:

first, for each of the plurality of stages, the view-preserving sub-feature and the view-transforming sub-feature are adaptively stacked in channel dimensions according to the sample adaptive sub-feature. The specific implementation form of the adaptive stacking is shown as the following formula:

V _f ＝Concat(ω _n ×V _n )，n＝0，1，2，3。

wherein V represents a feature. V _f Representing the stacking result characteristics. V ₀ Representing the view-preserving sub-feature described above. V ₁ The horizontal visual field conversion sub-feature is represented among the visual field conversion sub-features. V ₂ A vertical field of view transform sub-feature of the field of view transform sub-features. V ₃ A scaled view transform sub-feature of the view transform sub-features is represented. Concat represents the function of the superposition operation of the above features on the channel. n represents a constant. Omega _n A feature value representing a sample adaptation sub-feature. The eigenvalue size of the sample adaptive sub-feature may lie between 0 and 1.

Optionally, feature weights of a horizontal field-of-view transformation sub-feature, a vertical field-of-view transformation sub-feature, a scaled field-of-view transformation sub-feature, and the field-of-view preservation sub-feature in the field-of-view transformation sub-features may be adjusted by feature values of the sample adaptive sub-features, so that the stacking result feature is applicable to different samples.

Optionally, the horizontal visual field transformation submodule, the vertical visual field transformation submodule and the scaled visual field transformation submodule may be deleted according to a requirement of an actual task.

And then, carrying out convolution operation twice on the stacking result characteristics to obtain the enhancer characteristics corresponding to the stage.

And thirdly, generating a remarkable object segmentation result image according to the characteristics of the plurality of stages of enhancers.

Optionally, an output prediction module may be formed by a preset number of convolutional layers, and the obtained enhancement factor features may be segmented according to the output prediction module to obtain a significant object segmentation result image.

Optionally, at the stage of performing model training on the preset visual field perception convolutional neural network segmentation model, the enhancer features of each stage in the multiple stage enhancer features are respectively connected to the output prediction module, and the multi-stage salient object segmentation result image result is supervised to generate a better feature learning result of each stage. Thereby, the accuracy of the segmentation result of the salient object is improved.

In practice, the trained convolutional neural network is used for processing the equidistant cylindrical projection image, so that the segmentation result of the salient object in the equidistant cylindrical projection image can be accurately obtained, and the reliability of segmenting the salient object in the panoramic image is improved.

In practice, the panorama image salient object segmentation method and the related content thereof realize the segmentation of the salient object of the panorama image, and in the process of segmenting the salient object, a series of processing such as feature extraction, feature modulation and feature enhancement is carried out on the panorama image, so that the accuracy and the reliability of segmenting the panorama image salient object are improved.

With further reference to fig. 5, as an implementation of the methods shown in the above figures, the present disclosure provides some embodiments of a visual field aware panorama image salient object segmentation apparatus, which correspond to those method embodiments shown in fig. 1, and which may be particularly applied in various electronic devices.

As shown in fig. 5, the visual field perceived panoramic image salient object segmentation apparatus 500 of some embodiments includes: an acquisition unit 501, a projection unit 502 and a segmentation unit 503. Wherein the acquisition unit 501 is configured to acquire a panoramic image; the projection unit 502 is configured to perform projection processing on the panoramic image to obtain equidistant cylindrical projection images and analyze the equidistant cylindrical projection images; the segmentation unit 503 is configured to acquire a preset field-of-view perceptual convolutional neural network segmentation model, and perform segmentation processing on the equidistant cylindrical projection image according to the preset field-of-view perceptual convolutional neural network segmentation model to obtain a salient object segmentation result image.

It will be understood that the units described in the apparatus 500 correspond to the various steps in the method described with reference to fig. 2. Thus, the operations, features and resulting advantages described above with respect to the method are also applicable to the apparatus 500 and the units included therein, and are not described herein again.

In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (Hyper Text Transfer Protocol), and may be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combinations of the above-mentioned features, and other embodiments in which the above-mentioned features or their equivalents are combined arbitrarily without departing from the spirit of the invention are also encompassed. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A visual field perception panoramic image salient object segmentation method comprises the following steps:

acquiring a panoramic image;

carrying out projection processing on the panoramic image to obtain equidistant cylindrical projection images and analyzing the equidistant cylindrical projection images;

and acquiring a preset visual field perception convolutional neural network segmentation model, and segmenting the equidistant cylindrical projection image according to the preset visual field perception convolutional neural network segmentation model to obtain a significant object segmentation result image.

2. The method according to claim 1, wherein the segmenting the equidistant cylindrical projection image according to the preset visual field perception convolutional neural network segmentation model to obtain a significant object segmentation result image comprises:

extracting, adapting and modulating the characteristics of multiple stages of the equidistant cylindrical projection image to obtain modulation sub-characteristics of multiple stages;

performing enhancement processing on the plurality of stage modulation sub-features to obtain a plurality of stage enhancement sub-features, wherein the observing behavior of the human eye on the panoramic image is applied to feature learning of equidistant cylindrical projection images by a neural network, and the enhancement processing is performed on the plurality of stage modulation sub-features to obtain a plurality of stage enhancement sub-features, which may include the following steps:

according to a sample self-adaptive visual field transformation module in the preset visual field perception convolutional neural network segmentation model, performing enhancement processing on the multiple stage modulation sub-features to obtain multiple stage enhancement sub-features;

and generating a remarkable object segmentation result image according to the plurality of stage enhancer characteristics.

3. The method of claim 2, wherein said performing a plurality of phase feature extractions, adaptations and modulations on said equidistant cylindrical projection images resulting in a plurality of phase modulation sub-features comprises:

according to a basic feature extraction module in the preset visual field perception convolutional neural network segmentation model, performing multi-stage feature extraction on the equidistant cylindrical projection image to obtain a plurality of stage basic sub-features;

according to a channel adaptation module in the preset visual field perception convolutional neural network segmentation model, carrying out adaptation processing on the multiple stage basic sub-features to obtain multiple stage adaptation sub-features;

and modulating the multi-stage adapter characteristics according to a characteristic fusion module in the preset visual field perception convolutional neural network segmentation model to obtain multi-stage modulation sub-characteristics.

4. The method according to claim 3, wherein the performing a plurality of stage feature extractions on the equidistant cylindrical projection image according to a basic feature extraction module in the preset visual field perception convolutional neural network segmentation model to obtain a plurality of stage basic sub-features comprises:

and sequentially extracting the characteristics of five stages of the equidistant cylindrical projection images according to a basic characteristic extraction module in the preset visual field perception convolutional neural network segmentation model to obtain a plurality of stage basic sub-characteristics respectively corresponding to two, three, four and five stages.

5. The method according to claim 4, wherein the adapting the plurality of stage basis sub-features according to a channel adaptation module in the preset visual field perception convolutional neural network segmentation model to obtain a plurality of stage adaptor sub-features comprises:

and performing convolution operation twice on the basic sub-features of the multiple stages to obtain multiple stage adapter sub-features, wherein the space sizes of convolution kernels of the convolution operation twice are 3 × 3 and 1 × 1 in sequence, and the number of the convolution kernels of the convolution operation twice is decreased progressively in sequence.

6. The method according to claim 5, wherein the modulating the plurality of phase adaptation sub-features according to a feature fusion module in the preset visual field perception convolutional neural network segmentation model to obtain a plurality of phase modulation sub-features comprises:

according to the feature fusion module of the current stage, modulating each stage adapter feature of the plurality of stage adapter features and the enhancer feature of the higher stage adjacent to the current stage to obtain a modulation sub-feature of the current stage;

in response to determining that the enhancer feature of the higher-level stage adjacent to the current stage is not present, using only the adaptor feature of the current stage as an input to the feature fusion module, wherein the feature fusion module includes a preset number of convolution operations.

7. The method according to claim 6, wherein the enhancing the plurality of stage modulation sub-features according to a sample adaptive visual field transformation module in the preset visual field perceptual convolutional neural network segmentation model to obtain a plurality of stage enhancement sub-features comprises:

according to a visual field holding submodule in the sample self-adaptive visual field conversion module, carrying out visual field holding processing on the multiple stage modulation sub-features to obtain multiple stage visual field holding sub-features;

according to a visual field conversion submodule in the sample self-adaptive visual field conversion module, performing visual field conversion processing on the multiple stage modulation sub-characteristics to obtain multiple stage visual field conversion sub-characteristics;

according to a sample adaptive sub-module in the sample adaptive view transformation module, performing view adaptive processing on the multiple stage modulation sub-features to obtain multiple stage sample adaptive sub-features, wherein the view adaptive processing on the multiple stage modulation sub-features comprises the following steps:

performing convolution processing on each phase modulation sub-feature of the multiple phases for preset times according to a sample adaptive sub-module in the sample adaptive view field transformation module to obtain multiple phase sample adaptive sub-features, wherein the size of each feature value of the sample adaptive sub-features of each phase of the multiple phases is between 0 and 1;

performing fusion processing on the plurality of stage view retention sub-features, the view transformation sub-features and the sample adaptive sub-features to obtain the plurality of stage enhancement sub-features, wherein the fusion processing on the plurality of stage view retention sub-features, the view transformation sub-features and the sample adaptive sub-features comprises the following steps:

for each stage in the multiple stages, performing channel dimension adaptive stacking on the view-field-keeping sub-feature and the view-field-changing sub-feature according to the sample adaptive sub-feature, wherein the specific implementation form of the adaptive stacking is shown as the following formula:

V _f ＝Concat(ω _n ×V _n )，n＝0，1，2，3，

wherein V represents a feature, V _f Characteristic of the stacking result, V ₀ Representing said view-holder feature, V ₁ Representing a horizontal view transform sub-feature, V, of said view transform sub-features ₂ Vertical viewing in the View transform sub-featureWild transformation sub-feature, V ₃ Representing a scaled view transform sub-feature of the view transform sub-features, concat represents a superposition operation function of the feature on a channel, n represents a constant, ω _n A feature value representing a sample adaptive sub-feature, the magnitude of the feature value of the sample adaptive sub-feature being between 0 and 1;

and carrying out convolution operation twice on the stacking result characteristic to obtain an enhancer characteristic corresponding to the stage.

8. The method of claim 7, wherein the performing field-of-view preservation processing on the plurality of phase modulation sub-features according to a field-of-view preservation sub-module in the sample adaptive field-of-view transform module to obtain a plurality of phase field-of-view preservation sub-features comprises:

and performing convolution processing for preset times on each stage modulation sub-feature of the plurality of stage modulation sub-features according to a visual field holding submodule in the sample self-adaptive visual field conversion module to obtain a plurality of stage visual field holding sub-features.

9. The method of claim 8, wherein the field-of-view transforming the plurality of phase modulation sub-features according to a field-of-view transformation submodule in the sample adaptive field-of-view transformation module to obtain a plurality of phase field-of-view transformation sub-features comprises:

according to a horizontal visual field conversion submodule in the visual field conversion submodule, performing horizontal visual field conversion processing on each stage modulation sub-characteristic of the plurality of stage modulation sub-characteristics to obtain a plurality of stage horizontal visual field conversion sub-characteristics;

according to a vertical visual field transformation submodule in the visual field transformation submodule, performing vertical visual field transformation processing on each stage modulation sub-characteristic of the plurality of stage modulation sub-characteristics to obtain a plurality of stage vertical visual field transformation sub-characteristics;

according to a scaling view field transformation submodule in the view field transformation submodule, carrying out scaling view field transformation processing on each stage modulation sub-characteristic of the plurality of stage modulation sub-characteristics to obtain a plurality of stage scaling view field transformation sub-characteristics;

obtaining each stage visual field transformation sub-feature according to each stage horizontal visual field transformation sub-feature of the plurality of stage horizontal visual field transformation sub-features, each stage vertical visual field transformation sub-feature of the plurality of stage vertical visual field transformation sub-features and each stage scaling visual field transformation sub-feature of the plurality of stage scaling visual field transformation sub-features, wherein the observation behavior of the human eye on the panoramic image is divided into left and right observation behaviors in the horizontal direction, up and down observation behaviors in the vertical direction and far and near observation behaviors in the front and back direction, and the left and right observation behaviors in the horizontal direction, up and down observation behaviors in the vertical direction and far and near observation behaviors in the front and back direction are applied to feature learning of the equidistant cylindrical projection image by a neural network, wherein transformation parts in the horizontal visual field transformation sub-module, the vertical visual field transformation sub-module and the scaling visual field transformation sub-module respectively comprise certain parallel sub-branches, and all sub-branches in the parallel sub-branches have the same form and different parameters;

in addition, the specific transformation functions of the horizontal field of view transformation process, the vertical field of view transformation process and the scaled field of view transformation process are different, but the main operation flow is the same, and the main operation flow comprises the following steps: the method comprises the following steps of visual field forward transformation processing, visual field forward transformation feature learning, visual field inverse transformation processing, visual field inverse transformation feature learning and visual field transformation sub-branch feature fusion, wherein the main operation flow comprises the following steps:

the view forward conversion processing function of the view forward conversion processing is shown as the following formula:

P′ _e ＝T(SP ^-1 (f(SP(T ^-1 (P _e )))))

＝F(P _e )，

wherein F represents the view forward transform processing function, P _e Representing projection point coordinates, P ', of the modulation sub-feature in a planar coordinate system' _e The coordinate of the projection point is transformed in the plane coordinate system corresponding to the equidistant cylindrical projection image after the forward transformation of the visual field, and T represents the coordinate of the projection pointA spherical coordinate transformation function from the projected spherical coordinate value in the projected spherical coordinate system corresponding to the equidistant cylindrical projected image to the planar coordinate value in the planar coordinate system corresponding to the equidistant cylindrical projected image, T ^-1 Representing an inverse function of T, SP representing a spherical polar plane projection function of projecting projected spherical coordinates in a projected spherical coordinate system to projected complex plane coordinates in a projected complex plane coordinate system corresponding to the equidistant cylindrical projection image, SP representing a spherical polar plane projection function of projecting projected spherical coordinates in a projected complex plane coordinate system corresponding to the equidistant cylindrical projection image, SP ^-1 Representing an inverse function of the SP, f representing a Mobius transformation function which meets preset conditions, and realizing the horizontal visual field transformation processing, the vertical visual field transformation processing and the zoom visual field transformation processing under different preset conditions;

the specific form of the Mobius transformation function is shown as the following formula:

wherein the content of the first and second substances,

representing projection complex planes corresponding to the equidistant cylindrical projection images, a, b, c and d respectively represent a constant complex number in the projection complex planes, and z represents a projection complex variable;

the inverse view transform function of the inverse view transform process is expressed as an inverse function F of the forward view transform process function ^-1 The view forward transform feature learning, the view inverse transform feature learning, and the view transform sub-branch feature fusion comprise convolution operations;

wherein f represents a Mobius transform function satisfying a preset condition, and the horizontal visual field transformation processing, the vertical visual field transformation processing and the scaled visual field transformation processing are realized under different constraint conditions, comprising the following steps:

determining a Mobius transformation function used for the horizontal visual field transformation processing and the vertical visual field transformation processing as a first Mobius transformation function, wherein the specific form of the first Mobius transformation function is shown as the following formula:

wherein the content of the first and second substances,

represents the complex number of the conjugate of b,

represents a complex conjugate of a;

in response to determining that a direction rotation angle value of a vector passing through an origin under a projected spherical coordinate system is within a preset range, a specific form of complex numbers a and b in the first Mobius transformation function is as shown in the following formula:

wherein L represents a projection spherical vector passing through an origin in the projection spherical coordinate system, L represents a first coordinate of the projection spherical vector, m represents a second coordinate of the projection spherical vector, n represents a third coordinate of the projection spherical vector, i represents an imaginary unit, and θ represents the direction rotation angle;

in response to determining that L = (0, 1), determine the first mobius transform function as a horizontal field of view transform processing function;

in response to determining that L = (0, 1, 0), determine the first mobius transform function as a vertical field of view transform processing function;

determining a Mobius transform function used for the scaled visual field transformation process as a second Mobius transform function, wherein the second Mobius transform function is in a specific form as shown in the following formula:

wherein a ' represents a complex exponent in the projected complex plane, ρ represents a modulus of the complex exponent, e ' represents a natural exponent, and θ ' represents a argument of the complex exponent;

in response to determining θ' =0, ρ < 1, determining the second Mobius transform function as a field of view contraction function centered at an origin;

in response to determining θ' =0, ρ > 1, the second mobius transform function is determined as a field of view expansion function centered at an origin.

10. The method of claim 9, wherein said projecting said panoramic image to obtain equidistant cylindrical projection images and analyzing further comprises;

analyzing the distortion degree of the salient region in the equidistant cylindrical projection image according to the corresponding mark of the equidistant cylindrical projection image to obtain the distortion degree of the salient region of the equidistant cylindrical projection image, wherein the specific formula for determining the distortion degree of the salient region is as follows:

wherein D represents the distortion degree of the salient region, j represents the pixel row index serial number in the pixel image included in the equidistant cylindrical projection image, y represents the ordinate of the pixel in the pixel image under the plane coordinate system, y _j Represents the pixel ordinate of the j-th row of pixels in the pixel image under the plane coordinate system,

representing spherical coordinates of pixels in the projected spherical coordinate system corresponding to vertical coordinates of pixels in the pixel image in the planar coordinate system,

representing a pixel spherical coordinate in the projected spherical coordinate system corresponding to a pixel ordinate of a jth row of pixels in the pixel image in the planar coordinate system, Q representing a set of corresponding coordinate pairs of pixel coordinates in the planar coordinate system of pixels in the pixel image and pixel spherical coordinates corresponding to the pixel coordinates, h representing a height of the equidistant cylindrical projection image, E representing a projection operator from the planar coordinate system to the projected spherical coordinate system, N representing a number of corresponding coordinate pairs in the set of corresponding coordinate pairs,

representing a salient region in the pixel image corresponding to the pixel ordinate of the jth row of pixels in the planar coordinate system,

representing an area on a spherical image of the salient region in the equidistant cylindrical projection image corresponding to pixel spherical coordinates of a pixel ordinate of a jth row of pixels in the pixel image in the planar coordinate system in the spherical coordinate system.