CN116778187A

CN116778187A - Salient target detection method based on light field refocusing data enhancement

Info

Publication number: CN116778187A
Application number: CN202310683470.1A
Authority: CN
Inventors: 王昕�; 张勇; 熊高敏; 高隽
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-09-19

Abstract

The invention discloses a salient target detection method based on light field refocusing data enhancement, which comprises the following steps: 1. refocusing the light field data to obtain light field data under different focusing parameters; 2. performing data enhancement on refocusing data; 4. constructing a depth convolution neural network, taking a light field refocusing image and a depth image as input, and training the depth convolution neural network to obtain a light field saliency target detection model; 5. and performing salient target detection on the light field refocusing image and the depth image to be detected by using the trained model, and evaluating the accuracy of the model on the data to be detected. The invention can realize the salient target detection based on the enhancement of the light field refocusing data, thereby effectively improving the accuracy of the salient target detection of the scene in the complex and changeable environment.

Description

Salient target detection method based on light field refocusing data enhancement

Technical Field

The invention belongs to the fields of computer vision, image processing and analysis, and particularly relates to a salient target detection method based on light field refocusing data enhancement.

Background

Visual salience is a mechanism of attention of the human visual system, when we are observing a scene, there is often a salient region in the scene that attracts our attention, and naturally we ignore those non-salient regions, so that humans can process a large amount of image data quickly. The salient target detection means that a computer simulates a human visual system to quickly and accurately locate an interested region or target in a visual field, and the accurate salient target detection can provide reliable prior information for target detection and identification, image segmentation and visual tracking.

According to different types of input data, saliency target detection is mainly divided into three categories: (1) saliency target detection based on RGB images; (2) saliency target detection based on RGB-D images; (3) light field based saliency target detection. Under complex scenes such as high light or dim light, partial shielding, disordered background, similar foreground and background, and the like, a salient target or region is difficult to effectively detect by taking an RGB image as input. The method of inputting RGB-D images as well as RGB images and depth maps, additionally introducing depth information, has been demonstrated to improve the performance of salient object detection, but if the quality of the depth maps is poor, the effect of salient object detection is poor. The light field refers to the quantity of light transmitted in various directions through each point in space, and simultaneously records the position information and the view angle information of light radiation in space, so that the description of a natural scene is more complete.

Currently, there have been several efforts to investigate light field-based salient object detection, which are broadly divided into feature-based methods and learning-based methods. The feature-based method is to estimate the salient target by using information such as color, depth, background priori and the like on the basis of a light field focal stack and a full-focus image. Such methods only consider a few limited features and often do not have high detection accuracy. The learning-based method trains a salient object detection model by a certain amount of training data, and tests on test data by using the trained model. The learning-based method relies on the strong learning capability of the deep neural network, integrates various characteristics, and greatly improves the detection precision compared with the characteristic-based method. However, these learning-based approaches still suffer from drawbacks:

1. for focal stack-based methods, local blurring between different refocused images is detrimental to such methods to obtain a saliency map with sharp edges, and when the depth of field range is narrow, such methods are difficult to obtain ideal effects;

2. training tests are carried out on the data set proposed by the user mostly based on the learning method, and the model is not easy to embody the robustness of the model due to the lack of comparison;

3. most of the methods based on learning use focal stacks, the difference of focus depth variation between refocused images contained in the focal stacks is small, and the full-focus image is a special refocused image, and certain data redundancy exists in the two image types, so that a large amount of calculation overhead exists in the network.

Disclosure of Invention

The invention aims to solve the defects in the prior art, provides a salient object detection method based on light field refocusing data enhancement, and aims to fully mine special properties of light field data and reduce calculation load, so that the precision and accuracy of salient object detection of scenes in complex and changeable environments are effectively improved.

In order to solve the technical problems, the invention adopts the following technical scheme:

the invention discloses a salient object detection method based on light field refocusing data enhancement, which is characterized by comprising the following steps of:

step 1, refocusing light field data to obtain the light field data under different focusing parameters;

step 1.1, recording light field data of the nth scene asWherein u and v represent any one of the horizontal viewing angle and the vertical viewing angle in the viewing angle dimension, respectively, and +.>M represents the maximum number of viewing angles in the horizontal and vertical directions; x and y respectively represent pixel point coordinates in any horizontal direction and vertical direction in the space dimension, and x is E [1, X],y∈[1,Y]X and Y respectively represent the maximum space width and the maximum space height of the visual angle image; n is E [1, N]N represents the number of light field data, F represents the distance from the light field camera main lens to the sensor;

step 1.2, light field data for the nth sceneIn a virtual focal plane F _α Refocusing the position to obtain refocused light field data +.>Wherein F' _α Is a virtual focal plane F _α The distance from the camera main lens, x 'and y' respectively represent the pixel point coordinates of any horizontal direction and any vertical direction in the space dimension of the refocused back view angle image;

step 2, refocusing the light field dataDecoding to obtain refocused images focused at different depths of a scene;

step 2.1, refocusing the light field with the aid of (1)Performing computational imaging to obtain an nth scene in a virtual focal plane F _α Image of the place->

In the formula (1), alpha represents a virtual focal plane F _α A scaling factor of the distance to the sensor and the distance F of the light field camera main lens to the sensor;

step 2.2, taking N different proportionality coefficients { alpha } ₁ ,α ₂ ,…,α _m ,…,α _N Repeating steps 1.2 to 2.1 to obtain a series of refocused images focused at different depths of the nth scene And forms a focal stack for the nth scene, where α _m Represents the mth scaling factor,/->Representing the nth scene at the mth scale factor alpha _m Lower virtual focal plane->Refocusing image at, N represents the focal stack +.>The number of refocusing images involved, let ∈ ->The height, the width and the channel number of the device are H, W and C respectively;

step 3, focusing the nth scenePerforming data enhancement processing on refocusing images contained in the image data to obtain a focus stack with enhanced nth scene dataWherein (1)>Representing the nth scene at the mth scale factor alpha _m Lower virtual focal plane->Enhancing the refocused image after processing;

respectively marking the depth map and the true saliency map of the nth scene as D ⁿ 、G ⁿ Depth map D for the nth scene ⁿ And a true saliency map G ⁿ And performing data enhancement processing to obtain a depth map after data enhancementAnd true saliency map->

Step 4, constructing a salient object detection model based on light field refocusing data enhancement, which comprises the following steps: the system comprises an encoding network, an RGB and depth fusion module, a depth recovery module, a decoding network and an optimization module;

step 4.1, the coding network comprises: RGB networks and deep networks; wherein, the RGB network takes ResNet18 as a backbone network, and comprises: j basic blocks and j channel dimension reduction modules; the depth network is composed of j convolution modules;

the refocused imageInputting the saliency target detection model, and sequentially carrying out convolution processing on j basic blocks of RGB (red, green and blue) network in the coding network to obtain j refocusing features in the nth sceneWherein (1)>Representing refocus image +.>Is the ith feature map of (2);

each channel dimension reduction module is sequentially composed of two convolution layers, a batch normalization layer and a ReLU activation layer;

the j channel dimension reduction modules respectively focus the j refocusing features in the nth sceneAfter processing, j dimension-reducing features ∈j after the dimension reduction of the nth scene are obtained>Wherein (1)>Representing the ith feature after dimension reduction;

the depth mapInputting the saliency target detection model, and sequentially carrying out convolution processing on j convolution modules in a depth network in a coding network to obtainDepth feature D in nth scene ^n′ ；

And 4.2, constructing the RGB and depth fusion module, which sequentially comprises the following steps: an IBR module, a convolution module Conv1 and an IRB module;

feature the jth dimension reduction in the nth sceneAnd depth feature D ^n′ After pixel level multiplication calculation, inputting the calculated pixel level multiplication calculation into the RGB and depth fusion module, and carrying out convolution processing by the IBR module to obtain a preliminary fusion characteristic E under an nth scene ⁿ ；

The convolution module Conv1 pair j-th dimension reduction featureAfter convolution processing, refocusing image characteristics are obtained

Fusion feature E in nth scene ⁿ Refocusing image featuresAnd depth feature D ^n′ After pixel level multiplication calculation, inputting the calculated result into an IRB module, and sequentially carrying out convolution, batch normalization and ReLU activation processing to obtain a final fusion feature E under an nth scene ^n′ ；

Step 4.3, the depth recovery module includes: a convolution module Conv2 and a fusion module;

the fusion feature E ^n′ Inputting the rough restoration depth map in the nth scene into the convolution module Conv2, and sequentially performing bilinear interpolation, convolution, batch normalization and ReLU activation to obtain the rough restoration depth map in the nth scene

The fusion module recovers the depth map of the roughnessAfter residual error, convolution, bilinear interpolation and Sigmoid activation processing are sequentially carried out, an accurate recovery depth map +_in an nth scene is obtained>

Step 4.4, the decoding network includes: the device comprises a bridging module and a decoding module;

the bridging module performs the dimension reduction on the j-th dimension reduction featureAfter the processing of convolution, batch normalization and ReLU activation is sequentially carried out, bridging feature B is obtained ⁿ ；

The decoding module consists of j decoding stages, each decoding stage consists of three continuous deconvolution modules, and each deconvolution module consists of a deconvolution layer, a batch normalization layer and a ReLU activation layer in sequence;

when i=1, bridging feature B will be ⁿ And fusion feature E ^n′ Inputting the i-th rough significant image and the i-th rough significant image into the i-th decoding stage together for processing

When i=2, 3, …, j, the i-1 th coarse significant image is up-sampled twice and then compared withInputting the i-th decoding stage together for processing to obtain the i-th rough significant image +.>Thereby outputting the j-th coarse salient image from the j-th decoding stage>And forms the roughness in the nth sceneIs a salient image collection of (1)

Step 4.5, constructing the optimization module, which comprises the following steps: an encoder, a decoder;

the encoder and decoder are used for sequentially carrying out the treatment on the jth rough salient imageProcessing to generate accurate prediction saliency map pre in nth scene ⁿ ；

Training a salient target detection model based on light field refocusing data enhancement;

step 5.1, establishing a loss function;

step 5.1.1, establishing a space loss function under the nth scene through the formula (2), the formula (3) and the formula (4) respectivelyEdge loss function->And depth loss function->

In the formulas (2), (3) and (4),represents the focal stack in the nth scene +.>Corresponding true saliency map, TP ⁿ Representing pre ⁿ FN of the region correctly predicted to be a salient target ⁿ Representation->Areas where significant targets are mispredicted as background, FP ⁿ Representing pre ⁿ The middle background is mispredicted as the region of the salient object, β represents the balance factor;

step 5.1.2, establishing a total loss function L under the nth scene through the method (5) ⁿ ：

And 5.2, training the saliency target detection model by using a random gradient descent algorithm, and calculating a total loss function under each scene to update network parameters until the total loss function converges, so as to obtain an optimal saliency target detection model for carrying out saliency target detection on the light field image.

The electronic device of the present invention includes a memory and a processor, wherein the memory is configured to store a program for supporting the processor to execute the saliency target detection method, and the processor is configured to execute the program stored in the memory.

The invention relates to a computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the saliency target detection method.

Compared with the prior art, the invention has the beneficial effects that:

1. according to the invention, the deep neural network based on light field refocusing data enhancement is constructed, and the label data is used for supervising the neural network to learn, so that a light field saliency target detection characteristic model with robustness is obtained, the problem of high detection precision due to the fact that the calculation burden of the network is large in the focal stack model is solved, the calculation burden of the network is greatly reduced, and the problem of low detection precision is solved.

2. According to the invention, by constructing the depth neural network based on light field refocusing data enhancement, the input depth map is considered, the view angle change of a partial area is converted into the depth change of the whole image area, and the depth change of the whole image area is ignored based on the light field data network.

3. According to the invention, by constructing the deep neural network based on the light field refocusing data enhancement, based on the thought of U-Net, the encoder part and the decoder part are symmetrically constructed, and the channel dimension reduction module is used for reducing the dimension of the characteristics acquired by the encoder, so that the data quantity required to be calculated by the decoder is effectively reduced, and the problems of huge calculated quantity and high time cost of the deep neural network based on the refocusing data enhancement are solved.

4. According to the invention, the optimization module is used for optimizing the detection result of the light field salient target, so that the pixel points with detection errors on the outline of the salient region are corrected, the detection edge is smoother, and the accuracy of detecting the light field salient target is improved.

Drawings

FIG. 1 is a flow chart of salient object detection for a light field refocused image in accordance with the present invention;

FIG. 2 is a schematic diagram of a deep neural network based on light field refocusing data enhancement used in the present invention;

FIG. 3 is a graph of salient object detection results for the DUTLF-V2, DUTLF-FS, lytro-Illum, HFUT-Lytro, LFSD portion test set for the present invention and other light field salient object detection methods.

Detailed Description

As shown in fig. 1, in the embodiment, a salient object detection method based on light field refocusing data enhancement is to construct a deep neural network based on light field refocusing data enhancement to obtain a light field salient object detection feature model capable of effectively detecting salient objects in complex scenes, so that accuracy and precision of detecting the salient objects of the scenes in complex and changeable environments are effectively improved. Specifically, the method comprises the following steps:

training and testing is performed in this embodiment using a light field saliency target detection dataset DUTLF-V2, the DUTLF-V2 containing a total of n=4204 scenes, wherein the training set contains 2597 scenes, the testing set contains 1247 scenes, and the maximum viewing angle number m=9 in the horizontal and vertical directions;

step 1.2, light field data for nth sceneIn a virtual focal plane F _α Refocusing the position to obtain refocused light field data +.>Wherein F' _α Is a virtual focal plane F _α The distance from the camera main lens, x 'and y' respectively represent the pixel point coordinates of any horizontal direction and any vertical direction in the space dimension of the refocused back view angle image;

step 2.2, taking N different proportionality coefficients { alpha } ₁ ,α ₂ ,…,α _m ,…,α _N Repeating steps 1.2 to 2.1 to obtain a series of refocused images focused at different depths of the scene And forms a focal stack for the nth scene, where α _m Represents the mth scaling factor,/->Representing the nth scene at the mth scale factor alpha _m Lower virtual focal plane->Refocusing at Jiao TuLike, N represents the focal stack +.>The number of refocusing images involved, let ∈ ->The height, the width and the channel number of the device are H, W and C respectively;

in this embodiment, α is determined by the depth of the specific scene containing target, and the refocus number is determined by the depth range of the specific scene containing target. Because the depth distribution of each scene containing the target is different, most of refocusing images acquired by each scene are 3-13, in order to ensure data consistency, the scenes with small scene depth change are duplicated, the existing refocusing images are duplicated, and the scenes with large scene depth change are discarded, so that each scene contains N=12 refocusing images. To reduce the computational effort of the neural network, the focal stack is further sampled to have a height h=256 and a width w=256, the refocused image being a color image, and channel c=3.

Step 3, focus Stack for nth scenePerforming data enhancement processing on refocusing images contained in the image data to obtain a focus stack with enhanced nth scene dataWherein (1)>Representing the nth scene at the mth scale factor alpha _m Lower virtual focal plane->Enhancing the refocused image after processing;

the depth map and the true saliency map of the nth scene are respectively marked as D ⁿ 、G ⁿ Depth to nth sceneDegree graph D ⁿ And a true saliency map G ⁿ And performing data enhancement processing to obtain a depth map after data enhancementAnd true saliency map->

Step 4, constructing a salient object detection model based on light field refocusing data enhancement, which comprises the following steps: the system comprises an encoding network, an RGB and depth fusion module, a depth recovery module, a decoding network and an optimization module; as shown in fig. 2;

in this embodiment, the number j=5 of basic blocks included in the RGB network.

in this embodiment, the number of channels after dimension reduction is 32.

Depth mapInputting into a saliency target detection model, andthe depth feature D under the nth scene is obtained after the convolution processing of j convolution modules in the depth network in the coding network is sequentially carried out ^n′ ；

And 4.2, constructing an RGB and depth fusion module, which sequentially comprises the following steps: an IBR module, a convolution module Conv1 and an IRB module;

feature the jth dimension reduction in the nth sceneAnd depth feature D ^n′ After pixel level multiplication calculation, inputting the calculated values into an RGB and depth fusion module, and carrying out convolution processing by an IBR module to obtain a preliminary fusion characteristic E under an nth scene ⁿ ；

Convolving module Conv1 pair jth dimension reduction featureAfter convolution processing, refocusing image characteristics +.>

Step 4.3, the depth restoration module includes: a convolution module Conv2 and a fusion module;

fusion feature E ^n′ Inputting the rough restoration depth map into a convolution module Conv2, and sequentially performing bilinear interpolation, convolution, batch normalization and ReLU activation to obtain the rough restoration depth map under the nth scene

Fusion ofModule-to-coarse recovery depth mapAfter residual error, convolution, bilinear interpolation and Sigmoid activation processing are sequentially carried out, an accurate recovery depth map +_in an nth scene is obtained>

Step 4.4, the decoding network comprises: the device comprises a bridging module and a decoding module;

the bridging module performs the j-th dimension reduction featureAfter the processing of convolution, batch normalization and ReLU activation is sequentially carried out, bridging feature B is obtained ⁿ ；

When i=2, 3, …, j, the i-1 th coarse significant image is up-sampled twice and then compared withInputting the i-th decoding stage together for processing to obtain the i-th rough significant image +.>Thereby outputting the j-th coarse salient image from the j-th decoding stage>Parallel structureSet of salient images that are coarse in nth scene

In this embodiment, the decoding stage is j=5.

Step 4.5, constructing an optimization module, which comprises the following steps: an encoder, a decoder;

step 5.1, establishing a loss function;

In the present embodiment, the training phase, the network training for 40 cycles, the initial learning rate is set to 0.0001, the momentum factor is set to (0.9,0.999), and the weight decay is set to 1e ^-8 The learning rate drops by 20% every 10 cycles of the iteration.

In this embodiment, an electronic device includes a memory for storing a program supporting the processor to execute the above method, and a processor configured to execute the program stored in the memory.

In this embodiment, a computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the method described above.

Table 1 shows that the method for detecting the salient targets based on the enhancement of the light field refocusing data of the invention comprises the steps of S _α 、F _α 、E _φ MAE is an evaluation standard, and the light field saliency target detection data sets DUTLF-V2, DUTLF-FS, lytro-Illum, HFUT-Lytro and LFSD are utilized to detect test sets, and the test sets are compared with 8 obvious target detection methods based on learning. S is S _α It is generally used to measure the similarity of the predicted saliency map and the true saliency map in spatial structure, and the closer the value is to 1, the better the effect of salient object detection is. F (F) _α Is a weighted harmonic average of the precision and recall, the closer the value is to 1, indicating a better effect of significant target detection. E (E) _φ Is a metric value that considers local pixel similarity and global pixel statistics between the predicted saliency map and the true saliency map, the closer the value is to 1, the better the effect of salient object detection. MAE is an overlapping evaluation index that describes the probability that the correct salient pixel is assigned as a non-salient pixel, the closer its value is to 0, indicating the better the effect of salient object detection. According to the quantitative analysis of table 1, it can be seen that in the test on the currently largest light field dataset DUTLF-V2, the present invention obtains the optimal results on all the evaluation indexes; in the test on the data set DUTLF-FS, each average index obtains an optimal result; in the test on the data set Lytro-Illum, the invention also obtains the first-ranking optimal result on each evaluation index; in the test on dataset HFUT-Lytro, the invention is in S _α Obtain suboptimal results, E _φ Obtain poor results, F _α Obtaining an optimal result, and obtaining a poor result by MAE; the test on the data set LFSD gave poor results for all evaluation indexes. The poor results obtained from the tests on the data sets HFUT-Lytro and LFSD are due to the fact that the LFSD data set and the HFUT-Lytro data set are acquired by a first-generation light field camera, and the obtained light field data have problems of color distortion and the like.

TABLE 1

Fig. 3 is a comparison of the salient object detection method based on light field refocusing data enhancement of the present invention with other salient object detection methods currently on LFSD, HFUT-Lytro, lytro ullum, DUTLF-FS and DUTLF-V2 datasets (from top to bottom), wherein various challenging scenarios including simple, complex scenarios, dim light and highlight light are included. The Ours is the light field saliency target detection method, and can intuitively show that the method of the invention has obvious advantages in saliency target positioning and segmentation and edge details.

Claims

1. The salient object detection method based on light field refocusing data enhancement is characterized by comprising the following steps of:

step 1.1, recording light field data of the nth scene asWherein u and v represent any one of the horizontal viewing angle and the vertical viewing angle in the viewing angle dimension, respectively, and +.>M represents the maximum number of viewing angles in the horizontal and vertical directions; x and y respectively represent pixel point coordinates in any horizontal direction and vertical direction in the space dimension, and x is E [1, X]，y∈[1，y]X and Y respectively represent the maximum space width and the maximum space height of the visual angle image; n is E [1, N]N represents the number of light field data, F represents the distance from the light field camera main lens to the sensor;

step 2.2, taking N different proportionality coefficients { alpha } ₁ ，α ₂ ，…，α _m ，…，α _N Repeating steps 1.2 to 2.1 to obtain a series of refocused images focused at different depths of the nth scene And forms a focal stack for the nth scene, where α _m Represents the mth scaling factor,/->Representing the nth scene at the mth scale factor alpha _m Lower virtual focal plane->Refocusing image at, N represents the focal stack +.>The number of refocusing images involved, let ∈ ->The height, the width and the channel number of the device are H, W and C respectively;

step 3, focusing the nth scenePerforming data enhancement processing on refocusing images contained in the image data to obtain a focus stack (I) after the nth scene data is enhanced>Juque (Juque)>Representing the nth scene at the mth scale factor alpha _m Lower virtual focal plane->Enhancing the refocused image after processing;

the saidThe depth map and the true saliency map of the nth scene are respectively denoted as D ⁿ 、G ⁿ Depth map D for the nth scene ⁿ And a true saliency map G ⁿ And performing data enhancement processing to obtain a depth map after data enhancementAnd true saliency map

the refocused imageInputting the saliency target detection model, and sequentially carrying out convolution processing on j basic blocks of RGB (red, green and blue) network in the coding network to obtain j refocusing features +.>Wherein (1)>Representing refocus image +.>Is the ith feature map of (2);

the depth mapInputting the saliency target detection model, and sequentially carrying out convolution processing on j convolution modules in a depth network in a coding network to obtain a depth feature D in an nth scene ^n′ ；

The convolution module Conv1 pair j-th dimension reduction featureAfter convolution processing, refocusing image characteristics +.>

When i=2, 3, …, j, the i-1 th coarse significant image is up-sampled twice and then compared withInputting the i-th decoding stage together for processing to obtain the i-th rough significant image +.>Thereby outputting the j-th coarse salient image from the j-th decoding stage>And constitutes the salient image set of roughness in the nth scene +.>

step 5.1, establishing a loss function;

2. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that supports the processor to perform the significance target detection method of claim 1, the processor being configured to execute the program stored in the memory.

3. A computer readable storage medium having a computer program stored thereon, characterized in that the computer program when executed by a processor performs the steps of the salient object detection method of claim 1.