CN116778187A - Salient target detection method based on light field refocusing data enhancement - Google Patents

Salient target detection method based on light field refocusing data enhancement Download PDF

Info

Publication number
CN116778187A
CN116778187A CN202310683470.1A CN202310683470A CN116778187A CN 116778187 A CN116778187 A CN 116778187A CN 202310683470 A CN202310683470 A CN 202310683470A CN 116778187 A CN116778187 A CN 116778187A
Authority
CN
China
Prior art keywords
light field
module
refocusing
image
depth
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310683470.1A
Other languages
Chinese (zh)
Inventor
王昕�
张勇
熊高敏
高隽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202310683470.1A priority Critical patent/CN116778187A/en
Publication of CN116778187A publication Critical patent/CN116778187A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/09Supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a salient target detection method based on light field refocusing data enhancement, which comprises the following steps: 1. refocusing the light field data to obtain light field data under different focusing parameters; 2. performing data enhancement on refocusing data; 4. constructing a depth convolution neural network, taking a light field refocusing image and a depth image as input, and training the depth convolution neural network to obtain a light field saliency target detection model; 5. and performing salient target detection on the light field refocusing image and the depth image to be detected by using the trained model, and evaluating the accuracy of the model on the data to be detected. The invention can realize the salient target detection based on the enhancement of the light field refocusing data, thereby effectively improving the accuracy of the salient target detection of the scene in the complex and changeable environment.

Description

Salient target detection method based on light field refocusing data enhancement
Technical Field
The invention belongs to the fields of computer vision, image processing and analysis, and particularly relates to a salient target detection method based on light field refocusing data enhancement.
Background
Visual salience is a mechanism of attention of the human visual system, when we are observing a scene, there is often a salient region in the scene that attracts our attention, and naturally we ignore those non-salient regions, so that humans can process a large amount of image data quickly. The salient target detection means that a computer simulates a human visual system to quickly and accurately locate an interested region or target in a visual field, and the accurate salient target detection can provide reliable prior information for target detection and identification, image segmentation and visual tracking.
According to different types of input data, saliency target detection is mainly divided into three categories: (1) saliency target detection based on RGB images; (2) saliency target detection based on RGB-D images; (3) light field based saliency target detection. Under complex scenes such as high light or dim light, partial shielding, disordered background, similar foreground and background, and the like, a salient target or region is difficult to effectively detect by taking an RGB image as input. The method of inputting RGB-D images as well as RGB images and depth maps, additionally introducing depth information, has been demonstrated to improve the performance of salient object detection, but if the quality of the depth maps is poor, the effect of salient object detection is poor. The light field refers to the quantity of light transmitted in various directions through each point in space, and simultaneously records the position information and the view angle information of light radiation in space, so that the description of a natural scene is more complete.
Currently, there have been several efforts to investigate light field-based salient object detection, which are broadly divided into feature-based methods and learning-based methods. The feature-based method is to estimate the salient target by using information such as color, depth, background priori and the like on the basis of a light field focal stack and a full-focus image. Such methods only consider a few limited features and often do not have high detection accuracy. The learning-based method trains a salient object detection model by a certain amount of training data, and tests on test data by using the trained model. The learning-based method relies on the strong learning capability of the deep neural network, integrates various characteristics, and greatly improves the detection precision compared with the characteristic-based method. However, these learning-based approaches still suffer from drawbacks:
1. for focal stack-based methods, local blurring between different refocused images is detrimental to such methods to obtain a saliency map with sharp edges, and when the depth of field range is narrow, such methods are difficult to obtain ideal effects;
2. training tests are carried out on the data set proposed by the user mostly based on the learning method, and the model is not easy to embody the robustness of the model due to the lack of comparison;
3. most of the methods based on learning use focal stacks, the difference of focus depth variation between refocused images contained in the focal stacks is small, and the full-focus image is a special refocused image, and certain data redundancy exists in the two image types, so that a large amount of calculation overhead exists in the network.
Disclosure of Invention
The invention aims to solve the defects in the prior art, provides a salient object detection method based on light field refocusing data enhancement, and aims to fully mine special properties of light field data and reduce calculation load, so that the precision and accuracy of salient object detection of scenes in complex and changeable environments are effectively improved.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention discloses a salient object detection method based on light field refocusing data enhancement, which is characterized by comprising the following steps of:
step 1, refocusing light field data to obtain the light field data under different focusing parameters;
step 1.1, recording light field data of the nth scene asWherein u and v represent any one of the horizontal viewing angle and the vertical viewing angle in the viewing angle dimension, respectively, and +.>M represents the maximum number of viewing angles in the horizontal and vertical directions; x and y respectively represent pixel point coordinates in any horizontal direction and vertical direction in the space dimension, and x is E [1, X],y∈[1,Y]X and Y respectively represent the maximum space width and the maximum space height of the visual angle image; n is E [1, N]N represents the number of light field data, F represents the distance from the light field camera main lens to the sensor;
step 1.2, light field data for the nth sceneIn a virtual focal plane F α Refocusing the position to obtain refocused light field data +.>Wherein F' α Is a virtual focal plane F α The distance from the camera main lens, x 'and y' respectively represent the pixel point coordinates of any horizontal direction and any vertical direction in the space dimension of the refocused back view angle image;
step 2, refocusing the light field dataDecoding to obtain refocused images focused at different depths of a scene;
step 2.1, refocusing the light field with the aid of (1)Performing computational imaging to obtain an nth scene in a virtual focal plane F α Image of the place->
In the formula (1), alpha represents a virtual focal plane F α A scaling factor of the distance to the sensor and the distance F of the light field camera main lens to the sensor;
step 2.2, taking N different proportionality coefficients { alpha } 12 ,…,α m ,…,α N Repeating steps 1.2 to 2.1 to obtain a series of refocused images focused at different depths of the nth scene And forms a focal stack for the nth scene, where α m Represents the mth scaling factor,/->Representing the nth scene at the mth scale factor alpha m Lower virtual focal plane->Refocusing image at, N represents the focal stack +.>The number of refocusing images involved, let ∈ ->The height, the width and the channel number of the device are H, W and C respectively;
step 3, focusing the nth scenePerforming data enhancement processing on refocusing images contained in the image data to obtain a focus stack with enhanced nth scene dataWherein (1)>Representing the nth scene at the mth scale factor alpha m Lower virtual focal plane->Enhancing the refocused image after processing;
respectively marking the depth map and the true saliency map of the nth scene as D n 、G n Depth map D for the nth scene n And a true saliency map G n And performing data enhancement processing to obtain a depth map after data enhancementAnd true saliency map->
Step 4, constructing a salient object detection model based on light field refocusing data enhancement, which comprises the following steps: the system comprises an encoding network, an RGB and depth fusion module, a depth recovery module, a decoding network and an optimization module;
step 4.1, the coding network comprises: RGB networks and deep networks; wherein, the RGB network takes ResNet18 as a backbone network, and comprises: j basic blocks and j channel dimension reduction modules; the depth network is composed of j convolution modules;
the refocused imageInputting the saliency target detection model, and sequentially carrying out convolution processing on j basic blocks of RGB (red, green and blue) network in the coding network to obtain j refocusing features in the nth sceneWherein (1)>Representing refocus image +.>Is the ith feature map of (2);
each channel dimension reduction module is sequentially composed of two convolution layers, a batch normalization layer and a ReLU activation layer;
the j channel dimension reduction modules respectively focus the j refocusing features in the nth sceneAfter processing, j dimension-reducing features ∈j after the dimension reduction of the nth scene are obtained>Wherein (1)>Representing the ith feature after dimension reduction;
the depth mapInputting the saliency target detection model, and sequentially carrying out convolution processing on j convolution modules in a depth network in a coding network to obtainDepth feature D in nth scene n′
And 4.2, constructing the RGB and depth fusion module, which sequentially comprises the following steps: an IBR module, a convolution module Conv1 and an IRB module;
feature the jth dimension reduction in the nth sceneAnd depth feature D n′ After pixel level multiplication calculation, inputting the calculated pixel level multiplication calculation into the RGB and depth fusion module, and carrying out convolution processing by the IBR module to obtain a preliminary fusion characteristic E under an nth scene n
The convolution module Conv1 pair j-th dimension reduction featureAfter convolution processing, refocusing image characteristics are obtained
Fusion feature E in nth scene n Refocusing image featuresAnd depth feature D n′ After pixel level multiplication calculation, inputting the calculated result into an IRB module, and sequentially carrying out convolution, batch normalization and ReLU activation processing to obtain a final fusion feature E under an nth scene n′
Step 4.3, the depth recovery module includes: a convolution module Conv2 and a fusion module;
the fusion feature E n′ Inputting the rough restoration depth map in the nth scene into the convolution module Conv2, and sequentially performing bilinear interpolation, convolution, batch normalization and ReLU activation to obtain the rough restoration depth map in the nth scene
The fusion module recovers the depth map of the roughnessAfter residual error, convolution, bilinear interpolation and Sigmoid activation processing are sequentially carried out, an accurate recovery depth map +_in an nth scene is obtained>
Step 4.4, the decoding network includes: the device comprises a bridging module and a decoding module;
the bridging module performs the dimension reduction on the j-th dimension reduction featureAfter the processing of convolution, batch normalization and ReLU activation is sequentially carried out, bridging feature B is obtained n
The decoding module consists of j decoding stages, each decoding stage consists of three continuous deconvolution modules, and each deconvolution module consists of a deconvolution layer, a batch normalization layer and a ReLU activation layer in sequence;
when i=1, bridging feature B will be n And fusion feature E n′ Inputting the i-th rough significant image and the i-th rough significant image into the i-th decoding stage together for processing
When i=2, 3, …, j, the i-1 th coarse significant image is up-sampled twice and then compared withInputting the i-th decoding stage together for processing to obtain the i-th rough significant image +.>Thereby outputting the j-th coarse salient image from the j-th decoding stage>And forms the roughness in the nth sceneIs a salient image collection of (1)
Step 4.5, constructing the optimization module, which comprises the following steps: an encoder, a decoder;
the encoder and decoder are used for sequentially carrying out the treatment on the jth rough salient imageProcessing to generate accurate prediction saliency map pre in nth scene n
Training a salient target detection model based on light field refocusing data enhancement;
step 5.1, establishing a loss function;
step 5.1.1, establishing a space loss function under the nth scene through the formula (2), the formula (3) and the formula (4) respectivelyEdge loss function->And depth loss function->
In the formulas (2), (3) and (4),represents the focal stack in the nth scene +.>Corresponding true saliency map, TP n Representing pre n FN of the region correctly predicted to be a salient target n Representation->Areas where significant targets are mispredicted as background, FP n Representing pre n The middle background is mispredicted as the region of the salient object, β represents the balance factor;
step 5.1.2, establishing a total loss function L under the nth scene through the method (5) n
And 5.2, training the saliency target detection model by using a random gradient descent algorithm, and calculating a total loss function under each scene to update network parameters until the total loss function converges, so as to obtain an optimal saliency target detection model for carrying out saliency target detection on the light field image.
The electronic device of the present invention includes a memory and a processor, wherein the memory is configured to store a program for supporting the processor to execute the saliency target detection method, and the processor is configured to execute the program stored in the memory.
The invention relates to a computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the saliency target detection method.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the invention, the deep neural network based on light field refocusing data enhancement is constructed, and the label data is used for supervising the neural network to learn, so that a light field saliency target detection characteristic model with robustness is obtained, the problem of high detection precision due to the fact that the calculation burden of the network is large in the focal stack model is solved, the calculation burden of the network is greatly reduced, and the problem of low detection precision is solved.
2. According to the invention, by constructing the depth neural network based on light field refocusing data enhancement, the input depth map is considered, the view angle change of a partial area is converted into the depth change of the whole image area, and the depth change of the whole image area is ignored based on the light field data network.
3. According to the invention, by constructing the deep neural network based on the light field refocusing data enhancement, based on the thought of U-Net, the encoder part and the decoder part are symmetrically constructed, and the channel dimension reduction module is used for reducing the dimension of the characteristics acquired by the encoder, so that the data quantity required to be calculated by the decoder is effectively reduced, and the problems of huge calculated quantity and high time cost of the deep neural network based on the refocusing data enhancement are solved.
4. According to the invention, the optimization module is used for optimizing the detection result of the light field salient target, so that the pixel points with detection errors on the outline of the salient region are corrected, the detection edge is smoother, and the accuracy of detecting the light field salient target is improved.
Drawings
FIG. 1 is a flow chart of salient object detection for a light field refocused image in accordance with the present invention;
FIG. 2 is a schematic diagram of a deep neural network based on light field refocusing data enhancement used in the present invention;
FIG. 3 is a graph of salient object detection results for the DUTLF-V2, DUTLF-FS, lytro-Illum, HFUT-Lytro, LFSD portion test set for the present invention and other light field salient object detection methods.
Detailed Description
As shown in fig. 1, in the embodiment, a salient object detection method based on light field refocusing data enhancement is to construct a deep neural network based on light field refocusing data enhancement to obtain a light field salient object detection feature model capable of effectively detecting salient objects in complex scenes, so that accuracy and precision of detecting the salient objects of the scenes in complex and changeable environments are effectively improved. Specifically, the method comprises the following steps:
step 1, refocusing light field data to obtain the light field data under different focusing parameters;
step 1.1, recording light field data of the nth scene asWherein u and v represent any one of the horizontal viewing angle and the vertical viewing angle in the viewing angle dimension, respectively, and +.>M represents the maximum number of viewing angles in the horizontal and vertical directions; x and y respectively represent pixel point coordinates in any horizontal direction and vertical direction in the space dimension, and x is E [1, X],y∈[1,Y]X and Y respectively represent the maximum space width and the maximum space height of the visual angle image; n is E [1, N]N represents the number of light field data, F represents the distance from the light field camera main lens to the sensor;
training and testing is performed in this embodiment using a light field saliency target detection dataset DUTLF-V2, the DUTLF-V2 containing a total of n=4204 scenes, wherein the training set contains 2597 scenes, the testing set contains 1247 scenes, and the maximum viewing angle number m=9 in the horizontal and vertical directions;
step 1.2, light field data for nth sceneIn a virtual focal plane F α Refocusing the position to obtain refocused light field data +.>Wherein F' α Is a virtual focal plane F α The distance from the camera main lens, x 'and y' respectively represent the pixel point coordinates of any horizontal direction and any vertical direction in the space dimension of the refocused back view angle image;
step 2, refocusing the light field dataDecoding to obtain refocused images focused at different depths of a scene;
step 2.1, refocusing the light field with the aid of (1)Performing computational imaging to obtain an nth scene in a virtual focal plane F α Image of the place->
In the formula (1), alpha represents a virtual focal plane F α A scaling factor of the distance to the sensor and the distance F of the light field camera main lens to the sensor;
step 2.2, taking N different proportionality coefficients { alpha } 12 ,…,α m ,…,α N Repeating steps 1.2 to 2.1 to obtain a series of refocused images focused at different depths of the scene And forms a focal stack for the nth scene, where α m Represents the mth scaling factor,/->Representing the nth scene at the mth scale factor alpha m Lower virtual focal plane->Refocusing at Jiao TuLike, N represents the focal stack +.>The number of refocusing images involved, let ∈ ->The height, the width and the channel number of the device are H, W and C respectively;
in this embodiment, α is determined by the depth of the specific scene containing target, and the refocus number is determined by the depth range of the specific scene containing target. Because the depth distribution of each scene containing the target is different, most of refocusing images acquired by each scene are 3-13, in order to ensure data consistency, the scenes with small scene depth change are duplicated, the existing refocusing images are duplicated, and the scenes with large scene depth change are discarded, so that each scene contains N=12 refocusing images. To reduce the computational effort of the neural network, the focal stack is further sampled to have a height h=256 and a width w=256, the refocused image being a color image, and channel c=3.
Step 3, focus Stack for nth scenePerforming data enhancement processing on refocusing images contained in the image data to obtain a focus stack with enhanced nth scene dataWherein (1)>Representing the nth scene at the mth scale factor alpha m Lower virtual focal plane->Enhancing the refocused image after processing;
the depth map and the true saliency map of the nth scene are respectively marked as D n 、G n Depth to nth sceneDegree graph D n And a true saliency map G n And performing data enhancement processing to obtain a depth map after data enhancementAnd true saliency map->
Step 4, constructing a salient object detection model based on light field refocusing data enhancement, which comprises the following steps: the system comprises an encoding network, an RGB and depth fusion module, a depth recovery module, a decoding network and an optimization module; as shown in fig. 2;
step 4.1, the coding network comprises: RGB networks and deep networks; wherein, the RGB network takes ResNet18 as a backbone network, and comprises: j basic blocks and j channel dimension reduction modules; the depth network is composed of j convolution modules;
in this embodiment, the number j=5 of basic blocks included in the RGB network.
Each channel dimension reduction module is sequentially composed of two convolution layers, a batch normalization layer and a ReLU activation layer;
the j channel dimension reduction modules respectively focus the j refocusing features in the nth sceneAfter processing, j dimension-reducing features ∈j after the dimension reduction of the nth scene are obtained>Wherein (1)>Representing the ith feature after dimension reduction;
in this embodiment, the number of channels after dimension reduction is 32.
Depth mapInputting into a saliency target detection model, andthe depth feature D under the nth scene is obtained after the convolution processing of j convolution modules in the depth network in the coding network is sequentially carried out n′
And 4.2, constructing an RGB and depth fusion module, which sequentially comprises the following steps: an IBR module, a convolution module Conv1 and an IRB module;
feature the jth dimension reduction in the nth sceneAnd depth feature D n′ After pixel level multiplication calculation, inputting the calculated values into an RGB and depth fusion module, and carrying out convolution processing by an IBR module to obtain a preliminary fusion characteristic E under an nth scene n
Convolving module Conv1 pair jth dimension reduction featureAfter convolution processing, refocusing image characteristics +.>
Fusion feature E in nth scene n Refocusing image featuresAnd depth feature D n′ After pixel level multiplication calculation, inputting the calculated result into an IRB module, and sequentially carrying out convolution, batch normalization and ReLU activation processing to obtain a final fusion feature E under an nth scene n′
Step 4.3, the depth restoration module includes: a convolution module Conv2 and a fusion module;
fusion feature E n′ Inputting the rough restoration depth map into a convolution module Conv2, and sequentially performing bilinear interpolation, convolution, batch normalization and ReLU activation to obtain the rough restoration depth map under the nth scene
Fusion ofModule-to-coarse recovery depth mapAfter residual error, convolution, bilinear interpolation and Sigmoid activation processing are sequentially carried out, an accurate recovery depth map +_in an nth scene is obtained>
Step 4.4, the decoding network comprises: the device comprises a bridging module and a decoding module;
the bridging module performs the j-th dimension reduction featureAfter the processing of convolution, batch normalization and ReLU activation is sequentially carried out, bridging feature B is obtained n
The decoding module consists of j decoding stages, each decoding stage consists of three continuous deconvolution modules, and each deconvolution module consists of a deconvolution layer, a batch normalization layer and a ReLU activation layer in sequence;
when i=1, bridging feature B will be n And fusion feature E n′ Inputting the i-th rough significant image and the i-th rough significant image into the i-th decoding stage together for processing
When i=2, 3, …, j, the i-1 th coarse significant image is up-sampled twice and then compared withInputting the i-th decoding stage together for processing to obtain the i-th rough significant image +.>Thereby outputting the j-th coarse salient image from the j-th decoding stage>Parallel structureSet of salient images that are coarse in nth scene
In this embodiment, the decoding stage is j=5.
Step 4.5, constructing an optimization module, which comprises the following steps: an encoder, a decoder;
the encoder and decoder are used for sequentially carrying out the treatment on the jth rough salient imageProcessing to generate accurate prediction saliency map pre in nth scene n
Training a salient target detection model based on light field refocusing data enhancement;
step 5.1, establishing a loss function;
step 5.1.1, establishing a space loss function under the nth scene through the formula (2), the formula (3) and the formula (4) respectivelyEdge loss function->And depth loss function->
In the formulas (2), (3) and (4),represents the focal stack in the nth scene +.>Corresponding true saliency map, TP n Representing pre n FN of the region correctly predicted to be a salient target n Representation->Areas where significant targets are mispredicted as background, FP n Representing pre n The middle background is mispredicted as the region of the salient object, β represents the balance factor;
step 5.1.2, establishing a total loss function L under the nth scene through the method (5) n
In the present embodiment, the training phase, the network training for 40 cycles, the initial learning rate is set to 0.0001, the momentum factor is set to (0.9,0.999), and the weight decay is set to 1e -8 The learning rate drops by 20% every 10 cycles of the iteration.
And 5.2, training the saliency target detection model by using a random gradient descent algorithm, and calculating a total loss function under each scene to update network parameters until the total loss function converges, so as to obtain an optimal saliency target detection model for carrying out saliency target detection on the light field image.
In this embodiment, an electronic device includes a memory for storing a program supporting the processor to execute the above method, and a processor configured to execute the program stored in the memory.
In this embodiment, a computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the method described above.
Table 1 shows that the method for detecting the salient targets based on the enhancement of the light field refocusing data of the invention comprises the steps of S α 、F α 、E φ MAE is an evaluation standard, and the light field saliency target detection data sets DUTLF-V2, DUTLF-FS, lytro-Illum, HFUT-Lytro and LFSD are utilized to detect test sets, and the test sets are compared with 8 obvious target detection methods based on learning. S is S α It is generally used to measure the similarity of the predicted saliency map and the true saliency map in spatial structure, and the closer the value is to 1, the better the effect of salient object detection is. F (F) α Is a weighted harmonic average of the precision and recall, the closer the value is to 1, indicating a better effect of significant target detection. E (E) φ Is a metric value that considers local pixel similarity and global pixel statistics between the predicted saliency map and the true saliency map, the closer the value is to 1, the better the effect of salient object detection. MAE is an overlapping evaluation index that describes the probability that the correct salient pixel is assigned as a non-salient pixel, the closer its value is to 0, indicating the better the effect of salient object detection. According to the quantitative analysis of table 1, it can be seen that in the test on the currently largest light field dataset DUTLF-V2, the present invention obtains the optimal results on all the evaluation indexes; in the test on the data set DUTLF-FS, each average index obtains an optimal result; in the test on the data set Lytro-Illum, the invention also obtains the first-ranking optimal result on each evaluation index; in the test on dataset HFUT-Lytro, the invention is in S α Obtain suboptimal results, E φ Obtain poor results, F α Obtaining an optimal result, and obtaining a poor result by MAE; the test on the data set LFSD gave poor results for all evaluation indexes. The poor results obtained from the tests on the data sets HFUT-Lytro and LFSD are due to the fact that the LFSD data set and the HFUT-Lytro data set are acquired by a first-generation light field camera, and the obtained light field data have problems of color distortion and the like.
TABLE 1
Fig. 3 is a comparison of the salient object detection method based on light field refocusing data enhancement of the present invention with other salient object detection methods currently on LFSD, HFUT-Lytro, lytro ullum, DUTLF-FS and DUTLF-V2 datasets (from top to bottom), wherein various challenging scenarios including simple, complex scenarios, dim light and highlight light are included. The Ours is the light field saliency target detection method, and can intuitively show that the method of the invention has obvious advantages in saliency target positioning and segmentation and edge details.

Claims (3)

1. The salient object detection method based on light field refocusing data enhancement is characterized by comprising the following steps of:
step 1, refocusing light field data to obtain the light field data under different focusing parameters;
step 1.1, recording light field data of the nth scene asWherein u and v represent any one of the horizontal viewing angle and the vertical viewing angle in the viewing angle dimension, respectively, and +.>M represents the maximum number of viewing angles in the horizontal and vertical directions; x and y respectively represent pixel point coordinates in any horizontal direction and vertical direction in the space dimension, and x is E [1, X],y∈[1,y]X and Y respectively represent the maximum space width and the maximum space height of the visual angle image; n is E [1, N]N represents the number of light field data, F represents the distance from the light field camera main lens to the sensor;
step 1.2, light field data for the nth sceneIn a virtual focal plane F α Refocusing the position to obtain refocused light field data +.>Wherein F' α Is a virtual focal plane F α The distance from the camera main lens, x 'and y' respectively represent the pixel point coordinates of any horizontal direction and any vertical direction in the space dimension of the refocused back view angle image;
step 2, refocusing the light field dataDecoding to obtain refocused images focused at different depths of a scene;
step 2.1, refocusing the light field with the aid of (1)Performing computational imaging to obtain an nth scene in a virtual focal plane F α Image of the place->
In the formula (1), alpha represents a virtual focal plane F α A scaling factor of the distance to the sensor and the distance F of the light field camera main lens to the sensor;
step 2.2, taking N different proportionality coefficients { alpha } 1 ,α 2 ,…,α m ,…,α N Repeating steps 1.2 to 2.1 to obtain a series of refocused images focused at different depths of the nth scene And forms a focal stack for the nth scene, where α m Represents the mth scaling factor,/->Representing the nth scene at the mth scale factor alpha m Lower virtual focal plane->Refocusing image at, N represents the focal stack +.>The number of refocusing images involved, let ∈ ->The height, the width and the channel number of the device are H, W and C respectively;
step 3, focusing the nth scenePerforming data enhancement processing on refocusing images contained in the image data to obtain a focus stack (I) after the nth scene data is enhanced>Juque (Juque)>Representing the nth scene at the mth scale factor alpha m Lower virtual focal plane->Enhancing the refocused image after processing;
the saidThe depth map and the true saliency map of the nth scene are respectively denoted as D n 、G n Depth map D for the nth scene n And a true saliency map G n And performing data enhancement processing to obtain a depth map after data enhancementAnd true saliency map
Step 4, constructing a salient object detection model based on light field refocusing data enhancement, which comprises the following steps: the system comprises an encoding network, an RGB and depth fusion module, a depth recovery module, a decoding network and an optimization module;
step 4.1, the coding network comprises: RGB networks and deep networks; wherein, the RGB network takes ResNet18 as a backbone network, and comprises: j basic blocks and j channel dimension reduction modules; the depth network is composed of j convolution modules;
the refocused imageInputting the saliency target detection model, and sequentially carrying out convolution processing on j basic blocks of RGB (red, green and blue) network in the coding network to obtain j refocusing features +.>Wherein (1)>Representing refocus image +.>Is the ith feature map of (2);
each channel dimension reduction module is sequentially composed of two convolution layers, a batch normalization layer and a ReLU activation layer;
the j channel dimension reduction modules respectively focus the j refocusing features in the nth sceneAfter processing, j dimension-reducing features ∈j after the dimension reduction of the nth scene are obtained>Wherein (1)>Representing the ith feature after dimension reduction;
the depth mapInputting the saliency target detection model, and sequentially carrying out convolution processing on j convolution modules in a depth network in a coding network to obtain a depth feature D in an nth scene n′
And 4.2, constructing the RGB and depth fusion module, which sequentially comprises the following steps: an IBR module, a convolution module Conv1 and an IRB module;
feature the jth dimension reduction in the nth sceneAnd depth feature D n′ After pixel level multiplication calculation, inputting the calculated pixel level multiplication calculation into the RGB and depth fusion module, and carrying out convolution processing by the IBR module to obtain a preliminary fusion characteristic E under an nth scene n
The convolution module Conv1 pair j-th dimension reduction featureAfter convolution processing, refocusing image characteristics +.>
Fusion feature E in nth scene n Refocusing image featuresAnd depth feature D n′ After pixel level multiplication calculation, inputting the calculated result into an IRB module, and sequentially carrying out convolution, batch normalization and ReLU activation processing to obtain a final fusion feature E under an nth scene n′
Step 4.3, the depth recovery module includes: a convolution module Conv2 and a fusion module;
the fusion feature E n′ Inputting the rough restoration depth map in the nth scene into the convolution module Conv2, and sequentially performing bilinear interpolation, convolution, batch normalization and ReLU activation to obtain the rough restoration depth map in the nth scene
The fusion module recovers the depth map of the roughnessAfter residual error, convolution, bilinear interpolation and Sigmoid activation processing are sequentially carried out, an accurate recovery depth map +_in an nth scene is obtained>
Step 4.4, the decoding network includes: the device comprises a bridging module and a decoding module;
the bridging module performs the dimension reduction on the j-th dimension reduction featureAfter the processing of convolution, batch normalization and ReLU activation is sequentially carried out, bridging feature B is obtained n
The decoding module consists of j decoding stages, each decoding stage consists of three continuous deconvolution modules, and each deconvolution module consists of a deconvolution layer, a batch normalization layer and a ReLU activation layer in sequence;
when i=1, bridging feature B will be n And fusion feature E n′ Inputting the i-th rough significant image and the i-th rough significant image into the i-th decoding stage together for processing
When i=2, 3, …, j, the i-1 th coarse significant image is up-sampled twice and then compared withInputting the i-th decoding stage together for processing to obtain the i-th rough significant image +.>Thereby outputting the j-th coarse salient image from the j-th decoding stage>And constitutes the salient image set of roughness in the nth scene +.>
Step 4.5, constructing the optimization module, which comprises the following steps: an encoder, a decoder;
the encoder and decoder are used for sequentially carrying out the treatment on the jth rough salient imageProcessing to generate accurate prediction saliency map pre in nth scene n
Training a salient target detection model based on light field refocusing data enhancement;
step 5.1, establishing a loss function;
step 5.1.1, establishing a space loss function under the nth scene through the formula (2), the formula (3) and the formula (4) respectivelyEdge loss function->And depth loss function->
In the formulas (2), (3) and (4),represents the focal stack in the nth scene +.>Corresponding true saliency map, TP n Representing pre n FN of the region correctly predicted to be a salient target n Representation->Areas where significant targets are mispredicted as background, FP n Representing pre n The middle background is mispredicted as the region of the salient object, β represents the balance factor;
step 5.1.2, establishing a total loss function L under the nth scene through the method (5) n
And 5.2, training the saliency target detection model by using a random gradient descent algorithm, and calculating a total loss function under each scene to update network parameters until the total loss function converges, so as to obtain an optimal saliency target detection model for carrying out saliency target detection on the light field image.
2. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that supports the processor to perform the significance target detection method of claim 1, the processor being configured to execute the program stored in the memory.
3. A computer readable storage medium having a computer program stored thereon, characterized in that the computer program when executed by a processor performs the steps of the salient object detection method of claim 1.
CN202310683470.1A 2023-06-09 2023-06-09 Salient target detection method based on light field refocusing data enhancement Pending CN116778187A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310683470.1A CN116778187A (en) 2023-06-09 2023-06-09 Salient target detection method based on light field refocusing data enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310683470.1A CN116778187A (en) 2023-06-09 2023-06-09 Salient target detection method based on light field refocusing data enhancement

Publications (1)

Publication Number Publication Date
CN116778187A true CN116778187A (en) 2023-09-19

Family

ID=87987184

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310683470.1A Pending CN116778187A (en) 2023-06-09 2023-06-09 Salient target detection method based on light field refocusing data enhancement

Country Status (1)

Country Link
CN (1) CN116778187A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118135120A (en) * 2024-05-06 2024-06-04 武汉大学 Three-dimensional reconstruction and micromanipulation system for surface morphology of nano sample

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118135120A (en) * 2024-05-06 2024-06-04 武汉大学 Three-dimensional reconstruction and micromanipulation system for surface morphology of nano sample
CN118135120B (en) * 2024-05-06 2024-07-12 武汉大学 Three-dimensional reconstruction and micromanipulation system for surface morphology of nano sample

Similar Documents

Publication Publication Date Title
US11200424B2 (en) Space-time memory network for locating target object in video content
CN111160297A (en) Pedestrian re-identification method and device based on residual attention mechanism space-time combined model
CN107481279A (en) A kind of monocular video depth map computational methods
CN110827312B (en) Learning method based on cooperative visual attention neural network
CN114565655B (en) Depth estimation method and device based on pyramid segmentation attention
CN113298815A (en) Semi-supervised remote sensing image semantic segmentation method and device and computer equipment
CN113361542B (en) Local feature extraction method based on deep learning
CN113343822B (en) Light field saliency target detection method based on 3D convolution
CN112819853B (en) Visual odometer method based on semantic priori
CN107766864B (en) Method and device for extracting features and method and device for object recognition
CN111239684A (en) Binocular fast distance measurement method based on YoloV3 deep learning
CN114140623A (en) Image feature point extraction method and system
CN116778187A (en) Salient target detection method based on light field refocusing data enhancement
CN114463492A (en) Adaptive channel attention three-dimensional reconstruction method based on deep learning
EP3185212A1 (en) Dynamic particle filter parameterization
CN112464775A (en) Video target re-identification method based on multi-branch network
CN114663880A (en) Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN113011359B (en) Method for simultaneously detecting plane structure and generating plane description based on image and application
CN113850761A (en) Remote sensing image target detection method based on multi-angle detection frame
CN112070181B (en) Image stream-based cooperative detection method and device and storage medium
CN117456330A (en) MSFAF-Net-based low-illumination target detection method
CN110910497A (en) Method and system for realizing augmented reality map
CN108154107B (en) Method for determining scene category to which remote sensing image belongs
CN116665293A (en) Sitting posture early warning method and system based on monocular vision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination