CN116778187A - Salient target detection method based on light field refocusing data enhancement - Google Patents
Salient target detection method based on light field refocusing data enhancement Download PDFInfo
- Publication number
- CN116778187A CN116778187A CN202310683470.1A CN202310683470A CN116778187A CN 116778187 A CN116778187 A CN 116778187A CN 202310683470 A CN202310683470 A CN 202310683470A CN 116778187 A CN116778187 A CN 116778187A
- Authority
- CN
- China
- Prior art keywords
- light field
- module
- refocusing
- image
- depth
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 70
- 238000012549 training Methods 0.000 claims abstract description 13
- 238000012545 processing Methods 0.000 claims description 42
- 230000004927 fusion Effects 0.000 claims description 30
- 230000009467 reduction Effects 0.000 claims description 28
- 238000000034 method Methods 0.000 claims description 22
- 230000004913 activation Effects 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 15
- 238000010606 normalization Methods 0.000 claims description 15
- 238000004364 calculation method Methods 0.000 claims description 12
- 238000011084 recovery Methods 0.000 claims description 9
- 230000000007 visual effect Effects 0.000 claims description 8
- 238000005457 optimization Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 5
- 238000004422 calculation algorithm Methods 0.000 claims description 3
- 230000002708 enhancing effect Effects 0.000 claims description 3
- 238000003384 imaging method Methods 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 abstract description 11
- 238000012360 testing method Methods 0.000 description 14
- 230000000694 effects Effects 0.000 description 6
- 230000008859 change Effects 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 5
- 241000282412 Homo Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001627 detrimental effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000004445 quantitative analysis Methods 0.000 description 1
- 230000005855 radiation Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/048—Activation functions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/09—Supervised learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Multimedia (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Image Processing (AREA)
Abstract
The invention discloses a salient target detection method based on light field refocusing data enhancement, which comprises the following steps: 1. refocusing the light field data to obtain light field data under different focusing parameters; 2. performing data enhancement on refocusing data; 4. constructing a depth convolution neural network, taking a light field refocusing image and a depth image as input, and training the depth convolution neural network to obtain a light field saliency target detection model; 5. and performing salient target detection on the light field refocusing image and the depth image to be detected by using the trained model, and evaluating the accuracy of the model on the data to be detected. The invention can realize the salient target detection based on the enhancement of the light field refocusing data, thereby effectively improving the accuracy of the salient target detection of the scene in the complex and changeable environment.
Description
Technical Field
The invention belongs to the fields of computer vision, image processing and analysis, and particularly relates to a salient target detection method based on light field refocusing data enhancement.
Background
Visual salience is a mechanism of attention of the human visual system, when we are observing a scene, there is often a salient region in the scene that attracts our attention, and naturally we ignore those non-salient regions, so that humans can process a large amount of image data quickly. The salient target detection means that a computer simulates a human visual system to quickly and accurately locate an interested region or target in a visual field, and the accurate salient target detection can provide reliable prior information for target detection and identification, image segmentation and visual tracking.
According to different types of input data, saliency target detection is mainly divided into three categories: (1) saliency target detection based on RGB images; (2) saliency target detection based on RGB-D images; (3) light field based saliency target detection. Under complex scenes such as high light or dim light, partial shielding, disordered background, similar foreground and background, and the like, a salient target or region is difficult to effectively detect by taking an RGB image as input. The method of inputting RGB-D images as well as RGB images and depth maps, additionally introducing depth information, has been demonstrated to improve the performance of salient object detection, but if the quality of the depth maps is poor, the effect of salient object detection is poor. The light field refers to the quantity of light transmitted in various directions through each point in space, and simultaneously records the position information and the view angle information of light radiation in space, so that the description of a natural scene is more complete.
Currently, there have been several efforts to investigate light field-based salient object detection, which are broadly divided into feature-based methods and learning-based methods. The feature-based method is to estimate the salient target by using information such as color, depth, background priori and the like on the basis of a light field focal stack and a full-focus image. Such methods only consider a few limited features and often do not have high detection accuracy. The learning-based method trains a salient object detection model by a certain amount of training data, and tests on test data by using the trained model. The learning-based method relies on the strong learning capability of the deep neural network, integrates various characteristics, and greatly improves the detection precision compared with the characteristic-based method. However, these learning-based approaches still suffer from drawbacks:
1. for focal stack-based methods, local blurring between different refocused images is detrimental to such methods to obtain a saliency map with sharp edges, and when the depth of field range is narrow, such methods are difficult to obtain ideal effects;
2. training tests are carried out on the data set proposed by the user mostly based on the learning method, and the model is not easy to embody the robustness of the model due to the lack of comparison;
3. most of the methods based on learning use focal stacks, the difference of focus depth variation between refocused images contained in the focal stacks is small, and the full-focus image is a special refocused image, and certain data redundancy exists in the two image types, so that a large amount of calculation overhead exists in the network.
Disclosure of Invention
The invention aims to solve the defects in the prior art, provides a salient object detection method based on light field refocusing data enhancement, and aims to fully mine special properties of light field data and reduce calculation load, so that the precision and accuracy of salient object detection of scenes in complex and changeable environments are effectively improved.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention discloses a salient object detection method based on light field refocusing data enhancement, which is characterized by comprising the following steps of:
step 1, refocusing light field data to obtain the light field data under different focusing parameters;
step 1.1, recording light field data of the nth scene asWherein u and v represent any one of the horizontal viewing angle and the vertical viewing angle in the viewing angle dimension, respectively, and +.>M represents the maximum number of viewing angles in the horizontal and vertical directions; x and y respectively represent pixel point coordinates in any horizontal direction and vertical direction in the space dimension, and x is E [1, X],y∈[1,Y]X and Y respectively represent the maximum space width and the maximum space height of the visual angle image; n is E [1, N]N represents the number of light field data, F represents the distance from the light field camera main lens to the sensor;
step 1.2, light field data for the nth sceneIn a virtual focal plane F α Refocusing the position to obtain refocused light field data +.>Wherein F' α Is a virtual focal plane F α The distance from the camera main lens, x 'and y' respectively represent the pixel point coordinates of any horizontal direction and any vertical direction in the space dimension of the refocused back view angle image;
step 2, refocusing the light field dataDecoding to obtain refocused images focused at different depths of a scene;
step 2.1, refocusing the light field with the aid of (1)Performing computational imaging to obtain an nth scene in a virtual focal plane F α Image of the place->
In the formula (1), alpha represents a virtual focal plane F α A scaling factor of the distance to the sensor and the distance F of the light field camera main lens to the sensor;
step 2.2, taking N different proportionality coefficients { alpha } 1 ,α 2 ,…,α m ,…,α N Repeating steps 1.2 to 2.1 to obtain a series of refocused images focused at different depths of the nth scene And forms a focal stack for the nth scene, where α m Represents the mth scaling factor,/->Representing the nth scene at the mth scale factor alpha m Lower virtual focal plane->Refocusing image at, N represents the focal stack +.>The number of refocusing images involved, let ∈ ->The height, the width and the channel number of the device are H, W and C respectively;
step 3, focusing the nth scenePerforming data enhancement processing on refocusing images contained in the image data to obtain a focus stack with enhanced nth scene dataWherein (1)>Representing the nth scene at the mth scale factor alpha m Lower virtual focal plane->Enhancing the refocused image after processing;
respectively marking the depth map and the true saliency map of the nth scene as D n 、G n Depth map D for the nth scene n And a true saliency map G n And performing data enhancement processing to obtain a depth map after data enhancementAnd true saliency map->
Step 4, constructing a salient object detection model based on light field refocusing data enhancement, which comprises the following steps: the system comprises an encoding network, an RGB and depth fusion module, a depth recovery module, a decoding network and an optimization module;
step 4.1, the coding network comprises: RGB networks and deep networks; wherein, the RGB network takes ResNet18 as a backbone network, and comprises: j basic blocks and j channel dimension reduction modules; the depth network is composed of j convolution modules;
the refocused imageInputting the saliency target detection model, and sequentially carrying out convolution processing on j basic blocks of RGB (red, green and blue) network in the coding network to obtain j refocusing features in the nth sceneWherein (1)>Representing refocus image +.>Is the ith feature map of (2);
each channel dimension reduction module is sequentially composed of two convolution layers, a batch normalization layer and a ReLU activation layer;
the j channel dimension reduction modules respectively focus the j refocusing features in the nth sceneAfter processing, j dimension-reducing features ∈j after the dimension reduction of the nth scene are obtained>Wherein (1)>Representing the ith feature after dimension reduction;
the depth mapInputting the saliency target detection model, and sequentially carrying out convolution processing on j convolution modules in a depth network in a coding network to obtainDepth feature D in nth scene n′ ;
And 4.2, constructing the RGB and depth fusion module, which sequentially comprises the following steps: an IBR module, a convolution module Conv1 and an IRB module;
feature the jth dimension reduction in the nth sceneAnd depth feature D n′ After pixel level multiplication calculation, inputting the calculated pixel level multiplication calculation into the RGB and depth fusion module, and carrying out convolution processing by the IBR module to obtain a preliminary fusion characteristic E under an nth scene n ;
The convolution module Conv1 pair j-th dimension reduction featureAfter convolution processing, refocusing image characteristics are obtained
Fusion feature E in nth scene n Refocusing image featuresAnd depth feature D n′ After pixel level multiplication calculation, inputting the calculated result into an IRB module, and sequentially carrying out convolution, batch normalization and ReLU activation processing to obtain a final fusion feature E under an nth scene n′ ;
Step 4.3, the depth recovery module includes: a convolution module Conv2 and a fusion module;
the fusion feature E n′ Inputting the rough restoration depth map in the nth scene into the convolution module Conv2, and sequentially performing bilinear interpolation, convolution, batch normalization and ReLU activation to obtain the rough restoration depth map in the nth scene
The fusion module recovers the depth map of the roughnessAfter residual error, convolution, bilinear interpolation and Sigmoid activation processing are sequentially carried out, an accurate recovery depth map +_in an nth scene is obtained>
Step 4.4, the decoding network includes: the device comprises a bridging module and a decoding module;
the bridging module performs the dimension reduction on the j-th dimension reduction featureAfter the processing of convolution, batch normalization and ReLU activation is sequentially carried out, bridging feature B is obtained n ;
The decoding module consists of j decoding stages, each decoding stage consists of three continuous deconvolution modules, and each deconvolution module consists of a deconvolution layer, a batch normalization layer and a ReLU activation layer in sequence;
when i=1, bridging feature B will be n And fusion feature E n′ Inputting the i-th rough significant image and the i-th rough significant image into the i-th decoding stage together for processing
When i=2, 3, …, j, the i-1 th coarse significant image is up-sampled twice and then compared withInputting the i-th decoding stage together for processing to obtain the i-th rough significant image +.>Thereby outputting the j-th coarse salient image from the j-th decoding stage>And forms the roughness in the nth sceneIs a salient image collection of (1)
Step 4.5, constructing the optimization module, which comprises the following steps: an encoder, a decoder;
the encoder and decoder are used for sequentially carrying out the treatment on the jth rough salient imageProcessing to generate accurate prediction saliency map pre in nth scene n ;
Training a salient target detection model based on light field refocusing data enhancement;
step 5.1, establishing a loss function;
step 5.1.1, establishing a space loss function under the nth scene through the formula (2), the formula (3) and the formula (4) respectivelyEdge loss function->And depth loss function->
In the formulas (2), (3) and (4),represents the focal stack in the nth scene +.>Corresponding true saliency map, TP n Representing pre n FN of the region correctly predicted to be a salient target n Representation->Areas where significant targets are mispredicted as background, FP n Representing pre n The middle background is mispredicted as the region of the salient object, β represents the balance factor;
step 5.1.2, establishing a total loss function L under the nth scene through the method (5) n :
And 5.2, training the saliency target detection model by using a random gradient descent algorithm, and calculating a total loss function under each scene to update network parameters until the total loss function converges, so as to obtain an optimal saliency target detection model for carrying out saliency target detection on the light field image.
The electronic device of the present invention includes a memory and a processor, wherein the memory is configured to store a program for supporting the processor to execute the saliency target detection method, and the processor is configured to execute the program stored in the memory.
The invention relates to a computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the saliency target detection method.
Compared with the prior art, the invention has the beneficial effects that:
1. according to the invention, the deep neural network based on light field refocusing data enhancement is constructed, and the label data is used for supervising the neural network to learn, so that a light field saliency target detection characteristic model with robustness is obtained, the problem of high detection precision due to the fact that the calculation burden of the network is large in the focal stack model is solved, the calculation burden of the network is greatly reduced, and the problem of low detection precision is solved.
2. According to the invention, by constructing the depth neural network based on light field refocusing data enhancement, the input depth map is considered, the view angle change of a partial area is converted into the depth change of the whole image area, and the depth change of the whole image area is ignored based on the light field data network.
3. According to the invention, by constructing the deep neural network based on the light field refocusing data enhancement, based on the thought of U-Net, the encoder part and the decoder part are symmetrically constructed, and the channel dimension reduction module is used for reducing the dimension of the characteristics acquired by the encoder, so that the data quantity required to be calculated by the decoder is effectively reduced, and the problems of huge calculated quantity and high time cost of the deep neural network based on the refocusing data enhancement are solved.
4. According to the invention, the optimization module is used for optimizing the detection result of the light field salient target, so that the pixel points with detection errors on the outline of the salient region are corrected, the detection edge is smoother, and the accuracy of detecting the light field salient target is improved.
Drawings
FIG. 1 is a flow chart of salient object detection for a light field refocused image in accordance with the present invention;
FIG. 2 is a schematic diagram of a deep neural network based on light field refocusing data enhancement used in the present invention;
FIG. 3 is a graph of salient object detection results for the DUTLF-V2, DUTLF-FS, lytro-Illum, HFUT-Lytro, LFSD portion test set for the present invention and other light field salient object detection methods.
Detailed Description
As shown in fig. 1, in the embodiment, a salient object detection method based on light field refocusing data enhancement is to construct a deep neural network based on light field refocusing data enhancement to obtain a light field salient object detection feature model capable of effectively detecting salient objects in complex scenes, so that accuracy and precision of detecting the salient objects of the scenes in complex and changeable environments are effectively improved. Specifically, the method comprises the following steps:
step 1, refocusing light field data to obtain the light field data under different focusing parameters;
step 1.1, recording light field data of the nth scene asWherein u and v represent any one of the horizontal viewing angle and the vertical viewing angle in the viewing angle dimension, respectively, and +.>M represents the maximum number of viewing angles in the horizontal and vertical directions; x and y respectively represent pixel point coordinates in any horizontal direction and vertical direction in the space dimension, and x is E [1, X],y∈[1,Y]X and Y respectively represent the maximum space width and the maximum space height of the visual angle image; n is E [1, N]N represents the number of light field data, F represents the distance from the light field camera main lens to the sensor;
training and testing is performed in this embodiment using a light field saliency target detection dataset DUTLF-V2, the DUTLF-V2 containing a total of n=4204 scenes, wherein the training set contains 2597 scenes, the testing set contains 1247 scenes, and the maximum viewing angle number m=9 in the horizontal and vertical directions;
step 1.2, light field data for nth sceneIn a virtual focal plane F α Refocusing the position to obtain refocused light field data +.>Wherein F' α Is a virtual focal plane F α The distance from the camera main lens, x 'and y' respectively represent the pixel point coordinates of any horizontal direction and any vertical direction in the space dimension of the refocused back view angle image;
step 2, refocusing the light field dataDecoding to obtain refocused images focused at different depths of a scene;
step 2.1, refocusing the light field with the aid of (1)Performing computational imaging to obtain an nth scene in a virtual focal plane F α Image of the place->
In the formula (1), alpha represents a virtual focal plane F α A scaling factor of the distance to the sensor and the distance F of the light field camera main lens to the sensor;
step 2.2, taking N different proportionality coefficients { alpha } 1 ,α 2 ,…,α m ,…,α N Repeating steps 1.2 to 2.1 to obtain a series of refocused images focused at different depths of the scene And forms a focal stack for the nth scene, where α m Represents the mth scaling factor,/->Representing the nth scene at the mth scale factor alpha m Lower virtual focal plane->Refocusing at Jiao TuLike, N represents the focal stack +.>The number of refocusing images involved, let ∈ ->The height, the width and the channel number of the device are H, W and C respectively;
in this embodiment, α is determined by the depth of the specific scene containing target, and the refocus number is determined by the depth range of the specific scene containing target. Because the depth distribution of each scene containing the target is different, most of refocusing images acquired by each scene are 3-13, in order to ensure data consistency, the scenes with small scene depth change are duplicated, the existing refocusing images are duplicated, and the scenes with large scene depth change are discarded, so that each scene contains N=12 refocusing images. To reduce the computational effort of the neural network, the focal stack is further sampled to have a height h=256 and a width w=256, the refocused image being a color image, and channel c=3.
Step 3, focus Stack for nth scenePerforming data enhancement processing on refocusing images contained in the image data to obtain a focus stack with enhanced nth scene dataWherein (1)>Representing the nth scene at the mth scale factor alpha m Lower virtual focal plane->Enhancing the refocused image after processing;
the depth map and the true saliency map of the nth scene are respectively marked as D n 、G n Depth to nth sceneDegree graph D n And a true saliency map G n And performing data enhancement processing to obtain a depth map after data enhancementAnd true saliency map->
Step 4, constructing a salient object detection model based on light field refocusing data enhancement, which comprises the following steps: the system comprises an encoding network, an RGB and depth fusion module, a depth recovery module, a decoding network and an optimization module; as shown in fig. 2;
step 4.1, the coding network comprises: RGB networks and deep networks; wherein, the RGB network takes ResNet18 as a backbone network, and comprises: j basic blocks and j channel dimension reduction modules; the depth network is composed of j convolution modules;
in this embodiment, the number j=5 of basic blocks included in the RGB network.
Each channel dimension reduction module is sequentially composed of two convolution layers, a batch normalization layer and a ReLU activation layer;
the j channel dimension reduction modules respectively focus the j refocusing features in the nth sceneAfter processing, j dimension-reducing features ∈j after the dimension reduction of the nth scene are obtained>Wherein (1)>Representing the ith feature after dimension reduction;
in this embodiment, the number of channels after dimension reduction is 32.
Depth mapInputting into a saliency target detection model, andthe depth feature D under the nth scene is obtained after the convolution processing of j convolution modules in the depth network in the coding network is sequentially carried out n′ ;
And 4.2, constructing an RGB and depth fusion module, which sequentially comprises the following steps: an IBR module, a convolution module Conv1 and an IRB module;
feature the jth dimension reduction in the nth sceneAnd depth feature D n′ After pixel level multiplication calculation, inputting the calculated values into an RGB and depth fusion module, and carrying out convolution processing by an IBR module to obtain a preliminary fusion characteristic E under an nth scene n ;
Convolving module Conv1 pair jth dimension reduction featureAfter convolution processing, refocusing image characteristics +.>
Fusion feature E in nth scene n Refocusing image featuresAnd depth feature D n′ After pixel level multiplication calculation, inputting the calculated result into an IRB module, and sequentially carrying out convolution, batch normalization and ReLU activation processing to obtain a final fusion feature E under an nth scene n′ ;
Step 4.3, the depth restoration module includes: a convolution module Conv2 and a fusion module;
fusion feature E n′ Inputting the rough restoration depth map into a convolution module Conv2, and sequentially performing bilinear interpolation, convolution, batch normalization and ReLU activation to obtain the rough restoration depth map under the nth scene
Fusion ofModule-to-coarse recovery depth mapAfter residual error, convolution, bilinear interpolation and Sigmoid activation processing are sequentially carried out, an accurate recovery depth map +_in an nth scene is obtained>
Step 4.4, the decoding network comprises: the device comprises a bridging module and a decoding module;
the bridging module performs the j-th dimension reduction featureAfter the processing of convolution, batch normalization and ReLU activation is sequentially carried out, bridging feature B is obtained n ;
The decoding module consists of j decoding stages, each decoding stage consists of three continuous deconvolution modules, and each deconvolution module consists of a deconvolution layer, a batch normalization layer and a ReLU activation layer in sequence;
when i=1, bridging feature B will be n And fusion feature E n′ Inputting the i-th rough significant image and the i-th rough significant image into the i-th decoding stage together for processing
When i=2, 3, …, j, the i-1 th coarse significant image is up-sampled twice and then compared withInputting the i-th decoding stage together for processing to obtain the i-th rough significant image +.>Thereby outputting the j-th coarse salient image from the j-th decoding stage>Parallel structureSet of salient images that are coarse in nth scene
In this embodiment, the decoding stage is j=5.
Step 4.5, constructing an optimization module, which comprises the following steps: an encoder, a decoder;
the encoder and decoder are used for sequentially carrying out the treatment on the jth rough salient imageProcessing to generate accurate prediction saliency map pre in nth scene n ;
Training a salient target detection model based on light field refocusing data enhancement;
step 5.1, establishing a loss function;
step 5.1.1, establishing a space loss function under the nth scene through the formula (2), the formula (3) and the formula (4) respectivelyEdge loss function->And depth loss function->
In the formulas (2), (3) and (4),represents the focal stack in the nth scene +.>Corresponding true saliency map, TP n Representing pre n FN of the region correctly predicted to be a salient target n Representation->Areas where significant targets are mispredicted as background, FP n Representing pre n The middle background is mispredicted as the region of the salient object, β represents the balance factor;
step 5.1.2, establishing a total loss function L under the nth scene through the method (5) n :
In the present embodiment, the training phase, the network training for 40 cycles, the initial learning rate is set to 0.0001, the momentum factor is set to (0.9,0.999), and the weight decay is set to 1e -8 The learning rate drops by 20% every 10 cycles of the iteration.
And 5.2, training the saliency target detection model by using a random gradient descent algorithm, and calculating a total loss function under each scene to update network parameters until the total loss function converges, so as to obtain an optimal saliency target detection model for carrying out saliency target detection on the light field image.
In this embodiment, an electronic device includes a memory for storing a program supporting the processor to execute the above method, and a processor configured to execute the program stored in the memory.
In this embodiment, a computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the method described above.
Table 1 shows that the method for detecting the salient targets based on the enhancement of the light field refocusing data of the invention comprises the steps of S α 、F α 、E φ MAE is an evaluation standard, and the light field saliency target detection data sets DUTLF-V2, DUTLF-FS, lytro-Illum, HFUT-Lytro and LFSD are utilized to detect test sets, and the test sets are compared with 8 obvious target detection methods based on learning. S is S α It is generally used to measure the similarity of the predicted saliency map and the true saliency map in spatial structure, and the closer the value is to 1, the better the effect of salient object detection is. F (F) α Is a weighted harmonic average of the precision and recall, the closer the value is to 1, indicating a better effect of significant target detection. E (E) φ Is a metric value that considers local pixel similarity and global pixel statistics between the predicted saliency map and the true saliency map, the closer the value is to 1, the better the effect of salient object detection. MAE is an overlapping evaluation index that describes the probability that the correct salient pixel is assigned as a non-salient pixel, the closer its value is to 0, indicating the better the effect of salient object detection. According to the quantitative analysis of table 1, it can be seen that in the test on the currently largest light field dataset DUTLF-V2, the present invention obtains the optimal results on all the evaluation indexes; in the test on the data set DUTLF-FS, each average index obtains an optimal result; in the test on the data set Lytro-Illum, the invention also obtains the first-ranking optimal result on each evaluation index; in the test on dataset HFUT-Lytro, the invention is in S α Obtain suboptimal results, E φ Obtain poor results, F α Obtaining an optimal result, and obtaining a poor result by MAE; the test on the data set LFSD gave poor results for all evaluation indexes. The poor results obtained from the tests on the data sets HFUT-Lytro and LFSD are due to the fact that the LFSD data set and the HFUT-Lytro data set are acquired by a first-generation light field camera, and the obtained light field data have problems of color distortion and the like.
TABLE 1
Fig. 3 is a comparison of the salient object detection method based on light field refocusing data enhancement of the present invention with other salient object detection methods currently on LFSD, HFUT-Lytro, lytro ullum, DUTLF-FS and DUTLF-V2 datasets (from top to bottom), wherein various challenging scenarios including simple, complex scenarios, dim light and highlight light are included. The Ours is the light field saliency target detection method, and can intuitively show that the method of the invention has obvious advantages in saliency target positioning and segmentation and edge details.
Claims (3)
1. The salient object detection method based on light field refocusing data enhancement is characterized by comprising the following steps of:
step 1, refocusing light field data to obtain the light field data under different focusing parameters;
step 1.1, recording light field data of the nth scene asWherein u and v represent any one of the horizontal viewing angle and the vertical viewing angle in the viewing angle dimension, respectively, and +.>M represents the maximum number of viewing angles in the horizontal and vertical directions; x and y respectively represent pixel point coordinates in any horizontal direction and vertical direction in the space dimension, and x is E [1, X],y∈[1,y]X and Y respectively represent the maximum space width and the maximum space height of the visual angle image; n is E [1, N]N represents the number of light field data, F represents the distance from the light field camera main lens to the sensor;
step 1.2, light field data for the nth sceneIn a virtual focal plane F α Refocusing the position to obtain refocused light field data +.>Wherein F' α Is a virtual focal plane F α The distance from the camera main lens, x 'and y' respectively represent the pixel point coordinates of any horizontal direction and any vertical direction in the space dimension of the refocused back view angle image;
step 2, refocusing the light field dataDecoding to obtain refocused images focused at different depths of a scene;
step 2.1, refocusing the light field with the aid of (1)Performing computational imaging to obtain an nth scene in a virtual focal plane F α Image of the place->
In the formula (1), alpha represents a virtual focal plane F α A scaling factor of the distance to the sensor and the distance F of the light field camera main lens to the sensor;
step 2.2, taking N different proportionality coefficients { alpha } 1 ,α 2 ,…,α m ,…,α N Repeating steps 1.2 to 2.1 to obtain a series of refocused images focused at different depths of the nth scene And forms a focal stack for the nth scene, where α m Represents the mth scaling factor,/->Representing the nth scene at the mth scale factor alpha m Lower virtual focal plane->Refocusing image at, N represents the focal stack +.>The number of refocusing images involved, let ∈ ->The height, the width and the channel number of the device are H, W and C respectively;
step 3, focusing the nth scenePerforming data enhancement processing on refocusing images contained in the image data to obtain a focus stack (I) after the nth scene data is enhanced>Juque (Juque)>Representing the nth scene at the mth scale factor alpha m Lower virtual focal plane->Enhancing the refocused image after processing;
the saidThe depth map and the true saliency map of the nth scene are respectively denoted as D n 、G n Depth map D for the nth scene n And a true saliency map G n And performing data enhancement processing to obtain a depth map after data enhancementAnd true saliency map
Step 4, constructing a salient object detection model based on light field refocusing data enhancement, which comprises the following steps: the system comprises an encoding network, an RGB and depth fusion module, a depth recovery module, a decoding network and an optimization module;
step 4.1, the coding network comprises: RGB networks and deep networks; wherein, the RGB network takes ResNet18 as a backbone network, and comprises: j basic blocks and j channel dimension reduction modules; the depth network is composed of j convolution modules;
the refocused imageInputting the saliency target detection model, and sequentially carrying out convolution processing on j basic blocks of RGB (red, green and blue) network in the coding network to obtain j refocusing features +.>Wherein (1)>Representing refocus image +.>Is the ith feature map of (2);
each channel dimension reduction module is sequentially composed of two convolution layers, a batch normalization layer and a ReLU activation layer;
the j channel dimension reduction modules respectively focus the j refocusing features in the nth sceneAfter processing, j dimension-reducing features ∈j after the dimension reduction of the nth scene are obtained>Wherein (1)>Representing the ith feature after dimension reduction;
the depth mapInputting the saliency target detection model, and sequentially carrying out convolution processing on j convolution modules in a depth network in a coding network to obtain a depth feature D in an nth scene n′ ;
And 4.2, constructing the RGB and depth fusion module, which sequentially comprises the following steps: an IBR module, a convolution module Conv1 and an IRB module;
feature the jth dimension reduction in the nth sceneAnd depth feature D n′ After pixel level multiplication calculation, inputting the calculated pixel level multiplication calculation into the RGB and depth fusion module, and carrying out convolution processing by the IBR module to obtain a preliminary fusion characteristic E under an nth scene n ;
The convolution module Conv1 pair j-th dimension reduction featureAfter convolution processing, refocusing image characteristics +.>
Fusion feature E in nth scene n Refocusing image featuresAnd depth feature D n′ After pixel level multiplication calculation, inputting the calculated result into an IRB module, and sequentially carrying out convolution, batch normalization and ReLU activation processing to obtain a final fusion feature E under an nth scene n′ ;
Step 4.3, the depth recovery module includes: a convolution module Conv2 and a fusion module;
the fusion feature E n′ Inputting the rough restoration depth map in the nth scene into the convolution module Conv2, and sequentially performing bilinear interpolation, convolution, batch normalization and ReLU activation to obtain the rough restoration depth map in the nth scene
The fusion module recovers the depth map of the roughnessAfter residual error, convolution, bilinear interpolation and Sigmoid activation processing are sequentially carried out, an accurate recovery depth map +_in an nth scene is obtained>
Step 4.4, the decoding network includes: the device comprises a bridging module and a decoding module;
the bridging module performs the dimension reduction on the j-th dimension reduction featureAfter the processing of convolution, batch normalization and ReLU activation is sequentially carried out, bridging feature B is obtained n ;
The decoding module consists of j decoding stages, each decoding stage consists of three continuous deconvolution modules, and each deconvolution module consists of a deconvolution layer, a batch normalization layer and a ReLU activation layer in sequence;
when i=1, bridging feature B will be n And fusion feature E n′ Inputting the i-th rough significant image and the i-th rough significant image into the i-th decoding stage together for processing
When i=2, 3, …, j, the i-1 th coarse significant image is up-sampled twice and then compared withInputting the i-th decoding stage together for processing to obtain the i-th rough significant image +.>Thereby outputting the j-th coarse salient image from the j-th decoding stage>And constitutes the salient image set of roughness in the nth scene +.>
Step 4.5, constructing the optimization module, which comprises the following steps: an encoder, a decoder;
the encoder and decoder are used for sequentially carrying out the treatment on the jth rough salient imageProcessing to generate accurate prediction saliency map pre in nth scene n ;
Training a salient target detection model based on light field refocusing data enhancement;
step 5.1, establishing a loss function;
step 5.1.1, establishing a space loss function under the nth scene through the formula (2), the formula (3) and the formula (4) respectivelyEdge loss function->And depth loss function->
In the formulas (2), (3) and (4),represents the focal stack in the nth scene +.>Corresponding true saliency map, TP n Representing pre n FN of the region correctly predicted to be a salient target n Representation->Areas where significant targets are mispredicted as background, FP n Representing pre n The middle background is mispredicted as the region of the salient object, β represents the balance factor;
step 5.1.2, establishing a total loss function L under the nth scene through the method (5) n :
And 5.2, training the saliency target detection model by using a random gradient descent algorithm, and calculating a total loss function under each scene to update network parameters until the total loss function converges, so as to obtain an optimal saliency target detection model for carrying out saliency target detection on the light field image.
2. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program that supports the processor to perform the significance target detection method of claim 1, the processor being configured to execute the program stored in the memory.
3. A computer readable storage medium having a computer program stored thereon, characterized in that the computer program when executed by a processor performs the steps of the salient object detection method of claim 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310683470.1A CN116778187A (en) | 2023-06-09 | 2023-06-09 | Salient target detection method based on light field refocusing data enhancement |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310683470.1A CN116778187A (en) | 2023-06-09 | 2023-06-09 | Salient target detection method based on light field refocusing data enhancement |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116778187A true CN116778187A (en) | 2023-09-19 |
Family
ID=87987184
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310683470.1A Pending CN116778187A (en) | 2023-06-09 | 2023-06-09 | Salient target detection method based on light field refocusing data enhancement |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116778187A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118135120A (en) * | 2024-05-06 | 2024-06-04 | 武汉大学 | Three-dimensional reconstruction and micromanipulation system for surface morphology of nano sample |
-
2023
- 2023-06-09 CN CN202310683470.1A patent/CN116778187A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118135120A (en) * | 2024-05-06 | 2024-06-04 | 武汉大学 | Three-dimensional reconstruction and micromanipulation system for surface morphology of nano sample |
CN118135120B (en) * | 2024-05-06 | 2024-07-12 | 武汉大学 | Three-dimensional reconstruction and micromanipulation system for surface morphology of nano sample |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11200424B2 (en) | Space-time memory network for locating target object in video content | |
CN111160297A (en) | Pedestrian re-identification method and device based on residual attention mechanism space-time combined model | |
CN107481279A (en) | A kind of monocular video depth map computational methods | |
CN110827312B (en) | Learning method based on cooperative visual attention neural network | |
CN114565655B (en) | Depth estimation method and device based on pyramid segmentation attention | |
CN113298815A (en) | Semi-supervised remote sensing image semantic segmentation method and device and computer equipment | |
CN113361542B (en) | Local feature extraction method based on deep learning | |
CN113343822B (en) | Light field saliency target detection method based on 3D convolution | |
CN112819853B (en) | Visual odometer method based on semantic priori | |
CN107766864B (en) | Method and device for extracting features and method and device for object recognition | |
CN111239684A (en) | Binocular fast distance measurement method based on YoloV3 deep learning | |
CN114140623A (en) | Image feature point extraction method and system | |
CN116778187A (en) | Salient target detection method based on light field refocusing data enhancement | |
CN114463492A (en) | Adaptive channel attention three-dimensional reconstruction method based on deep learning | |
EP3185212A1 (en) | Dynamic particle filter parameterization | |
CN112464775A (en) | Video target re-identification method based on multi-branch network | |
CN114663880A (en) | Three-dimensional target detection method based on multi-level cross-modal self-attention mechanism | |
CN112329662B (en) | Multi-view saliency estimation method based on unsupervised learning | |
CN113011359B (en) | Method for simultaneously detecting plane structure and generating plane description based on image and application | |
CN113850761A (en) | Remote sensing image target detection method based on multi-angle detection frame | |
CN112070181B (en) | Image stream-based cooperative detection method and device and storage medium | |
CN117456330A (en) | MSFAF-Net-based low-illumination target detection method | |
CN110910497A (en) | Method and system for realizing augmented reality map | |
CN108154107B (en) | Method for determining scene category to which remote sensing image belongs | |
CN116665293A (en) | Sitting posture early warning method and system based on monocular vision |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |