CN110633632A

CN110633632A - Weak supervision combined target detection and semantic segmentation method based on loop guidance

Info

Publication number: CN110633632A
Application number: CN201910723018.7A
Authority: CN
Inventors: 纪荣嵘; 沈云航
Original assignee: Xiamen University
Current assignee: Xiamen University
Priority date: 2019-08-06
Filing date: 2019-08-06
Publication date: 2019-12-31

Abstract

A weak supervision joint target detection and semantic segmentation method based on loop guidance belongs to the technical field of computer vision. Initializing a convolutional neural network; the neural network forwards propagates to obtain a characteristic map of the image; the target detection branch is transmitted forward to obtain a target positioning diagram; forward propagation of semantic segmentation branches to obtain segmentation masks; obtaining a pseudo-true semantic segmentation label through a target positioning diagram; obtaining the weight value of the image candidate region by dividing the mask; calculating the loss of the semantic segmentation branch; calculating the loss of the target detection branch; updating parameters by using a random gradient descent algorithm; repeating the steps until convergence; inputting an image into a neural network to obtain a target detection and semantic segmentation result; initializing a convolutional neural network; the neural network is propagated forward to obtain an image characteristic map; the target detection branch is transmitted forward to obtain a target detection result; forward propagating the semantic segmentation branch to obtain a semantic segmentation mask; and obtaining an example segmentation mask through the target detection result and the semantic segmentation mask.

Description

Weak supervision combined target detection and semantic segmentation method based on loop guidance

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a weak supervision joint target detection and semantic segmentation method based on loop guidance.

Background

Target detection and semantic segmentation are basic problems of machine vision, and are widely applied to scenes such as video monitoring, unmanned driving and the like, for example, the remote sensing field, and after a remote sensing image is input, the position of a building or a person in the remote sensing image can be automatically detected, so that the place is determined; can also be applied in the medical field, and various lesions can be analyzed according to medical X-ray images or microscopic images; in the military field, object detection may be used to locate the position of an enemy. Machine learning has enjoyed great success in tasks such as target detection and semantic segmentation, and is particularly based on strong supervised learning tasks such as classification and regression. Predictive models are learned from a training data set containing a large number of training samples, each corresponding to an event or object. The training sample consists of two parts: a feature vector (or example) describing the event/object, and a label representing the true value output. In the classification task, the labels represent the classes to which the training samples belong; in the regression task, the label is a real number value corresponding to the sample.

With the rise of deep learning, a large number of excellent object detection and semantic segmentation models have emerged in recent years. With the continuous development of data-driven methods in image recognition, people have a greater and greater interest in enlarging the scale of target detection and semantic segmentation systems. However, the current target detection and semantic segmentation have two disadvantages: first, most successful techniques require a large training data set containing truth labels. However, in many scenarios, it is difficult to obtain strong surveillance information due to the extremely high cost of the data annotation process. Therefore, training a high-accuracy detection and segmentation model requires a large amount of finely labeled picture data in the form of bounding boxes and pixels as a model supervision condition, and a large amount of manpower and material resources are required. Second, unlike the classification task, the method of fully labeling an object instance with categories, bounding boxes, and pixels is almost inextensible. Therefore, people increase the exploration strength of unsupervised and weakly supervised target detection and semantic segmentation methods, but the performance of the existing completely unsupervised and unmarked methods in the tasks of target detection and semantic segmentation is poor, and the conventional weakly supervised method cannot be well generalized to the image processing of complex scenes.

The weak supervision problem is that in order to realize a certain computer vision task, a manual label weaker than the task is adopted as supervision information. In general, weakly supervised annotations are easier to obtain than the original annotations. For example, for an object detection task, a label of an image-level (image-level) is a kind of weakly supervised label compared with a bounding box (bounding box) of an object; for semantic segmentation task, image-level (image-level) labels and bounding boxes (bounding boxes) of objects are a kind of weakly supervised labels compared with pixel-level (pixel-level) labels.

For target detection and semantic segmentation, related research work has been a research hotspot of computer vision. The current weak supervision target detection and semantic segmentation still have challenges, and overall, the challenges are mainly reflected in the following two aspects: robustness and computational complexity.

Robustness of target detection and semantic segmentation is mainly affected by intra-class apparent difference and inter-class apparent difference, and large intra-class apparent difference and small inter-class apparent difference generally cause the robustness of the target detection method to be reduced. Intra-class apparent differences refer to variations between different individuals of the same class, e.g., different individuals of a horse differ in color, texture, shape, pose, etc. Due to the influence of illumination, background, posture, change of viewpoint and shielding, even the same horse looks very different in different images, so that the construction of an appearance model with generalization capability is extremely difficult.

The computational complexity of target detection and semantic segmentation mainly derives from the number of target classes to be detected, the dimensionality of class appearance descriptors and the acquisition of a large amount of labeled data. The number of object categories in the real world is hundreds and thousands, the apparent descriptors are high-dimensional, and the acquisition of a large amount of sufficient labeled data is extremely time-consuming and labor-consuming, so that the computer complexity of target detection and semantic segmentation is high, and the design of an efficient target detection and semantic segmentation algorithm is very important. The current part of the work proposes a new feature matching method and a positioning strategy. Another category of computational complexity research is directed towards how to reduce the Search space in object detection and semantic segmentation, and such methods are collectively referred to as Selective Search strategies (Selective Search) or object Estimation (Estimation). The core idea of the method is that not every sub-region in an image contains objects which are independent of the category, and only a few candidate windows are meaningful candidate regions in target detection and semantic segmentation.

Disclosure of Invention

The invention aims to provide a weak supervision combined target detection and semantic segmentation method based on loop guidance.

The invention comprises the following steps:

the model training process comprises the following steps:

1) initializing a convolutional neural network;

2) the neural network forwards propagates to obtain a characteristic map of the image;

3) forward propagation of the target detection branch and obtaining a target positioning diagram;

4) forward propagating the semantic segmentation branches and obtaining segmentation masks;

5) obtaining a false-true semantic segmentation label through a target positioning graph, and taking the false-true semantic segmentation label as supervision information to train semantic segmentation;

6) obtaining the weight of the image candidate region through the segmentation mask, and correcting the candidate region as the prior of positioning;

7) calculating a loss of semantic segmentation branches based on the false-true semantic segmentation labels;

8) calculating the loss of the target detection branch by combining the weight of the candidate region;

9) updating parameters by using a random gradient descent algorithm;

10) repeating the steps 2) to 9) until convergence;

11) inputting an image into a neural network to obtain a target detection and semantic segmentation result;

in step 5) and step 6), the present invention proposes to use a loop-guided mechanism to assist the learning of both branches with each other. And obtaining a false-true semantic segmentation label by using a target positioning diagram detected by a weak supervision target, training semantic segmentation as supervision information, and obtaining a weight of an image candidate region by using a segmentation mask predicted by weak supervision semantic segmentation as a priori for positioning to correct the candidate region.

In step 7), the loss function of the semantic segmentation branch is:

in step 8), the loss function of the target detection branch is:

(II) model reasoning process:

12) initializing a convolutional neural network;

13) the neural network forwards propagates to obtain a characteristic map of the image;

14) the target detection branch is transmitted forward and a target detection result is obtained;

15) forward propagating the semantic segmentation branches and obtaining a semantic segmentation mask;

16) and obtaining an example segmentation mask through the target detection result and the semantic segmentation mask.

From the weak supervision angle, the method learns the target detection and semantic segmentation by using the image-level weakly labeled picture (only knowing whether the picture contains the target object). The invention relates to a novel cycle guidance-based weak supervision combined target detection and semantic segmentation method. The current weak supervision target detection and weak supervision semantic segmentation algorithms are usually separated and have poor performance. The invention provides a mechanism of multi-task learning to combine weak supervision target detection and semantic segmentation, and provides a learning mechanism of cycle guidance to mutually assist the learning of two tasks. The invention uses a deep convolutional neural network to train three modules simultaneously: the system comprises a backbone neural network, a target detection branch and a semantic segmentation branch. The backbone neural network is used for extracting the characteristics of the whole image. And the target detection branch carries out classified prediction on each candidate region. The semantic division branch classifies each position to form a division mask.

The invention provides a weak supervision target detection and semantic segmentation method using multi-task learning joint training, and respective tasks are enhanced by utilizing complementary information of target detection and semantic segmentation. The target positioning map of the weak supervised target detection can provide false and real semantic segmentation labels for weak supervised semantic segmentation, and the prediction mask of the weak supervised semantic segmentation can evaluate the weight value for the candidate region of the weak supervised detection. The invention introduces a cyclic learning guiding strategy on the existing weak supervision model, and simultaneously learns two models of weak supervision target detection and weak supervision semantic segmentation. The invention improves the weakly supervised target detector and the weakly supervised semantic segmentation model, and is more accurate than the original model. A large number of experimental results show that the method provided by the invention achieves excellent weak supervision target detection and weak supervision semantic segmentation performances.

Drawings

FIG. 1 is a method of cycle directed learning of the present invention.

Fig. 2 is an object location diagram for weakly supervised target detection.

Fig. 3 is a structural frame of the present invention.

FIG. 4 is complementary information of weakly supervised target detection and weakly supervised semantic segmentation.

Detailed Description

The invention provides a weak supervision combined target detection and semantic segmentation method based on loop guidance, and the following embodiments are combined with the accompanying drawings to explain the invention in detail:

the symbols primarily used in the present invention are first defined. Here, I e R is used^H×W×3Representing an input image in RGB format, t ∈ {0, 1}^cTags representing corresponding image planes, { p }₁ … p_RDenotes candidate regions (propofol) of the image, R denotes the number of candidate regions, c denotes the number of categories as a whole, and H and W denote the height and width of the input image, respectively.

As shown in FIG. 1, the present invention uses a loop-guided strategy to train both weakly supervised target detection and weakly supervised semantic segmentation models. Firstly, the target detector predicts the category and position of the object; then the result of target detection can be converted into a target positioning diagram; training a semantic separator by taking the target positioning graph as a false-true semantic separation label; then a semantic separator predicts the segmentation mask of the image; and finally, calculating the weight of the candidate region by dividing the mask to correct the training of the target detector. As shown in fig. 2, the first column represents an input image, the second column represents a segmentation map based on CAM (b.zhou, a.khosla, a.lapedria, a.oliva, and a.torralba, "Learning Deep Features for segmentation," in CVPR, 2016 "), the third column represents an object localization map of the present invention, and the fourth column represents a corrected object localization map. First, it can be seen that the object localization graph can provide higher quality pseudo-real semantic segmentation labels than the CAM-based segmentation graph. Second, it can be seen that weakly supervised semantic segmentation often fails to predict consistent object contours. This is also the case for many semantic segmentation methods that require modification of the segmentation mask by means of CRF (p. krahenbuhl and v. koltun, "Efficient induction in full Connected CRFs with Gaussian Edge Potentials," in neuroips, 2011.). Finally, it can be seen that while weakly supervised target detection can usually predict the correct object contour, weakly supervised target detection often cannot distinguish the number of objects, and the predicted result is often only a part of the objects. The failure modes of the weakly supervised target detection and the weakly supervised semantic segmentation are found to be complementary through experiments. In one aspect, a predictive segmentation mask of weakly supervised semantic segmentation may help weakly supervised target detection escape from local minima. On the other hand, the target positioning diagram of the weak supervision target detection can provide high-quality pseudo-real semantic segmentation labels.

As shown in fig. 3, the present invention uses a network such as VGGNet (simony, Karen, and Andrew zisserman. "Very Deep conditional Networks for Large-scale Image Recognition," arxiv.2014.) as a basic model backend structure. Generally, the deeper the depth of the model's back end, the more expressive the model is. The model of the present invention has two branches. The first branch is a weakly supervised target detection branch and the second branch is a weakly supervised semantic segmentation branch.

Weakly supervised target detection branching. The Weakly Supervised target Detection branch uses the WSDDN (h.bilen and a.vedaldi, "weak Supervised Deep Detection Networks," in CVPR, 2016.) model as the basic model. Firstly, inputting an image, obtaining a feature map of the image, and then obtaining R candidate regions { p } through an SPP layer (K.He, X.Zhang, S.Ren, and J.Sun, "Spatial Pyramid position in Deep conditional Networks for visual recognition," in ECCV, 2014.)₁ … p_RCharacteristics of. The features of the candidate region are then passed through two tributaries: classifying the branch flow and detecting the branch flow. The two branches respectively use the full connection layer to output two scoring matrixes

The two scoring matrices are normalized in category and candidate region dimensions with the sofimax layer σ (-) respectively.

And performing dot product on the normalized score matrix:

x^s＝σ(x^c)·σ(x^d) (3)

to obtain a prediction at the image level, an accumulation pooling is used:

and finally, obtaining cross entropy loss:

wherein, t_kRepresenting the true label of the kth category.

And (5) weakly supervising semantic segmentation branches. The weakly supervised Semantic segmentation branch is based on the DeepLab-ASPP (L. -C.Chen, G.Papandrou, I.Kokkinos, K.Murphy, and A.L.Yuille, "DeepLab: Semantic image segmentation with Deep computational networks, atom fusion, and Fullyconnected CRFs," TPAMI.2017.) model. The invention uses the target location map generated by the weak supervision target detection as the supervision information of the weak supervision semantic segmentation branch. Most weakly supervised semantic partitions use a full convolution network, a softmax normalization layer, and a polynomial cross entropy loss function. The invention uses a sigmoid normalization layer and a binary cross entropy loss function:

m and S are respectively an object positioning diagram detected by the weak supervision target and a segmentation mask of weak supervision semantic segmentation prediction.

Andrepresenting the height and width of the dividing mask, typically H and W, respectively

And (5) circularly guiding learning. Theoretically, the error patterns of weakly supervised target detection and weakly supervised semantic segmentation are complementary. In one aspect, weakly supervised target detection is typically formulated as a multi-instance classification. It can explicitly raise the background image to penalize the FalsePositive, so the weakly supervised target detection has a lower FalsePositive rate. However, to prevent self-emphasis from falling into local minima, weakly supervised target detection usually penalizes only high confidence false alarm. Therefore, weakly supervised target detection usually has ambiguous feature patterns in non-salient regions. In another aspect, the loss of weakly supervised semantic segmentation is at the pixel level. The lack of explicit penalty for FalsePositive results in noisy background prediction. However fine-grained prediction of ambiguous areas at weakly supervised target detection can be used to aid target localization. The present invention therefore proposes a loop-directed learning strategy to assist the respective task with complementary information.

And guiding semantic segmentation by target detection. Object localization maps for weakly supervised target detection are used to help train weakly supervised semantic segmentation. Weak surveillance targets are used to detect built-in foreground and background cues. It is specified that no additional parameters are required. In particular, the gradient of the classification score is propagated back to the first layer of the network, resulting in a coarse object localization map, as shown in the second row of FIG. 4. On the coarse object localization map, it is first normalized to between (0, 1). Then, a position with a value higher than 0.1 is set as a foreground region, and a position with a value lower than 0.005 is set as a background region. Finally, the remaining region is set as an uncertain region. And finally, taking the obtained corrected object positioning diagram as a false-true semantic segmentation label, as shown in the third line of fig. 4.

And (4) guidance of target detection by semantic segmentation. The candidate region is corrected using the segmentation mask of the weakly supervised semantic segmentation prediction as a priori of the localization. By dividing the mask map S^kThe position and shape of the object can be roughly estimated. The density of each candidate region is then calculated:

wherein the content of the first and second substances,denotes S^kThe i-th row and the j-th column of the element, gamma is 0.1, max M^kRepresents M^kMaximum value of (2). The density of the context region of the obtained candidate region is also calculated

Calculating a response value for each candidate region:

finally obtaining weighted candidate region scores

X^r＝σ(X^c)·σ(X^d)·W^r (9)

Wherein, X^rAnd representing the score matrix after the candidate area is corrected. The prediction score of the modified image plane may be calculated:

and finally, obtaining a corrected cross entropy loss function:

in the inference process, the detection result of the candidate area is calculated by using the target detection branch, and then the detection result is suppressed and filtered by using the non-maximum value. Meanwhile, the semantic segmentation branch outputs a semantic segmentation mask of the whole image. Finally, extracting the mask surrounding the box can obtain the result of example segmentation.

The invention relates to a novel weak supervision combined target detection and semantic segmentation method based on loop guidance. It is well known that the current weakly supervised target detection and weakly supervised semantic segmentation algorithms are usually separate and have poor performance. The invention provides a mechanism of multi-task learning to combine weak supervision target detection and semantic segmentation, and provides a learning mechanism of cycle guidance to mutually assist the learning of two tasks. The invention uses a deep convolutional neural network to train three modules simultaneously: the system comprises a backbone neural network, a target detection branch and a semantic segmentation branch. The backbone neural network is used for extracting the characteristics of the whole image. And the target detection branch carries out classified prediction on each candidate region. The semantic division branch classifies each position to form a division mask.

The invention utilizes the information complementary with the target detection and the semantic segmentation to enhance the respective tasks. The target positioning map of the weak supervised target detection can provide false and real semantic segmentation labels for weak supervised semantic segmentation, and the prediction mask of the weak supervised semantic segmentation can evaluate the weight value for the candidate region of the weak supervised detection. In conclusion, the invention introduces a cyclic learning guiding strategy on the existing weak supervision model, and simultaneously learns two models of weak supervision target detection and weak supervision semantic segmentation. The final effect is: the invention improves the weak supervision target detector and the model of weak supervision semantic segmentation, and is more accurate than the original model. A large number of experimental results show that the method provided by the invention achieves excellent weak supervision target detection and weak supervision semantic segmentation performances.

Claims

1. A weak supervision combined target detection and semantic segmentation method based on loop guidance is characterized by comprising the following steps:

the model training process comprises the following steps:

1) initializing a convolutional neural network;

9) updating parameters by using a random gradient descent algorithm;

10) repeating the steps 2) to 9) until convergence;

(II) model reasoning process:

12) initializing a convolutional neural network;

2. The method for weakly supervised joint object detection and semantic segmentation based on loop guidance as claimed in claim 1, wherein in step 7), the loss function of the semantic segmentation branch is:

m and S respectively represent an object positioning diagram detected by a weak supervision target and a segmentation mask predicted by weak supervision semantic segmentation;

and

representing the height and width of the dividing mask, typically H and W, respectively

3. The method for weakly supervised joint target detection and semantic segmentation based on loop guidance as claimed in claim 1, wherein in step 8), the loss function of the target detection branch is: