CN116342857A

CN116342857A - Weak supervision target positioning method based on category correction

Info

Publication number: CN116342857A
Application number: CN202310336796.7A
Authority: CN
Inventors: 瞿响宇; 杜博; 王增茂; 罗伊文; 陈尚法; 何向阳
Original assignee: Wuhan University WHU; Changjiang Institute of Survey Planning Design and Research Co Ltd
Current assignee: Wuhan University WHU; Changjiang Institute of Survey Planning Design and Research Co Ltd
Priority date: 2023-03-28
Filing date: 2023-03-28
Publication date: 2023-06-27

Abstract

The invention belongs to the field of computer vision, and particularly relates to a weak supervision target positioning method based on category correction. To solve the disadvantage of inaccurate positioning of CAM technology, we do not use class feature diagrams for positioning any more, but use a coarse-to-fine flow. The model of the invention is composed of a main network, a positioning network and a classification network, wherein the positioning network firstly generates a class-independent segmentation map by using an unsupervised segmentation technology, thereby determining the rough position of a target object. And fine granularity correction is carried out by the classification network through the class labels. The method based on category correction can accurately position the object and can well identify the contour details.

Description

Weak supervision target positioning method based on category correction

Technical Field

The invention belongs to the field of computer vision, and relates to a weak supervision target positioning method based on category correction.

Background

Target positioning is a basic sensing task in the field of computer vision, and aims to position a specific position of a target object in an image and judge the category to which the target object belongs. However, in practical applications, in order to make the algorithm model have good generalization performance, it is often necessary to use large-scale labor cost to label the target bounding box and even the pixel level. Because of the cost of labeling, the weakly supervised target localization task often enables models to locate objects by relying on class labels that are easily available and labeled. Aiming at the problem of weak supervision target positioning, the mainstream research at home and abroad is based on CAM technology, and the position of an object is determined through a focus highlight area related to a category in the diagram. However, such methods can generally locate only the portion of the object having the category identification, which causes an inaccurate problem that the algorithm locating frame is often smaller than the target object. Therefore, how to obtain an accurate positioning frame is a problem to be solved in the field of weak supervision target positioning.

Disclosure of Invention

The invention mainly provides a weak supervision target positioning method based on category correction. To solve the disadvantage of inaccurate positioning of CAM technology, we do not use class feature diagrams for positioning any more, but use a coarse-to-fine flow. The algorithm of the present invention consists of a positioning network and a classification network. First, a class-independent segmentation map is generated by a positioning network using an unsupervised segmentation technique to determine a rough location of a target object. Fine-grained correction is then performed by the classification network via the class labels. The method based on category correction can accurately position the object and can well identify the contour details.

In the technical scheme provided by the invention, the coarse-to-fine target positioning method comprises a training stage and a testing stage, wherein the training stage comprises the following steps of:

step 1, constructing a target positioning model, wherein the target positioning model comprises a main network, a classification network and a positioning network, the main network performs feature extraction on an input image, the classification network and the positioning network are dual networks, and the classification and mask prediction is performed on the features extracted from the main network;

step 2, for the input image I, generating a synthetic image I with distributed similarity to the training sample _s Foreground mask M _s Then, the image I is synthesized _s Inputting into a target positioning model to obtain a mask of positioning network prediction

Step 3, picture level fine positioning stage: the difference between the foreground and the background in the image hierarchy is increased, so that the positioning network can position more accurately; comprises the following substeps:

step 3.1, obtaining a real picture I by the positioning network with rough positioning capability trained in the step 2 _r Foreground mask prediction of (a)

Step 3.2, predicting the foreground mask

And real picture I _r Hadamard product is carried out to obtain a foreground attention image I irrelevant to category _f At the same time, 0-1 conversion is performed on the foreground mask to +.>

Will be true picture I _r And->

Hadamard product is carried out to obtain a background attention image I irrelevant to category _f ；

Step 3.3, respectively comparing the foreground attention images I _f And background attention image I _b Feeding into a classification network for prediction to obtain predicted probability characteristics

And +.>

Step 4, fine positioning stage of feature level: after the foreground and background differences of the image level are amplified, the differences between the foreground and the background of the feature level are increased by using the same method as that in the step 3, so that the positioning network further corrects the details which are positioned incorrectly, and a final positioning result is output;

the test phase is as follows:

disconnecting the positioning network and the classification network, and obtaining a final positioning frame by threshold screening of the foreground mask of the positioning network:

wherein the method comprises the steps of

Mask representing test sample predicted by positioning network, theta is screening threshold value, select function is selected +.>

A portion greater than the threshold value and returning a minimum bounding Box containing all foreground coordinates as the final determined bounding Box.

Furthermore, in the step 1, the backbone network adopts a U-Net network structure, and the positioning network adopts a CNN convolution network structure.

Further, in step 2, a bigbiggan method is used to generate a composite image and a mask.

Further, a mask is obtained in step 2

The specific formula of (2) is as follows:

wherein θ is _B And theta _L Representing parameters of the backbone network and the positioning network, respectively.

Further, the positioning network is optimized by adopting a binary cross entropy function, and the loss function is as follows:

where m, n are the width and height of the mask,

for foreground mask M _s Elements of row i and column j +.>

For prediction mask->

Elements of row i and column j.

Further, the probability characteristics predicted in step 3.3

And +.>

The calculation formula of (2) is as follows:

wherein θ is _B And theta _C Parameters representing the backbone network and classifier respectively, the loss functions for the foreground and background attention images are specifically as follows:

wherein the method comprises the steps of

Is a cross entropy function of the foreground attention image, < ->

Is the negative of the entropy of the background attention image, K is the number of categories of the whole dataset, and the overall loss function at the picture level fine positioning stage can be expressed as:

where α and B are balance parameters.

Further, the specific implementation manner of the step 4 is as follows;

step 4.1 for real picture I _r Obtaining a feature map by using the positioning network trained in the step 3

And mask->

The calculation formula of the feature map is as follows:

wherein θ is _B Is a parameter of the backbone network;

step 4.2, feature map

And mask->

0-1 conversion of mask +.>

Respectively carrying out Hadamard products to obtain a foreground characteristic diagram +.>

And background feature map->

The formula is as follows:

step 4.3, fixing the classification network weights trained in the step 3, using the classification network weights as a judger of mask quality, and respectively mapping the foreground characteristic images

And background feature map->

Feeding into a classification network for prediction to obtain predicted probability characteristics +.>

And +.>

Wherein θ is _C Parameters representing the classifier; the specific loss function is as follows:

wherein the method comprises the steps of

Is the cross entropy of the foreground features, +.>

Is the negative number of the entropy of the background feature, K is the class number of the training sample, and the loss function of the training sample in the feature level fine positioning stage can be expressed as:

where α and β are balance parameters.

Further, the value of the threshold value θ is 0.55±0.05.

Further, the values of alpha and beta are 1.

Compared with the prior art, the invention has the beneficial effects that:

the invention avoids the defect of small positioning caused by CAM technology, the CAM obtains the category attention image through the category information training in the whole course, but ignores the object area with low category identification degree, thus only performing rough positioning and having very bad positioning effect on the fine granularity data set. The invention adopts the flow combining the category irrelevant information and the category relevant information, trains the network by using the category irrelevant segmentation map, and carries out detail correction by the category information, thereby achieving the effect of fine positioning. The invention can completely locate the outline of the object, and the feature map can clearly outline the outline information of the target object. In the fine positioning stage, the category information plays an auxiliary correction role, so that the network can not ignore a foreground region with low category identification degree, and the defect of CAM technology is overcome.

Drawings

Fig. 1 is a training flow chart in an embodiment of the present invention.

FIG. 2 is a flow chart of a test in an embodiment of the invention.

Detailed Description

The invention is further described below with reference to the drawings and specific examples.

The invention provides a weak supervision target positioning algorithm based on category correction. The algorithm utilizes both category independent and category dependent information while avoiding CAM ^[1] Technical disadvantages. The invention provides a double-head network structure of a locator-classifier for learning category-independent information and category-related information. The locator consists of a segmentation network that predicts the foreground mask of the input image. The classifier then modifies the predicted structure of the locator from the image level and the feature level, respectively.

[1]B.Zhou,A.Khosla,A.Lapedriza,A.Oliva,and A.Torralba,“Learning deep features for discriminative localization,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016,pp.2921–2929.

The present invention proposes optimizing an algorithm model from a coarse to a fine training process, as shown in fig. 1. In the rough localization stage, for category-independent information, we generate a composite image I by an unsupervised method _s Foreground mask M _s . The related unsupervised method comprises ^[2],[3] . Locator prediction composite image I _s M generated by unsupervised masking _s And supervision is performed, so that the feature segmentation capability is independent of the category.

[2]A.Voynov,S.Morozov,and A.Babenko,“Object segmentation without labels with large-scale generative models,”in International Conference on Machine Learning.PMLR,2021,pp.10 596–10 606

[3]M.Chen,T.Artieres,and L.Denoyer,“Unsupervised object segmenta-`tion by redrawing,”Advances in Neural Information Processing Systems,vol.32,2019.

In the fine positioning phase, a positioner with coarse positioning capability first predicts the real picture I _r Foreground mask of (a)

To further increase the difference between foreground and background we get the foreground and background attention images by Hadamard product of the foreground mask and the original image. The classifier performs different classification tasks on the two attention images to optimize: the foreground image is supervised by the category labels, the background image is not to be classified into any category, and the supervision is performed by inhibiting the category highly belonging to the foreground image, so that the invention further increases the difference between the foreground and the background in the feature hierarchy. And carrying out Hadamard product on the feature map of the locator and the predicted foreground mask to obtain a separated foreground feature map and a background feature map. After the fine positioning of the image level, the classifier learns a certain foreground classifying capability, and the weight of the classifier is fixed at the moment so that the classifier does not participate in gradient feedback, and the classifier is utilized to judge the separation quality of the foreground and background feature images, so that the foreground and background distinguishing degree of the classifier in the feature level is increased.

After fine positioning training of an image level and a feature level, the locator can better judge the foreground and background areas from the semantic angle through category correction, and meanwhile, the defect that the CAM technology cannot find the foreground without category identification is avoided. This is because the locator is based on class independent feature training and thus has a good identification of the contour, texture information itself. After the category correction is obtained, the algorithm has judging capability on the foreground information related to the semantics and irrelevant to the semantics.

As shown in fig. 2, in the test stage, the invention takes the trained positioner to complete the target positioning task. First locator prediction for real picture I _r Is a mask of (2)

Foreground mask +.>

Binarizing to obtain binary foreground mask +.>

Wherein the foreground value is 1 and the background value is 0. The foreground values may be discontinuous and clustered, and the largest continuous foreground value cluster is selected as the prediction foreground, and the rest is regarded as the background. And taking the tightest bounding Box containing the foreground (namely the smallest bounding Box containing all foreground coordinates) as a positioning Box of the target object for the screened foreground.

The flow provided by the embodiment specifically comprises the following steps:

step 2: generating a synthetic image I with distributed similarity with training samples through an unsupervised algorithm such as GAN _s Foreground mask M _s In the example, bigBiGAN is selected ^[2] The method of (1) generates a composite image and a mask. Then the composite image I _s Inputting the foreground mask predicted by the positioning network into the target positioning model

The formula is as follows:

wherein θ is _B And theta _L Representing parameters of the backbone network and the positioning network, respectively, in the example θ _B Adopting a U-Net network structure, theta _L A CNN convolutional network structure is employed. f is the process of network prediction masking. Masking mask

The closer the value of the corresponding pixel in 1 indicates the more likely the positioning network will determine it as foreground, whereas the closer the pixel value is to 0 represents the more likely the positioning network will determine it as background. The algorithm then optimizes the positioning network using a binary cross entropy function (binary cross-entropy). The loss function is as follows:

where m, n are the width and height of the mask,

for foreground mask M _s Elements of row i and column j +.>

For predictive masking

Elements of row i and column j. Through step 2, the positioning network has rough positioning capability irrelevant to category, and category related correction is performed on the positioning network by using category information.

Step 3: picture level fine positioning stage: the disparity between foreground and background is increased from the image hierarchy to correct the image. Step 3 may be divided into the following sub-steps:

step 3.1: obtaining a real picture I through a positioning network with coarse positioning capability _r Is a predictive mask of (a)

Step 3.2: evaluating a mask using category information

And (2) the mass ofIs corrected during the training process. Sample I _r And mask->

0-1 conversion of mask +.>

Respectively carrying out Hadamard product->

And calculating to obtain a foreground attention image and a background attention image, wherein the formula is as follows:

step 3.3: respectively comparing foreground attention images I _f And background attention image I _b Feeding into a classification network for prediction to obtain predicted probability characteristics

And +.>

The formula in the example is expressed as:

wherein θ is _B And theta _C Representing parameters of the backbone network and the classification network, respectively. For I _f And monitoring by using the class labels, wherein the loss function is a cross entropy function. For I _b It does not belong to any class, so it is desirable for the model to be for I _b The class probability predictions for (a) tend to average so that they have neither too high a probability prediction class nor too low a probability class. In the example is represented as instruction I _b The entropy of the prediction probability of (c) is as large as possible. The loss function for the foreground, background, attention image is specifically as follows:

wherein the method comprises the steps of

Is a cross entropy function of the foreground attention image, < ->

Is the negative of the entropy of the background attention image, and K is the number of classes of the training sample as a whole. The overall loss function at the picture level fine positioning stage can be expressed as:

where α and β are balance parameters, and in practice, a number of experiments prove that setting both to 1 can make the algorithm achieve good results. In this step, on the one hand, the class-related correction is performed on the image level on the locator, and on the other hand, the training classifier has classification capability, so that the preparation is made for the class correction on the feature level in the next step.

Step 4: after the foreground and background differences at the image level are increased, this step needs to further ensure that the foreground and background still have differences at the feature level, which is more beneficial to the foreground positioning by the positioning network. Step 4 may be subdivided into the following sub-steps:

step 4.1: for real picture I _r Obtaining a feature map by using the positioning network trained in the step 3

And mask->

Wherein the calculation formula of the mask is the same as (1), and the calculation formula of the feature map is as follows:

wherein θ is _B Is a parameter of the backbone network.

Step 4.2: map the characteristic map

And mask->

0-1 conversion of mask +.>

And background feature map->

The formula is as follows:

step 4.3: fixing the classified network weight trained by the step 3, and taking the classified network weight as a judging device of mask quality. Respectively comparing the foreground feature images

And background feature map->

And +.>

The formula in the example is expressed as:

wherein θ is _C Representing parameters of the classification network. The function for the foreground probability features and the background probability features are identical to those in step 3.3. The algorithm optimizes the foreground probability features using a minimized cross entropy function and the background probability features using a maximized entropy function, the specific loss function of which is as follows in the example:

wherein the method comprises the steps of

Is the cross entropy of the foreground features, +.>

Is the negative of the entropy of the background feature, and K is the number of classes of the training sample as a whole. The overall loss function at the feature level fine positioning stage can be expressed as:

where α and β are balance parameters, and in practice, a lot of experiments prove that, consistent with step 3, it is found that setting both to 1 can make the algorithm achieve good effect.

Although step 4 and step 3 have similarities, step 4 is necessary. Since the classification network also participates in the training at step 3, that is to say the loss function is partly correcting the features of the positioning model, but a larger part is adapting the classification network. However, in step 4, the classification information can be more fully transferred to the positioning network by fixing the classification network and adjusting the classification network on the feature level, so as to carry out fine-grained correction on the positioning result of the network. Meanwhile, step 3 is also indispensable, and a classification network with generalization capability cannot be obtained without step 3.

The specific implementation also has the following notes:

in the test stage, for the selection of the threshold value theta, a great number of experiments prove that a good result can be obtained by taking the threshold value of 0.55 in the CUB data set. It should be noted that compared with the similar method, the sensitivity of the method to the threshold value is not high, and good effect can be obtained within the range of +/-0.15, and the threshold tolerance interval of the similar method is often less than +/-0.05.

It should be emphasized that the described embodiments of the present invention are illustrative rather than limiting. The invention thus comprises the examples described in the detailed description, but also other embodiments which are obvious to a person skilled in the art from the solution according to the invention, which fall within the scope of protection of the invention.

Claims

1. A weak supervision target positioning method based on category correction is characterized by comprising the following steps: the training phase comprises the following steps:

Step 3.2, predicting the foreground mask

Will be true picture I _r And->

Hadamard product is carried out to obtain a background attention image I irrelevant to category _b ；

And +.>

the test phase is as follows:

wherein the method comprises the steps of

2. The weak supervision target positioning method based on category correction as set forth in claim 1, wherein: in the step 1, the backbone network adopts a U-Net network structure, and the positioning network adopts a CNN convolution network structure.

3. The weak supervision target positioning method based on category correction as set forth in claim 1, wherein: in step 2, a BigBiGAN method is adopted to generate a composite image and a mask.

4. The weak supervision target positioning method based on category correction as set forth in claim 1, wherein: step 2 obtaining a mask

The specific formula of (2) is as follows:

wherein θ is _B And theta _L Representing a backbone network and a positioning network, respectivelyParameters.

5. The weak supervision target positioning method based on category correction as defined in claim 4, wherein: optimizing the positioning network by adopting a binary cross entropy function, wherein the loss function is as follows:

where m, n are the width and height of the mask,

for foreground mask M _s Elements of row i and column j +.>

For prediction mask->

Elements of row i and column j.

6. The weak supervision target positioning method based on category correction as set forth in claim 1, wherein: probability characterization predicted in step 3.3

And +.>

The calculation formula of (2) is as follows:

wherein the method comprises the steps of

Is a cross entropy function of the foreground attention image, < ->

where α and β are balance parameters.

7. The weak supervision target positioning method based on category correction as set forth in claim 1, wherein: the specific implementation mode of the step 4 is as follows;

And mask->

The calculation formula of the feature map is as follows:

wherein θ is _B Is a parameter of the backbone network;

step 4.2, feature map

And mask->

0-1 conversion of mask +.>

And background feature map->

The formula is as follows:

And background feature map->

And +.>

wherein the method comprises the steps of

Is the cross entropy of the foreground features, +.>

where α and β are balance parameters.

8. The weak supervision target positioning method based on category correction as set forth in claim 1, wherein: the value of the threshold value theta is 0.55 plus or minus 0.05.

9. A class-based modified weakly supervised target localization method as set forth in claim 6 or 7, wherein: the values of alpha and beta are 1.