CN116342857A - Weak supervision target positioning method based on category correction - Google Patents

Weak supervision target positioning method based on category correction Download PDF

Info

Publication number
CN116342857A
CN116342857A CN202310336796.7A CN202310336796A CN116342857A CN 116342857 A CN116342857 A CN 116342857A CN 202310336796 A CN202310336796 A CN 202310336796A CN 116342857 A CN116342857 A CN 116342857A
Authority
CN
China
Prior art keywords
network
positioning
foreground
mask
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310336796.7A
Other languages
Chinese (zh)
Inventor
瞿响宇
杜博
王增茂
罗伊文
陈尚法
何向阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Changjiang Institute of Survey Planning Design and Research Co Ltd
Original Assignee
Wuhan University WHU
Changjiang Institute of Survey Planning Design and Research Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU, Changjiang Institute of Survey Planning Design and Research Co Ltd filed Critical Wuhan University WHU
Priority to CN202310336796.7A priority Critical patent/CN116342857A/en
Publication of CN116342857A publication Critical patent/CN116342857A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/22Image preprocessing by selection of a specific region containing or referencing a pattern; Locating or processing of specific regions to guide the detection or recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention belongs to the field of computer vision, and particularly relates to a weak supervision target positioning method based on category correction. To solve the disadvantage of inaccurate positioning of CAM technology, we do not use class feature diagrams for positioning any more, but use a coarse-to-fine flow. The model of the invention is composed of a main network, a positioning network and a classification network, wherein the positioning network firstly generates a class-independent segmentation map by using an unsupervised segmentation technology, thereby determining the rough position of a target object. And fine granularity correction is carried out by the classification network through the class labels. The method based on category correction can accurately position the object and can well identify the contour details.

Description

Weak supervision target positioning method based on category correction
Technical Field
The invention belongs to the field of computer vision, and relates to a weak supervision target positioning method based on category correction.
Background
Target positioning is a basic sensing task in the field of computer vision, and aims to position a specific position of a target object in an image and judge the category to which the target object belongs. However, in practical applications, in order to make the algorithm model have good generalization performance, it is often necessary to use large-scale labor cost to label the target bounding box and even the pixel level. Because of the cost of labeling, the weakly supervised target localization task often enables models to locate objects by relying on class labels that are easily available and labeled. Aiming at the problem of weak supervision target positioning, the mainstream research at home and abroad is based on CAM technology, and the position of an object is determined through a focus highlight area related to a category in the diagram. However, such methods can generally locate only the portion of the object having the category identification, which causes an inaccurate problem that the algorithm locating frame is often smaller than the target object. Therefore, how to obtain an accurate positioning frame is a problem to be solved in the field of weak supervision target positioning.
Disclosure of Invention
The invention mainly provides a weak supervision target positioning method based on category correction. To solve the disadvantage of inaccurate positioning of CAM technology, we do not use class feature diagrams for positioning any more, but use a coarse-to-fine flow. The algorithm of the present invention consists of a positioning network and a classification network. First, a class-independent segmentation map is generated by a positioning network using an unsupervised segmentation technique to determine a rough location of a target object. Fine-grained correction is then performed by the classification network via the class labels. The method based on category correction can accurately position the object and can well identify the contour details.
In the technical scheme provided by the invention, the coarse-to-fine target positioning method comprises a training stage and a testing stage, wherein the training stage comprises the following steps of:
step 1, constructing a target positioning model, wherein the target positioning model comprises a main network, a classification network and a positioning network, the main network performs feature extraction on an input image, the classification network and the positioning network are dual networks, and the classification and mask prediction is performed on the features extracted from the main network;
step 2, for the input image I, generating a synthetic image I with distributed similarity to the training sample s Foreground mask M s Then, the image I is synthesized s Inputting into a target positioning model to obtain a mask of positioning network prediction
Figure BDA0004156783310000011
Step 3, picture level fine positioning stage: the difference between the foreground and the background in the image hierarchy is increased, so that the positioning network can position more accurately; comprises the following substeps:
step 3.1, obtaining a real picture I by the positioning network with rough positioning capability trained in the step 2 r Foreground mask prediction of (a)
Figure BDA0004156783310000012
Step 3.2, predicting the foreground mask
Figure BDA0004156783310000013
And real picture I r Hadamard product is carried out to obtain a foreground attention image I irrelevant to category f At the same time, 0-1 conversion is performed on the foreground mask to +.>
Figure BDA0004156783310000021
Will be true picture I r And->
Figure BDA0004156783310000022
Hadamard product is carried out to obtain a background attention image I irrelevant to category f
Step 3.3, respectively comparing the foreground attention images I f And background attention image I b Feeding into a classification network for prediction to obtain predicted probability characteristics
Figure BDA0004156783310000023
And +.>
Figure BDA0004156783310000024
Step 4, fine positioning stage of feature level: after the foreground and background differences of the image level are amplified, the differences between the foreground and the background of the feature level are increased by using the same method as that in the step 3, so that the positioning network further corrects the details which are positioned incorrectly, and a final positioning result is output;
the test phase is as follows:
disconnecting the positioning network and the classification network, and obtaining a final positioning frame by threshold screening of the foreground mask of the positioning network:
Figure BDA0004156783310000025
wherein the method comprises the steps of
Figure BDA0004156783310000026
Mask representing test sample predicted by positioning network, theta is screening threshold value, select function is selected +.>
Figure BDA0004156783310000027
A portion greater than the threshold value and returning a minimum bounding Box containing all foreground coordinates as the final determined bounding Box.
Furthermore, in the step 1, the backbone network adopts a U-Net network structure, and the positioning network adopts a CNN convolution network structure.
Further, in step 2, a bigbiggan method is used to generate a composite image and a mask.
Further, a mask is obtained in step 2
Figure BDA0004156783310000028
The specific formula of (2) is as follows:
Figure BDA0004156783310000029
wherein θ is B And theta L Representing parameters of the backbone network and the positioning network, respectively.
Further, the positioning network is optimized by adopting a binary cross entropy function, and the loss function is as follows:
Figure BDA00041567833100000210
where m, n are the width and height of the mask,
Figure BDA00041567833100000211
for foreground mask M s Elements of row i and column j +.>
Figure BDA00041567833100000212
For prediction mask->
Figure BDA00041567833100000213
Elements of row i and column j.
Further, the probability characteristics predicted in step 3.3
Figure BDA00041567833100000214
And +.>
Figure BDA00041567833100000215
The calculation formula of (2) is as follows:
Figure BDA00041567833100000216
wherein θ is B And theta C Parameters representing the backbone network and classifier respectively, the loss functions for the foreground and background attention images are specifically as follows:
Figure BDA00041567833100000217
wherein the method comprises the steps of
Figure BDA00041567833100000218
Is a cross entropy function of the foreground attention image, < ->
Figure BDA00041567833100000219
Is the negative of the entropy of the background attention image, K is the number of categories of the whole dataset, and the overall loss function at the picture level fine positioning stage can be expressed as:
Figure BDA0004156783310000031
where α and B are balance parameters.
Further, the specific implementation manner of the step 4 is as follows;
step 4.1 for real picture I r Obtaining a feature map by using the positioning network trained in the step 3
Figure BDA0004156783310000032
And mask->
Figure BDA0004156783310000033
The calculation formula of the feature map is as follows:
Figure BDA0004156783310000034
wherein θ is B Is a parameter of the backbone network;
step 4.2, feature map
Figure BDA0004156783310000035
And mask->
Figure BDA0004156783310000036
0-1 conversion of mask +.>
Figure BDA0004156783310000037
Respectively carrying out Hadamard products to obtain a foreground characteristic diagram +.>
Figure BDA0004156783310000038
And background feature map->
Figure BDA0004156783310000039
The formula is as follows:
Figure BDA00041567833100000310
step 4.3, fixing the classification network weights trained in the step 3, using the classification network weights as a judger of mask quality, and respectively mapping the foreground characteristic images
Figure BDA00041567833100000311
And background feature map->
Figure BDA00041567833100000312
Feeding into a classification network for prediction to obtain predicted probability characteristics +.>
Figure BDA00041567833100000313
And +.>
Figure BDA00041567833100000314
Figure BDA00041567833100000315
Wherein θ is C Parameters representing the classifier; the specific loss function is as follows:
Figure BDA00041567833100000316
wherein the method comprises the steps of
Figure BDA00041567833100000317
Is the cross entropy of the foreground features, +.>
Figure BDA00041567833100000318
Is the negative number of the entropy of the background feature, K is the class number of the training sample, and the loss function of the training sample in the feature level fine positioning stage can be expressed as:
Figure BDA00041567833100000319
where α and β are balance parameters.
Further, the value of the threshold value θ is 0.55±0.05.
Further, the values of alpha and beta are 1.
Compared with the prior art, the invention has the beneficial effects that:
the invention avoids the defect of small positioning caused by CAM technology, the CAM obtains the category attention image through the category information training in the whole course, but ignores the object area with low category identification degree, thus only performing rough positioning and having very bad positioning effect on the fine granularity data set. The invention adopts the flow combining the category irrelevant information and the category relevant information, trains the network by using the category irrelevant segmentation map, and carries out detail correction by the category information, thereby achieving the effect of fine positioning. The invention can completely locate the outline of the object, and the feature map can clearly outline the outline information of the target object. In the fine positioning stage, the category information plays an auxiliary correction role, so that the network can not ignore a foreground region with low category identification degree, and the defect of CAM technology is overcome.
Drawings
Fig. 1 is a training flow chart in an embodiment of the present invention.
FIG. 2 is a flow chart of a test in an embodiment of the invention.
Detailed Description
The invention is further described below with reference to the drawings and specific examples.
The invention provides a weak supervision target positioning algorithm based on category correction. The algorithm utilizes both category independent and category dependent information while avoiding CAM [1] Technical disadvantages. The invention provides a double-head network structure of a locator-classifier for learning category-independent information and category-related information. The locator consists of a segmentation network that predicts the foreground mask of the input image. The classifier then modifies the predicted structure of the locator from the image level and the feature level, respectively.
[1]B.Zhou,A.Khosla,A.Lapedriza,A.Oliva,and A.Torralba,“Learning deep features for discriminative localization,”in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2016,pp.2921–2929.
The present invention proposes optimizing an algorithm model from a coarse to a fine training process, as shown in fig. 1. In the rough localization stage, for category-independent information, we generate a composite image I by an unsupervised method s Foreground mask M s . The related unsupervised method comprises [2],[3] . Locator prediction composite image I s M generated by unsupervised masking s And supervision is performed, so that the feature segmentation capability is independent of the category.
[2]A.Voynov,S.Morozov,and A.Babenko,“Object segmentation without labels with large-scale generative models,”in International Conference on Machine Learning.PMLR,2021,pp.10 596–10 606
[3]M.Chen,T.Artieres,and L.Denoyer,“Unsupervised object segmenta-`tion by redrawing,”Advances in Neural Information Processing Systems,vol.32,2019.
In the fine positioning phase, a positioner with coarse positioning capability first predicts the real picture I r Foreground mask of (a)
Figure BDA0004156783310000041
To further increase the difference between foreground and background we get the foreground and background attention images by Hadamard product of the foreground mask and the original image. The classifier performs different classification tasks on the two attention images to optimize: the foreground image is supervised by the category labels, the background image is not to be classified into any category, and the supervision is performed by inhibiting the category highly belonging to the foreground image, so that the invention further increases the difference between the foreground and the background in the feature hierarchy. And carrying out Hadamard product on the feature map of the locator and the predicted foreground mask to obtain a separated foreground feature map and a background feature map. After the fine positioning of the image level, the classifier learns a certain foreground classifying capability, and the weight of the classifier is fixed at the moment so that the classifier does not participate in gradient feedback, and the classifier is utilized to judge the separation quality of the foreground and background feature images, so that the foreground and background distinguishing degree of the classifier in the feature level is increased.
After fine positioning training of an image level and a feature level, the locator can better judge the foreground and background areas from the semantic angle through category correction, and meanwhile, the defect that the CAM technology cannot find the foreground without category identification is avoided. This is because the locator is based on class independent feature training and thus has a good identification of the contour, texture information itself. After the category correction is obtained, the algorithm has judging capability on the foreground information related to the semantics and irrelevant to the semantics.
As shown in fig. 2, in the test stage, the invention takes the trained positioner to complete the target positioning task. First locator prediction for real picture I r Is a mask of (2)
Figure BDA0004156783310000051
Foreground mask +.>
Figure BDA0004156783310000052
Binarizing to obtain binary foreground mask +.>
Figure BDA0004156783310000053
Wherein the foreground value is 1 and the background value is 0. The foreground values may be discontinuous and clustered, and the largest continuous foreground value cluster is selected as the prediction foreground, and the rest is regarded as the background. And taking the tightest bounding Box containing the foreground (namely the smallest bounding Box containing all foreground coordinates) as a positioning Box of the target object for the screened foreground.
The flow provided by the embodiment specifically comprises the following steps:
step 1, constructing a target positioning model, wherein the target positioning model comprises a main network, a classification network and a positioning network, the main network performs feature extraction on an input image, the classification network and the positioning network are dual networks, and the classification and mask prediction is performed on the features extracted from the main network;
step 2: generating a synthetic image I with distributed similarity with training samples through an unsupervised algorithm such as GAN s Foreground mask M s In the example, bigBiGAN is selected [2] The method of (1) generates a composite image and a mask. Then the composite image I s Inputting the foreground mask predicted by the positioning network into the target positioning model
Figure BDA0004156783310000054
The formula is as follows:
Figure BDA0004156783310000055
wherein θ is B And theta L Representing parameters of the backbone network and the positioning network, respectively, in the example θ B Adopting a U-Net network structure, theta L A CNN convolutional network structure is employed. f is the process of network prediction masking. Masking mask
Figure BDA0004156783310000056
The closer the value of the corresponding pixel in 1 indicates the more likely the positioning network will determine it as foreground, whereas the closer the pixel value is to 0 represents the more likely the positioning network will determine it as background. The algorithm then optimizes the positioning network using a binary cross entropy function (binary cross-entropy). The loss function is as follows:
Figure BDA0004156783310000057
where m, n are the width and height of the mask,
Figure BDA0004156783310000058
for foreground mask M s Elements of row i and column j +.>
Figure BDA0004156783310000059
For predictive masking
Figure BDA00041567833100000510
Elements of row i and column j. Through step 2, the positioning network has rough positioning capability irrelevant to category, and category related correction is performed on the positioning network by using category information.
Step 3: picture level fine positioning stage: the disparity between foreground and background is increased from the image hierarchy to correct the image. Step 3 may be divided into the following sub-steps:
step 3.1: obtaining a real picture I through a positioning network with coarse positioning capability r Is a predictive mask of (a)
Figure BDA0004156783310000061
Step 3.2: evaluating a mask using category information
Figure BDA0004156783310000062
And (2) the mass ofIs corrected during the training process. Sample I r And mask->
Figure BDA0004156783310000063
0-1 conversion of mask +.>
Figure BDA0004156783310000064
Respectively carrying out Hadamard product->
Figure BDA0004156783310000065
And calculating to obtain a foreground attention image and a background attention image, wherein the formula is as follows:
Figure BDA0004156783310000066
step 3.3: respectively comparing foreground attention images I f And background attention image I b Feeding into a classification network for prediction to obtain predicted probability characteristics
Figure BDA0004156783310000067
And +.>
Figure BDA0004156783310000068
The formula in the example is expressed as:
Figure BDA0004156783310000069
wherein θ is B And theta C Representing parameters of the backbone network and the classification network, respectively. For I f And monitoring by using the class labels, wherein the loss function is a cross entropy function. For I b It does not belong to any class, so it is desirable for the model to be for I b The class probability predictions for (a) tend to average so that they have neither too high a probability prediction class nor too low a probability class. In the example is represented as instruction I b The entropy of the prediction probability of (c) is as large as possible. The loss function for the foreground, background, attention image is specifically as follows:
Figure BDA00041567833100000610
wherein the method comprises the steps of
Figure BDA00041567833100000611
Is a cross entropy function of the foreground attention image, < ->
Figure BDA00041567833100000612
Is the negative of the entropy of the background attention image, and K is the number of classes of the training sample as a whole. The overall loss function at the picture level fine positioning stage can be expressed as:
Figure BDA00041567833100000613
where α and β are balance parameters, and in practice, a number of experiments prove that setting both to 1 can make the algorithm achieve good results. In this step, on the one hand, the class-related correction is performed on the image level on the locator, and on the other hand, the training classifier has classification capability, so that the preparation is made for the class correction on the feature level in the next step.
Step 4: after the foreground and background differences at the image level are increased, this step needs to further ensure that the foreground and background still have differences at the feature level, which is more beneficial to the foreground positioning by the positioning network. Step 4 may be subdivided into the following sub-steps:
step 4.1: for real picture I r Obtaining a feature map by using the positioning network trained in the step 3
Figure BDA00041567833100000614
And mask->
Figure BDA00041567833100000615
Wherein the calculation formula of the mask is the same as (1), and the calculation formula of the feature map is as follows:
Figure BDA00041567833100000616
wherein θ is B Is a parameter of the backbone network.
Step 4.2: map the characteristic map
Figure BDA0004156783310000071
And mask->
Figure BDA0004156783310000072
0-1 conversion of mask +.>
Figure BDA0004156783310000073
Respectively carrying out Hadamard products to obtain a foreground characteristic diagram +.>
Figure BDA0004156783310000074
And background feature map->
Figure BDA0004156783310000075
The formula is as follows:
Figure BDA0004156783310000076
step 4.3: fixing the classified network weight trained by the step 3, and taking the classified network weight as a judging device of mask quality. Respectively comparing the foreground feature images
Figure BDA0004156783310000077
And background feature map->
Figure BDA0004156783310000078
Feeding into a classification network for prediction to obtain predicted probability characteristics +.>
Figure BDA0004156783310000079
And +.>
Figure BDA00041567833100000710
The formula in the example is expressed as:
Figure BDA00041567833100000711
wherein θ is C Representing parameters of the classification network. The function for the foreground probability features and the background probability features are identical to those in step 3.3. The algorithm optimizes the foreground probability features using a minimized cross entropy function and the background probability features using a maximized entropy function, the specific loss function of which is as follows in the example:
Figure BDA00041567833100000712
wherein the method comprises the steps of
Figure BDA00041567833100000713
Is the cross entropy of the foreground features, +.>
Figure BDA00041567833100000714
Is the negative of the entropy of the background feature, and K is the number of classes of the training sample as a whole. The overall loss function at the feature level fine positioning stage can be expressed as:
Figure BDA00041567833100000715
where α and β are balance parameters, and in practice, a lot of experiments prove that, consistent with step 3, it is found that setting both to 1 can make the algorithm achieve good effect.
Although step 4 and step 3 have similarities, step 4 is necessary. Since the classification network also participates in the training at step 3, that is to say the loss function is partly correcting the features of the positioning model, but a larger part is adapting the classification network. However, in step 4, the classification information can be more fully transferred to the positioning network by fixing the classification network and adjusting the classification network on the feature level, so as to carry out fine-grained correction on the positioning result of the network. Meanwhile, step 3 is also indispensable, and a classification network with generalization capability cannot be obtained without step 3.
The specific implementation also has the following notes:
in the test stage, for the selection of the threshold value theta, a great number of experiments prove that a good result can be obtained by taking the threshold value of 0.55 in the CUB data set. It should be noted that compared with the similar method, the sensitivity of the method to the threshold value is not high, and good effect can be obtained within the range of +/-0.15, and the threshold tolerance interval of the similar method is often less than +/-0.05.
It should be emphasized that the described embodiments of the present invention are illustrative rather than limiting. The invention thus comprises the examples described in the detailed description, but also other embodiments which are obvious to a person skilled in the art from the solution according to the invention, which fall within the scope of protection of the invention.

Claims (9)

1. A weak supervision target positioning method based on category correction is characterized by comprising the following steps: the training phase comprises the following steps:
step 1, constructing a target positioning model, wherein the target positioning model comprises a main network, a classification network and a positioning network, the main network performs feature extraction on an input image, the classification network and the positioning network are dual networks, and the classification and mask prediction is performed on the features extracted from the main network;
step 2, for the input image I, generating a synthetic image I with distributed similarity to the training sample s Foreground mask M s Then, the image I is synthesized s Inputting into a target positioning model to obtain a mask of positioning network prediction
Figure FDA0004156783300000011
Step 3, picture level fine positioning stage: the difference between the foreground and the background in the image hierarchy is increased, so that the positioning network can position more accurately; comprises the following substeps:
step 3.1, obtaining a real picture I by the positioning network with rough positioning capability trained in the step 2 r Foreground mask prediction of (a)
Figure FDA0004156783300000012
Step 3.2, predicting the foreground mask
Figure FDA0004156783300000013
And real picture I r Hadamard product is carried out to obtain a foreground attention image I irrelevant to category f At the same time, 0-1 conversion is performed on the foreground mask to +.>
Figure FDA0004156783300000014
Will be true picture I r And->
Figure FDA0004156783300000015
Hadamard product is carried out to obtain a background attention image I irrelevant to category b
Step 3.3, respectively comparing the foreground attention images I f And background attention image I b Feeding into a classification network for prediction to obtain predicted probability characteristics
Figure FDA0004156783300000016
And +.>
Figure FDA0004156783300000017
Step 4, fine positioning stage of feature level: after the foreground and background differences of the image level are amplified, the differences between the foreground and the background of the feature level are increased by using the same method as that in the step 3, so that the positioning network further corrects the details which are positioned incorrectly, and a final positioning result is output;
the test phase is as follows:
disconnecting the positioning network and the classification network, and obtaining a final positioning frame by threshold screening of the foreground mask of the positioning network:
Figure FDA0004156783300000018
wherein the method comprises the steps of
Figure FDA0004156783300000019
Mask representing test sample predicted by positioning network, theta is screening threshold value, select function is selected +.>
Figure FDA00041567833000000110
A portion greater than the threshold value and returning a minimum bounding Box containing all foreground coordinates as the final determined bounding Box.
2. The weak supervision target positioning method based on category correction as set forth in claim 1, wherein: in the step 1, the backbone network adopts a U-Net network structure, and the positioning network adopts a CNN convolution network structure.
3. The weak supervision target positioning method based on category correction as set forth in claim 1, wherein: in step 2, a BigBiGAN method is adopted to generate a composite image and a mask.
4. The weak supervision target positioning method based on category correction as set forth in claim 1, wherein: step 2 obtaining a mask
Figure FDA0004156783300000021
The specific formula of (2) is as follows:
Figure FDA0004156783300000022
wherein θ is B And theta L Representing a backbone network and a positioning network, respectivelyParameters.
5. The weak supervision target positioning method based on category correction as defined in claim 4, wherein: optimizing the positioning network by adopting a binary cross entropy function, wherein the loss function is as follows:
Figure FDA0004156783300000023
where m, n are the width and height of the mask,
Figure FDA0004156783300000024
for foreground mask M s Elements of row i and column j +.>
Figure FDA0004156783300000025
For prediction mask->
Figure FDA0004156783300000026
Elements of row i and column j.
6. The weak supervision target positioning method based on category correction as set forth in claim 1, wherein: probability characterization predicted in step 3.3
Figure FDA0004156783300000027
And +.>
Figure FDA0004156783300000028
The calculation formula of (2) is as follows:
Figure FDA0004156783300000029
wherein θ is B And theta C Parameters representing the backbone network and classifier respectively, the loss functions for the foreground and background attention images are specifically as follows:
Figure FDA00041567833000000210
wherein the method comprises the steps of
Figure FDA00041567833000000211
Is a cross entropy function of the foreground attention image, < ->
Figure FDA00041567833000000212
Is the negative of the entropy of the background attention image, K is the number of categories of the whole dataset, and the overall loss function at the picture level fine positioning stage can be expressed as:
Figure FDA00041567833000000213
where α and β are balance parameters.
7. The weak supervision target positioning method based on category correction as set forth in claim 1, wherein: the specific implementation mode of the step 4 is as follows;
step 4.1 for real picture I r Obtaining a feature map by using the positioning network trained in the step 3
Figure FDA00041567833000000214
And mask->
Figure FDA00041567833000000215
The calculation formula of the feature map is as follows:
Figure FDA00041567833000000216
wherein θ is B Is a parameter of the backbone network;
step 4.2, feature map
Figure FDA0004156783300000031
And mask->
Figure FDA0004156783300000032
0-1 conversion of mask +.>
Figure FDA0004156783300000033
Respectively carrying out Hadamard products to obtain a foreground characteristic diagram +.>
Figure FDA0004156783300000034
And background feature map->
Figure FDA0004156783300000035
The formula is as follows:
Figure FDA0004156783300000036
step 4.3, fixing the classification network weights trained in the step 3, using the classification network weights as a judger of mask quality, and respectively mapping the foreground characteristic images
Figure FDA0004156783300000037
And background feature map->
Figure FDA0004156783300000038
Feeding into a classification network for prediction to obtain predicted probability characteristics +.>
Figure FDA0004156783300000039
And +.>
Figure FDA00041567833000000310
Figure FDA00041567833000000311
Wherein θ is C Parameters representing the classifier; the specific loss function is as follows:
Figure FDA00041567833000000312
wherein the method comprises the steps of
Figure FDA00041567833000000313
Is the cross entropy of the foreground features, +.>
Figure FDA00041567833000000314
Is the negative number of the entropy of the background feature, K is the class number of the training sample, and the loss function of the training sample in the feature level fine positioning stage can be expressed as:
Figure FDA00041567833000000315
where α and β are balance parameters.
8. The weak supervision target positioning method based on category correction as set forth in claim 1, wherein: the value of the threshold value theta is 0.55 plus or minus 0.05.
9. A class-based modified weakly supervised target localization method as set forth in claim 6 or 7, wherein: the values of alpha and beta are 1.
CN202310336796.7A 2023-03-28 2023-03-28 Weak supervision target positioning method based on category correction Pending CN116342857A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310336796.7A CN116342857A (en) 2023-03-28 2023-03-28 Weak supervision target positioning method based on category correction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310336796.7A CN116342857A (en) 2023-03-28 2023-03-28 Weak supervision target positioning method based on category correction

Publications (1)

Publication Number Publication Date
CN116342857A true CN116342857A (en) 2023-06-27

Family

ID=86892823

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310336796.7A Pending CN116342857A (en) 2023-03-28 2023-03-28 Weak supervision target positioning method based on category correction

Country Status (1)

Country Link
CN (1) CN116342857A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912184A (en) * 2023-06-30 2023-10-20 哈尔滨工业大学 Weak supervision depth restoration image tampering positioning method and system based on tampering area separation and area constraint loss

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116912184A (en) * 2023-06-30 2023-10-20 哈尔滨工业大学 Weak supervision depth restoration image tampering positioning method and system based on tampering area separation and area constraint loss
CN116912184B (en) * 2023-06-30 2024-02-23 哈尔滨工业大学 Weak supervision depth restoration image tampering positioning method and system based on tampering area separation and area constraint loss

Similar Documents

Publication Publication Date Title
CN107133569B (en) Monitoring video multi-granularity labeling method based on generalized multi-label learning
CN111444939B (en) Small-scale equipment component detection method based on weak supervision cooperative learning in open scene of power field
Kuznetsova et al. Expanding object detector's horizon: Incremental learning framework for object detection in videos
CN113724231B (en) Industrial defect detection method based on semantic segmentation and target detection fusion model
CN112836639A (en) Pedestrian multi-target tracking video identification method based on improved YOLOv3 model
CN108564598B (en) Improved online Boosting target tracking method
CN114648665A (en) Weak supervision target detection method and system
CN111275010A (en) Pedestrian re-identification method based on computer vision
CN116342857A (en) Weak supervision target positioning method based on category correction
CN114818963B (en) Small sample detection method based on cross-image feature fusion
CN115601307A (en) Automatic cell detection method
CN115861229A (en) YOLOv5 s-based X-ray detection method for packaging defects of components
CN116051479A (en) Textile defect identification method integrating cross-domain migration and anomaly detection
CN112418358A (en) Vehicle multi-attribute classification method for strengthening deep fusion network
CN114078106A (en) Defect detection method based on improved Faster R-CNN
CN112307894A (en) Pedestrian age identification method based on wrinkle features and posture features in community monitoring scene
CN116681961A (en) Weak supervision target detection method based on semi-supervision method and noise processing
CN116310293A (en) Method for detecting target of generating high-quality candidate frame based on weak supervised learning
CN112487927B (en) Method and system for realizing indoor scene recognition based on object associated attention
CN110968735B (en) Unsupervised pedestrian re-identification method based on spherical similarity hierarchical clustering
CN111401286B (en) Pedestrian retrieval method based on component weight generation network
Zhao et al. Forward vehicle detection based on deep convolution neural network
CN114581722A (en) Two-stage multi-classification industrial image defect detection method based on twin residual error network
CN113688735A (en) Image classification method and device and electronic equipment
CN112733883B (en) Point supervision target detection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination