CN114648665A - Weak supervision target detection method and system - Google Patents

Weak supervision target detection method and system Download PDF

Info

Publication number
CN114648665A
CN114648665A CN202210302852.0A CN202210302852A CN114648665A CN 114648665 A CN114648665 A CN 114648665A CN 202210302852 A CN202210302852 A CN 202210302852A CN 114648665 A CN114648665 A CN 114648665A
Authority
CN
China
Prior art keywords
image
box
module
target
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210302852.0A
Other languages
Chinese (zh)
Inventor
马文萍
李腾武
朱浩
武越
李娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210302852.0A priority Critical patent/CN114648665A/en
Publication of CN114648665A publication Critical patent/CN114648665A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2431Multiple classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method and a system for detecting a weak supervision target, which train a target detector to detect a target in a picture under the condition of only image type labeling. In the generation part of the prior frame, a selective search algorithm and a gradient weighted activation mapping method are combined to generate a better prior frame, meanwhile, in the optimization iteration process of the detector, the supervision information of low-level features is added, and the concept of similarity degree is introduced to measure the degree that the target in the prior frame is a complete target. The problem that the existing weak supervision target detection method is easy to fall into local optimal pain points is solved, so that a network tends to select a prior frame covering the whole target under the condition of no target boundary frame information supervision. The network improves the performance of the detection of the weak supervision target, and can be used in the fields of automatic driving, intelligent security and the like; the experimental result shows that the compound has good competitive performance.

Description

Weak supervision target detection method and system
Technical Field
The invention belongs to the technical field of computer vision image processing, and particularly relates to a method and a system for detecting a weak supervision target.
Background
The purpose of weakly supervised object detection is to train an object detector under the condition of only image class (image level) annotation, which is different from the fully supervised object detection requiring instance level (instance level, which is the center coordinate, height and width of the maximum circumscribed rectangle of an object in an annotated image). Marking instance-level information requires a great deal of manpower, material resources and financial resources. However, annotation with image categories is significantly less costly, and we can also crawl a large number of pictures with category annotations from web search engines, social media, and the like. The large amount of training data can improve the target detection performance, and obviously, the cheap and easily available pictures with only category labels are good for the field of target detection. Therefore, learning the target detector by using weak supervision has received more and more attention from academia, and this is an urgent need in the industry.
The current stage of weakly supervised target detection is usually based on a multi-Instance mil (multiple Instance learning) procedure, which may result in the problem that weakly supervised target detection is easily trapped in a locally optimal solution, and collectively, due to the lack of Instance level constraints, the use of image class level constraints may cause the weakly supervised target detection to focus on only a local region because the classification only needs local information (for example, for a person or cat in a picture, the classifier only needs to focus on their face), while the detected target is the maximum bounding rectangle that can accurately locate the object, which is a gap between classification and detection. This local focus problem is particularly acute in the detection of objects that have large intra-class differences, which generally include non-rigid multi-pose objects such as humans and animals, because such objects generally have some constant appearance, such as the face.
Meanwhile, because there is no rectangular labeling information at the instance level, the methods at the present stage all use a large number of object proposal boxes (objects) to ensure recall rate, which results in a large amount of noise (a small part of objects, background, etc.) in the proposal boxes, which not only results in unstable training, but also consumes a large amount of GPU computing resources.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method and a system for detecting a weakly supervised target, before entering the MIL multi-instance learning, firstly screening an offer frame generated by a conventional method ss (selective search), screening the offer frame to obtain a high-quality offer frame, and sending the high-quality offer frame to the following multi-instance learning process, and then adding low-level (color, texture, etc.) supervision information to train a network better on the basis of the original image-level label supervision information in the multi-instance learning process, so as to improve the detection accuracy of the network.
The invention adopts the following technical scheme:
a weak supervision target detection method comprises the following steps:
s1, reading image data and image level labels, wherein the image level labels are image level classification labels only of object types in the images, and dividing the image data into a training data set and a test data set;
s2, generating candidate boxes by using a selective search algorithm in the training data set read in the step S1, and then generating high-quality object proposal boxes by activating a mapping method based on gradient weighting classes;
s3, inputting the training data set read in the step S1 into a VGG16 convolutional neural network for feature extraction, and generating feature matrixes with the same shape through an ROI Pooling layer by using the extracted proposed frame features;
s4, performing multi-instance learning by corresponding the image level labels read in the step S1 to the feature matrix obtained in the step S3 one by one, and constructing an MIL detector;
s5, adding low-level supervision information into the object proposal frame obtained in the step S2, establishing an example classifier for optimization iteration, and taking the proposal frame with the highest score as a pseudo label training boundary frame regression network;
s6, determining the MIL detector obtained in the step S4 and the loss function of the bounding box regression network obtained in the step S5;
s7, adjusting the hyper-parameters of the boundary box regression network in the step S6 to obtain a weak supervision detection model;
s8, training the weakly supervised detection model obtained in the step S7 by using the training data set obtained in the step S1 to obtain a trained weakly supervised detection model;
s9, generating a candidate frame by using a selective search algorithm in the test data set obtained in the step S1, and then classifying and performing boundary frame regression on the candidate frame by using the weak supervision detection model trained in the step S8 to obtain a final target detection frame.
Specifically, in step S2, the highest n potential object classes in each test image are predicted by using the image-level classifier in the first stage, and then the final proposed box for detection is obtained by using the two-stage proposed-box-level classifier, and the loss function in the first stage is as follows:
Figure BDA0003566205320000031
where C is the total number of image classes, yiIs a label of the ith image, PiIs the prediction result of the ith sigmoid classifier.
Specifically, step S4 specifically includes:
s401, obtaining a series of object proposal frames for a given image and a corresponding label thereof through step S2; then, the object proposal frame is sent into a classification data stream and a detection data stream, and two data matrixes are obtained through two full-connection layers respectively; the two data matrixes respectively generate a classification score and a detection score of each proposal box through two softmax operators;
s402, carrying out element-wise multiplication on the classification scores and the detection scores obtained in the step S401 to obtain final proposed frame scores, and adding the scores of all the proposed frames in the dimension of R to obtain the image-level prediction score of each category.
Further, in step S402, the loss function of the MIL detector is:
Figure BDA0003566205320000032
wherein, PcPredicting the score, y, for the image level of the c-th classcIs a label for the image.
Specifically, step S5 specifically includes:
s501, establishing K example classifiers on the basis of the MIL detector, and calculating similarity O of low-level features of the proposal box rbu(r) converting the analog to Obu(r) class score from previous classifier
Figure BDA0003566205320000041
Carrying out weighted addition, taking the added proposition box with high score as a proposition box with a complete object target, taking the previous proposition boxes with high score n as pseudo-supervision information when the example classifier is iteratively trained next time, and training K example classifiers for K times in total;
s502, connecting a boundary box regressor behind the K example classifiers, wherein the boundary box regressor aims to output a correction value for each frame and correct four parameters of x, y, w and h respectively.
Further, in step S501, the loss functions of the K example classifiers are iteratively trained
Figure BDA0003566205320000042
Comprises the following steps:
Figure BDA0003566205320000043
wherein R is the number of the proposal boxes generated in the step 3,
Figure BDA0003566205320000044
for the loss weight of each proposed box, CE is the cross entropy loss function,
Figure BDA0003566205320000045
the classification probability of the C +1 classes representing the r-th proposal box,
Figure BDA0003566205320000046
is its classification label.
Further, in step S503, the generating of the pseudo-supervision information and the positive sample box specifically includes: s5031, based on the class probability of the k-1 branch
Figure BDA0003566205320000047
NMS is performed on a set R of propofol, the threshold being a predefined TnmsThe set of boxes after NMS is denoted as Rkeep
S5032, for each class c, if
Figure BDA0003566205320000048
Set R obtained in step S5031keepMedium search category score higher than TconfThen label c, if none of the boxes is satisfied, the highest score is given label c, and all found boxes are denoted as Rseek
S5033, for each found proposal, finding a corresponding neighbor in R, and marking as Rneighbor
S5034, and reacting R obtained in the step S5032seekAnd step S5033RneighborAnd combining to obtain a positive sample box.
Specifically, in step S6, the total loss function L of the overall network is:
Figure BDA0003566205320000049
wherein L isbaseAs a loss function of the MIL detector, λ1Is divided into K examplesThe loss weight of a classifier, K is the number of instance classifiers,
Figure BDA00035662053200000410
for the loss function of the kth instance classifier, λ2Is the loss weight, L, of the bounding box regressorboxIs the loss function of the bounding box regressor.
Specifically, in step S7, the adjusting the hyper-parameter specifically includes:
the feature extraction stage uses a VGG16 network; loss weight λ for an example classifier detector 11, the number of example classifiers K is 3, and the loss weight λ of the bounding box regression network20.3; threshold T of NMSnms=0.3,,,Tconf=0.7,Tiou=0.5(ii) a During network training, the initial learning rate is 0.001, the learning rate attenuation is 0.0005, and the total iteration number is 200000.
Another technical solution of the present invention is a system for detecting a weakly supervised target, comprising:
the reading module is used for reading image data and image-level labels, wherein the image-level labels are image-level classification labels only of object classes in the image, and the image data are divided into a training data set and a test data set;
the weighting module is used for generating a candidate frame in the training data set read by the reading module by using a selective search algorithm and then generating a high-quality object proposal frame by activating a mapping method based on a gradient weighting class;
the matrix module is used for inputting the training data set read by the reading module into a VGG16 convolutional neural network for feature extraction, and generating feature matrixes with the same shape through the ROI Pooling layer according to the extracted proposed box features;
the learning module is used for carrying out multi-example learning by corresponding the image-level labels read by the reading module to the feature matrix obtained in the step S3 one by one and constructing an MIL detector;
the iteration module is used for adding low-level supervision information into the object proposing box obtained by the weighting module and establishing an example classifier for optimization iteration; using the proposition box with the highest score as a pseudo label training bounding box regression network;
the function module is used for determining the MIL detector obtained by the learning module and the loss function of the bounding box regression network obtained by the iteration module;
the adjusting module is used for adjusting the hyper-parameters of the boundary frame regression network in the function module to obtain a weak supervision detection model;
the training module is used for training the weak supervision detection model obtained in the adjusting module by utilizing the training data set obtained by the reading module to obtain a trained weak supervision detection model;
and the detection module is used for generating a candidate frame by using a selective search algorithm on the test data set obtained by the reading module, and then classifying and performing boundary frame regression on the candidate frame by using a weak supervision detection model trained by the training module to obtain a final target detection frame.
Compared with the prior art, the invention has at least the following beneficial effects:
the invention relates to a weak supervision target detection method, which comprises the steps of gradually screening proposal frames in each stage, generating the proposal frames capable of framing the whole part of a body as pseudo supervision information, supervising a final boundary frame regression network, improving the target detection precision, generating the proposal frames by the Grad-CAM technology before an MIL detector to obtain a series of high-quality (more completely covering the target) proposal frames, being beneficial to the detection of a subsequent detector and improving the defect that the traditional method is easy to fall into local optimum; the low-level feature supervision is added in the iterative optimization process, the concept of similarity is introduced, the proposal box containing the complete target is screened out, the pseudo label is generated to supervise the subsequent iterative process, the defect that the target is only framed locally in the existing method is overcome, and the detection precision is improved.
Further, in step S2, candidate boxes generated by the selective search algorithm are filtered by activating a mapping method based on gradient weighting class, and the filtered proposed boxes are close to the object target, so that noise and interference in later training of the MIL detector are reduced.
Further, in step S4, an MIL classifier is constructed, and in the case that only image-level labels and example-level labels are absent, the proposed boxes in each image are correctly classified, thereby greatly reducing the manpower and material resources required for labeling example-level information.
Further, the loss function of the MIL detector in step S402 is set as a two-class cross entropy loss, and the loss is propagated in the reverse direction in the training process, so that the direction of network optimization is guided well, and the training process is accelerated.
Furthermore, step S5 introduces the concept of low-level supervision information, i.e., image similarity, so that the method can well screen the proposal frame that frames the whole target from a series of proposal frames, thereby solving the problem that other weak supervision target detection methods in the same period are easy to fall into local optimization.
Further, step S501 iteratively trains K instance classifiers, each iteration is performed once, and the quality of the finally generated pseudo-supervisory information is high. The loss function of the example classifier is cross entropy loss, so that the direction of network optimization is guided well, and the training process is accelerated.
Further, in step S503, by setting a threshold, NMS processing is performed on the proposed boxes, and the proposed boxes with large overlapping degree are merged to generate pseudo-supervision information and a positive sample, which better guides training of the bounding box regression network.
Further, in step S6, by integrating the loss functions of different parts of the network (MIL classifier, K example classifiers, bounding box regressor), and setting the weights of the loss functions of different parts, the network can converge faster during back propagation training.
Furthermore, the learning rate attenuation is set through the adjustment of the hyper-parameters in the step S7, so that the convergence rapidity in the early stage of training and the convergence stability in the later stage of training are ensured.
In conclusion, the proposal frames are screened layer by combining the high-level information and the low-level supervision information extracted by the deep neural network, the proposal frames with higher quality are generated to be used as supervision information of target detection, and the precision of the target detection is improved; meanwhile, the network hyper-parameter setting is reasonable, and the training process is fast and stable.
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
Drawings
FIG. 1 is an overall design framework of the present invention;
fig. 2 shows a generation process of the pseudo tag.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the description of the present invention, it should be understood that the terms "comprises" and/or "comprising" indicate the presence of the stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of the various regions, layers and their relative sizes, positional relationships are shown in the drawings as examples only, and in practice deviations due to manufacturing tolerances or technical limitations are possible, and a person skilled in the art may additionally design regions/layers with different shapes, sizes, relative positions, according to the actual needs.
The invention provides a method for detecting a weak supervision target, which trains a target detector to detect a target in a picture under the condition of only image type labeling. In the generation part of the prior frame, a Selective Search (SS) algorithm and a gradient weighted-class activation mapping (Grad-CAM) method are combined to generate better prior frames, and the prior frames have higher intersection ratio with the group Truth than that obtained by a greedy Search method, so that the whole object can be better covered. Meanwhile, in the optimization iteration process of the detector, the supervision information of low-level features is added, and the concept of similarity is introduced to measure the degree that the target in the prior frame is a complete target. The problem that the current weak supervision target detection method is easy to fall into local optimal pain points is solved, and a prior frame covering the whole target but not a part of the target is more prone to be selected by a network under the condition of no target boundary frame information supervision. The network of the invention improves the performance of the weak supervision target detection, and can be used in the related fields of image processing and detection such as face detection, pedestrian counting, vehicle detection, robot navigation, safety systems and the like. The experimental result shows that the compound has good competitive performance.
Referring to fig. 1, a method for detecting a weakly supervised target according to the present invention includes the following steps:
s1, reading an image and a label from the data set, wherein the label is an image grade classification label of only the object class in the image;
s2, generating a high-quality proposal box
Part1 of FIG. 1 is the generation of proposal boxes that combine the techniques of Selective Search (SS) and gradient-weighted class-activated mapping (Grad-CAM) to generate better proposal boxes that overlap with group Truth more than (IOU) would achieve than a greedy search, covering better the entire object.
S201, generating a large number of candidate boxes by using a selective search algorithm;
s202, generating a high-quality object proposal box through Grad-CAM.
S2021, under the condition that only image-level labels exist, training a coarse classifier for a multi-label image classification task, wherein a sigmoid cross entropy loss function is as follows:
Figure BDA0003566205320000091
where C is the total number of image classes, yiIs a label of the ith image, PiIs the prediction result of the ith sigmoid classifier.
S2022, for each image containing the object class c, through a coarse classifier, through weighted combination of a group of convolution feature maps, obtaining an activation map M of the image for a specific classcAnd then setting a ReLU function for activation:
Figure BDA0003566205320000092
wherein A iskIs the k-th convolution signature,
Figure BDA0003566205320000093
is a characteristic diagram AkFor the weight of class c, the calculation method is to calculate ycRelative to AkGlobal average pooling:
Figure BDA0003566205320000094
wherein, ycIs the score of the c-th classifier before sigmoid.
For each particular class of activation map given an input image, first 10 segmentation thresholds are set, evenly distributed between the maximum grey value of the activation map and the mean grey value of all pixels;
then, for each segmentation threshold, obtaining a binary image from the activation map of a particular class;
finally, a group of bounding boxes is obtained by using the maximum connected region, each bounding box tightly surrounds one maximum connected region, and the bounding boxes are the screened proposal boxes.
In this way a large number of object proposal boxes of a particular category are obtained. However, although the high response areas contain the object, they are still far from fully locating the entire object.
S2023, in order to solve the problem, a set of fine classifiers is further trained so as to better locate the whole object in a weak supervision environment. For a given object class, only the proposal box generated in the first stage, whose softmax response is highest (or sigmoid score is 1), is selected as input for the fine classifier training in the second stage. This is actually a proposed box classification task whose penalty function is as follows:
Figure BDA0003566205320000101
by repeating the operation of the first stage, a higher quality object proposal box is generated, and the whole object can be positioned better than the first stage.
In summary, the highest scoring n potential object classes in each test image are predicted first using the image-level classifier of the one-stage, and then the final proposal box for detection is obtained using the two-stage proposal box-level classifier.
S3, inputting the whole image into a convolutional neural network for feature extraction, and generating feature matrixes with the same shape through an ROI Pooling layer by using the extracted proposed frame features;
part2 of FIG. 1 is a feature extraction network.
S4, multi-example learning;
part of part3 in fig. 1 is to construct the MIL detector.
S401, for a given graphLike x and its corresponding label yi=[y1,…,C]A series of proposal boxes are obtained by step S1
Figure BDA0003566205320000102
Wherein y isC1 or 0 denotes the presence or absence of the object class C in the image, and C is the number of object classes;
then, its proposed box feature (output of FC 7) is fed into two data streams, called classification data stream and detection data stream, which get two data matrices x through two full connection layers FC8c and FC8d, respectivelyc
Figure BDA0003566205320000111
These two data matrices are passed through two softmax operators, respectively, to generate a classification score and a detection score for each proposed box, as follows:
Figure BDA0003566205320000112
Figure BDA0003566205320000113
s402, obtaining x through the product of two matrix elements-wise by the final proposal box scoreR=σcls(xc)⊙σdet(xd) Which will be used for the next stage target detector optimization. Meanwhile, the scores of all the suggested frames are added in the dimension of R to obtain the image-level prediction score of the c category:
Figure BDA0003566205320000114
the above score is the prediction score of the category c in the image, and the loss function of the MIL detector is the two-category cross entropy loss:
Figure BDA0003566205320000115
s5, adding low-level supervision information optimization iteration
In convolutional neural networks, the information obtained from the convolutional layer of the bottom layer (bottom) is low-level, belonging to the apparent features such as edges, colors, textures, etc.; the information obtained after the convolutional layer of the higher layer (top) is high-level and belongs to semantic features, such as class information. If a proposal box category is scored based only on its score, this only takes into account top-down semantic information. But a box has no complete object to look at not high level semantic information but low level appearance information. The similarity is a measure of how well an image is an object.
An object appears on the image to have a well-defined border and center. Therefore, it is desirable that a proposed box with a complete object have a higher similarity score than a proposed box that only frames a partial or box to background. Therefore, low-level supervisory information is introduced in this phase to optimize the iterative target detector.
S501, iteratively trained K classifiers
Inspired by OICR, K example classifiers are established on the basis of an MIL detector, the output of the kth classifier is used as supervision for the (+1) th classifier, and low-level similarity information is used for guiding network training.
Each classifier is implemented by a fully connected layer and a softmax layer in the C +1 class dimension (background is class 0). For the kth example classifier, the trained loss function is:
Figure BDA0003566205320000121
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003566205320000122
the classification probability of the C +1 classes representing the r-th proposal box,
Figure BDA0003566205320000123
is its classification label.
Figure BDA0003566205320000124
Weight of
Figure BDA0003566205320000125
Generated based on the likeness score for the proposal box r.
Specifically, the similarity degree O of the low-level features of the proposal box r is first calculatedbu(r) which is measured by measuring the degree of crossing over Superpixels (SS), and the class score obtained by the last classifier
Figure BDA0003566205320000126
Carrying out weighted addition:
Figure BDA0003566205320000127
Figure BDA0003566205320000128
s502, the final part of the network is a boundary box regressor, and the regressor aims to output a correction value for each box, and respectively corrects four parameters of x, y, w and h:
Figure BDA0003566205320000129
the loss function of the bounding box regressor is:
Figure BDA00035662053200001210
s503, generation of pseudo-supervision information and a positive sample box.
S5031 based on k-1 branchingClass probability of (2)
Figure BDA00035662053200001211
NMS is performed on a set R of propofol, the threshold being a predefined TnmsThe set of boxes after NMS is denoted as Rkeep
S5032, for each class c (c)>0, i.e. not background class), if
Figure BDA0003566205320000131
That is, if the MIL detector previously determined that the image contains a c-class object, then R is presentkeepMedium search category score higher than TconfAnd then assign tag c to them, if none of the boxes are satisfied, the highest scoring one. All found boxes are denoted as Rseek
S5033, for each found proposal, find their neighbors in R, i.e. have more than IOU threshold T with themiouThese proposal are also labeled with the same class label. Note that these neighbor propofol are RneighborThe other boxes are considered as background. Thus each propofol has their pseudo class label;
s5034, the set of positive sample boxes is RseekAnd RneighborThe combined set.
S6, determining a loss function of the overall network (including an MIL detector, K example classifiers and a bounding box regressor);
Figure BDA0003566205320000132
s7, adjusting the hyper-parameters;
S701、λ1=1,λ2=0.3;
s702, optimizing the number K of the iterated detectors to be 3;
s703, the feature extraction network uses VGG 16;
S704、Tnms=0.3,Tconf=0.7,Tiou=0.5
s705, setting the initial learning rate to be 0.001, setting the attenuation of the learning rate to be 0.0005 and setting the total iteration number to be 200000; the threshold for NMS processed was 0.3.
S8, training the detection model by using a training data set to obtain a trained detection model;
taking a sample pair (a picture and an image-level label corresponding to the picture) of a training data set as input of a detection model, taking the category of a target in each picture and the position and category of each optimized iteration proposal in the training data set as output of the detection model, simultaneously generating a pseudo label in an iteration process, calculating total loss by solving an error between a prediction result and the label, then minimizing the error through back propagation, and optimizing network parameters of the detection model to obtain the trained weak supervision detection model, as shown in fig. 2.
And S9, testing the test data set by using the trained model.
And (3) taking the test data set as the input of the weak supervision detection model, taking the output of the weak supervision detection model as the position and the category of the target in the test data set, comparing the position and the category with the labels (example-level labels) in the test data set, and verifying the performance of the model.
In another embodiment of the present invention, a system for detecting a weakly supervised target is provided, which can be used to implement the method for detecting a weakly supervised target described above, and specifically, the system for detecting a weakly supervised target includes a reading module, a weighting module, a matrix module, a learning module, an iteration module, a function module, an adjustment module, a training module, and a detection module.
The reading module is used for reading image data and image-level labels, wherein the image-level labels are image-level classification labels only of object types in the image, and the image data are divided into a training data set and a test data set;
the weighting module is used for generating a candidate box by using a selective search algorithm in the training data set read by the reading module and then generating a high-quality object proposal box by activating a mapping method based on gradient weighting class;
the matrix module is used for inputting the training data set read by the reading module into a VGG16 convolutional neural network for feature extraction, and generating feature matrixes with the same shape through the ROI Pooling layer according to the extracted proposed box features;
the learning module is used for carrying out multi-example learning by corresponding the image-level labels read by the reading module to the feature matrix obtained in the step S3 one by one and constructing an MIL detector;
the iteration module is used for adding low-level supervision information into the object proposing box obtained by the weighting module and establishing an example classifier for optimization iteration; using the proposition box with the highest score as a pseudo label training bounding box regression network;
the function module is used for determining the MIL detector obtained by the learning module and the loss function of the bounding box regression network obtained by the iteration module;
the adjusting module is used for adjusting the hyperparameter of the loss function in the function module to obtain a weak supervision detection model;
the training module is used for training the weak supervision detection model obtained in the adjusting module by using the training data set obtained by the reading module to obtain a trained weak supervision detection model;
and the detection module is used for generating a candidate frame by using a selective search algorithm on the test data set obtained by the reading module, and then classifying and boundary frame regression are carried out on the candidate frame by using the weak supervision detection model trained by the training module to obtain a final target detection frame.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
1. Simulation conditions are as follows:
the hardware platform is as follows: HP-Z840 workstation, TITAN-X-12GB-GPU,64GB RAM.
The software platform is as follows: python, Pytorch deep learning framework.
2. Simulation content and results:
the datasets of the present simulation experiment are the PASCAL VOC 2007 and 2012 datasets, and the MS COCO dataset, with example level tags removed from the dataset, and only image level tags used. The PASCAL VOC 2007 and 2012 data sets consist of 9962 and 22531 images in 20 categories, respectively, with 5011 training images in the 2007 data set and 11540 training images in the 2012 data set; the MS COCO dataset consists of 123278 images of 80 categories, selecting 82783 as training images and 40504 as test images. Mean Average Precision (mAP) (IOU >0.5) was used as an evaluation criterion.
After training, the results of the model testing are shown in table 1.
Table 1 test results of the invention on each data set
Figure BDA0003566205320000151
According to the invention, the target detection precision on the PASCAL VOC 2007 data set is 54.2%, the target detection precision on the PASCAL VOC 2012 data set is 47.5%, and the target detection precision on the MS COCO data set is 23.2%, so that the advanced level of the weak supervision target detection network at the same stage is reached.
In summary, the method and system for detecting the weakly supervised target of the present invention have the following characteristics:
1. only image-level supervision information is required
In the target detection task, a large amount of manpower, material resources and financial resources are required for labeling information (target bounding boxes) at an instance level. The labeling cost of only image categories is obviously low, and a large number of pictures with category labels can be crawled from a network search engine, social media and the like. The large amount of training data can improve the target detection performance. The cheap and easily-obtained pictures labeled only by the categories are beneficial to the development and engineering of the target detection field.
2. Better quality proposal box
Most of the existing weak surveillance object detection algorithms adopt a traditional Selective Search (SS) algorithm to generate a proposal box, and then a classification problem is made on the basis of the proposal box. Thus, the defects of the typical machine learning classification algorithm become the defects of the typical weak supervision target detection, that is, more significant objects or parts of objects in the image are easily detected, and small objects or complete objects are lost. According to the invention, a series of high-quality (more complete target coverage) proposal frames are obtained by performing proposal frame generation by Grad-CAM technology before an MIL detector, so that the detection of a subsequent detector is facilitated, and the defect that the traditional method is easy to fall into local optimum is overcome;
3. low level supervisory information
When the conventional method is used for iteratively screening the proposal boxes, the evaluation standard only has the classification score of the proposal boxes, only high-level semantic information is considered, and the importance degree of low-level information (edges, textures and the like) in measuring whether a target is an object or not is ignored. The invention adds low-level feature supervision in the iterative optimization process, introduces the concept of similarity, screens out the proposal box internally containing the complete target, generates the pseudo label to supervise the subsequent iterative process, improves the defect that the prior method only frames the local part of the target, and improves the detection precision.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention should not be limited thereby, and any modification made on the basis of the technical idea proposed by the present invention falls within the protection scope of the claims of the present invention.

Claims (10)

1. A weak supervision target detection method is characterized by comprising the following steps:
s1, reading image data and image level labels, wherein the image level labels are image level classification labels only of object types in the images, and dividing the image data into a training data set and a test data set;
s2, generating candidate boxes by using a selective search algorithm in the training data set read in the step S1, and then generating high-quality object proposal boxes by activating a mapping method based on gradient weighting classes;
s3, inputting the training data set read in the step S1 into a VGG16 convolutional neural network for feature extraction, and generating feature matrixes with the same shape through an ROI Pooling layer by using the extracted proposed frame features;
s4, performing multi-instance learning by corresponding the image level labels read in the step S1 to the feature matrix obtained in the step S3 one by one, and constructing an MIL detector;
s5, adding low-level supervision information into the object proposal frame obtained in the step S2, and establishing an example classifier for optimization iteration; using the proposition box with the highest score as a pseudo label training bounding box regression network;
s6, determining the MIL detector obtained in the step S4 and the loss function of the bounding box regression network obtained in the step S5;
s7, adjusting the hyper-parameters of the boundary box regression network in the step S6 to obtain a weak supervision detection model;
s8, training the weakly supervised detection model obtained in the step S7 by using the training data set obtained in the step S1 to obtain a trained weakly supervised detection model;
s9, generating a candidate frame by using a selective search algorithm in the test data set obtained in the step S1, and then classifying and performing boundary frame regression on the candidate frame by using the weak supervision detection model trained in the step S8 to obtain a final target detection frame.
2. The method of claim 1, wherein in step S2, the highest scoring n potential object classes in each test image are predicted by using a one-stage image-level classifier, and then a final proposed box for detection is obtained by using a two-stage proposed-box-level classifier, and the loss function of the one stage is as follows:
Figure FDA0003566205310000011
where C is the total number of image classes, yiIs a label of the ith image, PiIs the prediction result of the ith sigmoid classifier.
3. The weak supervision target detection method according to claim 1, wherein step S4 specifically includes:
s401, obtaining a series of object proposal frames for a given image and a corresponding label thereof through step S2; then sending the object proposal frame into a classification data stream and a detection data stream, and respectively obtaining two data matrixes through two full connection layers; the two data matrixes respectively generate a classification score and a detection score of each proposal box through two softmax operators;
s402, carrying out element-wise multiplication on the classification scores and the detection scores obtained in the step S401 to obtain final proposed frame scores, and adding the scores of all the proposed frames in the dimension of R to obtain the image-level prediction score of each category.
4. The method according to claim 3, wherein in step S402, the loss function of the MIL detector is:
Figure FDA0003566205310000021
wherein, PcPredicting the score, y, for the image level of the c-th classcIs a label for the image.
5. The method for detecting the weakly supervised target according to claim 1, wherein the step S5 specifically includes:
s501, establishing K example classifiers on the basis of the MIL detector, and calculating similarity O of low-level features of the proposal box rbi(r) converting the analog to Obu(r) class score from previous classifier
Figure FDA0003566205310000022
Carrying out weighted addition, taking the proposal frames with high scores after the addition as proposal frames with complete object targets, taking the proposal frames with high scores at the first n times as pseudo-supervision information when the example classifier is iteratively trained next time, and iteratively training the K example classifiers for K times;
s502, connecting a boundary box regressor behind the K example classifiers, wherein the boundary box regressor aims to output a correction value for each frame and correct four parameters of x, y, w and h respectively.
6. The method of claim 5, wherein in step S501, the loss functions of K instance classifiers are iteratively trained
Figure FDA0003566205310000023
Comprises the following steps:
Figure FDA0003566205310000031
wherein R is the number of the proposal boxes generated in the step 3,
Figure FDA0003566205310000032
for the loss weight of each proposed box, CE is the cross entropy loss function,
Figure FDA0003566205310000033
the classification probability of the C +1 classes representing the r-th proposal box,
Figure FDA0003566205310000034
is its classification label.
7. The weakly supervised target detection method of claim 5, whereinIn step S503, the generating of the pseudo-supervision information and the positive sample box specifically includes: s5031, based on the class probability of the k-1 branch
Figure FDA0003566205310000035
NMS is performed on a set R of propofol, the threshold being a predefined TnmsThe set of boxes after NMS is denoted as Rkeep
S5032, for each class c, if
Figure FDA0003566205310000036
Set R obtained in step S5031keepMedium search category score higher than TconfThen label c, if none of the boxes is satisfied, the highest score is given label c, and all found boxes are denoted as Rseek
S5033, for each found proposal, finding a corresponding neighbor in R, and marking as Rneighbor
S5034, and reacting R obtained in the step S5032seekAnd step S5033RneighborAnd combining to obtain a positive sample box.
8. The method for detecting a weakly supervised object as recited in claim 1, wherein in step S6, the loss function L is:
Figure FDA0003566205310000037
wherein L isbaseAs a loss function of the MIL detector, λ1The loss weights for K instance classifiers, K being the number of instance classifiers,
Figure FDA0003566205310000038
for the loss function of the kth instance classifier, λ2Is the loss weight of the bounding box regressor, LboxIs the loss function of the bounding box regressor.
9. The method for detecting a weakly supervised target according to claim 1, wherein in step S7, the adjusting of the hyper-parameter specifically comprises:
the feature extraction stage in step S3 uses a VGG16 network; loss weight λ for an example classifier detector11, the number of example classifiers K is 3, and the loss weight λ of the bounding box regression network20.3; threshold T of NMSnms=0.3,,,Tconf=0.7,Tiou=0.5(ii) a During network training, the initial learning rate is 0.001, the learning rate attenuation is 0.0005, and the total iteration number is 200000.
10. A weakly supervised target detection system, comprising:
the reading module is used for reading image data and image-level labels, wherein the image-level labels are image-level classification labels only of object classes in the image, and the image data are divided into a training data set and a test data set;
the weighting module is used for generating a candidate box by using a selective search algorithm in the training data set read by the reading module and then generating a high-quality object proposal box by activating a mapping method based on gradient weighting class;
the matrix module is used for inputting the training data set read by the reading module into a VGG16 convolutional neural network for feature extraction, and generating feature matrixes with the same shape through the ROI Pooling layer according to the extracted proposed box features;
the learning module is used for carrying out multi-example learning by corresponding the image-level labels read by the reading module to the feature matrix obtained in the step S3 one by one and constructing an MIL detector;
the iteration module is used for adding low-level supervision information into the object proposing box obtained by the weighting module and establishing an example classifier for optimization iteration; using the proposition box with the highest score as a pseudo label training bounding box regression network;
the function module is used for determining the MIL detector obtained by the learning module and the loss function of the bounding box regression network obtained by the iteration module;
the adjusting module is used for adjusting the hyper-parameters of the boundary frame regression network in the function module to obtain a weak supervision detection model;
the training module is used for training the weak supervision detection model obtained in the adjusting module by utilizing the training data set obtained by the reading module to obtain a trained weak supervision detection model;
and the detection module is used for generating a candidate frame by using a selective search algorithm on the test data set obtained by the reading module, and then classifying and performing boundary frame regression on the candidate frame by using a weak supervision detection model trained by the training module to obtain a final target detection frame.
CN202210302852.0A 2022-03-25 2022-03-25 Weak supervision target detection method and system Pending CN114648665A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210302852.0A CN114648665A (en) 2022-03-25 2022-03-25 Weak supervision target detection method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210302852.0A CN114648665A (en) 2022-03-25 2022-03-25 Weak supervision target detection method and system

Publications (1)

Publication Number Publication Date
CN114648665A true CN114648665A (en) 2022-06-21

Family

ID=81996067

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210302852.0A Pending CN114648665A (en) 2022-03-25 2022-03-25 Weak supervision target detection method and system

Country Status (1)

Country Link
CN (1) CN114648665A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114882325A (en) * 2022-07-12 2022-08-09 之江实验室 Semi-supervisor detection and training method and device based on two-stage object detector
CN114896307A (en) * 2022-06-30 2022-08-12 北京航空航天大学杭州创新研究院 Time series data enhancement method and device and electronic equipment
CN115100501A (en) * 2022-06-22 2022-09-23 中国科学院大学 Accurate target detection method based on single-point supervision
CN115439688A (en) * 2022-09-01 2022-12-06 哈尔滨工业大学 Weak supervision object detection method based on surrounding area perception and association
CN115457388A (en) * 2022-09-06 2022-12-09 湖南经研电力设计有限公司 Power transmission and transformation remote sensing image ground feature identification method and system based on deep learning optimization
CN116310293A (en) * 2023-02-13 2023-06-23 中国矿业大学(北京) Method for detecting target of generating high-quality candidate frame based on weak supervised learning
CN116612120A (en) * 2023-07-20 2023-08-18 山东高速工程检测有限公司 Two-stage road defect detection method for data unbalance

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115100501A (en) * 2022-06-22 2022-09-23 中国科学院大学 Accurate target detection method based on single-point supervision
CN115100501B (en) * 2022-06-22 2023-09-22 中国科学院大学 Accurate target detection method based on single-point supervision
CN114896307A (en) * 2022-06-30 2022-08-12 北京航空航天大学杭州创新研究院 Time series data enhancement method and device and electronic equipment
CN114896307B (en) * 2022-06-30 2022-09-27 北京航空航天大学杭州创新研究院 Time series data enhancement method and device and electronic equipment
CN114882325A (en) * 2022-07-12 2022-08-09 之江实验室 Semi-supervisor detection and training method and device based on two-stage object detector
CN115439688A (en) * 2022-09-01 2022-12-06 哈尔滨工业大学 Weak supervision object detection method based on surrounding area perception and association
CN115457388A (en) * 2022-09-06 2022-12-09 湖南经研电力设计有限公司 Power transmission and transformation remote sensing image ground feature identification method and system based on deep learning optimization
CN115457388B (en) * 2022-09-06 2023-07-28 湖南经研电力设计有限公司 Power transmission and transformation remote sensing image ground object identification method and system based on deep learning optimization
CN116310293A (en) * 2023-02-13 2023-06-23 中国矿业大学(北京) Method for detecting target of generating high-quality candidate frame based on weak supervised learning
CN116310293B (en) * 2023-02-13 2023-09-12 中国矿业大学(北京) Method for detecting target of generating high-quality candidate frame based on weak supervised learning
CN116612120A (en) * 2023-07-20 2023-08-18 山东高速工程检测有限公司 Two-stage road defect detection method for data unbalance
CN116612120B (en) * 2023-07-20 2023-10-10 山东高速工程检测有限公司 Two-stage road defect detection method for data unbalance

Similar Documents

Publication Publication Date Title
CN114648665A (en) Weak supervision target detection method and system
CN112396002B (en) SE-YOLOv 3-based lightweight remote sensing target detection method
CN112966684B (en) Cooperative learning character recognition method under attention mechanism
Xing et al. A convolutional neural network-based method for workpiece surface defect detection
Bevandić et al. Simultaneous semantic segmentation and outlier detection in presence of domain shift
CN110097568A (en) A kind of the video object detection and dividing method based on the double branching networks of space-time
Long et al. Object detection in aerial images using feature fusion deep networks
US11640714B2 (en) Video panoptic segmentation
CN112766170B (en) Self-adaptive segmentation detection method and device based on cluster unmanned aerial vehicle image
CN111460927A (en) Method for extracting structured information of house property certificate image
CN111368660A (en) Single-stage semi-supervised image human body target detection method
CN112613428B (en) Resnet-3D convolution cattle video target detection method based on balance loss
CN110008899B (en) Method for extracting and classifying candidate targets of visible light remote sensing image
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN113592825A (en) YOLO algorithm-based real-time coal gangue detection method
CN113139896A (en) Target detection system and method based on super-resolution reconstruction
Yadav et al. An improved deep learning-based optimal object detection system from images
CN114332921A (en) Pedestrian detection method based on improved clustering algorithm for Faster R-CNN network
CN113496480A (en) Method for detecting weld image defects
CN116091946A (en) Yolov 5-based unmanned aerial vehicle aerial image target detection method
CN113657414B (en) Object identification method
Shahriyar et al. An approach for multi label image classification using single label convolutional neural network
CN110929726B (en) Railway contact network support number plate identification method and system
Li A deep learning-based text detection and recognition approach for natural scenes
CN110287970B (en) Weak supervision object positioning method based on CAM and covering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination