CN114648665A

CN114648665A - Weak supervision target detection method and system

Info

Publication number: CN114648665A
Application number: CN202210302852.0A
Authority: CN
Inventors: 马文萍; 李腾武; 朱浩; 武越; 李娜
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-03-25
Filing date: 2022-03-25
Publication date: 2022-06-21

Abstract

The invention discloses a method and a system for detecting a weak supervision target, which train a target detector to detect a target in a picture under the condition of only image type labeling. In the generation part of the prior frame, a selective search algorithm and a gradient weighted activation mapping method are combined to generate a better prior frame, meanwhile, in the optimization iteration process of the detector, the supervision information of low-level features is added, and the concept of similarity degree is introduced to measure the degree that the target in the prior frame is a complete target. The problem that the existing weak supervision target detection method is easy to fall into local optimal pain points is solved, so that a network tends to select a prior frame covering the whole target under the condition of no target boundary frame information supervision. The network improves the performance of the detection of the weak supervision target, and can be used in the fields of automatic driving, intelligent security and the like; the experimental result shows that the compound has good competitive performance.

Description

Weak supervision target detection method and system

Technical Field

The invention belongs to the technical field of computer vision image processing, and particularly relates to a method and a system for detecting a weak supervision target.

Background

The purpose of weakly supervised object detection is to train an object detector under the condition of only image class (image level) annotation, which is different from the fully supervised object detection requiring instance level (instance level, which is the center coordinate, height and width of the maximum circumscribed rectangle of an object in an annotated image). Marking instance-level information requires a great deal of manpower, material resources and financial resources. However, annotation with image categories is significantly less costly, and we can also crawl a large number of pictures with category annotations from web search engines, social media, and the like. The large amount of training data can improve the target detection performance, and obviously, the cheap and easily available pictures with only category labels are good for the field of target detection. Therefore, learning the target detector by using weak supervision has received more and more attention from academia, and this is an urgent need in the industry.

The current stage of weakly supervised target detection is usually based on a multi-Instance mil (multiple Instance learning) procedure, which may result in the problem that weakly supervised target detection is easily trapped in a locally optimal solution, and collectively, due to the lack of Instance level constraints, the use of image class level constraints may cause the weakly supervised target detection to focus on only a local region because the classification only needs local information (for example, for a person or cat in a picture, the classifier only needs to focus on their face), while the detected target is the maximum bounding rectangle that can accurately locate the object, which is a gap between classification and detection. This local focus problem is particularly acute in the detection of objects that have large intra-class differences, which generally include non-rigid multi-pose objects such as humans and animals, because such objects generally have some constant appearance, such as the face.

Meanwhile, because there is no rectangular labeling information at the instance level, the methods at the present stage all use a large number of object proposal boxes (objects) to ensure recall rate, which results in a large amount of noise (a small part of objects, background, etc.) in the proposal boxes, which not only results in unstable training, but also consumes a large amount of GPU computing resources.

Disclosure of Invention

The technical problem to be solved by the present invention is to provide a method and a system for detecting a weakly supervised target, before entering the MIL multi-instance learning, firstly screening an offer frame generated by a conventional method ss (selective search), screening the offer frame to obtain a high-quality offer frame, and sending the high-quality offer frame to the following multi-instance learning process, and then adding low-level (color, texture, etc.) supervision information to train a network better on the basis of the original image-level label supervision information in the multi-instance learning process, so as to improve the detection accuracy of the network.

The invention adopts the following technical scheme:

a weak supervision target detection method comprises the following steps:

s1, reading image data and image level labels, wherein the image level labels are image level classification labels only of object types in the images, and dividing the image data into a training data set and a test data set;

s2, generating candidate boxes by using a selective search algorithm in the training data set read in the step S1, and then generating high-quality object proposal boxes by activating a mapping method based on gradient weighting classes;

s3, inputting the training data set read in the step S1 into a VGG16 convolutional neural network for feature extraction, and generating feature matrixes with the same shape through an ROI Pooling layer by using the extracted proposed frame features;

s4, performing multi-instance learning by corresponding the image level labels read in the step S1 to the feature matrix obtained in the step S3 one by one, and constructing an MIL detector;

s5, adding low-level supervision information into the object proposal frame obtained in the step S2, establishing an example classifier for optimization iteration, and taking the proposal frame with the highest score as a pseudo label training boundary frame regression network;

s6, determining the MIL detector obtained in the step S4 and the loss function of the bounding box regression network obtained in the step S5;

s7, adjusting the hyper-parameters of the boundary box regression network in the step S6 to obtain a weak supervision detection model;

s8, training the weakly supervised detection model obtained in the step S7 by using the training data set obtained in the step S1 to obtain a trained weakly supervised detection model;

s9, generating a candidate frame by using a selective search algorithm in the test data set obtained in the step S1, and then classifying and performing boundary frame regression on the candidate frame by using the weak supervision detection model trained in the step S8 to obtain a final target detection frame.

Specifically, in step S2, the highest n potential object classes in each test image are predicted by using the image-level classifier in the first stage, and then the final proposed box for detection is obtained by using the two-stage proposed-box-level classifier, and the loss function in the first stage is as follows:

where C is the total number of image classes, y_iIs a label of the ith image, P_iIs the prediction result of the ith sigmoid classifier.

Specifically, step S4 specifically includes:

s401, obtaining a series of object proposal frames for a given image and a corresponding label thereof through step S2; then, the object proposal frame is sent into a classification data stream and a detection data stream, and two data matrixes are obtained through two full-connection layers respectively; the two data matrixes respectively generate a classification score and a detection score of each proposal box through two softmax operators;

s402, carrying out element-wise multiplication on the classification scores and the detection scores obtained in the step S401 to obtain final proposed frame scores, and adding the scores of all the proposed frames in the dimension of R to obtain the image-level prediction score of each category.

Further, in step S402, the loss function of the MIL detector is:

wherein, P_cPredicting the score, y, for the image level of the c-th class_cIs a label for the image.

Specifically, step S5 specifically includes:

s501, establishing K example classifiers on the basis of the MIL detector, and calculating similarity O of low-level features of the proposal box r_bu(r) converting the analog to O_bu(r) class score from previous classifier

Carrying out weighted addition, taking the added proposition box with high score as a proposition box with a complete object target, taking the previous proposition boxes with high score n as pseudo-supervision information when the example classifier is iteratively trained next time, and training K example classifiers for K times in total;

s502, connecting a boundary box regressor behind the K example classifiers, wherein the boundary box regressor aims to output a correction value for each frame and correct four parameters of x, y, w and h respectively.

Further, in step S501, the loss functions of the K example classifiers are iteratively trained

Comprises the following steps:

wherein R is the number of the proposal boxes generated in the step 3,

for the loss weight of each proposed box, CE is the cross entropy loss function,

the classification probability of the C +1 classes representing the r-th proposal box,

is its classification label.

Further, in step S503, the generating of the pseudo-supervision information and the positive sample box specifically includes: s5031, based on the class probability of the k-1 branch

NMS is performed on a set R of propofol, the threshold being a predefined T_nmsThe set of boxes after NMS is denoted as R_keep；

S5032, for each class c, if

Set R obtained in step S5031_keepMedium search category score higher than T_confThen label c, if none of the boxes is satisfied, the highest score is given label c, and all found boxes are denoted as R_seek；

S5033, for each found proposal, finding a corresponding neighbor in R, and marking as R_neighbor；

S5034, and reacting R obtained in the step S5032_seekAnd step S5033R_neighborAnd combining to obtain a positive sample box.

Specifically, in step S6, the total loss function L of the overall network is:

wherein L is_baseAs a loss function of the MIL detector, λ₁Is divided into K examplesThe loss weight of a classifier, K is the number of instance classifiers,

for the loss function of the kth instance classifier, λ₂Is the loss weight, L, of the bounding box regressor_boxIs the loss function of the bounding box regressor.

Specifically, in step S7, the adjusting the hyper-parameter specifically includes:

the feature extraction stage uses a VGG16 network; loss weight λ for an example classifier detector ₁1, the number of example classifiers K is 3, and the loss weight λ of the bounding box regression network₂0.3; threshold T of NMS_nms＝0.3,,，T_conf＝0.7，T_iou＝0.5(ii) a During network training, the initial learning rate is 0.001, the learning rate attenuation is 0.0005, and the total iteration number is 200000.

Another technical solution of the present invention is a system for detecting a weakly supervised target, comprising:

the reading module is used for reading image data and image-level labels, wherein the image-level labels are image-level classification labels only of object classes in the image, and the image data are divided into a training data set and a test data set;

the weighting module is used for generating a candidate frame in the training data set read by the reading module by using a selective search algorithm and then generating a high-quality object proposal frame by activating a mapping method based on a gradient weighting class;

the matrix module is used for inputting the training data set read by the reading module into a VGG16 convolutional neural network for feature extraction, and generating feature matrixes with the same shape through the ROI Pooling layer according to the extracted proposed box features;

the learning module is used for carrying out multi-example learning by corresponding the image-level labels read by the reading module to the feature matrix obtained in the step S3 one by one and constructing an MIL detector;

the iteration module is used for adding low-level supervision information into the object proposing box obtained by the weighting module and establishing an example classifier for optimization iteration; using the proposition box with the highest score as a pseudo label training bounding box regression network;

the function module is used for determining the MIL detector obtained by the learning module and the loss function of the bounding box regression network obtained by the iteration module;

the adjusting module is used for adjusting the hyper-parameters of the boundary frame regression network in the function module to obtain a weak supervision detection model;

the training module is used for training the weak supervision detection model obtained in the adjusting module by utilizing the training data set obtained by the reading module to obtain a trained weak supervision detection model;

and the detection module is used for generating a candidate frame by using a selective search algorithm on the test data set obtained by the reading module, and then classifying and performing boundary frame regression on the candidate frame by using a weak supervision detection model trained by the training module to obtain a final target detection frame.

Compared with the prior art, the invention has at least the following beneficial effects:

the invention relates to a weak supervision target detection method, which comprises the steps of gradually screening proposal frames in each stage, generating the proposal frames capable of framing the whole part of a body as pseudo supervision information, supervising a final boundary frame regression network, improving the target detection precision, generating the proposal frames by the Grad-CAM technology before an MIL detector to obtain a series of high-quality (more completely covering the target) proposal frames, being beneficial to the detection of a subsequent detector and improving the defect that the traditional method is easy to fall into local optimum; the low-level feature supervision is added in the iterative optimization process, the concept of similarity is introduced, the proposal box containing the complete target is screened out, the pseudo label is generated to supervise the subsequent iterative process, the defect that the target is only framed locally in the existing method is overcome, and the detection precision is improved.

Further, in step S2, candidate boxes generated by the selective search algorithm are filtered by activating a mapping method based on gradient weighting class, and the filtered proposed boxes are close to the object target, so that noise and interference in later training of the MIL detector are reduced.

Further, in step S4, an MIL classifier is constructed, and in the case that only image-level labels and example-level labels are absent, the proposed boxes in each image are correctly classified, thereby greatly reducing the manpower and material resources required for labeling example-level information.

Further, the loss function of the MIL detector in step S402 is set as a two-class cross entropy loss, and the loss is propagated in the reverse direction in the training process, so that the direction of network optimization is guided well, and the training process is accelerated.

Furthermore, step S5 introduces the concept of low-level supervision information, i.e., image similarity, so that the method can well screen the proposal frame that frames the whole target from a series of proposal frames, thereby solving the problem that other weak supervision target detection methods in the same period are easy to fall into local optimization.

Further, step S501 iteratively trains K instance classifiers, each iteration is performed once, and the quality of the finally generated pseudo-supervisory information is high. The loss function of the example classifier is cross entropy loss, so that the direction of network optimization is guided well, and the training process is accelerated.

Further, in step S503, by setting a threshold, NMS processing is performed on the proposed boxes, and the proposed boxes with large overlapping degree are merged to generate pseudo-supervision information and a positive sample, which better guides training of the bounding box regression network.

Further, in step S6, by integrating the loss functions of different parts of the network (MIL classifier, K example classifiers, bounding box regressor), and setting the weights of the loss functions of different parts, the network can converge faster during back propagation training.

Furthermore, the learning rate attenuation is set through the adjustment of the hyper-parameters in the step S7, so that the convergence rapidity in the early stage of training and the convergence stability in the later stage of training are ensured.

In conclusion, the proposal frames are screened layer by combining the high-level information and the low-level supervision information extracted by the deep neural network, the proposal frames with higher quality are generated to be used as supervision information of target detection, and the precision of the target detection is improved; meanwhile, the network hyper-parameter setting is reasonable, and the training process is fast and stable.

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

Drawings

FIG. 1 is an overall design framework of the present invention;

fig. 2 shows a generation process of the pseudo tag.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the description of the present invention, it should be understood that the terms "comprises" and/or "comprising" indicate the presence of the stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

Various structural schematics according to the disclosed embodiments of the invention are shown in the drawings. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of the various regions, layers and their relative sizes, positional relationships are shown in the drawings as examples only, and in practice deviations due to manufacturing tolerances or technical limitations are possible, and a person skilled in the art may additionally design regions/layers with different shapes, sizes, relative positions, according to the actual needs.

The invention provides a method for detecting a weak supervision target, which trains a target detector to detect a target in a picture under the condition of only image type labeling. In the generation part of the prior frame, a Selective Search (SS) algorithm and a gradient weighted-class activation mapping (Grad-CAM) method are combined to generate better prior frames, and the prior frames have higher intersection ratio with the group Truth than that obtained by a greedy Search method, so that the whole object can be better covered. Meanwhile, in the optimization iteration process of the detector, the supervision information of low-level features is added, and the concept of similarity is introduced to measure the degree that the target in the prior frame is a complete target. The problem that the current weak supervision target detection method is easy to fall into local optimal pain points is solved, and a prior frame covering the whole target but not a part of the target is more prone to be selected by a network under the condition of no target boundary frame information supervision. The network of the invention improves the performance of the weak supervision target detection, and can be used in the related fields of image processing and detection such as face detection, pedestrian counting, vehicle detection, robot navigation, safety systems and the like. The experimental result shows that the compound has good competitive performance.

Referring to fig. 1, a method for detecting a weakly supervised target according to the present invention includes the following steps:

s1, reading an image and a label from the data set, wherein the label is an image grade classification label of only the object class in the image;

s2, generating a high-quality proposal box

Part1 of FIG. 1 is the generation of proposal boxes that combine the techniques of Selective Search (SS) and gradient-weighted class-activated mapping (Grad-CAM) to generate better proposal boxes that overlap with group Truth more than (IOU) would achieve than a greedy search, covering better the entire object.

S201, generating a large number of candidate boxes by using a selective search algorithm;

s202, generating a high-quality object proposal box through Grad-CAM.

S2021, under the condition that only image-level labels exist, training a coarse classifier for a multi-label image classification task, wherein a sigmoid cross entropy loss function is as follows:

S2022, for each image containing the object class c, through a coarse classifier, through weighted combination of a group of convolution feature maps, obtaining an activation map M of the image for a specific class_cAnd then setting a ReLU function for activation:

wherein A is_kIs the k-th convolution signature,

is a characteristic diagram A_kFor the weight of class c, the calculation method is to calculate y_cRelative to A_kGlobal average pooling:

wherein, y_cIs the score of the c-th classifier before sigmoid.

For each particular class of activation map given an input image, first 10 segmentation thresholds are set, evenly distributed between the maximum grey value of the activation map and the mean grey value of all pixels;

then, for each segmentation threshold, obtaining a binary image from the activation map of a particular class;

finally, a group of bounding boxes is obtained by using the maximum connected region, each bounding box tightly surrounds one maximum connected region, and the bounding boxes are the screened proposal boxes.

In this way a large number of object proposal boxes of a particular category are obtained. However, although the high response areas contain the object, they are still far from fully locating the entire object.

S2023, in order to solve the problem, a set of fine classifiers is further trained so as to better locate the whole object in a weak supervision environment. For a given object class, only the proposal box generated in the first stage, whose softmax response is highest (or sigmoid score is 1), is selected as input for the fine classifier training in the second stage. This is actually a proposed box classification task whose penalty function is as follows:

by repeating the operation of the first stage, a higher quality object proposal box is generated, and the whole object can be positioned better than the first stage.

In summary, the highest scoring n potential object classes in each test image are predicted first using the image-level classifier of the one-stage, and then the final proposal box for detection is obtained using the two-stage proposal box-level classifier.

S3, inputting the whole image into a convolutional neural network for feature extraction, and generating feature matrixes with the same shape through an ROI Pooling layer by using the extracted proposed frame features;

part2 of FIG. 1 is a feature extraction network.

S4, multi-example learning;

part of part3 in fig. 1 is to construct the MIL detector.

S401, for a given graphLike x and its corresponding label y_i＝[y₁,…,_C]A series of proposal boxes are obtained by step S1

Wherein y is_C1 or 0 denotes the presence or absence of the object class C in the image, and C is the number of object classes;

then, its proposed box feature (output of FC 7) is fed into two data streams, called classification data stream and detection data stream, which get two data matrices x through two full connection layers FC8c and FC8d, respectively^c，

These two data matrices are passed through two softmax operators, respectively, to generate a classification score and a detection score for each proposed box, as follows:

s402, obtaining x through the product of two matrix elements-wise by the final proposal box score^R＝σ_cls(x^c)⊙σ_det(x^d) Which will be used for the next stage target detector optimization. Meanwhile, the scores of all the suggested frames are added in the dimension of R to obtain the image-level prediction score of the c category:

the above score is the prediction score of the category c in the image, and the loss function of the MIL detector is the two-category cross entropy loss:

s5, adding low-level supervision information optimization iteration

In convolutional neural networks, the information obtained from the convolutional layer of the bottom layer (bottom) is low-level, belonging to the apparent features such as edges, colors, textures, etc.; the information obtained after the convolutional layer of the higher layer (top) is high-level and belongs to semantic features, such as class information. If a proposal box category is scored based only on its score, this only takes into account top-down semantic information. But a box has no complete object to look at not high level semantic information but low level appearance information. The similarity is a measure of how well an image is an object.

An object appears on the image to have a well-defined border and center. Therefore, it is desirable that a proposed box with a complete object have a higher similarity score than a proposed box that only frames a partial or box to background. Therefore, low-level supervisory information is introduced in this phase to optimize the iterative target detector.

S501, iteratively trained K classifiers

Inspired by OICR, K example classifiers are established on the basis of an MIL detector, the output of the kth classifier is used as supervision for the (+1) th classifier, and low-level similarity information is used for guiding network training.

Each classifier is implemented by a fully connected layer and a softmax layer in the C +1 class dimension (background is class 0). For the kth example classifier, the trained loss function is:

wherein, the first and the second end of the pipe are connected with each other,

is its classification label.

Weight of

Generated based on the likeness score for the proposal box r.

Specifically, the similarity degree O of the low-level features of the proposal box r is first calculated_bu(r) which is measured by measuring the degree of crossing over Superpixels (SS), and the class score obtained by the last classifier

Carrying out weighted addition:

s502, the final part of the network is a boundary box regressor, and the regressor aims to output a correction value for each box, and respectively corrects four parameters of x, y, w and h:

the loss function of the bounding box regressor is:

s503, generation of pseudo-supervision information and a positive sample box.

S5031 based on k-1 branchingClass probability of (2)

S5032, for each class c (c)>0, i.e. not background class), if

That is, if the MIL detector previously determined that the image contains a c-class object, then R is present_keepMedium search category score higher than T_confAnd then assign tag c to them, if none of the boxes are satisfied, the highest scoring one. All found boxes are denoted as R_seek；

S5033, for each found proposal, find their neighbors in R, i.e. have more than IOU threshold T with them_iouThese proposal are also labeled with the same class label. Note that these neighbor propofol are R_neighborThe other boxes are considered as background. Thus each propofol has their pseudo class label;

s5034, the set of positive sample boxes is R_seekAnd R_neighborThe combined set.

S6, determining a loss function of the overall network (including an MIL detector, K example classifiers and a bounding box regressor);

s7, adjusting the hyper-parameters;

S701、λ₁＝1,λ₂＝0.3；

s702, optimizing the number K of the iterated detectors to be 3;

s703, the feature extraction network uses VGG 16;

S704、T_nms＝0.3,T_conf＝0.7,T_iou＝0.5；

s705, setting the initial learning rate to be 0.001, setting the attenuation of the learning rate to be 0.0005 and setting the total iteration number to be 200000; the threshold for NMS processed was 0.3.

S8, training the detection model by using a training data set to obtain a trained detection model;

taking a sample pair (a picture and an image-level label corresponding to the picture) of a training data set as input of a detection model, taking the category of a target in each picture and the position and category of each optimized iteration proposal in the training data set as output of the detection model, simultaneously generating a pseudo label in an iteration process, calculating total loss by solving an error between a prediction result and the label, then minimizing the error through back propagation, and optimizing network parameters of the detection model to obtain the trained weak supervision detection model, as shown in fig. 2.

And S9, testing the test data set by using the trained model.

And (3) taking the test data set as the input of the weak supervision detection model, taking the output of the weak supervision detection model as the position and the category of the target in the test data set, comparing the position and the category with the labels (example-level labels) in the test data set, and verifying the performance of the model.

In another embodiment of the present invention, a system for detecting a weakly supervised target is provided, which can be used to implement the method for detecting a weakly supervised target described above, and specifically, the system for detecting a weakly supervised target includes a reading module, a weighting module, a matrix module, a learning module, an iteration module, a function module, an adjustment module, a training module, and a detection module.

The reading module is used for reading image data and image-level labels, wherein the image-level labels are image-level classification labels only of object types in the image, and the image data are divided into a training data set and a test data set;

the weighting module is used for generating a candidate box by using a selective search algorithm in the training data set read by the reading module and then generating a high-quality object proposal box by activating a mapping method based on gradient weighting class;

the adjusting module is used for adjusting the hyperparameter of the loss function in the function module to obtain a weak supervision detection model;

the training module is used for training the weak supervision detection model obtained in the adjusting module by using the training data set obtained by the reading module to obtain a trained weak supervision detection model;

and the detection module is used for generating a candidate frame by using a selective search algorithm on the test data set obtained by the reading module, and then classifying and boundary frame regression are carried out on the candidate frame by using the weak supervision detection model trained by the training module to obtain a final target detection frame.

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of the embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

1. Simulation conditions are as follows:

the hardware platform is as follows: HP-Z840 workstation, TITAN-X-12GB-GPU,64GB RAM.

The software platform is as follows: python, Pytorch deep learning framework.

2. Simulation content and results:

the datasets of the present simulation experiment are the PASCAL VOC 2007 and 2012 datasets, and the MS COCO dataset, with example level tags removed from the dataset, and only image level tags used. The PASCAL VOC 2007 and 2012 data sets consist of 9962 and 22531 images in 20 categories, respectively, with 5011 training images in the 2007 data set and 11540 training images in the 2012 data set; the MS COCO dataset consists of 123278 images of 80 categories, selecting 82783 as training images and 40504 as test images. Mean Average Precision (mAP) (IOU >0.5) was used as an evaluation criterion.

After training, the results of the model testing are shown in table 1.

Table 1 test results of the invention on each data set

According to the invention, the target detection precision on the PASCAL VOC 2007 data set is 54.2%, the target detection precision on the PASCAL VOC 2012 data set is 47.5%, and the target detection precision on the MS COCO data set is 23.2%, so that the advanced level of the weak supervision target detection network at the same stage is reached.

In summary, the method and system for detecting the weakly supervised target of the present invention have the following characteristics:

1. only image-level supervision information is required

In the target detection task, a large amount of manpower, material resources and financial resources are required for labeling information (target bounding boxes) at an instance level. The labeling cost of only image categories is obviously low, and a large number of pictures with category labels can be crawled from a network search engine, social media and the like. The large amount of training data can improve the target detection performance. The cheap and easily-obtained pictures labeled only by the categories are beneficial to the development and engineering of the target detection field.

2. Better quality proposal box

Most of the existing weak surveillance object detection algorithms adopt a traditional Selective Search (SS) algorithm to generate a proposal box, and then a classification problem is made on the basis of the proposal box. Thus, the defects of the typical machine learning classification algorithm become the defects of the typical weak supervision target detection, that is, more significant objects or parts of objects in the image are easily detected, and small objects or complete objects are lost. According to the invention, a series of high-quality (more complete target coverage) proposal frames are obtained by performing proposal frame generation by Grad-CAM technology before an MIL detector, so that the detection of a subsequent detector is facilitated, and the defect that the traditional method is easy to fall into local optimum is overcome;

3. low level supervisory information

When the conventional method is used for iteratively screening the proposal boxes, the evaluation standard only has the classification score of the proposal boxes, only high-level semantic information is considered, and the importance degree of low-level information (edges, textures and the like) in measuring whether a target is an object or not is ignored. The invention adds low-level feature supervision in the iterative optimization process, introduces the concept of similarity, screens out the proposal box internally containing the complete target, generates the pseudo label to supervise the subsequent iterative process, improves the defect that the prior method only frames the local part of the target, and improves the detection precision.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above contents are only for illustrating the technical idea of the present invention, and the protection scope of the present invention should not be limited thereby, and any modification made on the basis of the technical idea proposed by the present invention falls within the protection scope of the claims of the present invention.

Claims

1. A weak supervision target detection method is characterized by comprising the following steps:

s5, adding low-level supervision information into the object proposal frame obtained in the step S2, and establishing an example classifier for optimization iteration; using the proposition box with the highest score as a pseudo label training bounding box regression network;

2. The method of claim 1, wherein in step S2, the highest scoring n potential object classes in each test image are predicted by using a one-stage image-level classifier, and then a final proposed box for detection is obtained by using a two-stage proposed-box-level classifier, and the loss function of the one stage is as follows:

3. The weak supervision target detection method according to claim 1, wherein step S4 specifically includes:

s401, obtaining a series of object proposal frames for a given image and a corresponding label thereof through step S2; then sending the object proposal frame into a classification data stream and a detection data stream, and respectively obtaining two data matrixes through two full connection layers; the two data matrixes respectively generate a classification score and a detection score of each proposal box through two softmax operators;

4. The method according to claim 3, wherein in step S402, the loss function of the MIL detector is:

5. The method for detecting the weakly supervised target according to claim 1, wherein the step S5 specifically includes:

s501, establishing K example classifiers on the basis of the MIL detector, and calculating similarity O of low-level features of the proposal box r_bi(r) converting the analog to O_bu(r) class score from previous classifier

Carrying out weighted addition, taking the proposal frames with high scores after the addition as proposal frames with complete object targets, taking the proposal frames with high scores at the first n times as pseudo-supervision information when the example classifier is iteratively trained next time, and iteratively training the K example classifiers for K times;

6. The method of claim 5, wherein in step S501, the loss functions of K instance classifiers are iteratively trained

Comprises the following steps:

wherein R is the number of the proposal boxes generated in the step 3,

is its classification label.

7. The weakly supervised target detection method of claim 5, whereinIn step S503, the generating of the pseudo-supervision information and the positive sample box specifically includes: s5031, based on the class probability of the k-1 branch

S5032, for each class c, if

8. The method for detecting a weakly supervised object as recited in claim 1, wherein in step S6, the loss function L is:

wherein L is_baseAs a loss function of the MIL detector, λ₁The loss weights for K instance classifiers, K being the number of instance classifiers,

for the loss function of the kth instance classifier, λ₂Is the loss weight of the bounding box regressor, L_boxIs the loss function of the bounding box regressor.

9. The method for detecting a weakly supervised target according to claim 1, wherein in step S7, the adjusting of the hyper-parameter specifically comprises:

the feature extraction stage in step S3 uses a VGG16 network; loss weight λ for an example classifier detector₁1, the number of example classifiers K is 3, and the loss weight λ of the bounding box regression network₂0.3; threshold T of NMS_nms＝0.3，，，T_conf＝0.7，T_iou＝0.5(ii) a During network training, the initial learning rate is 0.001, the learning rate attenuation is 0.0005, and the total iteration number is 200000.

10. A weakly supervised target detection system, comprising: