CN115565005A

CN115565005A - Weak supervision real-time target detection method based on progressive diversified domain migration

Info

Publication number: CN115565005A
Application number: CN202211235864.2A
Authority: CN
Inventors: 李成严; 郑企森; 王昊
Original assignee: Harbin University of Science and Technology
Current assignee: Harbin University of Science and Technology
Priority date: 2022-10-10
Filing date: 2022-10-10
Publication date: 2023-01-03

Abstract

The invention provides a weak supervision real-time target detection method based on progressive diversified domain migration, and aims to solve the problem of cross-domain target detection under the weak supervision condition. The method provides a two-stage progressive domain migration framework and a real-time target detector matched with the framework, and the intermediate domain is used for closing the domain gap to decompose a difficult task into two subtasks which are easier and have smaller gap. Firstly, the distribution of the annotation data is diversified through a domain shifter, a source domain image is converted into a diversified intermediate domain image, then the intermediate domain is used as a supervised source domain, and a pseudo-annotation image sample is generated by combining an image-level label in a target domain. A real-time target detector is constructed by stacking 50 layers of residual error network units and additionally superposing 3 convolution modules to form a basic network, fusing a characteristic pyramid network by a path from top to bottom and a path from bottom to top and a transverse connection depth, and taking a generalized cross-over ratio as a position loss function. The target detection model is gradually adjusted through the artificially generated image samples of the intermediate domain and the pseudo-label to solve the problems of image translation deviation, source bias discriminability of feature level adaptation and the like in the domain migration process, so that the detection accuracy is improved and the instantaneity is ensured under the weak supervision condition.

Description

Weak supervision real-time target detection method based on progressive diversified domain migration

Technical Field

The target detection is an important branch in the field of computer vision, and the machine classifies objects in a real image and finds out a boundary frame of the objects by learning visual parameters, so that the target detection method can be widely applied to intelligent monitoring, unmanned driving, image retrieval and the like. With the rapid development of the convolutional neural network in recent years, the target detection algorithm based on deep learning has advanced greatly. However, the existing target detection algorithm usually needs to use a large amount of data with a target object position information label for training, and obtaining such data requires a great deal of cost and a great deal of manpower. The weak supervision target detection is focused on detecting and positioning targets under the supervision of image-level labels, the requirements of a training process on dense labeling (such as pixel-level segmentation annotation or bounding boxes) are relaxed, the difficulty of data acquisition is greatly reduced, the manual labeling cost is saved, and the method has strong practical significance. The invention provides a weak supervision real-time target detection method based on progressive diversified domain migration, wherein a source domain data set is provided with class labels and boundary box labels, a target domain data set only has image-level labels, and classes to be detected in a target domain are all or a subset of the classes in the source domain. The method aims to convert the problem of weak supervision target detection into the problem of domain adaptation by using sufficient and easily-obtained complete labeling information in a source domain and an image-level label in a target domain, so that the method has stronger robustness and more excellent detection performance in an actual application scene.

Background

The supervised learning based target detection approach has conditional constraints that assume that the test data and the training data have the same distribution. However, domain shifts often occur in many practical applications, for example, changes in object appearance, viewpoint, background, lighting, and weather conditions all degrade the performance of the network. One possible solution is to collect annotation data for the target domain, but this is often expensive and time consuming. In order to solve the problems of scarce labeled data of a target domain, domain deviation and the like, target detection can be realized in a cross-domain self-adaptive mode, and the traditional weak supervision or unsupervised depth domain self-adaptive method is mainly based on feature level adaptation and pixel level adaptation.

The pixel-level self-adaptive method is realized by focusing on a method for performing image translation from a source domain to a visual appearance style of a target domain, namely learning the annotation information of the source domain and the image style of the target domain and transferring the annotation information of the source domain to a generated image. Most existing pixel-level adaptive methods are based on the assumption that the image translator can perfectly translate one domain into the opposite domain, so that the translated image can be considered as an image from the opposite domain. However, these methods exhibit bias in image translation in many adaptation situations. Because the performance of the image translator depends to a large extent on the appearance gap between the source domain and the target domain, new domain difference problems may result if these images with translation bias are considered to be from the target domain.

Feature-level adaptive methods align the distribution of source and target domains into a cross-domain feature space, where models supervised training by labeled source domain datasets are expected to be able to efficiently infer target domains, but where the model's feature extractor is forced to extract features in a way that distinguishes source domain data, this is not applicable to target domains. Furthermore, since the target detection data is interleaved with the instances of interest and the relatively unimportant background, it is more difficult for the source biased feature extractor to extract discriminatory features for the target domain instances. Thus, feature-level adapted target detectors risk source-bias discrimination and may lead to misidentification on the target domain.

Meanwhile, in the above mentioned cross-domain adaptive methods, a precondition assumption is involved: the target detector has stronger robustness and superior detection performance. However, under the condition of weak supervision, the existing methods cannot achieve the speed of real-time detection and the detection precision needs to be improved.

Disclosure of Invention

In order to solve the problem of cross-domain target detection under the weak supervision condition, the invention discloses a weak supervision real-time target detection method based on progressive diversified domain migration, which can improve the detection accuracy and meet the real-time requirement. Therefore, the invention provides the following technical scheme.

A two-stage progressive domain migration framework and a real-time object detector that matches the framework. In the first stage of a domain migration framework, the characteristics of a source domain and a target domain are combined, the learning tendency of an image translation generator and a discriminator is changed by increasing constraint conditions by utilizing the defectiveness of an image translation network, and a domain shifter capable of generating diversified data samples is designed so as to retain semantic information of different levels of the source domain and the image style of the target domain and generate an intermediate domain with the diversified characteristics of the source domain. And in the second stage of the domain migration framework, the intermediate domain is used as a supervised source domain, and an image sample with a pseudo-label is generated by combining the image-level annotation in the target domain. When a real-time target detector is constructed, a basic network is formed by stacking 50 layers of residual error network units and additionally superposing 3 convolution modules, low-resolution and high-semantic features and high-resolution and low-semantic features are deeply fused by a feature pyramid network through top-down and bottom-up paths and transverse connection, and a generalized intersection ratio is used as a position loss function. The target detection model is sequentially adjusted through artificially generated samples of the intermediate domain and the pseudo-label, so that the problems of source bias discriminability of feature level domain self-adaptation, image translation bias of pixel level domain self-adaptation and the like are solved, and the real-time target detection task under the weak supervision condition is realized by considering real-time performance and accuracy. The specific content comprises.

Design of domain shifter: domain shifting is realized by utilizing a variant of an image translation network from a given source domain, and in order to ensure the universality of the structure, a residual error generator and a discriminator in a cycle consistent generation countermeasure network (CycleGAN) are selected to construct a domain shifter. In order to significantly distinguish the domain shift effect, the present invention considers two factors in the objective function, namely color preservation and reconstruction.

Generating an intermediate domain: aligning feature distributions between two domains that differ significantly is challenging, so an intermediate feature space can be introduced to simplify the adaptation task. In other words, the inventive method does not directly solve the gap between the source domain and the target domain, but gradually adapts the target domain of the intermediate domain connection. The intermediate domain is constructed from the source domain image, synthesizing the target distribution at the pixel level.

Domain diversification: a method for diversifying a source domain by intentionally generating unique domain differences by a domain shifter. Each factor configuration of the domain shifter will produce different visually distinct pictures to enrich the middle domain samples.

Generating a pseudo-annotation image: in the target domain, if target detection is performed using a target detector trained only on the source domain, the main reason for the detection failure is that the target class is confused with other classes or backgrounds, rather than being inaccurately located. Adjusting the target detector on the pseudo-annotated image may improve the object context distribution, significantly reducing this confusion. And taking the generated intermediate domain as a supervised source domain, and generating a pseudo-labeled image by combining with the image-level label in the target domain.

Constructing a target detector: stacking 50 layers of residual error network units, additionally stacking 3 convolution modules to form a basic network, and deeply fusing low-resolution and strong-semantic features and high-resolution and weak-semantic features with a feature pyramid network in a way of top-down and bottom-up and transverse connection to optimize the multi-scale feature detection effect; and meanwhile, the generalized intersection ratio is used as a position loss function, so that the problem of infeasibility of optimization under the condition of non-overlapping bounding boxes is solved.

A progressive two-stage domain migration framework: the difficult task is broken down into two easier and less-apart subtasks using the intermediate domain to bridge the domain gap. Firstly, the distribution of the annotation data is diversified through a domain shifter, a source domain image is converted into a diversified intermediate domain image, then the intermediate domain is used as a supervised source domain, and a pseudo-annotation image is generated by combining an image-level label in a target domain. And (3) gradually adjusting the detection model by using two artificially generated image samples of the intermediate domain and the pseudo-label so as to solve the problems of image translation deviation, source bias discriminability of feature level adaptation and the like in the domain migration process.

The method for detecting the weakly supervised real-time target based on the progressive diversified domain migration specifically comprises the following steps:

step 1: loading a source domain data set into a target detector for pre-training to obtain a pre-training model;

and 2, step: loading the source domain and target domain data sets into a diversified domain shifter to generate intermediate domain images with source domain labeling information and different target domain image styles;

and step 3: loading the intermediate domain data into a pre-training model, and performing first-step adjustment on the model to obtain an intermediate domain training model;

and 4, step 4: loading the target domain image into a middle domain training model for detection, outputting information such as type, confidence score, position and the like, comparing the information with weak label (image-level label) information in the target domain image, adding pseudo label (position information) for the type in a detected picture, and generating a pseudo label image;

and 5: loading the pseudo-annotation image into a middle-domain training model, and performing second-step adjustment on the model to obtain a final learning training model;

step 6: and (4) testing the model.

Compared with the prior art, the invention has the following beneficial effects.

1. The problem of image translation bias adaptive to the pixel level and the problem of source bias discriminability adaptive to the feature level are effectively relieved.

2. The accuracy of cross-domain weak supervision target detection is improved, and the problem that real-time detection performance cannot be achieved under the weak supervision condition is solved.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

FIG. 2 is a diagram of the results of a multivariate domain shifter image translation with various constraint factor configurations.

FIG. 3 is a block diagram of an object detector of the present invention.

Fig. 4 is a diagram of an SSD network structure.

Fig. 5 is a diagram of a feature pyramid network structure.

Fig. 6 is a generalized intersection ratio calculation chart.

Fig. 7 is a diagram of a residual network architecture.

FIG. 8 is a diagram of detection error analysis.

Detailed Description

The technical scheme of the invention is further explained by combining the accompanying drawings as follows:

FIG. 1 is a flow chart of the method of the present invention, and each step is described in detail according to the contents shown in the flow chart.

The invention provides a progressive two-stage domain migration framework to gradually adjust a supervised target detection network pre-trained on a source domain so as to achieve the aim of weak supervision real-time detection, wherein two methods for generating image samples are adopted: converting the source domain image with the complete annotation into an intermediate domain image through a domain shifter, aligning the characteristic distribution on a pixel level according to appearance difference, and generating a diversified image with complete annotation information; and taking the generated intermediate domain as a supervised source domain, and combining an image-level label in a target domain to generate an image with a pseudo label. The two methods produce image samples with different properties: although the samples generated by the domain diversification are not high quality images in terms of their similarity to the target domain image, the bounding box is correctly labeled. In contrast, although the pseudo-annotation generates samples without an accurate bounding box, the image quality is guaranteed since they are entirely target domain images.

The method of the present invention uses these two generated images to gradually adjust the object detection model as shown in fig. 1. Firstly, pre-training (Pre-train) a model in a Source Domain (Source Domain) by using a complete label to obtain a Pre-trained model (Pre-model); and then, adjusting (Re-train) the pre-training model by using an Intermediate Domain (Intermediate Domain) image obtained by Domain Diversification (DD), fusing image features (differential Feature) which are not possessed by the pre-training model to obtain an Intermediate Domain model (DD-model), and finally, training and adjusting the Intermediate Domain model by using a Target Domain Pseudo Label (Pseudo Label, PL) image to obtain a Final Target detection model (Final-model). It is worth noting that the order of execution of the two adjustment steps is crucial, since the performance of the pseudo-labeling is highly dependent on the target detector.

Describing the data set in step 1, the source domain data set has category labels and bounding box labels, while the target domain data set has only image-level labels, and the classes to be detected in the target domain are all or a subset of the classes in the source domain. The object of the invention is to detect objects in the target domain as accurately as possible and to achieve real-time detection performance by using sufficient and readily available full labeling information in the source domain and image-level labels in the target domain. In the experiment of the invention, a PASCAL VOC2007 +2012 real world data set is selected as a source domain, and Clipart1k, watercolor2k and Commic 2k artistic media data sets are selected as target domains, so as to verify a domain migration target detection method from the real world to the artistic media.

To illustrate the implementation of the diversified domain shifter in step 2, among the many possibilities for the domain shifter to generate the intermediate domain, inspired by the pixel-level adaptation constraint, it can actually be implemented with the drawbacks of image translation. Representing source domain samples asx ^s The target domain sample is expressed asx ^t The distribution of domains is respectively represented asP _s (x)AndP _t (x). Generally, the aim of image translation methods is to obtain a translation image by optimizing the sampling from the target domainT(x ^s )Training generatorG. However, because the generator network has a high enough capacity to make the various translations, mere loss of resistance does not guarantee that a given one will be givenx ^s Converted into the desired target image. To compensate for this instability, the image translation method moves to the objective functionL _img Constraints are added to reduce the probability of bad producers:

L _img (G,D,M)=L _GAN (G,D)+αL _res (G,M) (1)

L _GAN (G,D)=Ε _xt~Pt(xt) [logD(x ^t )+ Ε _xs~Ps(xs) [log(1-D(T(x ^s )))]] (2)

wherein,Dis a discriminator for the counterstudy,L _res (G,M)is a possible additional moduleMThe additional modules imply a supplementary network required for complex constraints,αis the weight that balances these two losses.

Changing the learning trend by substituting constraints may result in the generatorGDiversifying the appearance of the translated image. Based on this condition, different constraints may be applied to implement different domain shifters. Objective function of domain shifterL _DD Can be written as:

L _DD (G,D,M)=L _GAN (G,D)+βL _res (G,D,M) (3)

wherein,L _DD (G,D,M)is a loss of the constraint that causes the domain shifter to be distinguished,Man additional module representing a possible loss of constraint,βis the weight that balances these two losses.

To verify the validity of the domain diversification, the present invention attempts to generate 3 different domain shifters for each domain adaptation task. To ensure the universality of the structure, a residual error generator and a discriminator in a cycle consistent generation countermeasure network (CycleGAN) are selected to construct a domain shifter. To significantly distinguish the domain shift effect, two factors are considered in the objective function, namely color Preservation (PC) and Reconstruction (RE). Fig. 2 shows the visual difference caused by each factor configuration, and simultaneously compared with the image generated by CycleGAN, the source domain is PASCAL VOC, and the target domains are Clipart1k, watercolor2k and commac 2k from top to bottom respectively.

Color save (PC) based domain shifter: for constraining the domain shifter to maintain color, the choice is made to use between the input image and the translated imageL ₁ And (4) loss. However, when given constraints are not sufficientIn effect, the instability of the training increases, so that only constraints are assigned to the target domain for diversified transitions. The constraint penalty of the domain shifter can be written as:

L _res,1 (G)=Ε _x~Pt(x) [||(G(x)-x)|| ₁ ] (4)

reconstruction (RE) based domain shifter: for reconstruction, a pair of domain shifters is also needed on the original basisG’Sum discriminatorD’To perform reverse translation. At the same time, additional generative antagonistic losses are required for trainingG’. Thus, the constraint penalty of the domain shifter can be written as:

L _res,2 (G, G’, D’)=Ε _x~Ps(xs) [ logD’(x ^s )+Ε _xt~Pt(xt) [log(1-D’(G’(x)))]

+Ε _xs~Ps(xs) [||(G’(G(x ^s ))-x ^s )|| ₁ ] +Ε _xt~Pt(xt) [||(G’(G(x ^t ))-x ^t )|| ₁ ] (5)

reconstruction and color preservation (PC + RE) based domain shifter: to consider both factors simultaneously, it is necessary to apply the sum of the two constraint penalty terms and an additional moduleG’AndD’：

L _res,3 (G, G’, D’)= L _res,1 (G)+ L _res,2 (G, G’, D’) (6)

explaining the effect of the intermediate domain training model in the step 3, comparing the proposed domain diversification method with the existing domain migration image style conversion method, namely a cross-domain target detection method (DAF) based on fast R-CNN and a Domain Transfer (DT) stage in a weak supervision cross-domain target detection method (CDWSDA), wherein a Baseline method is a pre-training model without loading, and only a completely labeled target domain data set is used for training and testing, the method of the invention improves 1.3% -17.5 mAP compared with the existing method, so that the domain diversification method has certain advantages in the aspect of detection accuracy compared with other methods.

To illustrate the generation of pseudo-labeled image in step 4, in the target domain, if the target detector trained only on the source domain is used for target detection, the main reason for the detection failure is that the target class is confused with other classes or backgrounds, rather than being inaccurately positioned. Adjusting the target detector on the pseudo-annotated image may improve the object context distribution, significantly reducing this confusion. The pseudo-annotation is simple to use and is applicable to any target detection network, as it does not access the middle layers of the network.

Formally, the purpose of pseudo-annotation is to come from the target domainXEach image of (1)x ^t Obtaining a pseudo bounding box of a target object andx ^t the original image-level label forms a complete pseudo labelG. Order tox ^t ∈ ^R×H×W×3 Represents an RGB image, whereinHAndWrespectively the height and width of the image.CRepresenting a set of object classes.zRepresenting image-level annotations:x ^t the class set in (1). In addition to this, the present invention is,GIncludedg =(b,c)whereinb∈R ⁴ is a boundary box that is a boundary of the frame,c∈C。

first, the output of the object detection network may be obtainedD。DIncluding each detectiond=(p,b,c)Which isIn (1),c∈C，p∈ Rrepresenting bounding boxesbBelong to the categorycThe probability of (c). Second, for each classc∈zUsing top-1 confidence detectiond=(p,b,c) ∈DAnd will be(b,c)Is added toG. Finally, usex ^t AndGthe target detector is adjusted. Notably, no layer of the target detection network is replaced to preserve the detection capabilities of the original network. In the method of the invention, the adjustment target detector is firstly trained by using the image obtained by the domain diversification, and then the adjustment is carried out by using the image obtained by the pseudo-labeling.

Fig. 4 shows a structure of the SSD network. The Single Shot multi box Detector (SSD) is a target detection model based on a convolutional neural network, and has relatively high precision and real-time detection performance. The SSD algorithmic model may be divided into two parts: the first part is a feature extraction network, in which features of the input image are extracted by the first five convolutional layers of the VGG 16; the other part is a multi-scale feature detection network, wherein feature maps generated by pooling are reduced layer by layer, a plurality of prior frames are generated on each layer of feature map, then the position and the category of a target object are predicted based on a plurality of convolution kernels, and finally a final detection result is obtained by using a non-maximum suppression (NMS) method, so that the detection of the multi-scale feature map is realized.

Fig. 5 is a diagram showing a structure of a Feature Pyramid Network (FPN). Although the existing target detection algorithm performs well in a natural scene, the detection effect is not ideal for small targets, especially for a single-stage method. Multi-scale feature fusion is one of The effective strategies to solve this problem, so Lin et al propose The Feature Pyramid Network (FPN) for use in The target detection network. FPN can combine low-level weak features with high-level robust features to improve the detection accuracy of small objects. As shown in fig. 5, when the feature pyramid network propagates to the deep layer, the feature map is Down-sampled layer by layer (Down-sampling), so as to obtain a deep feature map with rich semantic information. In order to solve the problem of insufficient semantic information of the shallow feature map in small target detection, the FPN performs Up-sampling (Up-sampling) on the high-level feature map twice, then performs channel matching by using a 1 × 1 convolution kernel, and finally adds the bottom-level feature information to obtain richer semantic information. The characteristic pyramid network successfully improves the fast RCNN algorithm by enhancing the connection between the characteristic layers and fusing the information between the characteristic layers, so that the detection precision is well improved.

Fig. 6 shows a generalized intersection-to-parallel ratio (GIOU) calculation chart. The Intersection Over Union (IOU) is the most popular evaluation index in the target detection benchmark. However, when optimizing the common distance loss, there is a gap between the parameters of the regression bounding box and the maximization metric. Generally speaking, the best objective function of the network is the scale function itself, and in the case of an axis-aligned two-dimensional bounding box, the IOU can be directly used as a regression loss, but optimization under non-overlapping bounding box conditions is not feasible. In contrast to the IOU, GIOU not only focuses on overlapping regions, but also the empty space between the predicted bounding box and the true bounding box increases in the closed graph when they are not well aligned. Thus, the value of GIOU not only better reflects how the two symmetric objects overlap, but also fits the overlap between the true bounding box and the predicted bounding box. GIOU improves on the basis of the IOU and can be expressed as:

GIOU=IOU-|C/(G∪P)|/|C| (7)

wherein,GandPis a rectangular frame with any size,Cis composed ofGAndPas shown in fig. 6.

Fig. 7 shows a structure diagram of a ResNet network. ResNet is a residual network module, and the residual structure associates input and output channels by means of "Shortcut Connections". ResNet can be understood as a sub-network, and can be stacked to form a deep network, which not only can ensure that the network can realize a deep and complex structure and improve the feature expression capability, but also solves the problems of overfitting and degradation which easily occur in the network. Assume the expected output of the underlying map isH (x)The residual features are mapped intoF(x)=H(x)-xThen the residual is optimized to 0, which is easier than optimizing the original underlying map. The deeper network layer number can improve the problem that the feature extraction capability is restricted by the instability of gradients such as gradient disappearance or gradient explosion and the like. If the network depth is deepened on the basis of a backbone network VGG-16 in the SSD, a large number of parameters occupy storage space, so that the detection speed is slowed, a network degradation phenomenon occurs, and high-level semantic information cannot be fully utilized to perform better detection on a small target. In order to ensure that the network has enough depth to extract rich semantic information to represent complex features, the method uses ResNet as a backbone network, because the ResNet is connected through shortcuts of residual mapping and is better at transferring gradient information to prevent gradient disappearance and gradient explosion. Therefore, compared with a common network, resNet can reach thousands of layers, and a stack of 50 layers of residual units, namely ResNet-50, is finally selected as a backbone network in consideration of time and limited computing resources.

For explaining the target detector in step 5, as shown in fig. 3, the input of the network model is an RGB image of 300 pixels × 300 pixels, the backbone is a Resnet-50 based network, and 3 convolution modules are additionally stacked to form a Resnet-SSD deep network. In the forward propagation calculation process, the main branch outputs feature maps with various resolutions, the detection capability of small targets can be enhanced by fusing the deep and shallow feature layers, and the semantic information of a shallow network is improved. As a feature fusion method, a Feature Pyramid Network (FPN) combines features with low resolution and strong semantics with features with high resolution and weak semantics through a top-down path, a bottom-up path and transverse connection to construct a deeper feature pyramid, and outputs the deeper feature pyramid in different features, so that the detection effect of small targets is better. Meanwhile, the GIOU loss function is adopted to replace a position loss function in the original SSD, so that the problem of infeasibility of optimization under the condition of non-overlapping boundary frames is solved, and the detection precision is improved. Experiments show that compared with the existing target detection network, the target detector has higher precision and better effect, and can achieve real-time detection performance.

The model test experiment in step 6, which uses the target detector proposed by the present invention, is described in VOC2007-trainval and VOC2012-traAnd (4) performing pre-training on the inval, and gradually adjusting images obtained by using a PC + RE middle domain in domain diversification and pseudo-labeling. When training the multi-domain shifter to generate the middle domain, the initial learning rate of the first 10 rounds (epoch) is 1.0 × 10 ^-5 After 10 epochs of network training, the learning rate is attenuated to 0, and the rest hyper-parameters are set according to the original text of CycleGAN. When the target detection network is adjusted, the momentum parameter is 0.9, the training network carries out 10000 iterations, the previous 7000 iterations, the learning rate is 1.0 multiplied by 10 ^-3 3000 iterations later, the learning rate is 1.0 × 10 ^-5 . All experiments of the invention are completed on PyTorch and single NVIDIA GeForce RTX GPU, the sizes of all input images are adjusted to 300 pixels multiplied by 300 pixels in the training process, and the average precision of the testing stage is evaluated according to a GIoU threshold value of 0.5.

For the ablation experiment, it is assumed that there is a case that only an image without a label in the target domain can be obtained, and at this time, a pseudo label cannot be generated according to the original image-level label of the target domain. For the above case, the Domain Diversification (DD) is applicable without any modification. Since image-level label annotations cannot be accessed, pseudo-labeling (PL) is not applicable to the above case, and in order to verify the validity of pseudo-labeling, the baseline method can be directly applied to pseudo-labeling, skipping the adjustment of domain diversification to the model. To verify the necessity of image-level labels, it can be provided that the probability is only in all detectionspHighest one detectiond _top Can be pseudo-labeled and is denoted as PL _ label. The Baseline method (Baseline) is to use only a fully labeled target domain data set for training and testing without loading a pre-training model. By quantitatively analyzing the contributions of different components, the domain diversification (PC + RE constraint) is improved by 2.2% -11.8% compared with a baseline method, and the advantage of pixel-level domain migration is shown; the pseudo-annotation is improved by 5.5% -10.9% compared with a baseline method, and the performance is damaged by using the PL _ label pseudo-annotation to adjust the target detection network, so that image-level annotation in a target domain is proved to be essential for the pseudo-annotation.

Fig. 8 is a diagram illustrating a detection error analysis, which can prove the positive effect of the method of the present invention on domain migration. PASCAL VOC → Clipart1k asFor example, since the Clipart1k test set has only 500 images, 1000 detection analyses with the highest confidence for each method were selected. Considering the positions, classes and gious of the prediction box and the real bounding box, the detection is divided into three groups: correct detection (Correct detection), position error (Localization error) and Background error (Background error). Correct detection representationGIOU＞0.5Correct class of, location error representation0.1＜GIOU＜0.5Correct class of (1), background error representationGIOU＜0.1Error class or correct class. By analyzing detection errors, domain Diversification (DD) uses the PC + RE constraint, and DD with or without false labeling (PL) reduces background detection errors compared to the baseline method, but DD with PL significantly increases the number of correct detections.

Compared with the existing cross-domain target detection method, the method of the invention has the advantages that the method has advanced results on Clipart1k, watercolor2k and Commic 2k data sets, and the mAP is improved by 0.4-4.7% compared with the existing method. Under the condition of weak supervision, the current methods can not meet the requirement of real-time detection. The method uses diversified domain migration and pseudo-labeling to replace a single-stage weak supervision detection scheme to realize real-time target detection, the detection speed on a PASCAL VOC2007 test set is 38FPS, and the detection speeds on Clipart1k, watercolor2k and Commic 2k data sets respectively reach 32FPS, 47FPS and 45FPS. Generally, the FPS can meet the real-time detection requirement under the actual condition when reaching 30 degrees, so that the method not only improves the accuracy rate, but also achieves the target of real-time detection under the condition of weak supervision.

The foregoing is a detailed description of the method of the invention, taken in conjunction with the accompanying drawings, and the detailed description is provided only to assist in understanding the method of the invention. For those skilled in the art, the invention can be modified and adapted within the scope of the embodiments and applications according to the spirit of the present invention, and therefore the present invention should not be construed as being limited thereto.

Claims

1. A weak supervision real-time target detection method based on progressive diversified domain migration, in order to solve the problem of cross-domain target detection under the weak supervision condition, have proposed a two-stage progressive domain migration frame and a real-time target detector matched with this frame; in the first stage of a domain migration framework, combining the characteristics of a source domain and a target domain, changing the learning trend of an image translation generator and a discriminator by increasing constraint conditions by utilizing the defectiveness of an image translation network, and designing a domain shifter capable of generating diversified data samples so as to retain semantic information of different levels of the source domain and the image style of the target domain and generate an intermediate domain with the diversified characteristics of the source domain; in the second stage of the domain migration framework, the intermediate domain is used as a supervised source domain, and an image sample with a pseudo-label is generated by combining image-level annotations in a target domain; when a real-time target detector is constructed, 50 layers of stacks of residual error network units are adopted, 3 convolution modules are additionally superposed to form a basic network, the characteristic pyramid network is used for deeply fusing low-resolution and strong-semantic characteristics and high-resolution and weak-semantic characteristics through top-down and bottom-up paths and transverse connection, and meanwhile, generalized intersection and comparison are adopted as a position loss function; sequentially adjusting a target detection model by artificially generating image samples through a middle domain and a pseudo label in sequence so as to improve the problems of source bias discrimination of feature level domain self-adaptation, image translation bias of pixel level domain self-adaptation and the like, and realizing a real-time target detection task under a weak supervision condition by considering real-time performance and accuracy;

the weak supervision real-time target detection method based on progressive diversified domain migration comprises the following steps:

step 6: and (5) testing the model.

2. The method of claim 1, wherein the intermediate domain image is an intermediate feature space introduced when aligning feature distributions between two domains with larger differences, and is used for simplifying the adaptive task; in other words, the method of the present invention does not directly solve the gap between the source domain and the target domain, but gradually adapts to the target domain connected by the intermediate domain; the intermediate domain is constructed from the source domain image, synthesizing the target distribution at the pixel level.

3. The method of claim 1, wherein the domain shifter is a way to implement domain shifting from a given source domain using a variant of the image translation network; in order to ensure the universality of the structure, the constraint conditions of a residual error generator and a discriminator in a cyclic consistency generation countermeasure network (cycleGAN) are selected to construct a domain shifter; in order to significantly distinguish the domain shift effect, the present invention considers two factors in the objective function, namely color preservation and reconstruction.

4. Domain diversification is a method of diversifying the source domain by deliberately creating unique domain differences through a domain shifter, as described in claim 1; each factor configuration of the domain shifter will produce different visually distinct pictures to enrich the middle domain samples.

5. The method of claim 1, wherein the pseudo-labeled picture samples are obtained by using the intermediate domain as a supervised source domain and generating a pseudo-labeled picture by combining with the image-level label in the target domain; in the target domain, if target detection is performed using a target detector trained only on the source domain, the main reason for the detection failure is that the target class is confused with other classes or backgrounds, rather than being inaccurately located; adjusting the target detector on the pseudo-annotated image may improve the object context distribution, significantly reducing this confusion.

6. The method as claimed in claim 1, wherein the target detector is a basic network formed by stacking 50 layers of residual network units and additionally overlapping 3 convolution modules, and the feature pyramid network is used for fusing low-resolution and strong-semantic features with high-resolution and weak-semantic features in depth by means of top-down and bottom-up paths and horizontal connection to optimize the multi-scale feature detection effect; and meanwhile, the generalized intersection ratio is used as a position loss function, so that the problem of infeasibility of optimization under the condition of non-overlapping bounding boxes is solved.

7. The method as claimed in claim 1, wherein the model test is to select a target domain and a source domain data set, set hyper-parameters such as momentum parameters, learning rate, model training rounds and the like, run a target detection task in an experimental environment, input related data into a model according to the inventive steps, and finally output specific position coordinates, confidence degrees and detection speeds of objects in the target domain image.