CN114882372A

CN114882372A - Target detection method and device

Info

Publication number: CN114882372A
Application number: CN202210807357.5A
Authority: CN
Inventors: 沈孔怀; 朱亚伦
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2022-08-09

Abstract

The invention discloses a method and equipment for detecting a target, which are used for improving the detection performance of a detection model in an actual scene while not increasing the labeled training data volume. The method comprises the following steps: acquiring a target image to be detected; inputting the target image into a trained detection model, and outputting the position of a target object in the target image; the detection model is obtained by training a labeled first sample set and an unlabeled second sample set, a first low-dimensional feature map of a first sample image in the first sample set and a second low-dimensional feature map of a second sample image in the second sample set are extracted during each training, the first low-dimensional feature map and the second low-dimensional feature map are fused to obtain a fused feature map, and the position of a target object labeled by the fused feature map and the first sample image is used for training.

Description

Target detection method and device

Technical Field

The invention relates to the technical field of computer graphics and image processing, in particular to a method and equipment for target detection.

Background

The current method based on deep learning is generally a data-driven method, however, the real environment is complex and diverse, and if the data distribution of the actual scene and the scene of the current training set is obviously different, the performance of the detection model may be significantly reduced. In order to ensure the performance of the algorithm, data under actual scenes needs to be covered during training, but the cost of data labeling is "expensive", and a large amount of labor and time are needed.

Therefore, under the condition that the data distribution of the actual scene and the scene where the training set is located has a large difference, how to ensure that the detection performance of the detection model in the actual scene is improved while the amount of the labeled training data is not increased becomes a technical problem to be solved urgently at present.

Disclosure of Invention

The invention provides a method and equipment for detecting a target, which are used for improving the detection performance of a detection model in an actual scene while not increasing the labeled training data volume.

In a first aspect, an embodiment of the present invention provides a method for target detection, including:

acquiring a target image to be detected;

inputting the target image into a trained detection model, and outputting the position of a target object in the target image;

the detection model is obtained by training a labeled first sample set and an unlabeled second sample set, a first low-dimensional feature map of a first sample image in the first sample set and a second low-dimensional feature map of a second sample image in the second sample set are extracted during each training, the first low-dimensional feature map and the second low-dimensional feature map are fused to obtain a fused feature map, and the positions of target objects labeled by the fused feature map and the first sample image are used for training; wherein the first sample set and the second sample set both contain target objects, the second sample set containing different scene characteristics than the first sample set.

According to the embodiment, on the basis of labeled data, a large amount of unlabeled data are used for training the detection model, and the separation and fusion of features are directly carried out on the shallow layer of the network, so that the training process is simplified, the low-dimensional features in the target domain are strengthened, the low-cost domain migration is realized, and the detection stability of the detection model in different scenes is improved.

As an optional implementation manner, the fusing the first low-dimensional feature map and the second low-dimensional feature map to obtain a fused feature map includes:

determining the fused feature map according to a first mean value and a first standard deviation of the first low-dimensional feature map on each channel of the feature extraction layer and a second mean value and a second standard deviation of the second low-dimensional feature map on each channel of the feature extraction layer;

wherein the feature extraction layer is a convolution layer for extracting low-dimensional features in the detection model.

As an optional implementation, the determining the fused feature map includes:

normalizing the first low-dimensional feature map according to the first mean value and the first standard deviation to obtain a normalized first low-dimensional feature map;

and determining the fusion feature map by taking the second mean value as the offset of the normalized first low-dimensional feature map and the second standard deviation as the scaling weight of the normalized first low-dimensional feature map.

As an optional implementation, the detection model further includes, at each training time:

determining a domain classification loss value according to the first low-dimensional feature map, the second low-dimensional feature map and the fusion feature map;

and adjusting parameters of the detection model in the training process according to the domain classification loss value so as to extract the domain-invariant features.

In the embodiment, the source domain, the target domain and the fusion feature are used in the domain classifier for distinguishing, so that the method is more beneficial for extracting the feature with unchanged domain from the backbone network of the detection model.

As an optional implementation manner, the adjusting parameters of the detection model in the training process according to the domain classification loss value to extract domain-invariant features includes:

and performing back propagation on the gradient corresponding to the domain classification loss value by using a gradient inversion layer, and adjusting parameters of the detection model in a training process to extract the feature with unchanged domain.

determining a similarity loss value according to the semantic similarity of the first low-dimensional feature map and the fusion feature map;

and adjusting parameters of the detection model in a training process according to the similarity loss value so as to constrain semantic consistency of the first low-dimensional feature map and the fusion feature map.

According to the embodiment, through separation and fusion of low-dimensional features between different domains and semantic consistency constraint, unique features of the target domain are strengthened on the basis of extracting the domain invariant features, and the performance of the detector on the target domain is improved.

determining a candidate box set associated with a target object in the fusion feature map;

determining a similar mutual exclusion loss value according to the candidate frame set associated with different target objects;

and adjusting parameters of the detection model in the training process according to the similar mutual exclusion loss value so as to enlarge the distance between different candidate frame sets.

In the embodiment, on the basis that the candidate frames are regressed to the target frame, the mutual exclusion loss of the same kind is introduced, and the distance between different candidate frames is enlarged, so that the candidate frames matched with different targets are far away as possible, the quality of the candidate frames of the detector is improved, and the problem of virtual detection of the target object in a dense scene is solved.

As an optional implementation manner, the determining a homogeneous mutual exclusion loss value according to a candidate box set associated with different target objects includes:

determining a plurality of candidate frame groups from a first candidate frame set associated with a first target object and a second candidate frame set associated with a second target object;

and determining the similar mutual exclusion loss value according to the distance between the first candidate frame and the second candidate frame in each candidate frame group.

In a second aspect, an embodiment of the present invention provides a training method for a detection model, where the method includes:

acquiring training samples of a detection model to be trained, wherein the training samples comprise a labeled first sample set and an unlabeled second sample set; the first sample set and the second sample set both contain target objects, the second sample set containing scene features different from the first sample set;

and extracting a first low-dimensional feature map of a first sample image in the first sample set and a second low-dimensional feature map of a second sample image in the second sample set during each training, fusing the first low-dimensional feature map and the second low-dimensional feature map to obtain a fused feature map, and training an initial detection model by using the fused feature map and the position of a target object marked by the first sample image to obtain a trained detection model.

In a third aspect, an apparatus for object detection provided by an embodiment of the present invention includes a processor and a memory, where the memory is configured to store a program executable by the processor, and the processor is configured to read the program in the memory and execute the following steps:

acquiring a target image to be detected;

As an alternative embodiment, the processor is configured to perform:

As an optional implementation, in each training of the detection model, the processor is specifically further configured to perform:

As an alternative embodiment, the processor is configured to perform:

As an optional embodiment, in each training of the detection model, the processor is specifically further configured to perform:

As an alternative embodiment, the processor is configured to perform:

In a fourth aspect, an embodiment of the present invention further provides an apparatus for target detection, where the apparatus includes:

the image acquisition unit is used for acquiring a target image to be detected;

the detection target unit is used for inputting the target image into a trained detection model and outputting the position of a target object in the target image; the detection model is obtained by training a labeled first sample set and an unlabeled second sample set, a first low-dimensional feature map of a first sample image in the first sample set and a second low-dimensional feature map of a second sample image in the second sample set are extracted during each training, the first low-dimensional feature map and the second low-dimensional feature map are fused to obtain a fused feature map, and the positions of target objects labeled by the fused feature map and the first sample image are used for training; wherein the first sample set and the second sample set both contain target objects, the second sample set containing different scene characteristics than the first sample set.

As an optional implementation manner, the detection target unit is specifically configured to:

As an optional implementation manner, in each training of the detection model, the detection target unit is further specifically configured to:

and adjusting parameters of the detection model in the training process according to the domain classification loss value so as to extract the features with invariable domains.

In a fifth aspect, embodiments of the present invention further provide a computer storage medium, on which a computer program is stored, where the computer program is used to implement the steps of the method in the first or second aspect when executed by a processor.

These and other aspects of the present application will be more readily apparent from the following description of the embodiments.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

Fig. 1A is a schematic diagram of an image of a target to be detected according to an embodiment of the present invention;

fig. 1B is a schematic diagram of an image of a target to be detected according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a false detection of a target object according to an embodiment of the present invention;

fig. 3 is a flowchart illustrating an embodiment of a method for target detection according to the present invention;

FIG. 4 is a schematic diagram of a detection model according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a target detection model in a training phase according to an embodiment of the present invention;

fig. 6 is a flowchart of an implementation of a method for detecting a target object according to an embodiment of the present invention;

FIG. 7 is a flowchart illustrating an implementation of a model training method according to an embodiment of the present invention;

fig. 8 is a schematic diagram of an apparatus for target detection according to an embodiment of the present invention;

fig. 9 is a schematic diagram of an apparatus for object detection according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The term "and/or" in the embodiments of the present invention describes an association relationship of associated objects, and indicates that three relationships may exist, for example, a and/or B may indicate: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

The application scenario described in the embodiment of the present invention is for more clearly illustrating the technical solution of the embodiment of the present invention, and does not form a limitation on the technical solution provided in the embodiment of the present invention, and it can be known by a person skilled in the art that with the occurrence of a new application scenario, the technical solution provided in the embodiment of the present invention is also applicable to similar technical problems. In the description of the present invention, the term "plurality" means two or more unless otherwise specified.

Embodiment 1, with the improvement of the computing power of modern computers, massive data can be truly utilized. In the field of computer vision, the indexes of related tasks are greatly improved by a deep learning-based method. For example, in the scenes of video monitoring, driving assistance and the like, the abstract features of the target are automatically extracted through training and analyzing a large amount of data, and the effects of common downstream tasks such as classification, detection, segmentation and the like are well improved. The current method based on deep learning is generally a data-driven method, however, the real environment is complex and diverse, and if the distribution of data of an actual scene and a scene where the current training set is located is obviously different, the performance of the model may be significantly reduced. For example, when a lightweight non-motor vehicle detector is trained using only the data in the scene of fig. 1A, there is a possibility that the detector has more missed detections and false detections due to differences in points, weather, illumination, etc. in the scene of fig. 1B. In order to ensure the performance of the algorithm, data under all scenes needs to be covered during training, but the cost of data labeling is "expensive", and a large amount of labor and time are needed. In addition, in the high peak period of traffic flow in the morning and evening, the number of non-motor vehicles is obviously increased, and false detection is easier to occur. As shown in fig. 2, in the training phase, the upper left labeled box is labeled GT1 and the lower right labeled box is labeled GT 2; the top left candidate box is denoted as PR1 and the bottom right candidate box is denoted as PR 2. Assuming that both PR1 and PR2 are matched to GT1, while PR1 and PR2 have the same IoU as GT1 (Intersection Over Union), PR2 is more susceptible to GT2, thereby causing PR2 to move away from the intended target, resulting in false detection between GT1 and GT 2.

Due to the fact that the real environment is complex and diverse, if the data distribution of the actual scene and the scene where the current training set is located is obviously different, for example, the actual scene is an image in a night scene, and the scene of the training set is an image in a day scene, the performance of the detection model may be significantly reduced. Therefore, in order to solve the problem that the performance of the detection model is reduced when the data distribution of the actual scene and the scene where the training set is located are greatly different, the present embodiment provides a method for detecting a target, which can improve the detection performance of the detection model in the actual scene while ensuring that the amount of labeled training data is not increased.

The core idea of the method for target detection provided by this embodiment is that a model is trained by using an unlabeled second sample set and an labeled first sample set, wherein a second low-dimensional feature of a second sample image in the second sample set and a first low-dimensional feature of a first sample image in the first sample set are fused, and the model is trained by using the fusion feature.

The target object of target detection in the present embodiment may be a person, a vehicle, an animal, or the like, and the present embodiment is not limited to this. The vehicle comprises at least one of an automobile and a non-automobile.

As shown in fig. 3, a specific implementation flow of the method for detecting a target provided in the embodiment of the present invention is as follows:

step 300, obtaining a target image to be detected;

in the embodiment, the target image to be detected comprises an image shot in any scene and/or an image shot by any camera device; optionally, the scene features in this embodiment are used to represent environmental features, such as lighting conditions, weather conditions, and the like, in which the image is captured, and/or style attributes used in capturing the image, which is not limited in this embodiment.

Step 301, inputting the target image into a trained detection model, and outputting the position of a target object in the target image; the detection model is obtained by training a labeled first sample set and an unlabeled second sample set, a first low-dimensional feature map of a first sample image in the first sample set and a second low-dimensional feature map of a second sample image in the second sample set are extracted during each training, the first low-dimensional feature map and the second low-dimensional feature map are fused to obtain a fused feature map, and the positions of target objects labeled by the fused feature map and the first sample image are used for training; wherein the first sample set and the second sample set both contain target objects, the second sample set containing different scene characteristics than the first sample set.

In some embodiments, the detection model in this embodiment is constructed by using a feature extractor, and the feature extractor in this embodiment includes, but is not limited to, at least one of CNN (Convolutional Neural Networks) and transform.

In some embodiments, as shown in fig. 4, the detection model in this embodiment includes a feature extraction module 400 and a classification regression module 401, where the feature extraction module includes a first feature extraction module and a second feature extraction module, the first feature extraction module is configured to extract low-dimensional features of an image, the second feature extraction module is configured to extract high-dimensional features of the image, an input target image is subjected to feature extraction by the first feature extraction module to obtain a low-dimensional feature map, and then the low-dimensional feature map is input to the second feature extraction module to perform feature extraction to obtain a high-dimensional feature map; and then inputting the high-dimensional feature map into a classification regression module for classification and regression, determining a candidate frame set of the target object, screening a detection frame of the target object from the candidate frame set, outputting the detection frame, and determining the position of the target object in the target image through the detection frame.

For example, in the present embodiment, the VGG16 is used as the backbone network G (i.e., the feature extraction module) of the detection model, and a total of 5 blocks are connected in series. Wherein G is ₁ Representing the first 2 blocks for extracting low-dimensional features, G ₂ And representing the last 3 blocks for extracting high-dimensional features. The classification regression module can take two-stage fast RCNN as an example, including RPN (Region Proposal network) and RCNN (Region-CNN). Where RPN is used to extract candidate boxes and RCNN is used for quadratic regression to correct the position of the candidate boxes.

In some embodiments, the detection model in this embodiment is initialized using a pre-trained model D _ pretrain based on ImageNet, that is, the pre-trained model is pre-trained by using labeled data in ImageNet, so as to obtain an initial detection model. Wherein, the structures of the detection model and the pre-training model D _ pretrain are the same. It should be noted that, in general, a target detection network uses a large classification data set, such as ImageNet, to pre-train a backbone network, and a large number of data sets labeled with training data in the ImageNet are used, so that the feature expression learned on the large-scale data set can help a detection algorithm to converge more quickly.

In some embodiments, the first low-dimensional feature map in this embodiment is obtained by feature extraction of the first sample image according to a feature extraction layer; the second low-dimensional feature map in this embodiment is obtained by performing feature extraction on the second sample image according to the feature extraction layer; in this embodiment, the first low-dimensional feature map and the second low-dimensional feature map are obtained by performing feature extraction by using the same feature extraction method, and in a specific implementation, the features of the first sample image and the second sample image may be extracted by using the same feature extraction layer, so as to obtain the first low-dimensional feature map and the second low-dimensional feature map. Wherein, the feature extraction layer is a convolution layer used for extracting low-dimensional features in the detection model. The detection model further includes a convolution layer for extracting high-dimensional features.

In some embodiments, the first low-dimensional feature map and the second low-dimensional feature map in this embodiment are used for characterizing at least one of color, texture, and style of the image. That is, the low-dimensional feature in the present embodiment is used to represent at least one of color, texture, and style of the image. The color, texture and style of the image are related to the environmental characteristics, such as the illumination condition and the weather condition, of the captured image and the style attribute used when the image is captured, so that the present embodiment can effectively fuse the characteristics of different scenes by extracting the low-dimensional characteristics in the first sample image and the second sample image, and since the unlabeled second sample set is generally related to the actual detection scene, the second sample set can be directly collected from the actual environment (without labeling) in the implementation, thereby improving the detection performance of the detection model in the actual scene (actual scene).

In implementation, in this embodiment, the labeled first sample set and the unlabeled second sample set include scene features that are not completely the same or different, and the scene features in the second sample set are used to perform fusion of the low-dimensional features, and the position of the target object in the fused feature map after fusion does not change relative to the position of the target object in the first sample image before fusion, so that the position of the labeled target object in the fused feature map and the corresponding first sample image is used to perform training, which can improve the detection performance of the detection model in the scene in the second sample set.

The detection model in the embodiment is trained by using the sample sets with different scene characteristics, so that the detection can be performed on target images with different scene characteristics, and the detection performance can be ensured without increasing the cost of the labeling data.

In some embodiments, the present embodiment obtains the fused feature map by:

In implementation, the mean and standard deviation of each channel of the low-dimensional feature extraction layer are used to realize fusion of different feature maps, and the fusion is performed to fuse the scene features of the second sample set and the scene features of the first sample set, so that the detection model improves the detection performance in a new scene.

Optionally, the number of channels in this embodiment is related to the number of convolution kernels in the detection model.

It should be noted that the first low-dimensional feature map is a three-dimensional vector, and when calculating the mean value and the standard deviation, the mean value and the standard deviation of each two-dimensional image of the three-dimensional vector in a certain dimension are calculated, so as to obtain a first mean value and a first standard deviation; similarly, the second low-dimensional feature map is also a three-dimensional vector, and when the mean value and the standard deviation are calculated, the mean value and the standard deviation of each two-dimensional image of the three-dimensional vector in a certain dimension are calculated, and finally, a second mean value and a second standard deviation are obtained.

In some embodiments, the fused feature map is calculated using the mean and variance;

step 1) normalizing the first low-dimensional feature map according to the first mean value and the first standard deviation to obtain a normalized first low-dimensional feature map;

in practice, the normalization process is performed by the following formula:

formula (1);

wherein, F _s Representing a normalized first low-dimensional feature map, f _s A first low-dimensional feature map is represented,

which represents the first mean value of the first mean value,

the first standard deviation is indicated.

And 2) determining the fusion feature map by taking the second mean value as the offset of the normalized first low-dimensional feature map and the second standard deviation as the scaling weight of the normalized first low-dimensional feature map.

In implementation, the fused feature map is calculated by the following formula:

formula (2);

wherein the content of the first and second substances,

representing a fused feature map, f _s A first low-dimensional feature map is represented,

which represents the first mean value of the first mean value,

represents a first standard deviation;

a second low-dimensional feature map is represented,

which represents the second mean value of the first mean value,

the second standard deviation is indicated.

It should be noted that, usually, during training of the target detection model, the loss value is used to adjust parameters of the detection model during the training process, and when the loss value satisfies the preset condition, the training is completed. In this embodiment, when training the target detection model, the loss value of the fusion feature graph during classification and regression is calculated, that is, the loss value calculated after the fusion features are input to the classification regression module, in addition, because the feature fusion of different domains (the first sample set and the second sample set) is performed in this embodiment, in order to constrain the fusion features, it is ensured that the domain-invariant features are extracted, this embodiment further provides the following steps:

determining a domain classification loss value according to the first low-dimensional feature map, the second low-dimensional feature map and the fusion feature map when the detection model is trained each time; and adjusting parameters of the detection model in the training process according to the domain classification loss value so as to extract the domain-invariant features.

In some embodiments, a gradient inversion layer is used to perform back propagation on a gradient corresponding to the domain classification loss value, and parameters of the detection model in a training process are adjusted to extract domain-invariant features. And performing confusion of different task domains on the first low-dimensional feature map, the second low-dimensional feature map and the fusion feature map by using a gradient inversion layer.

In implementation, in a training phase, a typical domain classifier can be used, the first low-dimensional feature map, the second low-dimensional feature map and the fusion feature map are input to the domain classifier, and a countermeasure training is implemented by using a gradient inversion layer GRL in the domain classifier, so as to achieve confusion of source domain (a first sample set) and target domain data (a second sample set), so that a backbone network in a detection model extracts domain-invariant features. Optionally, when the class confidence of the domain classification output in this embodiment approaches to uniform distribution (for example, when the two domain labels are provided, the confidence corresponding to the two nodes of the classifier are both 0.5), it is determined that the trunk network in the detection model can better extract the feature with unchanged domain at this time. Wherein the domain classification penalty value may be determined by

And (4) showing.

It should be noted that the GRL layer does not change any parameter during the forward propagation process, and only acts on the reverse propagation processUpdating the parameters to be propagated. In the forward propagation process, the features extracted by the domain classifier (assumed to be D) are used for calculating a domain classification loss value, and in the backward propagation process, the gradient corresponding to the domain classification loss value calculated in D is propagated backward, that is, the loss is minimized; but for a backbone network G (e.g. G) ₁ 、G ₂ ) In other words, the gradient corresponding to the domain classification loss value is reversely propagated after GRL inversion, that is, the loss is maximized, thereby realizing a countermeasure training and enabling the backbone network to extract the feature with unchanged domain.

In some embodiments, the detection model further determines a similarity loss value according to semantic similarity between the first low-dimensional feature map and the fused feature map during each training; and adjusting parameters of the detection model in a training process according to the similarity loss value so as to constrain semantic consistency of the first low-dimensional feature map and the fusion feature map.

In implementation, the target objects in the first low-dimensional feature map and the fused feature map should have semantic similarity, so that the first low-dimensional feature map and the fused feature map can be used as a similar network

The semantic similarity of the two is calculated, the similarity loss value is determined, and the problem that the network is not easy to converge is solved by adding constraint conditions. Alternative, similar networks

Similar network can be composed of the last 3 blocks of VGG16

Can also be derived from a pre-trained model

Initialization is performed and the parameters of the similar network are fixed and not updated during the training of the detection model. Alternatively, the similarity Loss value may be a typical Perceptual Loss (Perceptual Loss), which is denoted as

。

In some embodiments, the detection model may further determine a set of candidate boxes associated with a target object in the fused feature map each time it is trained; in an implementation, a set of candidate boxes is determined, for example, by using the RPN, and a fusion feature is input to the RPN to output the set of candidate boxes associated with a target object, wherein the set of candidate boxes includes one or more candidate boxes; when a plurality of target objects exist in the fusion characteristic diagram, determining a similar mutual exclusion loss value according to a candidate frame set associated with different target objects; and adjusting parameters of the detection model in the training process according to the similar mutual exclusion loss value so as to enlarge the distance between different candidate frame sets.

In some embodiments, homogeneous mutual exclusion penalty values are determined by:

determining a plurality of candidate frame groups from a first candidate frame set associated with a first target object and a second candidate frame set associated with a second target object; and determining the similar mutual exclusion loss value according to the distance between the first candidate frame and the second candidate frame in each candidate frame group.

In implementation, the mutual exclusion loss values of the same type are used for reducing the distance between different candidate frames, so that the problems of multiple detection and false detection caused by the fact that a plurality of dense target objects are contained in a target image, such as a plurality of dense vehicles, are solved. The candidate box generally needs to be associated with the target object (i.e. the target box) according to a certain matching policy (e.g. maximum IoU), and the set of candidate boxes associated with the kth target object is assumed to be

When is coming into contact with

When the temperature of the water is higher than the set temperature,

. Then the same kind of mutual exclusion loss value

The specific calculation process is shown as the following formula:

formula (3);

wherein the content of the first and second substances,

represent

Each candidate frame of (1) and

the corresponding distance is calculated pairwise for each candidate frame in (1), and N represents the logarithm of the candidate frame subjected to distance calculation. Optionally, the distance function may select the GIoU distance.

In some embodiments, the total loss used by the detection model in this embodiment during the training phase is shown as follows:

formula (4);

wherein the content of the first and second substances,

are weight factors respectively;

indicating loss of detection, including, for example, loss of classification and regression in RPN and RCNN;

representing a similarity loss value;

representing a homogeneous mutual exclusion penalty value;

representing a domain classification loss value.

In the testing stage and the using stage of the detection model, a target image to be detected directly obtains a feature map through a feature extraction module, and the feature map is input into a RPN module and an RCNN module, so that the predicted coordinates of the target object are output. I.e. feature fusion, semantic consistency constraints

The domain classifier GRL is only used in the training process.

As shown in fig. 5, this embodiment provides a schematic diagram of a target detection model in a training phase, in which VGG16 is used as a main network G (i.e., a feature extraction module) of the detection model, and the main network G is formed by connecting 5 blocks in series. Wherein G is ₁ Representing the first 2 blocks, and extracting low-dimensional features, wherein the low-dimensional features comprise a first low-dimensional feature graph T, a second low-dimensional feature graph G and a fusion feature graph TG; g ₂ And representing the last 3 blocks for extracting high-dimensional features. The classification regression module employs two-stage fast RCNN, including RPN and RCNN. Wherein, the DADAIN module is used for carrying out feature fusion, and the GRL module is used for classifying the domain into loss values

The corresponding gradient is propagated reversely, so that the parameters in the main network are optimized and adjusted in the direction of negative gradient, and the similar network

For calculating similarity loss value

RPN is also used to calculate homogeneous mutual exclusion penalty values

。

It should be noted that, in fact, since the feature extraction module is divided into the low-dimensional part and the high-dimensional part in this embodiment, when the domain classification loss value is calculated, the two parts are divided into two parts, one part calculates the domain classification loss values of the first low-dimensional feature map, the second low-dimensional feature map and the fused feature map that are output after the low-dimensional feature extraction, and mainly aims at the domain alignment processing of the pixel level, and the other part calculates the first high-dimensional feature map, the second high-dimensional feature map and the high-dimensional fused feature map that are output after the first low-dimensional feature map, the second low-dimensional feature map and the fused feature map are respectively subjected to the high-dimensional feature extraction, and mainly aims at the domain alignment processing of the image level.

As shown in fig. 6, this embodiment further provides a method for detecting a target object, and the specific implementation flow of the method is as follows:

step 600, obtaining a first sample set which is marked and a second sample set which is not marked;

601, in the training process of the detection model, sampling one image from the first sample set and the second sample set each time for training;

that is, the number of samples selected in one training is 2, wherein the Size of the sample Size is determined according to the optimization degree and speed of the model.

Step 602, extracting a first low-dimensional feature map of a first sample image and a second low-dimensional feature map of a second sample image;

step 603, fusing the first low-dimensional feature map and the second low-dimensional feature map to obtain a fused feature map;

step 604, training by using the position of the target object marked by the fusion feature map and the first sample image;

step 605, determining a domain classification loss value according to the first low-dimensional feature map, the second low-dimensional feature map and the fusion feature map;

step 606, determining a similarity loss value according to semantic similarity of the first low-dimensional feature map and the fusion feature map;

step 607, determining a candidate frame set associated with the target object in the fusion feature map; determining a similar mutual exclusion loss value according to the candidate frame set associated with different target objects;

step 608, adjusting parameters of the detection model in the training process according to the domain classification loss value, the similarity loss value, the homogeneous mutual exclusion loss value and the detection loss value, stopping until the total loss value meets a preset condition, and determining that the training of the detection model is finished;

step 609, inputting the target image to be detected into the trained detection model, and outputting the position of the target object in the target image.

In some embodiments, as shown in fig. 7, the present embodiment further provides a model training method based on the above target detection method, and a specific implementation flow of the method is as follows:

step 700, obtaining training samples of a detection model to be trained, wherein the training samples comprise a first sample set which is labeled and a second sample set which is not labeled; the first sample set and the second sample set both contain target objects, the second sample set containing scene features different from the first sample set;

in implementation, the training samples in the embodiment are different from the commonly labeled training samples, and unlabeled training samples under different scene characteristics are added on the basis of the labeled training samples, so that the accuracy of target detection under wider and more real scenes can be ensured in the training process, the detection performance is improved, and the data labeling cost is effectively saved.

Step 701, extracting a first low-dimensional feature map of a first sample image in the first sample set and a second low-dimensional feature map of a second sample image in the second sample set during each training, fusing the first low-dimensional feature map and the second low-dimensional feature map to obtain a fused feature map, and training an initial detection model by using the fused feature map and the position of a target object marked by the first sample image to obtain a trained detection model.

In implementation, each training is performed, a first sample image and a second sample image are input into an initial detection model, a first low-dimensional feature map of the first sample image and a second low-dimensional feature map of the second sample image are extracted, then the first low-dimensional feature map and the second low-dimensional feature map are fused to obtain a fusion feature map, and the initial detection model is trained by using the positions of target objects marked by the fusion feature map and the first sample image, so that a trained detection model is obtained.

It should be noted that, since the position of the target object in the fused feature map after fusion does not change relative to the position of the target object in the first sample image before fusion, the detection performance of the detection model in the scene in the second sample set can be improved by training with the fused feature map and the position of the labeled target object in the corresponding first sample image.

In some embodiments, the detection model, upon each training, is further to:

determining a domain classification loss value according to the first low-dimensional feature map, the second low-dimensional feature map and the fusion feature map; and adjusting parameters of the detection model in the training process according to the domain classification loss value so as to extract the domain-invariant features.

In some embodiments, a gradient inversion layer is used to perform back propagation on a gradient corresponding to the domain classification loss value, and parameters of the detection model in a training process are adjusted to extract domain-invariant features.

In some embodiments, the detection model, upon each training, is further configured to:

determining a similarity loss value according to the semantic similarity of the first low-dimensional feature map and the fusion feature map; and adjusting parameters of the detection model in a training process according to the similarity loss value so as to constrain semantic consistency of the first low-dimensional feature map and the fusion feature map.

determining a candidate box set associated with a target object in the fusion feature map; determining a similar mutual exclusion loss value according to the candidate frame set associated with different target objects; and adjusting parameters of the detection model in the training process according to the similar mutual exclusion loss value so as to enlarge the distance between different candidate frame sets.

In some embodiments, the loss value of the detection model during classification and regression through the fusion feature map is calculated during each training, and a domain classification loss value, a similarity loss value and a similar mutual exclusion loss value are also calculated, so that parameters of the detection model during the training process are adjusted by using the total loss value, and when the total loss value meets a preset condition, the completion of the training is determined, and the trained detection model is obtained.

The structure of the detection model in the training phase in this embodiment can refer to fig. 5 and the above description, and is not repeated here.

The method solves the problem that under the condition that the data distribution of a source domain (a first sample set) and a target domain (a second sample set) has a large difference, the unmarked data of the target domain is directly used, and the performance of a detection model under the target domain is improved; meanwhile, the problem of virtual detection of the target object in a dense scene is solved. On the basis of labeled data, a large amount of unlabeled data are used for training a detection model, compared with the method of training a transformation network independently in advance, the source domain data style is migrated to a target domain, the method directly performs separation and fusion of features on the shallow layer of the network, the training process is simplified, low-dimensional features in the target domain are enhanced, low-cost domain migration is achieved, separation and fusion of the low-dimensional features between different domains are performed, semantic consistency constraint is performed, on the basis of extracting the invariant features of the domains, unique features of the target domain are enhanced, the performance of a detector on the target domain is improved, and meanwhile, the detection stability of the detection model under different scenes is improved. In addition, in the domain classifier, the source domain, the target domain and the fusion feature are used for distinguishing, so that the domain invariant feature can be extracted by a backbone network of the detection model more favorably. In the training stage of the detection model, on the basis that the candidate frames are regressed to the target frame, the mutual exclusion loss of the same kind is introduced, and the distance between different candidate frames is enlarged, so that the candidate frames matched with different targets are far away as possible, the quality of the candidate frames of the detector is improved, and the problem of virtual detection of the target object in a dense scene is solved.

Embodiment 2 is based on the same inventive concept, and the embodiment of the present invention further provides a device for target detection, where the device is a device in the method in the embodiment of the present invention, and the principle of the device to solve the problem is similar to that of the method, so that the implementation of the device may refer to the implementation of the method, and repeated details are not repeated.

As shown in fig. 8, the apparatus includes a processor 800 and a memory 801, the memory 801 is used for storing programs executable by the processor 800, the processor 800 is used for reading the programs in the memory 801 and executing the following steps:

acquiring a target image to be detected;

As an alternative implementation, the processor 800 is specifically configured to perform:

As an optional implementation, in each training of the detection model, the processor 800 is specifically further configured to perform:

Embodiment 3, based on the same inventive concept, an embodiment of the present invention further provides a device for target detection, and since the device is the device in the method in the embodiment of the present invention, and the principle of the device to solve the problem is similar to that of the method, the implementation of the device may refer to the implementation of the method, and repeated details are not repeated.

As shown in fig. 9, the apparatus includes:

an image acquiring unit 900, configured to acquire a target image to be detected;

a detection target unit 901, configured to input the target image into a trained detection model, and output a position of a target object in the target image; the detection model is obtained by training a labeled first sample set and an unlabeled second sample set, a first low-dimensional feature map of a first sample image in the first sample set and a second low-dimensional feature map of a second sample image in the second sample set are extracted during each training, the first low-dimensional feature map and the second low-dimensional feature map are fused to obtain a fused feature map, and the positions of target objects labeled by the fused feature map and the first sample image are used for training; wherein the first sample set and the second sample set both contain target objects, the second sample set containing different scene characteristics than the first sample set.

As an optional implementation manner, the detection target unit 901 is specifically configured to:

As an optional implementation manner, in each training of the detection model, the detection target unit 901 is further specifically configured to:

Based on the same inventive concept, an embodiment of the present invention further provides a computer storage medium, on which a computer program is stored, which when executed by a processor implements the following steps:

acquiring a target image to be detected;

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method of object detection, the method comprising:

acquiring a target image to be detected;

2. The method according to claim 1, wherein the fusing the first low-dimensional feature map and the second low-dimensional feature map to obtain a fused feature map comprises:

3. The method of claim 2, wherein determining the fused feature map comprises:

4. The method of claim 1, wherein the detection model, upon each training, further comprises:

5. The method of claim 4, wherein the adjusting parameters of the detection model in the training process according to the domain classification loss value to extract domain-invariant features comprises:

6. The method of claim 1, wherein the detection model, upon each training, further comprises:

7. The method of claim 1, wherein the detection model, upon each training, further comprises:

8. The method of claim 7, wherein determining homogeneous mutual exclusion loss values according to the candidate box sets associated with different target objects comprises:

9. A method for training a test model, the method comprising:

10. An apparatus for object detection, comprising a processor and a memory, the memory storing a program executable by the processor, the processor being configured to read the program from the memory and to perform the steps of the method of any one of claims 1 to 8 or 9.

11. A computer storage medium having a computer program stored thereon, the program, when executed by a processor, implementing the steps of the method according to any one of claims 1 to 8 or 9.