CN115937596A

CN115937596A - Target detection method, training method and device of model thereof, and storage medium

Info

Publication number: CN115937596A
Application number: CN202211644437.XA
Authority: CN
Inventors: 刘天赐; 程博; 邵明
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2022-12-20
Filing date: 2022-12-20
Publication date: 2023-04-07

Abstract

The application discloses a target detection method and a training method, equipment and a storage medium of a model thereof, wherein the training method of the target detection model comprises the following steps: acquiring a sample image, wherein the sample image is marked with at least one target type contained in the sample image; determining target area characteristics of each target type in the sample image by using a target detection model, wherein the target area characteristics of the target type are used for representing image data characteristics in the target area, and the target area is an area where a target corresponding to the target type is located in the sample image; comparing the target area features of each target type with the reference area features of the plurality of types respectively to obtain a plurality of comparison results, wherein the reference area features of the plurality of types comprise the reference area features of each type in the plurality of historical sample images; determining a target loss based on the plurality of alignment results; and adjusting parameters in the target detection model by using the target loss. By the scheme, the accuracy of subsequent target detection can be improved.

Description

Target detection method, training method and device of model thereof, and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a target detection method, a training method for a model thereof, a device, and a storage medium.

Background

The image processing method based on deep learning is the technical field of big fire in recent years, and the target positioning and target detection technology is taken as the mainstream technical method in the field, has been widely applied to the fields of intelligent transportation, intelligent medical treatment, intelligent city management, household intelligent service and the like, and brings safer, more comfortable and more convenient service for the life of people. The mainstream target detection technology needs to label the position and the type of a target in training data, extract the characteristic of a convolutional neural network memorability image or video, and perform classification and regression tasks to respectively obtain the type and the position information of the target in the image, so as to position and identify the target required in a project. For the target detection method, training is carried out by relying on carefully labeled data, the labeling of the data consumes a large amount of labor and time, the image labeling work in some professional fields, such as the fields of intelligent medical treatment and industrial defect detection, and the targets in the images can be labeled correctly by more specialized personnel, so that the labeling cost is increased undoubtedly.

Disclosure of Invention

The application at least provides a target detection method and a training method, equipment and a storage medium of a model thereof.

The application provides a training method of a target detection model, which comprises the following steps: acquiring a sample image, wherein the sample image is marked with at least one target type contained in the sample image; determining target area characteristics of each target type in the sample image by using a target detection model, wherein the target area characteristics of the target type are used for representing image data characteristics in the target area, and the target area is an area where a target corresponding to the target type is located in the sample image; comparing the target area features of each target type with the reference area features of the plurality of types respectively to obtain a plurality of comparison results, wherein the reference area features of the plurality of types comprise the reference area features of each type in the plurality of historical sample images; determining a target loss based on the plurality of alignment results; and adjusting parameters in the target detection model by using the target loss.

The application provides a target detection method, which comprises the following steps: acquiring a target image; carrying out target detection on the target image by using a target detection network to obtain an initial detection result corresponding to the target image, wherein the initial detection result comprises a confidence coefficient that a target area contained in the target image contains a target; the target detection network is obtained by training the training method; and determining a target detection result of the target image based on the corresponding confidence of each target area.

The application provides a training device of target detection model, includes: the system comprises a sample image acquisition module, a data acquisition module and a data processing module, wherein the sample image acquisition module is used for acquiring a sample image, and the sample image is marked with at least one target type contained in the sample image; the region feature acquisition module is used for determining target region features of each target type in the sample image by using the target detection model, wherein the target region features of the target types are used for representing image data features in the target regions, and the target regions are regions where targets corresponding to the target types are located in the sample image; the characteristic comparison module is used for comparing the target area characteristics of each target type with the reference area characteristics of a plurality of types respectively to obtain a plurality of comparison results, wherein the reference area characteristics of the plurality of types comprise the reference area characteristics of each type in a plurality of historical sample images; a loss determination module for determining a target loss based on the plurality of comparison results; and the parameter adjusting module is used for adjusting the parameters in the target detection model by using the target loss.

The application provides a target detection device, includes: the target image acquisition module is used for acquiring a target image; the detection module is used for carrying out target detection on the target image by using a target detection network to obtain an initial detection result corresponding to the target image, wherein the initial detection result comprises a confidence coefficient that a target area contained in the target image contains a target; the target detection network is obtained by training the training method; and the processing module is used for determining a target detection result of the target image based on the corresponding confidence coefficient of each target area.

The application provides an electronic device, which comprises a memory and a processor, wherein the processor is used for executing program instructions stored in the memory so as to realize the target detection method or the training method of the target detection model.

The present application provides a computer-readable storage medium having stored thereon program instructions that, when executed by a processor, implement the above-mentioned object detection method or the above-mentioned training method of an object detection model.

According to the scheme, the target detection model is trained by only using the sample image marked with the type of the contained target, the position of the target does not need to be marked, and the data marking amount can be reduced. In addition, target region features of the sample image are extracted in the training process, and the extracted target region features are compared with reference region features of various categories in other images for learning, so that feature differences in the categories can be reduced, the feature differences among the categories can be increased, and a target detection result obtained by a subsequent target detection model is more accurate.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and, together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic flow chart diagram illustrating an embodiment of a target detection method of the present application;

FIG. 2 is a schematic sub-flow diagram illustrating step S12 according to an embodiment of the training method for the target detection model of the present application;

FIG. 3 is a schematic diagram of the structure of an object detection model of the present application;

FIG. 4 is a schematic diagram of a position information attention mechanism network shown in the training method of the object detection model of the present application;

FIG. 5 is a schematic flow chart diagram illustrating an embodiment of a target detection method of the present application;

FIG. 6 is another schematic flow chart diagram illustrating an embodiment of a target detection method of the present application;

FIG. 7 is a schematic structural diagram of an embodiment of a training apparatus for an object detection model according to the present application;

FIG. 8 is a schematic structural diagram of an embodiment of an object detection apparatus of the present application;

FIG. 9 is a schematic structural diagram of an embodiment of an electronic device of the present application;

FIG. 10 is a schematic structural diagram of an embodiment of a computer-readable storage medium of the present application.

Detailed Description

The embodiments of the present application will be described in detail below with reference to the drawings.

In the following description, for purposes of explanation and not limitation, specific details are set forth such as particular system structures, interfaces, techniques, etc. in order to provide a thorough understanding of the present application.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship. Further, the term "plurality" herein means two or more than two. Additionally, the term "at least one" herein means any one of a variety or any combination of at least two of a variety, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Referring to fig. 1, fig. 1 is a schematic flow chart of an embodiment of a target detection method according to the present application.

As shown in fig. 1, the target detection method provided in the embodiment of the present disclosure may include the following steps:

step S11: a sample image is acquired.

The sample image is marked with at least one target type contained by the sample image. The object type refers to the type of the object. Illustratively, the target type can be any target needing detection, such as a human body, a human face, a vehicle, a sheep and the like. For example, if a vehicle is included in the sample image, the sample image may be marked to include an identification of the vehicle. Illustratively, the number of sample images may be a single or multiple, and the target detection model is trained using a batch of sample images, for example.

Step S12: target region features for each target type in the sample image are determined using a target detection model.

Wherein the target area characteristic of the target type is used for representing the image data characteristic in the target area. The target area is the area of the target corresponding to the target type in the sample image. Illustratively, the target detection model determines a target area where a target corresponding to the target type is located in the sample image, and then determines image data characteristics of the target area as the target area characteristics.

Step S13: and comparing the target area characteristics of each target type with the reference area characteristics of a plurality of types respectively to obtain a plurality of comparison results.

Several means one or more. The same type as the target type may or may not exist in the several types. The plurality of types of reference region features include reference region features for each type in the plurality of historical sample images. Illustratively, the plurality of types may include sheep, the target type is a vehicle, the difference between the sheep and the vehicle in the image is large in terms of appearance and the like, and if the comparison result between the extracted regional features is that the two are relatively close, it is obvious that the feature extraction has a problem, so that the target loss is determined based on the comparison result between the two, the feature difference between the types can be increased, and the subsequent target detection result is more accurate. Or, the plurality of types include vehicles, the target type is a vehicle, the difference between the two types on the image should be small, and if the comparison result between the extracted regional features is that the two types are far, it is obvious that the feature extraction has a cliff, so that the target loss is determined based on the comparison result between the two types, the feature difference in the type can be reduced, and the subsequent target detection result is more accurate.

Step S14: based on the plurality of alignments, a target loss is determined.

Specifically, the target loss may be determined according to a comparison result between the target region feature of the target category and the reference region feature of each category.

Step S15: and adjusting parameters in the target detection model by using the target loss.

The method for adjusting the parameters in the model by using the target loss may be any method, such as back propagation, and is not limited in this respect.

According to the scheme, the target detection model is trained by only using the sample image marked with the type of the contained target, the position of the target does not need to be marked, and the data marking amount can be reduced. In addition, the target region features of the sample image are extracted in the training process, and the extracted target region features are compared with the reference region features of all classes in other images for learning, so that the feature difference in the classes can be reduced, the feature difference among the classes can be increased, and the target detection result obtained by a subsequent target detection model is more accurate.

In some embodiments, the sample image includes an original image and an enhanced image obtained by performing pixel enhancement processing on the original image. Optionally, the number of sample images is plural. For example, the original image includes a plurality of sheets. In some application scenarios, for a sample data set I, the sample data set includes a number of original images and category labels corresponding to the original images, where C is a total number of target categories in the sample data set. In the category label, the category position of the object included in the original image is 1, and the category position of the object not included is 0, for example, if the original image includes a category a, the category position corresponding to the original image in the label of the category a is 1. One or more categories may be present in an original image, and one or more objects of the same category may be present. In the training process, the data is sent to the network for training according to the set Batch number N as a whole. The original image and the enhanced image are trained in the same way, and are described in the following.

Before each Batch data is sent to the network training, each original image is subjected to random pixel enhancement to generate an enhanced image corresponding to the original image to form an image pair. The pixel enhancement mode is a mode such as Gaussian blur, random brightness change, random salt and pepper noise addition and the like or a random combination of several modes. Illustratively, the original image includes a plurality of sheets, and each original image corresponds to one enhanced image.

In some embodiments, the step S11 may include the following steps:

a plurality of original images are acquired, and whether the number of target original images is greater than or equal to a preset number is judged. The target original image is an original image which contains target categories, and the number of the target categories is smaller than or equal to a preset value. For example, if the preset value is 1, if the original image is marked with only the a category in the image, that is, the number of the target categories contained in the original image is 1, the original image is the target original image.

In some embodiments, in response to that the number of the target original images is greater than or equal to a preset number, the reference original images including other target categories and the target original images are subjected to stitching processing to obtain final original images, and the enhanced images of the reference original images and the enhanced images of the target original images are subjected to stitching processing to obtain final enhanced images.

For example, the target category included in the target original image is the a category, and the other categories may be any other than the a category, such as the B category and the C category. In some embodiments, the target original image may be stitched with several other reference original images. The number of reference original images may be set in advance. For example, in order to ensure the integrity of the images after image stitching, three reference original images may be used to perform stitching processing on the target original image, and in addition, target categories included in several reference original images may exist in the same target category or may all be different target categories.

In some application scenarios, for an original image with only one type of object (an object original image), random data stitching is performed according to a probability if the image is the object original image and meets the probability of stitching enhancement. The probability is specifically a ratio of the number of target original images to the total number of original images in one Batch, and the ratio is taken as the above-mentioned preset number. Three objects which contain different types of objects and are used as the original images of the objects are randomly searched for and spliced by 2-by-2 in a random order in the data of the same Batch to form a multi-class image. In other words, the reference original image may be other target original images as well. In addition, the label corresponding to the final original image obtained by splicing is the fusion of the labels of the four original images before splicing. Before splicing images, four original images needing to be spliced are respectively subjected to random pixel enhancement to generate enhanced images corresponding to the four original images, and then splicing of the enhanced images is carried out. The final original image and/or the final enhanced image which are generated finally are subjected to Gaussian blur to eliminate the boundary effect of splicing.

In some application scenes, the original images containing multi-class targets are directly subjected to random pixel enhancement to generate corresponding enhanced images, and then the enhanced images are used as comparison data and are sent to a network for learning. All the enhanced images are generated on line before each Epoch iteration, so that the original image data of the Batch generates enhanced image data of the Batch to form an image pair with the enhanced image data, the enhanced image data and the enhanced image data are respectively sent to the main network for learning, and the weight of the whole main network is shared in the learning process.

Referring to fig. 2, fig. 2 is a sub-flow diagram illustrating step S12 according to an embodiment of the training method of the object detection model of the present application. As shown in fig. 2, in some embodiments, the step S12 may include the following steps:

step S121: and performing initial feature extraction on the sample image by using the target detection model to obtain an initial feature map about each target type.

Referring to FIG. 3, FIG. 3 is a schematic structural diagram of a target detection model according to the present application. As shown in FIG. 3, the original data and the generated enhanced data are sent to the backbone network for feature extraction and semantic information learning, and FIG. 3 shows the last three feature extraction modules of the backbone network, which are F respectively ₁ 、F ₂ And F ₃ 。F ₁ 、F ₂ And F ₃ Are arranged in cascade. Wherein, F ₁ 、F ₂ And F ₃ The output can be regarded as an initial characteristic diagram. For example, a set of initial feature maps belongs to R ^H×W×(Cm) . Wherein m is a hyper-parameter, and is used for representing the number of channels of the initial feature map of a certain category, that is, the number of the initial feature maps corresponding to each target type. H and W are the height and width of the initial feature map, and C is used to represent the total number of object classes contained in all sample images, i.e., the total number of object classes contained in the sample images in the entire training set. In other words, in some scenarios, the object detection model may determine the initial feature map, the first feature map, and the second feature map for all classes in the sample image during the detection process.

Step S122: and performing advanced feature extraction on the initial feature map of each target type to obtain a first feature map related to each target type and a second feature map related to each target type.

The first feature map and the second feature map respectively contain position information of the target area. As in FIG. 3, the set of first feature maps may be the LIA of FIG. 3 ₁ The set of second feature maps may be the LIA of FIG. 3 ₂ . The position information of the target area may have a certain difference between the first feature map and the second feature map, for example, the position information in the second feature map may be more accurate.

In some embodiments, the manner of acquiring the first feature map and the second feature map in step S122 may be: and performing first feature extraction on the initial feature map to obtain a first feature map about each target type. And performing second feature extraction on the first feature map to obtain a second feature map about each target type. The number of channels of the first feature map of each target type is more than that of the channels of the second feature map of the corresponding target type. The number of channels of the first feature map of each target type is less than that of the channels of the initial feature map of each target type.

Illustratively, the number of channels of the initial feature map is mapped to C × k by using 1 × 1 convolution, and a first feature map set LIA corresponding to each sample image is generated ₁ ∈R ^H×W×() H and W are the height and width of the first feature map, C is used to represent the total number of object classes contained in all sample images, k is the hyperparameter and represents the number of channels of the first feature map for each object type. Optionally, k is less than or equal to m. In other words, by performing the first feature extraction on the initial feature map, the purpose is to compress the feature map of each target class into the first feature map set with the number of channels k.

The method for acquiring the first feature map may be: and performing first feature extraction on the initial feature map by using a first feature extraction module in the target detection model to obtain a first feature map about each target type.

Alternatively, the first feature extraction module may be designed based on a location information attention mechanism. The first feature extraction module can be seamlessly embedded into any backbone network to extract the position information again.

The manner of obtaining the second feature map may be: and performing second feature extraction on the first feature map by using a third feature extraction module in the target detection model to obtain second feature maps related to all target types. Wherein the third feature extraction module may be designed based on the location information attention mechanism. The third feature extraction module can be seamlessly embedded into any backbone network to extract the position information again.

Illustratively, the LIA to be generated ₁ Taking the features of each k channels in each C multiplied by k channel as a feature unit, respectively performing 1 multiplied by 1 convolution to compress the number of the channels from k to 1, and generating a second feature map set LIA ₂ ∈R ^H×W×C Where C is the number of target classes. And compressing the position information of each target category to 1 channel through second feature extraction, wherein the feature map of each channel comprises the position information of the corresponding target category.

Step S123: and processing each second feature graph respectively to generate an area mask corresponding to the target type.

In some embodiments, before performing step S123, the following steps may also be performed: and obtaining a sample detection result of the sample image by using the second feature map of each target type. The sample detection result includes a confidence that the target included in each target region of the sample image is of the target type. And then, selecting a second feature map with the confidence coefficient meeting a preset confidence coefficient condition from the second feature maps of the target types as a final second feature map. On this basis, the second feature maps are respectively processed, and a manner of generating the area mask corresponding to the target type may specifically be: and processing each final second feature graph respectively to generate an area mask corresponding to the related target type.

As described above, if the second feature map set includes the second feature maps corresponding to the C categories, the sample detection result may obtain the confidence that the target included in each category of the sample image is the corresponding category. The confidence may be a prediction probability. Illustratively, according to the size of the class prediction probability, the prediction label corresponding to each prediction probability is determined. For example, if the class prediction probability is greater than or equal to the preset probability, the prediction label corresponding to the class prediction probability is determined to be 1, otherwise, the prediction label is 0. The second feature map with the confidence level satisfying the preset confidence level condition may be selected as the final second feature map from the second feature maps of the target types, where the second feature map with the confidence level satisfying the preset confidence level condition is selected as the final second feature map. For example, if only 4 target categories are marked in the sample image, the number of the selected final second feature maps is at most four, and if the number of the prediction labels 1 is more than 4, the corresponding final second feature maps can be selected from high to low according to the prediction probability. In addition, for the situation that the target category contained in the label is not high and the classification confidence given by the target detection model is not high, the learning integrity of the semantic features by the current network is considered not high, so that the final second feature map is not determined for the target category. And the final second feature diagram obtained by determination is used for generating the area mask corresponding to the relevant type.

Specifically, the second feature maps are respectively processed, and a manner of generating the area mask corresponding to the target type may specifically be: and performing binarization processing on each second feature map to generate an area mask corresponding to each second feature map.

The selected final second feature map may be considered as a location information activation map of the relevant category. Carrying out binarization on each pixel value according to a set threshold value upsilon on the position activation graph, setting the value smaller than the threshold value upsilon as 0, setting the value larger than the threshold value upsilon as 1, and generating a mask M corresponding to the category _c ∈R ^H×W 。

Step S124: and for each target type, processing the first feature map corresponding to the target type by using the area mask to obtain the target area features corresponding to the target type.

In step S124, specifically, it may be: and performing mask average pooling on the first feature map by using the region mask to obtain target region features.

In other embodiments, mask Average Pooling (MAP) may be performed on the first feature MAPs corresponding to the respective categories to obtain target region features of the target categories. As described above, the total number of classes included in the entire training set is C, and if the sample image includes d classes, the mask average pooling processing is performed to obtain the target area feature f of the target class d _d Can refer to equation 1:

wherein M is _c (i, j) denotes a mask M _c ∈R ^H×W Mask at intermediate position (i, j). Wherein, LIA ₁ ∈R ^H×W×() ，LIA ₁ (i, j) is the pixel value at position (i, j) in the first feature map under a certain channel. k represents the number of k channels of the first feature map of the target class d, and the first feature map under each channelAll the characteristic graphs need to be processed to obtain f _d ∈R ^C×k . This way, the activation of the target area is achieved in the forward direction of the target detection model.

In some embodiments, before performing step S12, the following steps may also be performed: and respectively carrying out third feature extraction on the initial feature map of each target type to obtain a reference feature map of each target type. The reference feature map includes position information of the target area. And for each target type, processing the reference feature map of the target type by using the area mask to obtain the reference area feature corresponding to the target type. And then, updating the reference region characteristics of a plurality of types by using the reference region characteristics corresponding to each target type in the sample image. And the updated reference region characteristics of the plurality of types comprise the reference region characteristics of each target type in the sample image. On this basis, the step S12 may specifically include the following steps: and comparing the target area characteristics of each target type with the updated reference area characteristics of a plurality of types respectively to obtain a plurality of comparison results.

In some embodiments, the specific manner of obtaining the reference feature map may be: and performing third feature extraction on the initial feature map by using a second feature extraction module in the target detection model to obtain a reference feature map of each target type. Alternatively, the structure of the first feature extraction module may be the same as that of the second feature extraction module, specifically, the initial feature map for performing the third feature extraction and the initial feature map for obtaining the features of the target region are obtained by the same feature extraction module in the backbone network, for example, by using the feature extraction module F ₃ And determining the target region feature and the reference region feature by the output initial feature map. I.e. with F ₃ The network structures of the connected first characteristic extraction module and the second characteristic extraction module are the same.

That is, LIA 'is the reference feature map set of each type obtained by the second feature extraction module' ₁ ∈R ^H×W×(Ck) . The reference feature map is processed by using the region mask to obtain the reference region feature according to the above formula 1, whereAnd will not be described in detail.

The characteristic matrix is composed of a plurality of types of regional characteristics, each row in the characteristic matrix corresponds to one group of regional characteristics of one type, and the types corresponding to different rows are different. The comparison result includes a similarity matrix. The above-mentioned manner of comparing the target area features of each target type with the reference area features of several types to obtain a plurality of comparison results may be: and performing similarity calculation on the target area characteristics of each target type and the characteristic matrix to obtain a similarity matrix. Each row in the similarity matrix corresponds to the similarity between the target area characteristic of one target type and the history area characteristic of each type. On this basis, the above-mentioned manner for determining the target loss based on the plurality of alignment results may be: based on the similarity matrix, a target loss is determined.

For example, in the data of the same Batch, the reference region features in all sample images containing the class d participate in the feature buffer together

In or>

The structure of (1). In each Epoch iteration, it is asserted>

The reference area features of the whole Epoch image are stored to form a feature matrix which is used as a comparison learning correlation matrix of the area features and is used for calculating a similarity matrix Q of the target area features in each sample image relative to the reference area features in the image in the whole Epoch.

Assuming that the final data of one Batch generates the region-level feature vectors f corresponding to the n sample images, assume that the current feature buffer is

m is the number of all the feature vectors f', wherein m is more than or equal to n. For a Batch of data, generated->

Will count and->

The similarity Q of (2) is calculated as follows:

Q＝M _Batch ·M ^T ∈R ^n×m formula 2;

where T denotes transposition. One target area characteristic of each behavior in Q and current Batch data

The similarity of all the reference region features. Optionally, if the similarity is greater than the preset similarity, the corresponding categories of the two region features are considered to be the same category, binarization is performed on Q, the similarity of the characteristics of the same category is set to 1, and otherwise, the similarity is set to 0.

Wherein the target loss may be a cross-entropy loss. And (3) calculating Loss through a Cross Entropy Loss function (Cross Engine Loss, CE Loss), calculating gradient and propagating reversely, and updating parameters of the network.

By the method, in the process of learning the semantic features of the current Batch data, the target detection model can synthesize all previously learned region-level semantic features, so that the similarity between similar features is improved, the difference between different features is increased, and the model can learn the semantic information in the whole data more comprehensively and accurately. In addition, each iteration of the model refers to all learned semantic information, and the learning of semantic features cannot be influenced due to the particularity of certain Batch data.

In some embodiments, the structure of the first feature extraction module may be the same as that of the second feature extraction module. As in FIG. 3, a first feature extraction module E ₃ Is the same as the second feature extraction module. On this basis, the step S15 may include the steps of: and adjusting parameters of the first feature extraction module by using the target loss in a back propagation mode to obtain the updated parameters of the first feature extraction module. Then, the first feature is extractedAnd fusing the updated parameters of the module with the parameters to be updated in the second feature extraction module, and updating the second feature extraction module by using the fused parameters.

For the parameters of the second feature extraction module, momentum updating is adopted, the momentum updating time can be configured by self, and E ₃ All data are updated by back propagation after each substitution is completed, and the second feature extraction module can be used for comparing with the E ₃ The updating can be performed together, or after several iterations are performed, and the updating process can refer to formula 3:

θ '← γ θ' + (1- γ) θ equation 3;

wherein θ is E ₃ And theta' is a parameter of the second feature extraction module when the parameter is updated after the current iteration. The parameter updating in the second feature extraction module does not need back propagation, so that a large amount of calculation and storage of video memory are not needed.

As described above, in some embodiments, the sample image includes an original image and an enhanced image obtained by performing pixel enhancement processing on the original image. Then, the second feature map includes a first sub-feature map of each target type corresponding to the original image and a second sub-feature map of each target type corresponding to the enhanced image. On this basis, the training method of the target detection model provided by this embodiment may further include the following steps: and fusing the first sub-feature maps of the target types to obtain a first foreground feature map containing position information of a foreground in the original image. The foreground in the original image is the set of target regions. And fusing the second sub-feature maps of the target types to obtain a second foreground feature map containing position information of the foreground in the enhanced image. The foreground in the enhanced image is a collection of target regions. Then, based on the position difference of the foreground in the first foreground feature map and the second foreground feature map, the position difference loss is determined. On the basis of obtaining the location difference loss, the step S15 may specifically include: and adjusting parameters in the target detection model by using the target loss and the position difference loss.

For example, the manner of obtaining the first foreground feature map may beCorresponding the original image to LIA ₂ And adding in the channel direction to fuse the position information of the targets corresponding to all the categories. Then all the position information is normalized to be between {0,1} through a Sigmod function to generate a first foreground feature map LIA ₃ ，LIA ₃ Foreground position information activation map being class independent, that is to say in LIA ₃ Position information contained in as foreground, i.e. LIA ₃ ∈R ^H×W . The manner of obtaining the second foreground feature map is the same, and is not described herein again. The second feature maps may be fused by directly adding the feature maps or by performing weighted fusion.

To better understand the channel number variation process of generating the second feature map from the first feature map and then generating the corresponding first foreground feature map or the second foreground feature map, please refer to fig. 4, and fig. 4 is a schematic structural diagram of the location information attention mechanism network shown in the training method of the object detection model of the present application. That is, the first feature extraction module and the third feature extraction module belong to the location information attention mechanism network. As shown in FIG. 4, a first set of feature maps LIA ₁ ∈R ^H×W×() That is, the number of channels of the first feature map of each category in the first feature map set is k, the total number of channels is Ck, and the second feature map set LIA ₂ ∈R ^H×W×C That is, the number of channels of the second feature map of each category in the second feature map set is 1, the total number of channels is C, and then the second feature maps of each category are fused to obtain the foreground feature map LIA ₃ ∈R ^H×W That is, the foreground feature map is independent of the category and the total number of channels is 1. The formula for obtaining the foreground feature map from the initial feature map can refer to formula 4:

wherein the content of the first and second substances,

showing the mapping of the initial feature map F to the final LIA ₃ The process of (1).

In some embodiments, the number of initial feature maps is multiple. The plurality of initial feature maps are obtained by a plurality of cascade-arranged first feature extraction modules in the target detection model respectively, and as described above, the first feature extraction modules may include F ₁ 、F ₂ And F ₃ The initial feature map may include F ₁ Corresponding initial characteristic map, F ₂ Corresponding initial feature map and F ₃ Corresponding initial feature maps. The number of the second characteristic diagrams is multiple. That is, the second characteristic diagram includes F ₁ Corresponding initial characteristic map, F ₂ Corresponding second characteristic diagram and F ₃ A corresponding second profile. The number of the first foreground feature maps and the number of the second foreground feature maps are the same as the number of the initial feature maps, and the first foreground feature maps and the second foreground feature maps are in one-to-one correspondence respectively. That is, the first foreground feature map includes F ₁ Corresponding first foreground feature map, F ₂ Corresponding first foreground feature map and F ₃ And (4) corresponding first foreground feature maps. The second foreground feature map includes F ₁ Corresponding second foreground feature map, F ₂ Corresponding second foreground feature map and F ₃ And (4) corresponding second foreground feature maps. The second foreground feature map includes F ₁ Corresponding second foreground feature map, F ₂ Corresponding second foreground feature map and F ₃ A corresponding second foreground feature map. Wherein, F ₁ The corresponding first foreground feature map and the second foreground feature map correspond to each other, F ₂ The corresponding first foreground feature map and second foreground feature map correspond to each other, and F ₃ The corresponding first foreground feature map and the second foreground feature map correspond to each other. On this basis, the above-mentioned manner for determining the position difference loss based on the position difference between the first foreground feature map and the second foreground feature map with respect to the foreground may be: and determining position difference loss based on the difference between each first foreground feature map and the corresponding second foreground feature map.

Wherein, the way of calculating the position difference loss between the first foreground feature map and the second foreground feature map may be to calculate the L2 loss. For example, the L2 loss can be calculated by referring to equation 5:

wherein, F _k And

initial feature maps representing an original image and an enhanced image, respectively, where k =1,2,3, where k = 1->

Is represented by F ₁ Corresponding loss of position difference, k =2 @>

Is shown as F ₂ Corresponding loss of position difference, k =3 ″, in>

Is represented by F ₃ The corresponding position difference is lost. (i, j) represents the position of a pixel point on the foreground feature map. />

Represents a first foreground feature map, and>

representing a second foreground feature map. W, H represents the width and height, respectively, of the foreground feature map.

And (3) mutually constraining feature activation graphs (a first foreground feature graph and a second foreground feature graph) generated at different depths of the network, and guiding the network to learn more abstract semantic features instead of features such as shallow edges, colors and textures.

In some embodiments, the training method of the target detection model provided in this embodiment may further include the following steps: and determining a sample detection result of the sample image by using the target detection model. The sample detection result includes a confidence that the target included in each target region of the sample image is of the target type. And determining the binary cross entropy loss corresponding to the sample detection result. On this basis, with the target loss, the method of adjusting the parameters in the target detection model may be: and adjusting parameters in the target detection model by combining the target loss and the binary cross entropy loss.

Wherein, the detection result of the obtained sample can be obtained from the second characteristic diagram. Illustratively, for the last layer of feature extraction modules (e.g., F as described above) ₃ ) The obtained LIA2 is subjected to Global Average Pooling (GAP), and then the generated feature vectors are compressed into vectors with the length of C by using a full-link layer as a final confidence coefficient, wherein the confidence coefficient can be a class prediction probability, and the C class prediction probabilities are finally obtained. The prediction probability of the C categories may be obtained by referring to formula 6 and formula 7:

wherein the content of the first and second substances,

showing the mapping of the initial feature map F to the final LIA ₂ The process of (1). />

Indicates that the content in parentheses is subjected to global mean pooling (GAP) processing, and ` H `>

Indicating that the generated feature vector is compressed into a vector of length C using the fully connected layer. Illustratively, according to the magnitude of the class prediction probability, the prediction label corresponding to each prediction probability is determined. For example, if the class prediction probability is greater than or equal to the predetermined probability, the class prediction probability is determinedThe corresponding prediction tag is 1, otherwise it is 0.

Since the labels in the sample image are only class labels, a Binary Cross Entropy Loss function (BCE Loss) is used for Loss calculation. The specific way of performing binary cross entropy loss calculation can refer to formula 8:

where C represents the total number of categories of data, i.e. the length of the label vector y, y and

respectively representing a prediction label and a real class label output by the target detection model. As described above, when a sample image is labeled, if a type a exists in the sample image, the position of the sample image corresponding to the type a in the label is 1, and if not, the position of the sample image corresponding to the type a is 0. That is, each real tag may be 0 or 1.

In some embodiments, parameters of the model are adjusted using the target penalty, the location difference penalty, and the binary cross entropy penalty together. The parameters of the model may be adjusted in common by sequentially using the target losses, or by fusing the losses and adjusting the parameters using the fused losses. The manner in which the parameters of the model are adjusted in relation to the losses is not specified here.

In some application scenarios, when forward reasoning is performed on the trained target detection model, only the LIA needs to be generated ₁ And LIA ₂ And selecting a final second feature map through the classification confidence, thresholding the final second feature map by setting a threshold value tau, segmenting a region with higher confidence, and restoring the region into an original image to generate a minimum enclosing rectangle as a final detection result.

In this embodiment, the original image and the enhanced image are simultaneously sent to the target detection model for learning, and the original image and the enhanced image are simultaneously sent to different stages of the model for learningGenerating corresponding location information attention LIA ₁ 、LIA ₂ And LIA ₃ ，LIA ₃ LIA of class-agnostic foreground position information activation map, original image and enhanced image at depth of model blind ₃ And (4) using the position difference loss for constraint, and maximally retaining the position information of the target in the network. While the deepest layer (e.g. F) ₃ ) Feature generation LIA of ₁ 、LIA ₂ The method comprises the steps of generating region-level semantic features and constructing a Memory Bank (reference region features of a plurality of categories), wherein the Memory Bank divides semantic information of different categories into different groups to serve as references of the semantic information, and after the Memory Bank is constructed, a next model can guide the learning of the semantic features, so that the inter-category difference is maximized and the intra-category difference is minimized. And finally, after the model training is finished, directly carrying out LIA according to the characteristics of the deepest layer ₂ And selecting a target feature activation graph (a second feature graph) to be activated according to the classification confidence, and enabling the feature activation graph to act back to the original graph after thresholding so as to realize a detection task of the target of the corresponding category.

The main body of the target detection method may be a target detection apparatus, for example, the target detection apparatus may be a terminal device or a server or other processing device, where the terminal device may be a monitoring device, a network video recorder, a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like in a security system. In some possible implementations, the object detection method may be implemented by a processor invoking computer readable instructions stored in a memory.

Referring to fig. 5, fig. 5 is a schematic flowchart illustrating an embodiment of a target detection method according to the present application.

As shown in fig. 5, the target detection method provided in this embodiment includes the following steps:

step S21: and acquiring a target image.

The target image may be obtained by shooting by an executing device executing the target detection method, or may be transmitted to the executing device by another device. The target image may be an image including the target. The target may be any target that needs to be detected, such as a human body, a vehicle, etc.

Step S22: and carrying out target detection on the target image by using a target detection network to obtain an initial detection result corresponding to the target image, wherein the initial detection result comprises a confidence coefficient that a target area contained in the target image contains the target.

The obtaining method of the initial detection result may refer to a related obtaining method of the initial detection result in the training method embodiment of the target detection model, which is not described herein too much.

Step S23: and determining a target detection result of the target image based on the confidence degree corresponding to each target area.

For example, if the confidence is high, it is determined that the target exists in the target image.

For better understanding of the process of target detection, please refer to fig. 6, and fig. 6 is another schematic process diagram of an embodiment of the target detection method of the present application.

As shown in fig. 6, a target image is input into a feature extraction module in a backbone network to perform general feature extraction, a result output by the last feature extraction module is used as an initial feature, then the result is passed through a first feature extraction module E3 to obtain a first feature graph set, then the feature extraction is performed on the first feature graph set to obtain a second feature graph set, GAP and a full connection layer are performed on the second feature graph set to obtain a vector with a length of C as a confidence corresponding to each category, then the category with the confidence higher than the preset confidence is selected as a final category, a second feature graph corresponding to the final category is determined, then the second feature graph is thresholded according to a set threshold τ so as to segment a region with a higher confidence, and the segmented second feature graph is restored into an original image to generate a minimum bounding rectangle, so as to obtain a target detection result.

The main body of the target detection method may be a target detection apparatus, for example, the target detection apparatus may be a terminal device or a server or other processing device, where the terminal device may be a monitoring device, a network video recorder, a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like in a security system. In some possible implementations, the object detection method may be implemented by a processor calling computer readable instructions stored in a memory.

Referring to fig. 7, fig. 7 is a schematic structural diagram of an embodiment of a training apparatus for a target detection model according to the present application. The training device 30 for the target detection model includes a sample image obtaining module 31, a region feature obtaining module 32, a feature comparing module 33, a loss determining module 34, and a parameter adjusting module 35. A sample image obtaining module 31, configured to obtain a sample image, where the sample image is marked with at least one target type included in the sample image; the region feature acquiring module 32 is configured to determine, by using the target detection model, a target region feature of each target type in the sample image, where the target region feature of the target type is used to represent an image data feature in the target region, and the target region is a region where a target corresponding to the target type is located in the sample image; the characteristic comparison module 33 is configured to compare the target area characteristics of each target type with the reference area characteristics of multiple types, respectively, to obtain multiple comparison results, where the reference area characteristics of multiple types include reference area characteristics of each type in multiple historical sample images; a loss determination module 34 for determining a target loss based on the plurality of alignment results; and a parameter adjusting module 35, configured to adjust parameters in the target detection model by using the target loss.

The functions of each module can be described in the embodiment of the target detection method, and are not described herein again.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an embodiment of the object detection device according to the present application. The object detection device 40 includes an object image acquisition module 41, a detection module 42, and a processing module 43. A target image acquisition module 41, configured to acquire a target image; the detection module 42 is configured to perform target detection on the target image by using a target detection network to obtain an initial detection result corresponding to the target image, where the initial detection result includes a confidence that a target region included in the target image includes a target; the target detection network is obtained by training the training method; and the processing module 43 is configured to determine a target detection result of the target image based on the confidence corresponding to each target region.

The functions of the modules may be described in the embodiments of the target detection method, and are not described herein again.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an embodiment of an electronic device according to the present application. The electronic device 50 comprises a memory 51 and a processor 52, the processor 52 being configured to execute program instructions stored in the memory 51 to implement the steps in any of the above-described embodiments of the object detection method. In one particular implementation scenario, the electronic device 50 may include, but is not limited to: the monitoring device, the microcomputer, and the server, and the electronic device 50 may further include a mobile device such as a notebook computer, a tablet computer, and the like, which is not limited herein.

In particular, the processor 52 is configured to control itself and the memory 51 to implement the steps in any of the above-described embodiments of the object detection method. Processor 52 may also be referred to as a CPU (Central Processing Unit). Processor 52 may be an integrated circuit chip having signal processing capabilities. The Processor 52 may also be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. In addition, the processor 52 may be commonly implemented by an integrated circuit chip.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an embodiment of a computer-readable storage medium according to the present application. The computer readable storage medium 60 stores program instructions 61 executable by the processor, the program instructions 61 for implementing the steps in any of the above-described object detection method embodiments.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

The foregoing description of the various embodiments is intended to highlight various differences between the embodiments, and the same or similar parts may be referred to each other, and for brevity, will not be described again herein.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a module or a unit is merely one type of logical division, and an actual implementation may have another division, for example, a unit or a component may be combined or integrated with another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some interfaces, and may be in an electrical, mechanical or other form.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit. The integrated unit, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, and various media capable of storing program codes.

Claims

1. A method for training a target detection model, comprising:

acquiring a sample image, wherein the sample image is marked with at least one target type contained in the sample image;

determining target area features of each target type in the sample image by using a target detection model, wherein the target area features of the target types are used for representing image data features in a target area, and the target area is an area where a target corresponding to the target type is located in the sample image;

comparing the target region features of each target type with a plurality of types of reference region features respectively to obtain a plurality of comparison results, wherein the plurality of types of reference region features comprise the reference region features of each type in a plurality of historical sample images;

determining a target loss based on the plurality of alignment results;

and adjusting parameters in the target detection model by using the target loss.

2. The method of claim 1, wherein determining the target region features for each target type in the sample image using a target detection model comprises:

performing initial feature extraction on the sample image by using the target detection model to obtain an initial feature map of each target type;

performing advanced feature extraction on the initial feature map of each target type to obtain a first feature map of each target type and a second feature map of each target type, wherein the first feature map and the second feature map respectively contain position information of the target area;

processing each second feature graph respectively to generate an area mask corresponding to the target type;

and for each target type, processing the first feature graph corresponding to the target type by using the area mask to obtain the target area feature corresponding to the target type.

3. The method according to claim 2, wherein performing advanced feature extraction on the initial feature map of each of the target types to obtain a first feature map of each of the target types and a second feature map of each of the target types comprises:

performing first feature extraction on the initial feature map to obtain a first feature map about each target type;

performing second feature extraction on the first feature map to obtain a second feature map about each target type;

and the number of channels of the first characteristic diagram of each target type is more than that of the channels of the second characteristic diagram of the corresponding target type.

4. The method according to claim 2, wherein before the processing is performed on each of the second feature maps to generate the area mask corresponding to the target type, the method further comprises:

obtaining a sample detection result of the sample image by using the second feature map of each target type, wherein the sample detection result comprises a confidence coefficient that a target included in each target region of the sample image is the target type;

selecting a second feature map with the confidence coefficient meeting a preset confidence coefficient condition from the second feature maps of the target types as a final second feature map;

the processing each second feature map to generate an area mask corresponding to the target type includes:

and processing each final second feature graph respectively to generate an area mask corresponding to the target type.

5. The method according to claim 2, wherein the separately processing each of the second feature maps to generate an area mask corresponding to the target type includes:

respectively carrying out binarization processing on each second feature map to generate an area mask corresponding to each second feature map;

for each target type, processing the first feature map corresponding to the target type by using the area mask to obtain a target area feature corresponding to the target type, including:

and performing mask average pooling on the first feature map by using the region mask to obtain the target region feature.

6. The method according to any one of claims 2 to 5, wherein before comparing the target region feature of each target type with a plurality of types of reference region features respectively to obtain a plurality of comparison results, the method further comprises:

respectively carrying out third feature extraction on the initial feature map of each target type to obtain a reference feature map of each target type, wherein the reference feature map comprises position information of a target area;

for each target type, processing the reference feature map of the target type by using the area mask to obtain a reference area feature corresponding to the target type;

updating the reference region features of the plurality of types by using the reference region feature corresponding to each target type in the sample image, wherein the updated reference region features of the plurality of types include the reference region feature related to each target type in the sample image;

the step of comparing the target area features of each target type with the reference area features of a plurality of types respectively to obtain a plurality of comparison results includes:

and comparing the target area characteristics of each target type with the updated reference area characteristics of a plurality of types respectively to obtain a plurality of comparison results.

7. The method according to claim 6, wherein performing advanced feature extraction on the initial feature map of each of the target types to obtain a first feature map for each of the target types and a second feature map for each of the target types comprises:

performing first feature extraction on the initial feature map by using a first feature extraction module in the target detection model to obtain a first feature map about each target type;

the third feature extraction is performed on the initial feature maps of the target types respectively to obtain reference feature maps of the target types, and the method comprises the following steps:

performing third feature extraction on the initial feature map by using a second feature extraction module in the target detection model to obtain a reference feature map of each target type, wherein the structure of the first feature extraction module is the same as that of the second feature extraction module;

the adjusting parameters in the target detection model by using the target loss includes:

adjusting parameters of the first feature extraction module by using the target loss in a back propagation mode to obtain parameters updated by the first feature extraction module;

and fusing the parameters updated by the first feature extraction module with the parameters to be updated in the second feature extraction module, and updating the second feature extraction module by using the fused parameters.

8. The method according to any one of claims 2 to 5, wherein the sample image includes an original image and an enhanced image obtained by performing pixel enhancement processing on the original image, and the second feature map includes a first sub-feature map of each target type corresponding to the original image and a second sub-feature map of each target type corresponding to the enhanced image, and the method further includes:

fusing the first sub-feature maps of the target types to obtain a first foreground feature map containing position information of a foreground in the original image, wherein the foreground in the original image is a set of each target area; fusing the second sub-feature maps of the target types to obtain a second foreground feature map containing position information of a foreground in the enhanced image, wherein the foreground in the enhanced image is a set of target areas;

determining a position difference loss based on the position difference of the first foreground characteristic map and the second foreground characteristic map about the foreground;

and adjusting parameters in the target detection model by using the target loss and the position difference loss.

9. The method according to claim 8, wherein the number of the initial feature maps is multiple, the multiple initial feature maps are obtained by multiple cascade-arranged first feature extraction modules in the target detection model, the number of the second feature maps is multiple, the number of the first foreground feature maps and the number of the second foreground feature maps are the same as the number of the initial feature maps, and the first foreground feature maps and the second foreground feature maps are in one-to-one correspondence respectively;

the determining a location difference loss based on a location difference with respect to the foreground in the first foreground feature map and the second foreground feature map comprises:

determining the position difference loss based on a difference between each of the first foreground feature maps and the corresponding second foreground feature map, respectively.

10. The method of claim 8, wherein the number of sample images is plural, and the obtaining the sample images comprises:

acquiring a plurality of original images, and judging whether the number of target original images is greater than or equal to a preset number, wherein the target original images are original images which contain target categories and the number of which is less than or equal to the preset value;

and in response to the fact that the number of the target original images is larger than or equal to the preset number, splicing the reference original images containing other target categories and the target original images to obtain final original images, and splicing the enhanced images of the reference original images and the enhanced images of the target original images to obtain final enhanced images.

11. The method according to any one of claims 1 to 5, wherein a feature matrix is composed of a plurality of types of regional features, each row in the feature matrix corresponds to a group of regional features of one type, the types corresponding to different rows are different, and the comparing of the target regional features of each target type with the reference regional features of the plurality of types to obtain a plurality of comparison results includes:

similarity calculation is carried out on the target area characteristics of each target type and the characteristic matrix to obtain a similarity matrix, each row in the similarity matrix corresponds to the similarity between the target area characteristics of one target type and the historical area characteristics of each type, and the comparison result comprises the similarity matrix;

said determining a target loss based on said plurality of alignments comprising:

determining the target loss based on a similarity matrix.

12. The method according to any one of claims 1-5, further comprising:

determining a sample detection result of the sample image by using the target detection model, wherein the sample detection result comprises a confidence level that a target contained in each target region of the sample image is the target type;

determining binary cross entropy loss corresponding to the sample detection result;

and adjusting parameters in the target detection model by combining the target loss and the binary cross entropy loss.

13. A method for detecting an object, characterized in that,

acquiring a target image;

performing target detection on the target image by using a target detection network to obtain an initial detection result corresponding to the target image, wherein the initial detection result comprises a confidence coefficient that a target area contained in the target image contains a target; the target detection network is obtained by training according to the training method of any one of claims 1-10;

and determining a target detection result of the target image based on the corresponding confidence of each target region.

14. An electronic device comprising a memory and a processor for executing program instructions stored in the memory to implement the method of any of claims 1 to 13.

15. A computer readable storage medium on which program instructions are stored, which program instructions, when executed by a processor, implement the method of any one of claims 1 to 13.