CN114694005A

CN114694005A - Target detection model training method and device, and target detection method and device

Info

Publication number: CN114694005A
Application number: CN202210435047.5A
Authority: CN
Inventors: 赵明瑶; 罗壮; 张海强
Original assignee: Zhidao Network Technology Beijing Co Ltd
Current assignee: Zhidao Network Technology Beijing Co Ltd
Priority date: 2022-04-24
Filing date: 2022-04-24
Publication date: 2022-07-01

Abstract

The application discloses a target detection model training method and device and a target detection method and device, wherein the target detection model training method comprises the following steps: acquiring an image to be trained containing labeling information; extracting features of the target detection model by using a backbone network of the target detection model to obtain a multi-scale feature map, wherein the backbone network comprises enhanced convertible hole convolution which is used for extracting multi-scale feature information; performing feature fusion on the multi-scale feature map by using a feature gold tower network of the target detection model; detecting the fused multi-scale characteristic diagram by using a detection head network of the target detection model to obtain a target detection result; and determining a loss value according to the target detection result and the labeling information, and updating parameters of the target detection model by using the loss value to obtain the trained target detection model. The method improves the convertible hole convolution in the original backbone network, can adaptively extract multi-scale characteristic information in the image, and improves the accuracy of target detection.

Description

Target detection model training method and device, and target detection method and device

Technical Field

The present application relates to the field of target detection technologies, and in particular, to a target detection model training method and apparatus, and a target detection method and apparatus.

Background

The monocular 3D target detection only has one RGB image, the inherent dimensions of different detection targets such as vehicles, pedestrians and the like have large differences in a complex scene of a road, and the same target is displayed on the RGB images in different dimensions due to different distances from an observation device.

In the existing monocular 3D target detection, an RGB image is input into a monocular 3D target detection network composed of a backbone network, a feature pyramid network and a detection head, and vectors including a target type, a target 3D frame (box) code, a target direction type, a target attribute, a target centrality and the like can be output. And decoding the target 3D frame coding vector through a 3D frame decoding post-processing module, so that decoding information such as the position, the size of a target scale, the target depth, the target angle and the like of the target can be obtained, and a final 3D target detection result can be determined by combining Score scoring and an NMS (Non-Maximum Suppression) algorithm.

However, the processing of the target scale problem by the above scheme is concentrated in the deep-layer network, the feature pyramid network outputs the multi-scale feature maps first, and then trains in the feature maps of different scale levels respectively based on the labeling information of the targets of different scales, the shallow-layer network mostly adopts the universal ResNet50 or RestNet101 as the backbone network, and the multi-scale problem is not considered in the extraction of the bottom-layer features, which results in insufficient detection capability of the multi-scale targets.

Disclosure of Invention

The embodiment of the application provides a target detection model training method and device and a target detection method and device, so as to extract multi-scale characteristic information of a target and improve the accuracy of target detection.

The embodiment of the application adopts the following technical scheme:

in a first aspect, an embodiment of the present application provides a target detection model training method, where the target detection model training method includes:

acquiring an image to be trained, wherein the image to be trained comprises the labeling information of the image to be trained;

performing feature extraction on the image to be trained by using a backbone network of the target detection model to obtain a multi-scale feature map of the image to be trained, wherein the backbone network comprises an enhanced convertible hole convolution which is used for extracting multi-scale feature information in the image to be trained;

performing feature fusion on the multi-scale feature map of the image to be trained by using the feature pyramid network of the target detection model to obtain a fused multi-scale feature map;

detecting the fused multi-scale characteristic diagram by using a detection head network of the target detection model to obtain a target detection result of the image to be trained;

and determining a loss value according to the target detection result of the image to be trained and the labeling information of the image to be trained, and updating the parameters of the target detection model by using the loss value to obtain the trained target detection model.

Optionally, the backbone network further includes a first global context module and a second global context module, the enhanced convertible hole convolution includes a plurality of stages of convertible hole convolutions that are sequentially concatenated, the plurality of stages of convertible hole convolutions include at least a first stage of convertible hole convolution and a second stage of convertible hole convolution,

the extracting the features of the image to be trained by using the backbone network of the target detection model to obtain the multi-scale feature map of the image to be trained comprises the following steps:

acquiring a first feature map of the image to be trained, wherein the first feature map is output by an upstream module corresponding to the first global context module;

processing the first feature map by using the first global context module to obtain the second feature map;

processing the second characteristic diagram by using a plurality of stages of switchable hole convolution which are cascaded in sequence to obtain a third characteristic diagram;

and processing the third feature map by using the second global context module to obtain a multi-scale feature map of the image to be trained.

Optionally, the processing the first feature map by using the first global context module to obtain the second feature map includes:

carrying out global average pooling on the first feature map to obtain a first global average pooling result;

performing 1x1 convolution processing on the first global average pooling processing result to obtain a first 1x1 convolution processing result;

and performing fusion processing on the first feature map and the first 1x1 convolution processing result to obtain the second feature map.

Optionally, the first stage of convertible hole convolution includes a first conversion function, a convertible hole convolution corresponding to a first hole rate, and a convertible hole convolution corresponding to a second hole rate, and the processing the second feature map by using the sequentially cascaded plurality of stages of convertible hole convolutions to obtain a third feature map includes:

processing the second feature map by using the first conversion function to obtain a processing result of the first conversion function;

performing 3x3 hole convolution processing on the second feature map by using convertible hole convolution corresponding to the first hole rate to obtain a first hole convolution processing result;

performing 3x3 hole convolution processing on the second feature map by using the convertible hole convolution corresponding to the second hole rate to obtain a second hole convolution processing result;

and according to the processing result of the first conversion function, performing fusion processing on the first hole convolution processing result and the second hole convolution processing result to obtain a fourth feature map output by the convertible hole convolution in the first stage.

Optionally, the first conversion function includes 5x5 average pooling and 1x1 convolution, and the processing the second feature map by using the first conversion function to obtain the processing result of the first conversion function includes:

carrying out 5x5 average pooling on the second feature map to obtain a first average pooling result;

and carrying out 1x1 convolution on the first average pooling result to obtain a processing result of the first conversion function.

Optionally, the convertible hole convolution at the second stage includes a second conversion function and a convertible hole convolution corresponding to a third hole rate, and the processing the second feature map by using the convertible hole convolution at the multiple stages that are sequentially cascaded to obtain a third feature map includes:

processing the second feature map by using the second conversion function to obtain a processing result of the second conversion function;

performing 3x3 hole convolution processing on the second feature map by using convertible hole convolution corresponding to the third hole rate to obtain a third hole convolution processing result;

acquiring a fourth feature map of the convertible hole convolution output in the first stage;

and according to the processing result of the second conversion function, carrying out fusion processing on the fourth feature map and the third cavity convolution processing result to obtain the third feature map.

Optionally, the second conversion function includes 11x11 average pooling and 1x1 convolution, and the processing the second feature map by using the second conversion function to obtain the processing result of the second conversion function includes:

carrying out 11x11 average pooling on the second feature map to obtain a second average pooling result;

and carrying out 1x1 convolution on the second average pooling result to obtain a processing result of the second conversion function.

In a second aspect, an embodiment of the present application further provides an object detection method, where the object detection method includes:

acquiring an image to be detected;

detecting the image to be detected by using a target detection model to obtain a target detection result;

the target detection model is obtained by training based on any one of the target detection model training methods.

In a third aspect, an embodiment of the present application further provides an object detection model training apparatus, where the object detection model training apparatus includes:

the device comprises a first acquisition unit, a second acquisition unit and a control unit, wherein the first acquisition unit is used for acquiring an image to be trained, and the image to be trained comprises the marking information of the image to be trained;

the feature extraction unit is used for extracting features of the image to be trained by using a backbone network of the target detection model to obtain a multi-scale feature map of the image to be trained, wherein the backbone network comprises an enhanced convertible hole convolution which is used for extracting multi-scale feature information in the image to be trained;

the characteristic fusion unit is used for carrying out characteristic fusion on the multi-scale characteristic diagram of the image to be trained by utilizing the characteristic pyramid network of the target detection model to obtain a fused multi-scale characteristic diagram;

the first detection unit is used for detecting the fused multi-scale characteristic diagram by using a detection head network of the target detection model to obtain a target detection result of the image to be trained;

and the updating unit is used for determining a loss value according to the target detection result of the image to be trained and the labeling information of the image to be trained, and updating the parameters of the target detection model by using the loss value to obtain the trained target detection model.

In a fourth aspect, an embodiment of the present application further provides an object detection apparatus, where the object detection apparatus includes:

the second acquisition unit is used for acquiring an image to be detected;

the second detection unit is used for detecting the image to be detected by using the target detection model to obtain a target detection result;

and the target detection model is obtained by training based on the target detection model training device.

In a fifth aspect, an embodiment of the present application further provides an electronic device, including:

a processor; and

a memory arranged to store computer executable instructions that, when executed, cause the processor to perform any of the aforementioned object detection model training methods or object detection methods.

In a sixth aspect, embodiments of the present application further provide a computer-readable storage medium storing one or more programs, which when executed by an electronic device including a plurality of application programs, cause the electronic device to perform any one of the object detection model training method or the object detection method described in the foregoing.

The embodiment of the application adopts at least one technical scheme which can achieve the following beneficial effects: the target detection model training method of the embodiment of the application comprises the steps of firstly obtaining an image to be trained, wherein the image to be trained comprises marking information of the image to be trained; then, extracting features of the image to be trained by using a backbone network of the target detection model to obtain a multi-scale feature map of the image to be trained, wherein the backbone network comprises an enhanced convertible hole convolution which is used for extracting multi-scale feature information in the image to be trained; then, performing feature fusion on the multi-scale feature map of the image to be trained by using a feature gold sub-tower network of the target detection model to obtain a fused multi-scale feature map; detecting the fused multi-scale characteristic diagram by using a detection head network of the target detection model to obtain a target detection result of the image to be trained; and finally, determining a loss value according to the target detection result of the image to be trained and the labeling information of the image to be trained, and updating the parameters of the target detection model by using the loss value to obtain the trained target detection model. According to the target detection model training method, convertible hole convolution in the original backbone network is improved to obtain the enhanced convertible hole convolution, so that multi-scale feature information in the image can be extracted in a self-adaptive mode, and the accuracy of target detection is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

FIG. 1 is a schematic flow chart illustrating a method for training a target detection model according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a network structure of RestNet50/ResNet101 in the prior art;

FIG. 3 is a schematic overall flowchart of a target detection model training method according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating a network structure of a SAC in the prior art;

FIG. 5 is a schematic diagram of an enhanced switchable hole convolution network according to an embodiment of the present application;

FIG. 6 is a schematic flowchart of a target detection method according to an embodiment of the present application;

FIG. 7 is a schematic structural diagram of an apparatus for training a target detection model according to an embodiment of the present disclosure;

FIG. 8 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail and completely with reference to the following specific embodiments of the present application and the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The technical solutions provided by the embodiments of the present application are described in detail below with reference to the accompanying drawings.

An embodiment of the present application provides a method for training a target detection model, and as shown in fig. 1, provides a schematic flow chart of the method for training the target detection model in the embodiment of the present application, where the method for training the target detection model at least includes the following steps S110 to S150:

step S110, obtaining an image to be trained, wherein the image to be trained comprises the labeling information of the image to be trained.

When the target detection model is trained, the image to be trained needs to be acquired first as a training sample, and the label information group Truth of the target is labeled in the image to be trained in advance and is used as a basis for calculating the loss value of the target detection model training subsequently. The target detection model training method in the embodiment of the application can be used for detecting a monocular 3D target, and certainly, a person skilled in the art can also flexibly extend the method to other target detection scenarios, such as a 2D detection scenario, and the like, and is not limited specifically herein.

And step S120, performing feature extraction on the image to be trained by using a backbone network of the target detection model to obtain a multi-scale feature map of the image to be trained, wherein the backbone network comprises an enhanced convertible hole convolution which is used for extracting multi-scale feature information in the image to be trained.

After the image to be trained is obtained, feature extraction needs to be performed on the image to be trained by using a backbone network backbone of the target detection model, so that a multi-scale feature map of the image to be trained is obtained.

As shown in fig. 2, a schematic diagram of a network structure of RestNet50/ResNet101 in the prior art is provided, and it can be seen that the network structure of the original RestNet50/ResNet101 includes a Conv3x3, that is, a 3x3 convolution, and when a network structure is modified, all of the Conv3x3 can be replaced by the enhanced switchable hole convolution in the embodiment of the present application.

And S130, performing feature fusion on the multi-scale feature map of the image to be trained by using the feature pyramid network of the target detection model to obtain a fused multi-scale feature map.

And step S140, detecting the fused multi-scale characteristic diagram by using a detection head network of the target detection model to obtain a target detection result of the image to be trained.

And S150, determining a loss value according to the target detection result of the image to be trained and the labeling information of the image to be trained, and updating the parameters of the target detection model by using the loss value to obtain the trained target detection model.

As shown in fig. 3, an overall flowchart of a target detection model training method in the embodiment of the present application is provided. Firstly, inputting images to be trained and image meta-information such as camera internal parameters and image sizes into the improved backbone network backbone, and sequentially performing feature extraction processing according to C3-C4-C5 to obtain a multi-scale feature map, wherein each layer of C3-C4-C5 can contain the enhanced convertible hole convolution of the embodiment of the application and is used for adaptively extracting multi-scale feature information in a shallow network.

And then inputting the multi-scale Feature map output by the improved backbone network into a Feature Pyramid Network (FPN) of the target detection model for Feature fusion, and outputting a fused Feature map with 5 level scales of P3-P4-P5-P6-P7.

And finally, predicting on each fused feature map by using a detection Head of the target detection model, and outputting vectors with target types, target 3D box codes, target direction types, target attributes, target centrality and the like. Decoding a target 3D Box coding vector through a 3D Box decoding post-processing module (Box Decoder) so as to obtain decoding information such as the position, the size of the target dimension, the target depth, the target angle and the like of the target, and finally determining a final 3D target detection result by combining Score scoring and an NMS algorithm.

According to the target detection model training method, convertible hole convolution in the original backbone network is improved to obtain the enhanced convertible hole convolution, so that multi-scale feature information in the image can be extracted in a self-adaptive mode, and accuracy of subsequent target detection is improved.

In an embodiment of the present application, the backbone network further includes a first global context module and a second global context module, the enhanced convertible hole convolution includes a plurality of stages of convertible hole convolutions that are sequentially cascaded, the plurality of stages of convertible hole convolutions at least include a first stage of convertible hole convolution and a second stage of convertible hole convolution, and the performing feature extraction on the image to be trained by using the backbone network of the target detection model to obtain the multi-scale feature map of the image to be trained includes: acquiring a first feature map of the image to be trained, wherein the first feature map is output by an upstream module corresponding to the first global context module; processing the first feature map by using the first global context module to obtain the second feature map; processing the second characteristic diagram by using a plurality of stages of switchable hole convolution which are cascaded in sequence to obtain a third characteristic diagram; and processing the third feature map by using the second global context module to obtain the multi-scale feature map of the image to be trained.

The enhanced convertible hole Convolution (MSAC) according to the embodiment of the present application may be obtained by modifying an original convertible hole Convolution (SAC for short) network structure. Fig. 4 provides a schematic diagram of a network structure of a SAC in the prior art. The original SAC network structure only designs one conversion function and can only convert the cavity convolution of two different cavity rates, so that the receptive field is limited, and the multi-scale characteristic information which can be extracted is also limited. The SAC network may be specifically expressed as:

where x is the input, w is the weight, r is the hole rate of the hole convolution, which is also the hyper-parameter of the SAC, Δ w represents the weight to be trained, and the transfer function S (") is related to the input and the location.

Based on this, the enhanced convertible hole convolution according to the embodiment of the present application may include a plurality of stages of sequentially cascaded convertible hole convolution SAC, and the plurality of stages are sequentially cascaded, so that conversion can be performed between more than two hole convolutions with different hole rates, thereby further expanding a receptive field and enriching extracted multi-scale feature information.

Fig. 5 is a schematic diagram of a network structure of an enhanced convertible hole convolution according to an embodiment of the present application. For the convenience of understanding the embodiments of the present application, the cascade of two stages of switchable hole convolutions is used as an example for illustration.

Specifically, a Global Context module, namely a first Global Context module (Pre-Global Context) and a second Global Context module (Post-Global Context), is inserted in the backbone network according to the embodiment before and after the enhanced structure of the convertible hole convolution, and the Global Context module is similar to the sentet (Squeeze-and-Excitation Networks), but has two main differences: 1) the global context module only has one convolution layer and does not have any nonlinear layer; 2) the output is added back to the backbone instead of multiplying the input by the value after the recalibration operation by Sigmoid.

Before the first global context module, an original network structure corresponding to the backbone network before the 3x3 convolution is connected, the output of the original network structure can be regarded as a first feature map, then the first feature map is processed by the first global context module to obtain a second feature map, the second feature map is processed by the enhanced convertible hole convolution to obtain a third feature map, and finally the third feature map is processed by the second global context module to obtain a final multi-scale feature map.

In an embodiment of the application, the processing the first feature map by using the first global context module to obtain the second feature map includes: carrying out global average pooling on the first feature map to obtain a first global average pooling result; performing 1x1 convolution processing on the first global average pooling processing result to obtain a first 1x1 convolution processing result; and performing fusion processing on the first feature map and the first 1x1 convolution processing result to obtain the second feature map.

In an embodiment of the application, the processing the third feature map by using the second global context module to obtain the multi-scale feature map of the image to be trained includes: carrying out global average pooling on the third feature map to obtain a second global average pooling result; performing 1x1 convolution processing on the second global average pooling processing result to obtain a second 1x1 convolution processing result; and performing fusion processing on the third feature map and the second 1x1 convolution processing result to obtain the multi-scale feature map of the image to be trained.

The two Global context modules of the embodiment of the present application have the same network structure, that is, both include a Global Average Pooling (Global Average Pooling) and a Conv1x1, that is, 1x1 convolution. Taking the first global context module as an example, the first feature map is subjected to global average pooling, so that the number of parameters is compressed, the weight is reduced, and the calculation amount is reduced. And then carrying out 1x1 convolution processing on the first global average pooling processing result, and finally adding the first feature map and the first 1x1 convolution processing result to obtain a second feature map, so that the information circulation is improved, and the problems of gradient disappearance and degradation are avoided.

In an embodiment of the application, the first stage of convertible hole convolution includes a first conversion function, a convertible hole convolution corresponding to a first hole rate, and a convertible hole convolution corresponding to a second hole rate, and the processing the second feature map by using the convertible hole convolution of the plurality of stages that are sequentially cascaded to obtain a third feature map includes: processing the second feature map by using the first conversion function to obtain a processing result of the first conversion function; performing 3x3 hole convolution processing on the second feature map by using the convertible hole convolution corresponding to the first hole rate to obtain a first hole convolution processing result; performing 3x3 hole convolution processing on the second feature map by using the convertible hole convolution corresponding to the second hole rate to obtain a second hole convolution processing result; and according to the processing result of the first conversion function, performing fusion processing on the first hole convolution processing result and the second hole convolution processing result to obtain a fourth feature map output by the convertible hole convolution in the first stage.

In the First stage of the present embodiment, the convertible hole convolution (First Level SAC) has the same structure and parameters as the SAC network provided in fig. 4, where the First Level SAC includes the First conversion function S_first("and") three main parts of convertible hole convolution corresponding to the first hole rate and convertible hole convolution corresponding to the second hole rate.

Using a first transfer function S_first(step S) processing the second characteristic diagram to obtain the processing result of the first conversion function_first(x) And 1-S_first(x) And the two parts, namely the first conversion function can be adaptively adjusted and selected according to the learning condition to perform convolution processing by the hole convolution of the first voidage or the hole convolution of the second voidage. As an example, the first voidage may be set to be equal to 1, the second voidage may be set to be equal to 3, and the void convolutions all adopt Conv3x3, that is, 3x3 convolution, where it should be noted that when the atrous is equal to 1, the convertible void convolution corresponding to the first voidage is essentially ordinary 3x3 convolution.

In obtaining S_first(x) And 1-S_first(x) Then, the processing result of the first transfer function can be used as a fusion mask, and S in the fusion mask_first(x) Performing characteristic multiplication on the convolution processing result of the first hole to obtain 1-S_first(x) And performing characteristic multiplication on the second hole convolution processing result, and finally adding the two multiplication results to obtain a fourth characteristic diagram of the convertible hole convolution output in the first stage.

In an embodiment of the present application, the first conversion function includes 5x5 average pooling and 1x1 convolution, and the processing the second feature map by the first conversion function to obtain the processing result of the first conversion function includes: carrying out 5x5 average pooling on the second feature map to obtain a first average pooling result; and carrying out 1x1 convolution on the first average pooling result to obtain a processing result of the first conversion function.

The first transfer function of the embodiments of the present application consists of an average pooling layer of 5x5 and a convolution layer of 1x1, which is input and position dependent. The second characteristic diagram is firstly subjected to 5x5 average pooling to obtain a first average pooling result, so that the receptive field can be enlarged on one hand, and the overfitting phenomenon can be reduced on the other hand. And then carrying out 1x1 convolution on the first average pooling result to obtain a processing result of the first conversion function.

In an embodiment of the application, the convertible hole convolution at the second stage includes a second conversion function and a convertible hole convolution corresponding to a third hole rate, and the processing the second feature map by using the convertible hole convolution at the multiple stages that are sequentially cascaded to obtain a third feature map includes: processing the second feature map by using the second conversion function to obtain a processing result of the second conversion function; performing 3x3 hole convolution processing on the second feature map by using convertible hole convolution corresponding to the third hole rate to obtain a third hole convolution processing result; acquiring a fourth feature map of the convertible hole convolution output in the first stage; and according to the processing result of the second conversion function, carrying out fusion processing on the fourth feature map and the third cavity convolution processing result to obtain the third feature map.

The Second stage of convertible hole convolution (Second Level SAC) of the embodiment of the present application includes a Second conversion function S_ndConvertible hole volume corresponding to (i) the third hole rateThe product, in addition, also includes the result of convolution output of the convertible hole in the first stage, i.e. the fourth feature map. As an example, the third void rate may be set to 6, and the void convolution may also use Conv3x3, thereby further enlarging the receptive field.

In particular, a second transfer function S is used_nd(step S) processing the second characteristic diagram to obtain the processing result of the second conversion function_2nd(x) And 1-S_2nd(x) And the method can adaptively adjust and select the hole convolution result of the third hole rate or the hole convolution results of the first hole rate and the second hole rate according to the learning condition. The processing result of the second transfer function can then be used as a fusion mask, with S therein_2nd(x) Performing characteristic multiplication on the fourth characteristic diagram output by convolution of the convertible holes in the first stage, and performing characteristic multiplication on the fourth characteristic diagram output by convolution of the convertible holes in the first stage to obtain a 1-S characteristic diagram_2nd(x) And finally, adding the two multiplication results to obtain a result, namely a third characteristic diagram, of the second stage of convertible hole convolution output.

In an embodiment of the present application, the second conversion function includes 11x11 average pooling and 1x1 convolution, and the processing the second feature map by the second conversion function to obtain the processing result of the second conversion function includes: carrying out 11x11 average pooling on the second feature map to obtain a second average pooling result; and carrying out 1x1 convolution on the second average pooling result to obtain a processing result of the second conversion function.

The first transfer function of the embodiments of the present application consists of an average pooling layer of 11x11 and a convolution layer of 1x1, which is input and location dependent. Firstly, 11x11 average pooling processing is carried out on the second characteristic diagram to obtain a second average pooling result, so that on one hand, the receptive field can be enlarged, and on the other hand, the overfitting phenomenon can be reduced. And then carrying out 1x1 convolution on the second average pooling result to obtain a processing result of the second conversion function.

Based on the foregoing embodiment, the MSAC of the embodiment of the present application is a cascade improvement on an original SAC structure, and the MSAC sequentially designs two or more switches, and can convert a feature after SAC conversion and a third hole convolution processing result or a result of more hole convolution processing results. The two-stage MSAC can be represented as:

where x is the input, w is the weight, and r _ first is the hole rate of the first stage of convertible hole convolution, which can be set to error ═ 3.Δ w _ first represents the weight to be trained in the first stage, and the first conversion function S_first("consists of 5x5 average pooling layers and 1x1 convolutional layers. The output of the first stage convertible hole convolution is used as an input of the second stage convertible hole convolution, and r _2nd is the hole rate of the second stage convertible hole convolution, which can be set to 6.Δ w _2nd represents the weight to be trained in the second stage, and the second transfer function S_nd("consists of an average pooling layer of 11x11 and a 1x1 convolutional layer.

The void ratios atrous 1, 3, and 6 are all hyper-parameters, and during actual training, the void ratios can be flexibly replaced with other void ratios, such as atrous 1, 2, and 4, according to training requirements.

The locking mechanism in the original SAC sets one weight to w and the other weight to w + Δ w for the following reasons: the target detection model typically uses pre-trained checkpoint to initialize the weights. However, for SAC layers converted from standard convolutional layers, the weight of the larger void rate is usually missing, and since objects of different scales can be roughly detected with different degrees of coarseness by the same weight, it is possible to initialize the missing weight with the weight in the pre-trained model. In the embodiment of the present application, w + Δ w is used to represent the weight of the deletion starting from the pre-trained checkpoint. W of the switchable convolution of each level is an initial w, and Δ w _ first and Δ w _2nd are different and can be respectively initialized to 0.

In an embodiment of the present application, feature extraction is performed based on the backbone network based on ResNet50/ResNet101 shown in fig. 2, wherein all Conv3 × 3 ordinary convolutions may be replaced by the MSAC network structure of the embodiment of the present application. Of course, those skilled in the art can flexibly configure the alternative modes according to actual requirements. For example, all Conv3 × 3 are replaced by MSAC, or the first 3 levels are replaced by SAC convolution, the last level 3 blocks are replaced by MSAC convolution, etc.

An embodiment of the present application further provides a target detection method, and as shown in fig. 6, a flowchart of the target detection method in the embodiment of the present application is provided, where the target detection method at least includes the following steps S610 to S620:

step S610, acquiring an image to be detected;

s620, detecting the image to be detected by using a target detection model to obtain a target detection result;

In order to verify the detection precision of the target detection model trained by the application, the embodiment of the application compares the detection results of the target detection model trained by the application and the original monocular 3D target detection method by using the test data. Experiments show that the mAP (mean Average Precision) of the target detection model trained by the method is improved by 1.2-1.4 points compared with the original monocular 3D target detection method, and a similar method can be extended to other backbone networks and similar target detection and segmentation tasks.

An embodiment of the present application further provides a target detection model training apparatus 700, as shown in fig. 7, which provides a schematic structural diagram of the target detection model training apparatus in the embodiment of the present application, where the target detection model training apparatus 700 includes: a first obtaining unit 710, a feature extracting unit 720, a feature fusing unit 730, a first detecting unit 740, and an updating unit 750, wherein:

a first obtaining unit 710, configured to obtain an image to be trained, where the image to be trained includes label information of the image to be trained;

a feature extraction unit 720, configured to perform feature extraction on the image to be trained by using a backbone network of the target detection model to obtain a multi-scale feature map of the image to be trained, where the backbone network includes an enhanced convertible hole convolution, and the enhanced convertible hole convolution is used to extract multi-scale feature information in the image to be trained;

the feature fusion unit 730 is configured to perform feature fusion on the multi-scale feature map of the image to be trained by using the feature pyramid network of the target detection model to obtain a fused multi-scale feature map;

the first detecting unit 740 is configured to detect the fused multi-scale feature map by using a detection head network of the target detection model, so as to obtain a target detection result of the image to be trained;

and an updating unit 750, configured to determine a loss value according to the target detection result of the image to be trained and the label information of the image to be trained, and update the parameter of the target detection model by using the loss value, so as to obtain the trained target detection model.

In an embodiment of the present application, the backbone network further includes a first global context module and a second global context module, the enhanced convertible hole convolution includes a plurality of cascaded stages of convertible hole convolutions, the plurality of stages of convertible hole convolutions at least include a first stage of convertible hole convolution and a second stage of convertible hole convolution, and the feature extraction unit 720 is specifically configured to: acquiring a first feature map of the image to be trained, wherein the first feature map is output by an upstream module corresponding to the first global context module; processing the first feature map by using the first global context module to obtain the second feature map; processing the second characteristic diagram by using a plurality of stages of switchable hole convolution which are cascaded in sequence to obtain a third characteristic diagram; and processing the third feature map by using the second global context module to obtain a multi-scale feature map of the image to be trained.

In an embodiment of the present application, the feature extraction unit 720 is specifically configured to: performing global average pooling on the first feature map to obtain a first global average pooling result; performing 1x1 convolution processing on the first global average pooling processing result to obtain a first 1x1 convolution processing result; and performing fusion processing on the first feature map and the first 1x1 convolution processing result to obtain the second feature map.

In an embodiment of the application, the convertible hole convolution of the first stage includes a first conversion function, a convertible hole convolution corresponding to a first hole rate, and a convertible hole convolution corresponding to a second hole rate, and the feature extraction unit 720 is specifically configured to: processing the second feature map by using the first conversion function to obtain a processing result of the first conversion function; performing 3x3 hole convolution processing on the second feature map by using convertible hole convolution corresponding to the first hole rate to obtain a first hole convolution processing result; performing 3x3 hole convolution processing on the second feature map by using the convertible hole convolution corresponding to the second hole rate to obtain a second hole convolution processing result; and according to the processing result of the first conversion function, performing fusion processing on the first hole convolution processing result and the second hole convolution processing result to obtain a fourth feature map output by the convertible hole convolution in the first stage.

In an embodiment of the application, the first conversion function includes 5x5 average pooling and 1x1 convolution, and the feature extraction unit 720 is specifically configured to: carrying out 5x5 average pooling on the second feature map to obtain a first average pooling result; and carrying out 1x1 convolution on the first average pooling result to obtain a processing result of the first conversion function.

In an embodiment of the application, the convertible hole convolution at the second stage includes a second conversion function and a convertible hole convolution corresponding to a third hole rate, and the feature extraction unit 720 is specifically configured to: processing the second feature map by using the second conversion function to obtain a processing result of the second conversion function; performing 3x3 hole convolution processing on the second feature map by using convertible hole convolution corresponding to the third hole rate to obtain a third hole convolution processing result; acquiring a fourth feature map of the convertible hole convolution output of the first stage; and according to the processing result of the second conversion function, carrying out fusion processing on the fourth feature map and the third cavity convolution processing result to obtain the third feature map.

In an embodiment of the application, the second transfer function includes 11x11 average pooling and 1x1 convolution, and the feature extraction unit 720 is specifically configured to: carrying out 11x11 average pooling on the second feature map to obtain a second average pooling result; and carrying out 1x1 convolution on the second average pooling result to obtain a processing result of the second conversion function.

In an embodiment of the present application, the feature extraction unit 720 is specifically configured to: carrying out global average pooling on the third feature map to obtain a second global average pooling result; performing 1x1 convolution processing on the second global average pooling processing result to obtain a second 1x1 convolution processing result; and performing fusion processing on the third feature map and the second 1x1 convolution processing result to obtain the multi-scale feature map of the image to be trained.

It can be understood that the above target detection model training apparatus can implement each step of the target detection model training method provided in the foregoing embodiment, and the relevant explanations about the target detection model training method are all applicable to the target detection model training apparatus, and are not described herein again.

An embodiment of the present application further provides a target detection apparatus 800, as shown in fig. 8, which provides a schematic structural diagram of the target detection apparatus in the embodiment of the present application, where the target detection apparatus 800 includes: a second obtaining unit 810 and a second detecting unit 820, wherein:

a second obtaining unit 810, configured to obtain an image to be detected;

the second detection unit 820 is configured to detect the image to be detected by using a target detection model to obtain a target detection result;

It can be understood that the target detection apparatus can implement the steps of the target detection method provided in the foregoing embodiment, and the related explanations about the target detection method are applicable to the target detection apparatus, and are not described herein again.

Fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application. Referring to fig. 9, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, the network interface, and the memory may be connected to each other via an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 9, but this does not indicate only one bus or one type of bus.

And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.

The processor reads a corresponding computer program from the nonvolatile memory to the memory and then runs the computer program to form the target detection model training device on a logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:

The method performed by the object detection model training apparatus as disclosed in the embodiment of fig. 1 of the present application may be applied to or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or by instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

The electronic device may further execute the method executed by the target detection model training apparatus in fig. 1, and implement the functions of the target detection model training apparatus in the embodiment shown in fig. 1, which are not described herein again in this application.

An embodiment of the present application further provides a computer-readable storage medium storing one or more programs, where the one or more programs include instructions, which, when executed by an electronic device including a plurality of application programs, enable the electronic device to perform the method performed by the object detection model training apparatus in the embodiment shown in fig. 1, and are specifically configured to perform:

It should be noted that the electronic device according to the embodiment of the present application may also be configured to execute the method executed by the target detection apparatus disclosed in the embodiment shown in fig. 6, which is not described in detail again.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. An object detection model training method, wherein the object detection model training method comprises the following steps:

2. The object detection model training method of claim 1, wherein the backbone network further comprises a first global context module and a second global context module, the enhanced convertible hole convolution comprises a plurality of stages of convertible hole convolutions that are concatenated sequentially, the plurality of stages of convertible hole convolutions comprising at least a first stage of convertible hole convolution and a second stage of convertible hole convolution,

and processing the third feature map by using the second global context module to obtain the multi-scale feature map of the image to be trained.

3. The method for training the object detection model according to claim 2, wherein the processing the first feature map by using the first global context module to obtain the second feature map comprises:

4. The target detection model training method of claim 2, wherein the first stage of convertible hole convolution includes a first conversion function, a convertible hole convolution corresponding to a first hole rate, and a convertible hole convolution corresponding to a second hole rate, and the processing the second feature map by using the convertible hole convolution of the plurality of stages that are sequentially cascaded to obtain a third feature map includes:

and according to the processing result of the first conversion function, performing fusion processing on the first hole convolution processing result and the second hole convolution processing result to obtain a fourth feature map of the convertible hole convolution output in the first stage.

5. The method of training an object detection model according to claim 4, wherein the first transformation function comprises 5x5 average pooling and 1x1 convolution, and the processing the second feature map with the first transformation function to obtain the processing result of the first transformation function comprises:

6. The target detection model training method according to claim 2, wherein the second stage of convertible hole convolution includes a second conversion function and a convertible hole convolution corresponding to a third hole rate, and the processing the second feature map by using a plurality of stages of convertible hole convolution in cascade to obtain a third feature map includes:

7. The method of training an object detection model according to claim 6, wherein the second transfer function comprises 11x11 average pooling and 1x1 convolution, and the processing the second feature map using the second transfer function to obtain the processing result of the second transfer function comprises:

8. An object detection method, wherein the object detection method comprises:

acquiring an image to be detected;

wherein the object detection model is obtained by training based on the object detection model training method according to any one of claims 1 to 7.

9. An object detection model training apparatus, wherein the object detection model training apparatus comprises:

the characteristic fusion unit is used for carrying out characteristic fusion on the multi-scale characteristic graph of the image to be trained by utilizing the characteristic pyramid network of the target detection model to obtain a fused multi-scale characteristic graph;

10. An object detection apparatus, wherein the object detection apparatus comprises:

the second acquisition unit is used for acquiring an image to be detected;

wherein the object detection model is trained based on the object detection model training apparatus of claim 9.