CN111695403B

CN111695403B - Depth perception convolutional neural network-based 2D and 3D image synchronous detection method

Info

Publication number: CN111695403B
Application number: CN202010308948.9A
Authority: CN
Inventors: 吴明瞭; 付智俊; 郭启翔; 尹思维; 谢斌; 何薇; 焦红波; 王晨阳; 白世伟
Original assignee: Dongfeng Automobile Co Ltd
Current assignee: Dongfeng Automobile Co Ltd
Priority date: 2020-04-19
Filing date: 2020-04-19
Publication date: 2024-03-22
Anticipated expiration: 2040-04-19
Also published as: CN111695403A

Abstract

The invention discloses a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, which comprises the following steps: step 1, defining a target anchor point formula, introducing preset depth information parameters, and designating a shared center pixel position; step 2, generating a preset anchor frame according to an anchor point template defining a target object, a visual anchor point generation formula and a 3D priori anchor point; step 3, checking the intersection ratio of the anchor frames; step 4, analyzing a network loss function of the target object; step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain a feature map, sending the feature map into global feature extraction and local feature extraction, and finally combining according to a certain weight; step 6, forward optimization processing, namely leading out a parameter step sigma, setting a cycle termination parameter beta and optimizing parameters; and 7, outputting the 3D parameters. The invention can realize higher safety of automatic driving and can be widely applied to the field of computer vision.

Description

Depth perception convolutional neural network-based 2D and 3D image synchronous detection method

Technical Field

The invention relates to a detection method of an effective target in the field of computer vision such as unmanned driving and assisted driving, in particular to a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network.

Background

Object detection refers to the detection and identification of category and position information of an object of interest (such as a vehicle, a pedestrian, an obstacle, etc.) in an image or a video by using computer technology, and is one of important research fields in the field of computer vision. With the continuous improvement and development of deep learning technology, object detection technology based on deep learning has been widely applied in many real fields, such as: unmanned, assisted driving, face recognition, unmanned security, man-machine interaction, behavior recognition and other related fields.

As one of important research directions in the deep learning technology, the deep convolutional neural network has achieved remarkable results on object detection tasks, and can realize real-time detection and identification of an object of interest in 2D image data. However, in the field of unmanned research, the system is required to obtain the position information in the 3D space of the object of interest in the application to better realize the corresponding function, so that the stability and the safety of the system are improved.

The current hardware equipment for 3D image recognition relies on cameras, which can be divided into monocular cameras and multi-ocular cameras according to the functions of the cameras: the monocular camera is fixed focus and is mostly applied to road condition judgment of automatic driving, but the monocular camera has an irreconcilable contradiction in the aspects of ranging range and distance, namely, the wider the visual angle of the camera is, the shorter the length of the accurate distance can be detected, the narrower the visual angle is, the longer the detected distance is, the more the distance is similar to the world seen by human eyes, the more far the distance is seen, the narrower the coverage range is, and the more near the distance is seen; binocular cameras are cameras with different focal lengths, the focal lengths of which are related to the imaging definition, but at present, vehicle-mounted cameras are difficult to achieve frequent zooming, and the cost of multi-camera cameras is higher and the algorithm complexity of the multi-camera cameras is increased compared with that of monocular cameras, so that the multi-camera cameras are not suitable for unmanned systems at present.

In order to improve the accuracy of 3D image detection, existing 3D image detection methods also rely on expensive lidar sensors, which can provide sparse depth data as input. However, when the laser radar sensor-dependent mode is combined with a monocular camera, sparse depth data lacks depth information, so that the implementation in 3D image detection is difficult.

For example, taking an automatic driving system as an example for an object detection task under the scene, the traditional 2D object detection method obtains a real-time road scene in the driving process through a vehicle-mounted camera, inputs the real-time road scene into an existing algorithm, realizes detection of an object of interest in an image through a trained detection model, outputs position and category information of the object to a decision-making layer of a control end, and plans how a vehicle runs. However, one problem is that the 3D spatial position information of the detection target obtained by the monocular camera is unstable, and the accuracy of the detection target is reduced due to a plurality of influencing factors.

Disclosure of Invention

The invention aims to overcome the defects of the background technology, and provides a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, so that the advantage of saving more detailed semantic information by a camera is increased on the basis of keeping accurate depth information of a laser scanner, and higher drivability and safety in an automatic driving process can be realized.

The invention provides a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, which comprises the following steps: step 1, defining an anchor point template of a target object: respectively defining specific formulas of a 2D target anchor point and a 3D target anchor point, introducing preset depth information parameters, and designating a shared central pixel position; step 2, generating an anchor frame of the model prediction feature map: according to an anchor point template defining a target object, a preset anchor frame is generated according to a visual anchor point generation formula and a pre-calculated 3D priori anchor point; step 3, checking the intersection ratio of GT of the anchor frame: checking whether the GT intersection ratio IOU of the anchor frame is more than or equal to 0.5 according to the generated anchor frame; step 4, analyzing a network loss function of the target object: the method comprises classification loss function LC analysis, 2D frame regression loss function analysis and 3D frame regression loss function analysis; step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain a feature map with h-w dimensions, then respectively sending the feature map into two branches, wherein one is global feature extraction and the other is local feature extraction, and finally combining the features of the two branches according to a certain weight; step 6, forward optimization processing: projecting 3D information to 2D information, performing forward optimization processing, leading out a parameter step sigma for updating theta, setting a cycle termination parameter beta, and inputting an optimization parameter when alpha is larger than the parameter beta; and 7, performing 3D target detection according to the 3D output parameters.

In the above technical solution, in the step 1, the specific formula of the 2D target anchor point is [ w, h ]2D, and the specific formula of the 3D target anchor point is [ w, h, l, θ ]3D, where w, h, and l respectively represent given values of the width, height, and length of the target detection object, and θ represents an observation angle of view of the camera on the target detection object; the method comprises the steps of introducing a preset depth information parameter Zp, designating a shared central pixel position as [ x, y ] P, wherein the parameter expressed by 2D is expressed as [ x, y ] 2D=P.w, h ]2D according to pixel coordinates, wherein P represents a coordinate point of a known projection matrix for projecting an object, 3D central position [ x, y, z ]3D under a camera coordinate system is projected into an image of the given known projection matrix P, and the depth information parameter Zp is coded, wherein the formula is as follows:

in the above technical solution, in the step 2, each anchor point in the model prediction output feature map is defined as C, and each anchor point corresponds to [ tx, ty, tw, th ]]2D、[tx,ty,tz]P、[tw,th,tl,tθ]3D, setting the total number of anchor points of a single pixel on the feature map of each target detection object as na, presetting the number of training model categories as nc, hxw as the feature mapThe total number of output frames is nb=w×h×na; each anchor point is distributed at each pixel position [ x, y ]] _P ∈R ^w×h The first output anchor C represents a shared class prediction of dimension na×nc×h×w, where the output dimension of each class is na×h×w.

In the above technical solution, in the step 2, [ tx, ty, tw, th ]2D representing 2D bounding box conversion is collectively referred to as b2D, where the bounding box conversion formula is as follows:

wherein xP and yP represent the spatial center position of each frame, and the transformed frame b' _2D Is defined as [ x, y, w, h ]]′ _2D Transforming 7 output variables, namely projection center [ t ] _x ,t _y ,t _z ] _P Scaling t _w ,t _h ,t _l ] _3D And direction change t _θ3D Collectively referred to as b _3D The b is _3D Conversion applied to the band parameters [ w, h ]] _2D ,z _P ,[w,h,l,θ] _3D Is an anchor point of:

similarly, the inverse transform of equation (1) is used to determine the 3D center position [ x, y, z ] obtained after projection in image space]′ _P To calculate the camera coordinates [ x, y, z ]]′ _3D ，b′ _3D Represents [ x, y, z ]]′ _P And [ w, h, l, θ ]]′ _3D 。

In the above technical solution, in the step 3, if the intersection ratio IOU of GT of the anchor frame is less than 0.5, setting the category of the target object as a background category, and ignoring or deleting the boundary anchor frame; if the cross ratio IOU of GT of the anchor frame is more than or equal to 0.5, generating category indexes tau and 2D frames of the target object according to the generated anchor frame GTAnd 3D frame->

In the above technical solution, in the step 4, the classification loss function LC adopts a polynomial logic loss function based on softmax, and the formula is:

2D frame regression loss functionAnalysis for matching before GT transformation>And GT transformed b' _2D Cross-over ratio IOU between:

3D frame regression loss functionAnalysis for leaving 3DEach of the frame parameters is optimized by a smooth L1 regression loss function, and the formula is as follows:

in the above technical solution, in the step 4, the entire multi-tasking network loss function L is introduced, which further includes a regularization weight λ ₁ And lambda (lambda) ₂ The formula is defined as follows:

in the above technical solution, in the step 5, the specific process is as follows:

step 5-1, obtaining a characteristic diagram with h x w dimensions by using a convolutional neural network DenseNet: introducing a super-parameter b, wherein b represents the number of bins of a row level, and the number is used for dividing the feature map into b in the transverse direction, and each bin represents a specific convolution kernel k; and 5-2, extracting global/local characteristics, wherein the step 5-2 is divided into two branches, and the flow is specifically as follows: step 5-2-1, global feature extraction: the global feature extraction adopts a conventional convolution, and the conventional convolution introduces global features F in the convolution process _global The global feature F _global In which a convolution kernel of the packing number 1 and 3*3 is introduced and then non-linearly activated by a Relu function to generate 512 feature maps, the entire feature map is acted upon by conventional 3x3 and 1x1 convolutions, and then C, θ, [ t ] are output on each feature map F _x ,t _y ,t _w ,t _h ] _2D ,[t _x ,t _y ,t _z ] _P ,[t _w ,t _h ,t _l ,tθ] _3D A total of 13 outputs, each of which is connected to a convolution kernel O of 1*1 _global The method comprises the steps of carrying out a first treatment on the surface of the Step 5-2-2, local feature extraction: for local feature extraction, depth-aware convolution is adopted, and the depth-aware convolution introduces global features F in the convolution process _local The global feature F _local In which the number of incoming padding is 1 and 3*3, then non-linearly activated by the Relu function to generate 512 feature maps, acting on different bins (convolution kernel pixels) with different 3x3 kernels and dividing them longitudinally by b bins, then outputting C, θ, [ t ] on each feature map F _x ,t _y ,t _w ,t _h ] _2D ,[t _x ,t _y ,t _z ] _P ,[t _w ,t _h ,t _l ,tθ] _3D A total of 13 outputs, each of which is connected to a convolution kernel O of 1*1 _local The method comprises the steps of carrying out a first treatment on the surface of the Step 5-3, weighting the output of the global feature extraction and the local feature extraction: introducing a learned weight alpha, wherein the weight alpha uses the spatial invariance of the convolutional neural network as an index of the 1 st to 13 th outputs, and a specific output function is as follows:

O ⁱ ＝O _global ⁱ ·α _i +O _local ⁱ ·(1-α _i ) (8)。

in the above technical solution, in the step 5, further includes steps 5-4: the backbone network of the 3D target detection method is based on DenseNet-121, and provides a dense connection mechanism for interconnecting all layers: that is, each layer will accept all its previous layers as its additional input, resNet will connect each layer together with the previous 2-3 layers by way of element level addition, whereas in DenseNet, each layer will be concat with all the previous layers in the channel dimension and as input for the next layer, denseNet contains a total of L (L+1)/2 connections for a network of L layers, and DenseNet is a feature map from the different layers directly concat.

In the above technical solution, in the step 6, the iterative steps of the algorithm are as follows: by combining the projection of the 3D frame with the 2D estimation frame b' _2D As L _1loss And continuously adjusting theta, and projecting 3D to the 2D frame according to the following formula:

γ _P ＝P·γ _3D ,γ _2D ＝γ _P /γ _P [φ _z ],

x _min ＝min(γ _3D [φ _x ]),y _min ＝min(γ _3D [γ _3D [φ _y ]])

x _max ＝max(γ _3D [φ _x ]),y _max ＝max(γ _3D [γ _3D [φ _y ]])

(9)，

wherein phi represents the axis [ x, y, z ]]Is projected with 3D frame parameters x _min ,y _min ,x _max ,y _max ]And the original 2D frame estimate b' _2D To calculate L _1loss When loss is not updated within the range of theta + -sigma, the step length sigma is changed by an attenuation factor gamma, and when sigma > beta, the operation is repeatedly performed; in the step 7, 13 parameters are output in total according to the 3D, and the 13 parameters are respectively: c, θ, [ t ] _x ,t _y ,t _w ,t _h ] _2D ,[t _x ,t _y ,t _z ] _P ,[t _w ,t _h ,t _l ,tθ] _3D 。

The 2D and 3D image synchronous detection method based on the depth perception convolutional neural network has the following beneficial effects: the scheme of the invention provides an algorithm for fusing laser radar point clouds and RGB (red (R), green (G) and blue (B) channel colors) images. 3D target visual analysis plays an important role in the autonomous driving car visual perception system. Modern autopilot vehicles are often equipped with a plurality of sensors, such as lidar and cameras. In terms of the application characteristics of the two sensors, a camera and a laser radar camera can be used for target detection, a laser scanner has the advantage of accurate depth information, and the camera stores more detailed semantic information, so that the fusion of laser radar point cloud and RGB images can realize automatic driving of an automobile with higher performance and safety. Object detection in three-dimensional space using lidar and image data is used to achieve highly accurate target location and recognition of objects in road scenes.

Drawings

FIG. 1 is a basic idea flow chart of a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network;

FIG. 2 is a specific flowchart of a method for detecting 2D and 3D image synchronization based on a depth perception convolutional neural network;

FIG. 3 is a schematic diagram of parameter definition of an anchor point template in a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network;

FIG. 4 is a block diagram of a three-dimensional anchor of a 3D object in a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network;

FIG. 5 is a bird's eye view of a three-dimensional anchor frame of a 3D object in a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network;

FIG. 6 is a schematic diagram of an RPN network architecture in a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network according to the present invention;

FIG. 7 is a schematic diagram of extraction of transverse segmentation local features in a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network;

FIG. 8 is a schematic diagram of longitudinal segmentation local feature extraction in the 2D and 3D image synchronous detection method based on a depth perception convolutional neural network;

fig. 9 is a network architecture diagram of Densenet in the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, which should not be construed as limiting the invention.

Referring to fig. 1, the basic idea of the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network of the invention is that: input image, simultaneous detection processing of 2D and 3D images, projection of 3D information to 2D information and forward optimization processing, and 3D target detection according to 3D output parameters.

Referring to fig. 2, the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network comprises the following specific steps:

step 1: an anchor template of the target object is defined. In order to predict a 2D frame and a 3D frame simultaneously, anchor templates need to be defined in respective dimension spaces, and it should be noted that the 2D frame herein is the maximum length and width observed by the 3D target object. Specifically, taking an automobile as an example, referring to fig. 3, specific formulas of anchor points of a 2D target and anchor points of a 3D target are [ w, h ]2D and [ w, h, l, θ ]3D, wherein w, h and l respectively represent the width, height and length of a target detection object, and w, h and l are given values in a detection camera coordinate system; in addition, since the 3D object is different from the 2D object and has rotatability, θ represents the viewing angle of the camera to the object detection object, which corresponds to the rotation of the camera around the Y axis of the camera coordinate system, the viewing angle considers the relative orientation of the object relative to the viewing angle of the camera, and not the aerial view (BEV) of the ground, and here, θ is more meaningful in intuitively estimating the viewing angle when processing the 3D image feature.

Wherein, to define the position of a 2D/3D frame of a complete target object, a preset depth information parameter Zp is introduced, and a shared central pixel position [ x, y ] P is specified, wherein the parameter represented by 2D is represented according to pixel coordinates, i.e., [ x, y ] 2d=p· [ w, h ]2D, wherein P represents a coordinate point of a known projection matrix required to project the target object; in 3D object detection, 3D center positions [ x, y, z ] in a camera coordinate system are three-dimensionally projected into an image of a given known projection matrix P, and depth information parameters Zp are encoded, with the following formula:

the mean value statistics is carried out on [ w, h, l, theta ]3D of each preset depth information parameter Zp and 3D target object, and the Zp and the [ w, h, l, theta ]3D are calculated independently for each anchor point in advance, and the functions of the parameters are as follows: can serve as strong a priori information to mitigate the difficulty of 3D parameter estimation. Specifically, for each anchor point, each preset depth information parameter Zp and the [ w, h, l, θ ]3D of the 3D object are statistics of more than 0.5, where the anchor point represents a discrete template, and the 3D prior can be used as a strong initial guess, so that a reasonably consistent scene geometry is assumed.

Step 2: and generating an anchor frame of the model prediction feature map according to the anchor point template defining the target object. Specifically, according to an anchor point template of a target object, the method is expressed as generating a preset anchor frame according to a visual anchor point generation formula and a pre-calculated 3D priori anchor point, specifically, the generated three-dimensional anchor frame can be seen in fig. 4, and the bird's eye view is seen in fig. 5.

Further, each anchor point in the model prediction output feature map is defined as C, the number of anchor points corresponding to [ tx, ty, tw, th ]2D, [ tx, ty, tz ] P, [ tw, th, tl, tθ ]3D, the total number of anchor points is set as na (the number of anchor points of a single pixel on the feature map of each target detection object), and the number of categories (preset training models) is set as nc and hxw, which are the resolution of the feature map.

Thus, the total number of output frames is nb=w×h×na;

each anchor point is distributed at each pixel position [ x, y ]] _P ∈R ^w×h ，

The first output anchor C represents a shared class prediction of dimension na×nc×h×w, where the output dimensions of each other (each class) are na×h×w.

Further, [ tx, ty, tw, th ]2D represents 2D bounding box transformation, we collectively refer to b2D, specifically, where the bounding box transformation formula is as follows:

where xP and yP denote the spatial center position of each box. Transformed box b' _2D Is defined as [ x, y, w, h ]]′ _2D The following 7 outputs represent the projective center transform [ t ] _x ,t _y ,t _z ] _P Scaling t _w ,t _h ,t _l ] _3D And a direction change t _θ3D Collectively referred to as b _3D . Similar to 2D, the conversion is applied to the band parameters [ w, h] _2D ,z _P ,[w,h,l,θ] _3D Anchor point of (b):

similarly, b' _3D Represents [ x, y, z ]]′ _P And [ w, h, l, θ ]]′ _3D . As previously described, the authors estimate the 3D center of the projection rather than the camera coordinates to better process the image space based convolution features. In the reasoning process, the inverse transformation of formula (1) is utilized to obtain the 3D center position [ x, y, z ] after the projection in the image space]′ _P To calculate the camera coordinates [ x, y, z ]]′ _3D 。

Step 3: and checking whether the GT (ground real condition) intersection ratio (IOU) of the anchor frame is more than or equal to 0.5 according to the generated anchor frame.

If the cross ratio IOU of GT of the anchor frame is less than 0.5, setting the category of the target object as a background category, and neglecting or deleting the boundary anchor frame;

if the cross ratio IOU of GT of the anchor frame is more than or equal to 0.5, generating category index tau and 2D frame of the target object according to the generated anchor frame GT (ground truth)And 3D frame->And performs the following step 4.

Step 4: the network loss function of the target is analyzed. Further, the step includes classification loss function LC analysis, 2D frame regression loss function analysis, and 3D frame regression loss function analysis.

The classification loss function LC adopts a polynomial logic loss function based on softmax, and the formula is as follows:

and introduce 2D frame regression frame lossFor matching before GT transformation>And GT transformed b' _2D Cross-over ratio IOU between:

3D frame regression loss functionAnalysis, for optimizing each of the remaining 3D bezel parameters with a smooth L1 regression loss function, of the formula:

further, for the whole network framework, a whole multi-task network loss function L is also introduced, wherein the whole multi-task network loss function L is also packagedBracketing regularization weights lambda ₁ And lambda (lambda) ₂ The formula is defined as follows:

step 5: a depth-aware convolutional regional recommendation network is established to enhance the ability of higher-order feature space awareness in the regional recommendation network.

A super parameter b is introduced, where b represents the number of bins at the row level, representing the division of the feature map into b in the lateral direction, each bin representing a particular convolution kernel k.

Step 5-1, introducing a Densenet convolutional neural network. Further, a DenseNet (convolutional neural network with a deeper layer) is used as a basic feature extractor to obtain a feature map with h-w dimension, the feature map is respectively sent to two branches, one is global feature extraction, the other is local feature extraction, and finally the features of the two branches are combined according to a certain weight. Wherein the global block acts on the whole feature map by means of conventional 3x3 and 1x1 convolutions, and the block acts on different bins by means of different 3x3 kernels, which are seen in the cross bar in fig. 6 and are divided in the longitudinal direction into b bins, the RPN network architecture being shown in fig. 6.

It should be noted that, for the local feature extraction, two feature extraction methods are adopted in the present technology, as shown in fig. 7.

For b longitudinal bars generated by taking b bins divided along the longitudinal direction as random functions when the local feature 1 is extracted, the randomness of image extraction is increased in the convolution process, and therefore the recognition rate is improved.

Furthermore, in order to more accurately identify the 3D target image, the present technology further provides a longitudinal segmentation method, and the specific segmentation method is shown in fig. 8.

The adopted longitudinal cutting method enables the obtained local features of the feature extraction to be more, so that the recognition rate is improved.

In addition, the backbone network of the present 3D object detection method is based on DenseNet-121, and the network architecture of densnet can be seen in fig. 9, where the DenseNet proposes a more aggressive dense connection mechanism: i.e. all layers are interconnected, specifically each layer will accept all its preceding layers as its additional input. It can be seen that ResNet is where each layer is shorted together with a layer (typically 2-3 layers) in front by element level addition. In DenseNet, each layer is connected (concat) together with all previous layers in the channel dimension (where the feature map size of each layer is the same) and serves as input to the next layer. For an L-layer network, denseNet contains a total of L (l+1)/2 connections, which is a dense connection compared to ResNet. And DenseNet is a feature map from different layers, which can realize feature reuse and improve efficiency. The network architecture of Densenet is shown in FIG. 9.

And 5-2, extracting global/local characteristics. The step 5-2 is divided into two branches, namely a step 5-2-1 and a step 5-2-2.

And 5-2-1, global feature extraction. The global feature extraction adopts a conventional convolution, a convolution kernel of which acts as a global convolution in the whole space, and the conventional convolution introduces a global feature F in the convolution process _global The global feature F _global A packing number of 1 and 3*3 convolution kernels are introduced and then non-linearly activated by the Relu function (Rectified Linear Unit, linear rectification function) to generate 512 feature maps.

Then 13 outputs (from the previous, it can be seen that 13 outputs are respectively C, θ, [ t ] _x ,t _y ,t _w ,t _h ] _2D ,[t _x ,t _y ,t _z ] _P ,[t _w ,t _h ,t _l ,tθ] _3D ) And wherein each output is connected to a convolution kernel O of 1*1 _global 。

And 5-2, extracting local features. For local feature extraction, depth-aware convolution (depth-aware convolution), namely local volume, is adoptedAnd (3) accumulation. The depth perception convolution introduces global features F in the convolution process _local The global feature F _local A packing number of 1 and 3*3 convolution kernels are introduced and then non-linearly activated by the Relu function to generate 512 feature maps.

Then 13 outputs (from the previous, it can be seen that 13 outputs are respectively C, θ, [ t ] _x ,t _y ,t _w ,t _h ] _2D ,[t _x ,t _y ,t _z ] _P ,[t _w ,t _h ,t _l ,tθ] _3D ) And wherein each output is connected to a convolution kernel O of 1*1 _local 。

And 5-3, weighting the output of the global feature extraction and the local feature extraction. A weight α (which is learned) is introduced here, which takes advantage of the spatial invariance of convolutional neural networks, as an index of the 1 st to 13 th outputs, whose specific output functions are as follows:

O ⁱ ＝O _global ⁱ ·α _i +O _local ⁱ ·(1-α _i ) (8)

and 6, projecting the 3D information to the 2D information and performing forward optimization processing. A parameter step σ is derived here (for updating θ) and a loop termination parameter β is set, and when α is greater than parameter β, the input of the optimization parameter is performed.

The iterative step of the algorithm is by combining the projection of the 3D frame with the 2D estimation frame b' _2D As L _1loss And θ is continuously adjusted. And the formula of the step of projecting 3D to the 2D frame is as follows:

γ _P ＝P·γ _3D ,γ _2D ＝γ _P /γ _P [φ _z ],

x _min ＝min(γ _3D [φ _x ]),y _min ＝min(γ _3D [γ _3D [φ _y ]])

x _max ＝max(γ _3D [φ _x ]),y _max ＝max(γ _3D [γ _3D [φ _y ]])

(9)

where φ denotes the index of the axis [ x, y, z ].

2D frame parameters [ x ] after projection with 3D frame _min ,y _min ,x _max ,y _max ]And the original 2D frame estimate b' _2D To calculate L _1loss When loss is not updated within a range of θ±σ, the step size σ is changed by the attenuation factor γ, and when σ > β, the above operation is repeatedly performed.

Step 7, 13 parameters are output, wherein the 13 parameters are respectively: c, θ, [ t ] _x ,t _y ,t _w ,t _h ] _2D ,[t _x ,t _y ,t _z ] _P ,[t _w ,t _h ,t _l ,tθ] _3D And finally, 3D target detection is carried out.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

What is not described in detail in this specification is prior art known to those skilled in the art.

Claims

1. A2D and 3D image synchronous detection method based on a depth perception convolutional neural network is characterized by comprising the following steps of: the method comprises the following steps:

step 1, defining an anchor point template of a target object: respectively defining a 2D target anchor point and a 3D target anchor point, introducing preset depth information parameters, and designating a shared central pixel position; the steps are as follows1, the 2D target anchor point is specifically [ w, h ]] _2D The 3D target anchor point is specifically [ w, h, l, theta ]] _3D Wherein w, h and l respectively represent given values of the width, the height and the length of the target detection object, and θ represents the viewing angle of the camera to the target detection object; the introduced preset depth information parameter is Zp, and the shared central pixel position is designated as [ x, y ]] _P Wherein the parameters of the 2D representation are represented as [ x, y ] according to pixel coordinates] _2D ＝P·[w,h] _2D Where P represents the coordinate point of the known projection matrix required to project the object, the 3D center position [ x, y, z ] in the camera coordinate system] _3D Three-dimensionally projecting into an image given a known projection matrix P and encoding depth information parameters Zp, the formula of which is as follows:

step 2, generating an anchor frame of the model prediction feature map: according to an anchor point template defining a target object, a preset anchor frame is generated according to a visual anchor point generation formula and a pre-calculated 3D priori anchor point;

in the step 2, each anchor point in the model predictive output feature map is defined as C, and each anchor point corresponds to [ tx, ty, tw, th ]] _2D 、[tx,ty,tz] _P 、[tw,th,tl,tθ] _3D Let the total number of anchor points of a single pixel on the feature map of each target detection object be na, the number of preset training model categories be nc, hxw be the resolution of the feature map, and the total number of output frames be nb=w×h×na; each anchor point is distributed at each pixel position [ x, y ]] _P ∈R ^w×h The first output anchor C represents a shared class prediction of dimension na×nc×h×w, where the output dimension of each class is na×h×w;

in said step 2, [ tx, ty, tw, th ] representing a 2D bounding box transformation] _2D Collectively referred to as b _2D Wherein the bounding box transformation formula is as follows:

wherein xP and yP represent the spatial center position of each frame, and the transformed frame b' _2D Is defined as [ x, y, w, h ]]′ _2D Transforming 7 output variables, namely projection centersScaling t _w ，t _h ，t _l ] _3D And direction change->Collectively referred to as b _3D The b is _3D Conversion applied to the band parameters [ w, h ]] _2D ，z _P ，[w，h，l，θ] _3D Is an anchor point of:

similarly, the inverse transform of equation (1) is used to determine the 3D center position [ x, y, z ] obtained after projection in image space]′ _P To calculate the camera coordinates [ x, y, z ]]′ _3D ，b′ _3D Represents [ x, y, z ]]′ _P And [ w, h, l, θ ]]′ _3D ；

Step 3, checking the intersection ratio of GT of the anchor frame: checking whether the GT intersection ratio IOU of the anchor frame is more than or equal to 0.5 according to the generated anchor frame;

step 4, analyzing a network loss function of the target object: the method comprises classification loss function LC analysis, 2D frame regression loss function analysis and 3D frame regression loss function analysis;

step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain a feature map with h-w dimensions, then respectively sending the feature map into two branches, wherein one is global feature extraction and the other is local feature extraction, and finally combining the features of the two branches according to a certain weight;

in the step 5, the specific process is as follows:

step 5-1, obtaining a characteristic diagram with h x w dimensions by using a convolutional neural network DenseNet: introducing a super-parameter b, wherein b represents the number of bins of a row level, and the number is used for dividing the feature map into b in the transverse direction, and each bin represents a specific convolution kernel k;

and 5-2, extracting global/local characteristics, wherein the step 5-2 is divided into two branches, and the flow is specifically as follows:

step 5-2-1, global feature extraction: the global feature extraction adopts a conventional convolution, and the conventional convolution introduces global features F in the convolution process _global The global feature F _global A convolution kernel of padding number 1 and 3*3 is introduced, then non-linearly activated by the Relu function to generate 512 feature maps, the entire feature map is acted upon by conventional 3x3 and 1x1 convolutions,

then output C, θ, [ t ] on each feature map F _x ，t _y ，t _w ，t _h ] _2D ，[t _x ，t _y ，t _z ] _P ，[t _w ，t _h ，t _l ，t _θ ] _3D A total of 13 outputs, each of which is connected to a convolution kernel O of 1*1 _global ；

Step 5-2-2, local feature extraction: depth perception for local feature extractionConvolution, which introduces global features F in the convolution process _local The global feature F _local A convolution kernel of padding number 1 and 3*3 is introduced, then non-linearly activated by the Relu function to generate 512 feature maps, with different 3x3 kernels acting on different bins and dividing them longitudinally by b bins,

then output C, θ, [ t ] on each feature map F _x ，t _y ，t _w ，t _h ] _2D ，[t _x ，t _y ，t _z ] _P ，[t _w ，t _h ，t _l ，t _θ ] _3D A total of 13 outputs, each of which is connected to a convolution kernel O of 1*1 _local ；

Step 5-3, weighting the output of the global feature extraction and the local feature extraction: introducing a weight alpha learned by the neural network, wherein the weight alpha uses the spatial invariance of the convolutional neural network as an index of the 1 st to 13 th outputs, and a specific output function is as follows:

O _i ＝O _global ⁱ ·a _i +O _local ⁱ ·(1-a _i ) (8)

step 6, forward optimization processing: projecting 3D information to 2D information, performing forward optimization processing, leading out a parameter step sigma for updating theta, setting a cycle termination parameter beta, and inputting an optimization parameter when alpha is larger than the parameter beta;

θ represents the angle of view of the camera to the target detection object;

in the step 6, the iterative steps of the algorithm are as follows:

by combining the projection of the 3D frame with the 2D estimation frame b' _2D As L _1loss And continuously adjusting theta, and projecting 3D to the 2D frame according to the following formula:

γ _P ＝P·γ _3D ，γ _2D ＝γ _P /γ _P [φ _z ]，

x _min ＝min(γ _3D [φ _x ])，y _{dish in} ＝min(γ _3D [γ _3D [φ _y ]])

x _max ＝max(Y _3D [φ _x ])，y _max ＝max(γ _3D [γ _3D [φ _y ]])

(9)，

Wherein phi denotes the index of the axis x, y, z,

2D frame parameters [ x ] after projection with 3D frame _min ，y _min ，x _max ，y _max ]And the original 2D frame estimate b' _2D To calculate L _1loss When loss is not updated within the range of theta + -sigma, the step length sigma is changed by an attenuation factor gamma, and when sigma > beta, the operation is repeatedly performed;

step 7, 3D target detection is carried out according to the 3D output parameters;

in the step 7, 13 parameters are output in total according to the 3D, and the 13 parameters are respectively: c, θ, [ t ] _x ，t _y ，t _w ，t _h ] _2D ，[t _x ，t _y ，t _z ] _P ，[t _w ，t _h ，t _l ，tθ] _3D ；

In the step 3, if the intersection ratio IOU of GT of the anchor frame is less than 0.5, setting the category of the target object as a background category, and ignoring or deleting the boundary anchor frame; if the cross ratio IOU of GT of the anchor frame is more than or equal to 0.5, generating category indexes T and 2D frames of the target object according to the generated anchor frame GTAnd 3D frame->

In the step 4, the classification loss function LC adopts a polynomial logic loss function based on softmax, and the formula is as follows:

in the step 4, the whole multi-task network loss function L is also introduced, wherein the regularization weight lambda is also included ₁ And lambda (lambda) ₂ The formula is defined as follows:

in the step 5, the method further comprises the steps of 5-4: the backbone network of the 3D target detection method is based on DenseNet-121, and provides a dense connection mechanism for interconnecting all layers: i.e. each layer will accept all its previous layers as its extra input, the ResNet will connect each layer together with the previous 2-3 layers short-circuited by element-wise addition, whereas in the DenseNet each layer will be concat together with all the previous layers in the channel dimension and as input for the next layer, the DenseNet contains in common an L (L+1)/2 connection for a network of L layers, and the DenseNet links the feature map from the respective layers by means of a concat connector.