CN111695403B - Depth perception convolutional neural network-based 2D and 3D image synchronous detection method - Google Patents

Depth perception convolutional neural network-based 2D and 3D image synchronous detection method Download PDF

Info

Publication number
CN111695403B
CN111695403B CN202010308948.9A CN202010308948A CN111695403B CN 111695403 B CN111695403 B CN 111695403B CN 202010308948 A CN202010308948 A CN 202010308948A CN 111695403 B CN111695403 B CN 111695403B
Authority
CN
China
Prior art keywords
frame
anchor
anchor point
global
feature map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010308948.9A
Other languages
Chinese (zh)
Other versions
CN111695403A (en
Inventor
吴明瞭
付智俊
郭启翔
尹思维
谢斌
何薇
焦红波
王晨阳
白世伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongfeng Automobile Co Ltd
Original Assignee
Dongfeng Automobile Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongfeng Automobile Co Ltd filed Critical Dongfeng Automobile Co Ltd
Priority to CN202010308948.9A priority Critical patent/CN111695403B/en
Publication of CN111695403A publication Critical patent/CN111695403A/en
Application granted granted Critical
Publication of CN111695403B publication Critical patent/CN111695403B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/56Context or environment of the image exterior to a vehicle by using sensors mounted on the vehicle
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/08Projecting images onto non-planar surfaces, e.g. geodetic screens
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • G06T7/55Depth or shape recovery from multiple images
    • G06T7/593Depth or shape recovery from multiple images from stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, which comprises the following steps: step 1, defining a target anchor point formula, introducing preset depth information parameters, and designating a shared center pixel position; step 2, generating a preset anchor frame according to an anchor point template defining a target object, a visual anchor point generation formula and a 3D priori anchor point; step 3, checking the intersection ratio of the anchor frames; step 4, analyzing a network loss function of the target object; step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain a feature map, sending the feature map into global feature extraction and local feature extraction, and finally combining according to a certain weight; step 6, forward optimization processing, namely leading out a parameter step sigma, setting a cycle termination parameter beta and optimizing parameters; and 7, outputting the 3D parameters. The invention can realize higher safety of automatic driving and can be widely applied to the field of computer vision.

Description

Depth perception convolutional neural network-based 2D and 3D image synchronous detection method
Technical Field
The invention relates to a detection method of an effective target in the field of computer vision such as unmanned driving and assisted driving, in particular to a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network.
Background
Object detection refers to the detection and identification of category and position information of an object of interest (such as a vehicle, a pedestrian, an obstacle, etc.) in an image or a video by using computer technology, and is one of important research fields in the field of computer vision. With the continuous improvement and development of deep learning technology, object detection technology based on deep learning has been widely applied in many real fields, such as: unmanned, assisted driving, face recognition, unmanned security, man-machine interaction, behavior recognition and other related fields.
As one of important research directions in the deep learning technology, the deep convolutional neural network has achieved remarkable results on object detection tasks, and can realize real-time detection and identification of an object of interest in 2D image data. However, in the field of unmanned research, the system is required to obtain the position information in the 3D space of the object of interest in the application to better realize the corresponding function, so that the stability and the safety of the system are improved.
The current hardware equipment for 3D image recognition relies on cameras, which can be divided into monocular cameras and multi-ocular cameras according to the functions of the cameras: the monocular camera is fixed focus and is mostly applied to road condition judgment of automatic driving, but the monocular camera has an irreconcilable contradiction in the aspects of ranging range and distance, namely, the wider the visual angle of the camera is, the shorter the length of the accurate distance can be detected, the narrower the visual angle is, the longer the detected distance is, the more the distance is similar to the world seen by human eyes, the more far the distance is seen, the narrower the coverage range is, and the more near the distance is seen; binocular cameras are cameras with different focal lengths, the focal lengths of which are related to the imaging definition, but at present, vehicle-mounted cameras are difficult to achieve frequent zooming, and the cost of multi-camera cameras is higher and the algorithm complexity of the multi-camera cameras is increased compared with that of monocular cameras, so that the multi-camera cameras are not suitable for unmanned systems at present.
In order to improve the accuracy of 3D image detection, existing 3D image detection methods also rely on expensive lidar sensors, which can provide sparse depth data as input. However, when the laser radar sensor-dependent mode is combined with a monocular camera, sparse depth data lacks depth information, so that the implementation in 3D image detection is difficult.
For example, taking an automatic driving system as an example for an object detection task under the scene, the traditional 2D object detection method obtains a real-time road scene in the driving process through a vehicle-mounted camera, inputs the real-time road scene into an existing algorithm, realizes detection of an object of interest in an image through a trained detection model, outputs position and category information of the object to a decision-making layer of a control end, and plans how a vehicle runs. However, one problem is that the 3D spatial position information of the detection target obtained by the monocular camera is unstable, and the accuracy of the detection target is reduced due to a plurality of influencing factors.
Disclosure of Invention
The invention aims to overcome the defects of the background technology, and provides a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, so that the advantage of saving more detailed semantic information by a camera is increased on the basis of keeping accurate depth information of a laser scanner, and higher drivability and safety in an automatic driving process can be realized.
The invention provides a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network, which comprises the following steps: step 1, defining an anchor point template of a target object: respectively defining specific formulas of a 2D target anchor point and a 3D target anchor point, introducing preset depth information parameters, and designating a shared central pixel position; step 2, generating an anchor frame of the model prediction feature map: according to an anchor point template defining a target object, a preset anchor frame is generated according to a visual anchor point generation formula and a pre-calculated 3D priori anchor point; step 3, checking the intersection ratio of GT of the anchor frame: checking whether the GT intersection ratio IOU of the anchor frame is more than or equal to 0.5 according to the generated anchor frame; step 4, analyzing a network loss function of the target object: the method comprises classification loss function LC analysis, 2D frame regression loss function analysis and 3D frame regression loss function analysis; step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain a feature map with h-w dimensions, then respectively sending the feature map into two branches, wherein one is global feature extraction and the other is local feature extraction, and finally combining the features of the two branches according to a certain weight; step 6, forward optimization processing: projecting 3D information to 2D information, performing forward optimization processing, leading out a parameter step sigma for updating theta, setting a cycle termination parameter beta, and inputting an optimization parameter when alpha is larger than the parameter beta; and 7, performing 3D target detection according to the 3D output parameters.
In the above technical solution, in the step 1, the specific formula of the 2D target anchor point is [ w, h ]2D, and the specific formula of the 3D target anchor point is [ w, h, l, θ ]3D, where w, h, and l respectively represent given values of the width, height, and length of the target detection object, and θ represents an observation angle of view of the camera on the target detection object; the method comprises the steps of introducing a preset depth information parameter Zp, designating a shared central pixel position as [ x, y ] P, wherein the parameter expressed by 2D is expressed as [ x, y ] 2D=P.w, h ]2D according to pixel coordinates, wherein P represents a coordinate point of a known projection matrix for projecting an object, 3D central position [ x, y, z ]3D under a camera coordinate system is projected into an image of the given known projection matrix P, and the depth information parameter Zp is coded, wherein the formula is as follows:
in the above technical solution, in the step 2, each anchor point in the model prediction output feature map is defined as C, and each anchor point corresponds to [ tx, ty, tw, th ]]2D、[tx,ty,tz]P、[tw,th,tl,tθ]3D, setting the total number of anchor points of a single pixel on the feature map of each target detection object as na, presetting the number of training model categories as nc, hxw as the feature mapThe total number of output frames is nb=w×h×na; each anchor point is distributed at each pixel position [ x, y ]] P ∈R w×h The first output anchor C represents a shared class prediction of dimension na×nc×h×w, where the output dimension of each class is na×h×w.
In the above technical solution, in the step 2, [ tx, ty, tw, th ]2D representing 2D bounding box conversion is collectively referred to as b2D, where the bounding box conversion formula is as follows:
wherein xP and yP represent the spatial center position of each frame, and the transformed frame b' 2D Is defined as [ x, y, w, h ]]′ 2D Transforming 7 output variables, namely projection center [ t ] x ,t y ,t z ] P Scaling t w ,t h ,t l ] 3D And direction change t θ3D Collectively referred to as b 3D The b is 3D Conversion applied to the band parameters [ w, h ]] 2D ,z P ,[w,h,l,θ] 3D Is an anchor point of:
similarly, the inverse transform of equation (1) is used to determine the 3D center position [ x, y, z ] obtained after projection in image space]′ P To calculate the camera coordinates [ x, y, z ]]′ 3D ,b′ 3D Represents [ x, y, z ]]′ P And [ w, h, l, θ ]]′ 3D
In the above technical solution, in the step 3, if the intersection ratio IOU of GT of the anchor frame is less than 0.5, setting the category of the target object as a background category, and ignoring or deleting the boundary anchor frame; if the cross ratio IOU of GT of the anchor frame is more than or equal to 0.5, generating category indexes tau and 2D frames of the target object according to the generated anchor frame GTAnd 3D frame->
In the above technical solution, in the step 4, the classification loss function LC adopts a polynomial logic loss function based on softmax, and the formula is:
2D frame regression loss functionAnalysis for matching before GT transformation>And GT transformed b' 2D Cross-over ratio IOU between:
3D frame regression loss functionAnalysis for leaving 3DEach of the frame parameters is optimized by a smooth L1 regression loss function, and the formula is as follows:
in the above technical solution, in the step 4, the entire multi-tasking network loss function L is introduced, which further includes a regularization weight λ 1 And lambda (lambda) 2 The formula is defined as follows:
in the above technical solution, in the step 5, the specific process is as follows:
step 5-1, obtaining a characteristic diagram with h x w dimensions by using a convolutional neural network DenseNet: introducing a super-parameter b, wherein b represents the number of bins of a row level, and the number is used for dividing the feature map into b in the transverse direction, and each bin represents a specific convolution kernel k; and 5-2, extracting global/local characteristics, wherein the step 5-2 is divided into two branches, and the flow is specifically as follows: step 5-2-1, global feature extraction: the global feature extraction adopts a conventional convolution, and the conventional convolution introduces global features F in the convolution process global The global feature F global In which a convolution kernel of the packing number 1 and 3*3 is introduced and then non-linearly activated by a Relu function to generate 512 feature maps, the entire feature map is acted upon by conventional 3x3 and 1x1 convolutions, and then C, θ, [ t ] are output on each feature map F x ,t y ,t w ,t h ] 2D ,[t x ,t y ,t z ] P ,[t w ,t h ,t l ,tθ] 3D A total of 13 outputs, each of which is connected to a convolution kernel O of 1*1 global The method comprises the steps of carrying out a first treatment on the surface of the Step 5-2-2, local feature extraction: for local feature extraction, depth-aware convolution is adopted, and the depth-aware convolution introduces global features F in the convolution process local The global feature F local In which the number of incoming padding is 1 and 3*3, then non-linearly activated by the Relu function to generate 512 feature maps, acting on different bins (convolution kernel pixels) with different 3x3 kernels and dividing them longitudinally by b bins, then outputting C, θ, [ t ] on each feature map F x ,t y ,t w ,t h ] 2D ,[t x ,t y ,t z ] P ,[t w ,t h ,t l ,tθ] 3D A total of 13 outputs, each of which is connected to a convolution kernel O of 1*1 local The method comprises the steps of carrying out a first treatment on the surface of the Step 5-3, weighting the output of the global feature extraction and the local feature extraction: introducing a learned weight alpha, wherein the weight alpha uses the spatial invariance of the convolutional neural network as an index of the 1 st to 13 th outputs, and a specific output function is as follows:
O i =O global i ·α i +O local i ·(1-α i ) (8)。
in the above technical solution, in the step 5, further includes steps 5-4: the backbone network of the 3D target detection method is based on DenseNet-121, and provides a dense connection mechanism for interconnecting all layers: that is, each layer will accept all its previous layers as its additional input, resNet will connect each layer together with the previous 2-3 layers by way of element level addition, whereas in DenseNet, each layer will be concat with all the previous layers in the channel dimension and as input for the next layer, denseNet contains a total of L (L+1)/2 connections for a network of L layers, and DenseNet is a feature map from the different layers directly concat.
In the above technical solution, in the step 6, the iterative steps of the algorithm are as follows: by combining the projection of the 3D frame with the 2D estimation frame b' 2D As L 1loss And continuously adjusting theta, and projecting 3D to the 2D frame according to the following formula:
γ P =P·γ 3D2D =γ PPz ],
x min =min(γ 3Dx ]),y min =min(γ 3D3Dy ]])
x max =max(γ 3Dx ]),y max =max(γ 3D3Dy ]])
(9),
wherein phi represents the axis [ x, y, z ]]Is projected with 3D frame parameters x min ,y min ,x max ,y max ]And the original 2D frame estimate b' 2D To calculate L 1loss When loss is not updated within the range of theta + -sigma, the step length sigma is changed by an attenuation factor gamma, and when sigma > beta, the operation is repeatedly performed; in the step 7, 13 parameters are output in total according to the 3D, and the 13 parameters are respectively: c, θ, [ t ] x ,t y ,t w ,t h ] 2D ,[t x ,t y ,t z ] P ,[t w ,t h ,t l ,tθ] 3D
The 2D and 3D image synchronous detection method based on the depth perception convolutional neural network has the following beneficial effects: the scheme of the invention provides an algorithm for fusing laser radar point clouds and RGB (red (R), green (G) and blue (B) channel colors) images. 3D target visual analysis plays an important role in the autonomous driving car visual perception system. Modern autopilot vehicles are often equipped with a plurality of sensors, such as lidar and cameras. In terms of the application characteristics of the two sensors, a camera and a laser radar camera can be used for target detection, a laser scanner has the advantage of accurate depth information, and the camera stores more detailed semantic information, so that the fusion of laser radar point cloud and RGB images can realize automatic driving of an automobile with higher performance and safety. Object detection in three-dimensional space using lidar and image data is used to achieve highly accurate target location and recognition of objects in road scenes.
Drawings
FIG. 1 is a basic idea flow chart of a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network;
FIG. 2 is a specific flowchart of a method for detecting 2D and 3D image synchronization based on a depth perception convolutional neural network;
FIG. 3 is a schematic diagram of parameter definition of an anchor point template in a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network;
FIG. 4 is a block diagram of a three-dimensional anchor of a 3D object in a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network;
FIG. 5 is a bird's eye view of a three-dimensional anchor frame of a 3D object in a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network;
FIG. 6 is a schematic diagram of an RPN network architecture in a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network according to the present invention;
FIG. 7 is a schematic diagram of extraction of transverse segmentation local features in a 2D and 3D image synchronous detection method based on a depth perception convolutional neural network;
FIG. 8 is a schematic diagram of longitudinal segmentation local feature extraction in the 2D and 3D image synchronous detection method based on a depth perception convolutional neural network;
fig. 9 is a network architecture diagram of Densenet in the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, which should not be construed as limiting the invention.
Referring to fig. 1, the basic idea of the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network of the invention is that: input image, simultaneous detection processing of 2D and 3D images, projection of 3D information to 2D information and forward optimization processing, and 3D target detection according to 3D output parameters.
Referring to fig. 2, the 2D and 3D image synchronous detection method based on the depth perception convolutional neural network comprises the following specific steps:
step 1: an anchor template of the target object is defined. In order to predict a 2D frame and a 3D frame simultaneously, anchor templates need to be defined in respective dimension spaces, and it should be noted that the 2D frame herein is the maximum length and width observed by the 3D target object. Specifically, taking an automobile as an example, referring to fig. 3, specific formulas of anchor points of a 2D target and anchor points of a 3D target are [ w, h ]2D and [ w, h, l, θ ]3D, wherein w, h and l respectively represent the width, height and length of a target detection object, and w, h and l are given values in a detection camera coordinate system; in addition, since the 3D object is different from the 2D object and has rotatability, θ represents the viewing angle of the camera to the object detection object, which corresponds to the rotation of the camera around the Y axis of the camera coordinate system, the viewing angle considers the relative orientation of the object relative to the viewing angle of the camera, and not the aerial view (BEV) of the ground, and here, θ is more meaningful in intuitively estimating the viewing angle when processing the 3D image feature.
Wherein, to define the position of a 2D/3D frame of a complete target object, a preset depth information parameter Zp is introduced, and a shared central pixel position [ x, y ] P is specified, wherein the parameter represented by 2D is represented according to pixel coordinates, i.e., [ x, y ] 2d=p· [ w, h ]2D, wherein P represents a coordinate point of a known projection matrix required to project the target object; in 3D object detection, 3D center positions [ x, y, z ] in a camera coordinate system are three-dimensionally projected into an image of a given known projection matrix P, and depth information parameters Zp are encoded, with the following formula:
the mean value statistics is carried out on [ w, h, l, theta ]3D of each preset depth information parameter Zp and 3D target object, and the Zp and the [ w, h, l, theta ]3D are calculated independently for each anchor point in advance, and the functions of the parameters are as follows: can serve as strong a priori information to mitigate the difficulty of 3D parameter estimation. Specifically, for each anchor point, each preset depth information parameter Zp and the [ w, h, l, θ ]3D of the 3D object are statistics of more than 0.5, where the anchor point represents a discrete template, and the 3D prior can be used as a strong initial guess, so that a reasonably consistent scene geometry is assumed.
Step 2: and generating an anchor frame of the model prediction feature map according to the anchor point template defining the target object. Specifically, according to an anchor point template of a target object, the method is expressed as generating a preset anchor frame according to a visual anchor point generation formula and a pre-calculated 3D priori anchor point, specifically, the generated three-dimensional anchor frame can be seen in fig. 4, and the bird's eye view is seen in fig. 5.
Further, each anchor point in the model prediction output feature map is defined as C, the number of anchor points corresponding to [ tx, ty, tw, th ]2D, [ tx, ty, tz ] P, [ tw, th, tl, tθ ]3D, the total number of anchor points is set as na (the number of anchor points of a single pixel on the feature map of each target detection object), and the number of categories (preset training models) is set as nc and hxw, which are the resolution of the feature map.
Thus, the total number of output frames is nb=w×h×na;
each anchor point is distributed at each pixel position [ x, y ]] P ∈R w×h
The first output anchor C represents a shared class prediction of dimension na×nc×h×w, where the output dimensions of each other (each class) are na×h×w.
Further, [ tx, ty, tw, th ]2D represents 2D bounding box transformation, we collectively refer to b2D, specifically, where the bounding box transformation formula is as follows:
where xP and yP denote the spatial center position of each box. Transformed box b' 2D Is defined as [ x, y, w, h ]]′ 2D The following 7 outputs represent the projective center transform [ t ] x ,t y ,t z ] P Scaling t w ,t h ,t l ] 3D And a direction change t θ3D Collectively referred to as b 3D . Similar to 2D, the conversion is applied to the band parameters [ w, h] 2D ,z P ,[w,h,l,θ] 3D Anchor point of (b):
similarly, b' 3D Represents [ x, y, z ]]′ P And [ w, h, l, θ ]]′ 3D . As previously described, the authors estimate the 3D center of the projection rather than the camera coordinates to better process the image space based convolution features. In the reasoning process, the inverse transformation of formula (1) is utilized to obtain the 3D center position [ x, y, z ] after the projection in the image space]′ P To calculate the camera coordinates [ x, y, z ]]′ 3D
Step 3: and checking whether the GT (ground real condition) intersection ratio (IOU) of the anchor frame is more than or equal to 0.5 according to the generated anchor frame.
If the cross ratio IOU of GT of the anchor frame is less than 0.5, setting the category of the target object as a background category, and neglecting or deleting the boundary anchor frame;
if the cross ratio IOU of GT of the anchor frame is more than or equal to 0.5, generating category index tau and 2D frame of the target object according to the generated anchor frame GT (ground truth)And 3D frame->And performs the following step 4.
Step 4: the network loss function of the target is analyzed. Further, the step includes classification loss function LC analysis, 2D frame regression loss function analysis, and 3D frame regression loss function analysis.
The classification loss function LC adopts a polynomial logic loss function based on softmax, and the formula is as follows:
and introduce 2D frame regression frame lossFor matching before GT transformation>And GT transformed b' 2D Cross-over ratio IOU between:
3D frame regression loss functionAnalysis, for optimizing each of the remaining 3D bezel parameters with a smooth L1 regression loss function, of the formula:
further, for the whole network framework, a whole multi-task network loss function L is also introduced, wherein the whole multi-task network loss function L is also packagedBracketing regularization weights lambda 1 And lambda (lambda) 2 The formula is defined as follows:
step 5: a depth-aware convolutional regional recommendation network is established to enhance the ability of higher-order feature space awareness in the regional recommendation network.
A super parameter b is introduced, where b represents the number of bins at the row level, representing the division of the feature map into b in the lateral direction, each bin representing a particular convolution kernel k.
Step 5-1, introducing a Densenet convolutional neural network. Further, a DenseNet (convolutional neural network with a deeper layer) is used as a basic feature extractor to obtain a feature map with h-w dimension, the feature map is respectively sent to two branches, one is global feature extraction, the other is local feature extraction, and finally the features of the two branches are combined according to a certain weight. Wherein the global block acts on the whole feature map by means of conventional 3x3 and 1x1 convolutions, and the block acts on different bins by means of different 3x3 kernels, which are seen in the cross bar in fig. 6 and are divided in the longitudinal direction into b bins, the RPN network architecture being shown in fig. 6.
It should be noted that, for the local feature extraction, two feature extraction methods are adopted in the present technology, as shown in fig. 7.
For b longitudinal bars generated by taking b bins divided along the longitudinal direction as random functions when the local feature 1 is extracted, the randomness of image extraction is increased in the convolution process, and therefore the recognition rate is improved.
Furthermore, in order to more accurately identify the 3D target image, the present technology further provides a longitudinal segmentation method, and the specific segmentation method is shown in fig. 8.
The adopted longitudinal cutting method enables the obtained local features of the feature extraction to be more, so that the recognition rate is improved.
In addition, the backbone network of the present 3D object detection method is based on DenseNet-121, and the network architecture of densnet can be seen in fig. 9, where the DenseNet proposes a more aggressive dense connection mechanism: i.e. all layers are interconnected, specifically each layer will accept all its preceding layers as its additional input. It can be seen that ResNet is where each layer is shorted together with a layer (typically 2-3 layers) in front by element level addition. In DenseNet, each layer is connected (concat) together with all previous layers in the channel dimension (where the feature map size of each layer is the same) and serves as input to the next layer. For an L-layer network, denseNet contains a total of L (l+1)/2 connections, which is a dense connection compared to ResNet. And DenseNet is a feature map from different layers, which can realize feature reuse and improve efficiency. The network architecture of Densenet is shown in FIG. 9.
And 5-2, extracting global/local characteristics. The step 5-2 is divided into two branches, namely a step 5-2-1 and a step 5-2-2.
And 5-2-1, global feature extraction. The global feature extraction adopts a conventional convolution, a convolution kernel of which acts as a global convolution in the whole space, and the conventional convolution introduces a global feature F in the convolution process global The global feature F global A packing number of 1 and 3*3 convolution kernels are introduced and then non-linearly activated by the Relu function (Rectified Linear Unit, linear rectification function) to generate 512 feature maps.
Then 13 outputs (from the previous, it can be seen that 13 outputs are respectively C, θ, [ t ] x ,t y ,t w ,t h ] 2D ,[t x ,t y ,t z ] P ,[t w ,t h ,t l ,tθ] 3D ) And wherein each output is connected to a convolution kernel O of 1*1 global
And 5-2, extracting local features. For local feature extraction, depth-aware convolution (depth-aware convolution), namely local volume, is adoptedAnd (3) accumulation. The depth perception convolution introduces global features F in the convolution process local The global feature F local A packing number of 1 and 3*3 convolution kernels are introduced and then non-linearly activated by the Relu function to generate 512 feature maps.
Then 13 outputs (from the previous, it can be seen that 13 outputs are respectively C, θ, [ t ] x ,t y ,t w ,t h ] 2D ,[t x ,t y ,t z ] P ,[t w ,t h ,t l ,tθ] 3D ) And wherein each output is connected to a convolution kernel O of 1*1 local
And 5-3, weighting the output of the global feature extraction and the local feature extraction. A weight α (which is learned) is introduced here, which takes advantage of the spatial invariance of convolutional neural networks, as an index of the 1 st to 13 th outputs, whose specific output functions are as follows:
O i =O global i ·α i +O local i ·(1-α i ) (8)
and 6, projecting the 3D information to the 2D information and performing forward optimization processing. A parameter step σ is derived here (for updating θ) and a loop termination parameter β is set, and when α is greater than parameter β, the input of the optimization parameter is performed.
The iterative step of the algorithm is by combining the projection of the 3D frame with the 2D estimation frame b' 2D As L 1loss And θ is continuously adjusted. And the formula of the step of projecting 3D to the 2D frame is as follows:
γ P =P·γ 3D2D =γ PPz ],
x min =min(γ 3Dx ]),y min =min(γ 3D3Dy ]])
x max =max(γ 3Dx ]),y max =max(γ 3D3Dy ]])
(9)
where φ denotes the index of the axis [ x, y, z ].
2D frame parameters [ x ] after projection with 3D frame min ,y min ,x max ,y max ]And the original 2D frame estimate b' 2D To calculate L 1loss When loss is not updated within a range of θ±σ, the step size σ is changed by the attenuation factor γ, and when σ > β, the above operation is repeatedly performed.
Step 7, 13 parameters are output, wherein the 13 parameters are respectively: c, θ, [ t ] x ,t y ,t w ,t h ] 2D ,[t x ,t y ,t z ] P ,[t w ,t h ,t l ,tθ] 3D And finally, 3D target detection is carried out.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.
What is not described in detail in this specification is prior art known to those skilled in the art.

Claims (1)

1. A2D and 3D image synchronous detection method based on a depth perception convolutional neural network is characterized by comprising the following steps of: the method comprises the following steps:
step 1, defining an anchor point template of a target object: respectively defining a 2D target anchor point and a 3D target anchor point, introducing preset depth information parameters, and designating a shared central pixel position; the steps are as follows1, the 2D target anchor point is specifically [ w, h ]] 2D The 3D target anchor point is specifically [ w, h, l, theta ]] 3D Wherein w, h and l respectively represent given values of the width, the height and the length of the target detection object, and θ represents the viewing angle of the camera to the target detection object; the introduced preset depth information parameter is Zp, and the shared central pixel position is designated as [ x, y ]] P Wherein the parameters of the 2D representation are represented as [ x, y ] according to pixel coordinates] 2D =P·[w,h] 2D Where P represents the coordinate point of the known projection matrix required to project the object, the 3D center position [ x, y, z ] in the camera coordinate system] 3D Three-dimensionally projecting into an image given a known projection matrix P and encoding depth information parameters Zp, the formula of which is as follows:
step 2, generating an anchor frame of the model prediction feature map: according to an anchor point template defining a target object, a preset anchor frame is generated according to a visual anchor point generation formula and a pre-calculated 3D priori anchor point;
in the step 2, each anchor point in the model predictive output feature map is defined as C, and each anchor point corresponds to [ tx, ty, tw, th ]] 2D 、[tx,ty,tz] P 、[tw,th,tl,tθ] 3D Let the total number of anchor points of a single pixel on the feature map of each target detection object be na, the number of preset training model categories be nc, hxw be the resolution of the feature map, and the total number of output frames be nb=w×h×na; each anchor point is distributed at each pixel position [ x, y ]] P ∈R w×h The first output anchor C represents a shared class prediction of dimension na×nc×h×w, where the output dimension of each class is na×h×w;
in said step 2, [ tx, ty, tw, th ] representing a 2D bounding box transformation] 2D Collectively referred to as b 2D Wherein the bounding box transformation formula is as follows:
wherein xP and yP represent the spatial center position of each frame, and the transformed frame b' 2D Is defined as [ x, y, w, h ]]′ 2D Transforming 7 output variables, namely projection centersScaling t w ,t h ,t l ] 3D And direction change->Collectively referred to as b 3D The b is 3D Conversion applied to the band parameters [ w, h ]] 2D ,z P ,[w,h,l,θ] 3D Is an anchor point of:
similarly, the inverse transform of equation (1) is used to determine the 3D center position [ x, y, z ] obtained after projection in image space]′ P To calculate the camera coordinates [ x, y, z ]]′ 3D ,b′ 3D Represents [ x, y, z ]]′ P And [ w, h, l, θ ]]′ 3D
Step 3, checking the intersection ratio of GT of the anchor frame: checking whether the GT intersection ratio IOU of the anchor frame is more than or equal to 0.5 according to the generated anchor frame;
step 4, analyzing a network loss function of the target object: the method comprises classification loss function LC analysis, 2D frame regression loss function analysis and 3D frame regression loss function analysis;
step 5, establishing a depth perception convolution area suggestion network: introducing a Densenet convolutional neural network to obtain a feature map with h-w dimensions, then respectively sending the feature map into two branches, wherein one is global feature extraction and the other is local feature extraction, and finally combining the features of the two branches according to a certain weight;
in the step 5, the specific process is as follows:
step 5-1, obtaining a characteristic diagram with h x w dimensions by using a convolutional neural network DenseNet: introducing a super-parameter b, wherein b represents the number of bins of a row level, and the number is used for dividing the feature map into b in the transverse direction, and each bin represents a specific convolution kernel k;
and 5-2, extracting global/local characteristics, wherein the step 5-2 is divided into two branches, and the flow is specifically as follows:
step 5-2-1, global feature extraction: the global feature extraction adopts a conventional convolution, and the conventional convolution introduces global features F in the convolution process global The global feature F global A convolution kernel of padding number 1 and 3*3 is introduced, then non-linearly activated by the Relu function to generate 512 feature maps, the entire feature map is acted upon by conventional 3x3 and 1x1 convolutions,
then output C, θ, [ t ] on each feature map F x ,t y ,t w ,t h ] 2D ,[t x ,t y ,t z ] P ,[t w ,t h ,t l ,t θ ] 3D A total of 13 outputs, each of which is connected to a convolution kernel O of 1*1 global
Step 5-2-2, local feature extraction: depth perception for local feature extractionConvolution, which introduces global features F in the convolution process local The global feature F local A convolution kernel of padding number 1 and 3*3 is introduced, then non-linearly activated by the Relu function to generate 512 feature maps, with different 3x3 kernels acting on different bins and dividing them longitudinally by b bins,
then output C, θ, [ t ] on each feature map F x ,t y ,t w ,t h ] 2D ,[t x ,t y ,t z ] P ,[t w ,t h ,t l ,t θ ] 3D A total of 13 outputs, each of which is connected to a convolution kernel O of 1*1 local
Step 5-3, weighting the output of the global feature extraction and the local feature extraction: introducing a weight alpha learned by the neural network, wherein the weight alpha uses the spatial invariance of the convolutional neural network as an index of the 1 st to 13 th outputs, and a specific output function is as follows:
O i =O global i ·a i +O local i ·(1-a i ) (8)
step 6, forward optimization processing: projecting 3D information to 2D information, performing forward optimization processing, leading out a parameter step sigma for updating theta, setting a cycle termination parameter beta, and inputting an optimization parameter when alpha is larger than the parameter beta;
θ represents the angle of view of the camera to the target detection object;
in the step 6, the iterative steps of the algorithm are as follows:
by combining the projection of the 3D frame with the 2D estimation frame b' 2D As L 1loss And continuously adjusting theta, and projecting 3D to the 2D frame according to the following formula:
γ P =P·γ 3D ,γ 2D =γ PPz ],
x min =min(γ 3Dx ]),y dish in =min(γ 3D3Dy ]])
x max =max(Y 3Dx ]),y max =max(γ 3D3Dy ]])
(9),
Wherein phi denotes the index of the axis x, y, z,
2D frame parameters [ x ] after projection with 3D frame min ,y min ,x max ,y max ]And the original 2D frame estimate b' 2D To calculate L 1loss When loss is not updated within the range of theta + -sigma, the step length sigma is changed by an attenuation factor gamma, and when sigma > beta, the operation is repeatedly performed;
step 7, 3D target detection is carried out according to the 3D output parameters;
in the step 7, 13 parameters are output in total according to the 3D, and the 13 parameters are respectively: c, θ, [ t ] x ,t y ,t w ,t h ] 2D ,[t x ,t y ,t z ] P ,[t w ,t h ,t l ,tθ] 3D
In the step 3, if the intersection ratio IOU of GT of the anchor frame is less than 0.5, setting the category of the target object as a background category, and ignoring or deleting the boundary anchor frame; if the cross ratio IOU of GT of the anchor frame is more than or equal to 0.5, generating category indexes T and 2D frames of the target object according to the generated anchor frame GTAnd 3D frame->
In the step 4, the classification loss function LC adopts a polynomial logic loss function based on softmax, and the formula is as follows:
2D frame regression loss functionAnalysis for matching before GT transformation>And GT transformed b' 2D Cross-over ratio IOU between:
3D frame regression loss functionAnalysis, for optimizing each of the remaining 3D bezel parameters with a smooth L1 regression loss function, of the formula:
in the step 4, the whole multi-task network loss function L is also introduced, wherein the regularization weight lambda is also included 1 And lambda (lambda) 2 The formula is defined as follows:
in the step 5, the method further comprises the steps of 5-4: the backbone network of the 3D target detection method is based on DenseNet-121, and provides a dense connection mechanism for interconnecting all layers: i.e. each layer will accept all its previous layers as its extra input, the ResNet will connect each layer together with the previous 2-3 layers short-circuited by element-wise addition, whereas in the DenseNet each layer will be concat together with all the previous layers in the channel dimension and as input for the next layer, the DenseNet contains in common an L (L+1)/2 connection for a network of L layers, and the DenseNet links the feature map from the respective layers by means of a concat connector.
CN202010308948.9A 2020-04-19 2020-04-19 Depth perception convolutional neural network-based 2D and 3D image synchronous detection method Active CN111695403B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010308948.9A CN111695403B (en) 2020-04-19 2020-04-19 Depth perception convolutional neural network-based 2D and 3D image synchronous detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010308948.9A CN111695403B (en) 2020-04-19 2020-04-19 Depth perception convolutional neural network-based 2D and 3D image synchronous detection method

Publications (2)

Publication Number Publication Date
CN111695403A CN111695403A (en) 2020-09-22
CN111695403B true CN111695403B (en) 2024-03-22

Family

ID=72476391

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010308948.9A Active CN111695403B (en) 2020-04-19 2020-04-19 Depth perception convolutional neural network-based 2D and 3D image synchronous detection method

Country Status (1)

Country Link
CN (1) CN111695403B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114266900B (en) * 2021-12-20 2024-07-05 河南大学 Monocular 3D target detection method based on dynamic convolution

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07220084A (en) * 1994-02-04 1995-08-18 Canon Inc Arithmetic system, semiconductor device, and image information processor
CN106599939A (en) * 2016-12-30 2017-04-26 深圳市唯特视科技有限公司 Real-time target detection method based on region convolutional neural network
CN106886755A (en) * 2017-01-19 2017-06-23 北京航空航天大学 A kind of intersection vehicles system for detecting regulation violation based on Traffic Sign Recognition
CN109543601A (en) * 2018-11-21 2019-03-29 电子科技大学 A kind of unmanned vehicle object detection method based on multi-modal deep learning
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
EP3525131A1 (en) * 2018-02-09 2019-08-14 Bayerische Motoren Werke Aktiengesellschaft Methods and apparatuses for object detection in a scene represented by depth data of a range detection sensor and image data of a camera
CN110555407A (en) * 2019-09-02 2019-12-10 东风汽车有限公司 pavement vehicle space identification method and electronic equipment
CN110852314A (en) * 2020-01-16 2020-02-28 江西高创保安服务技术有限公司 Article detection network method based on camera projection model
CN110942000A (en) * 2019-11-13 2020-03-31 南京理工大学 Unmanned vehicle target detection method based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108985238B (en) * 2018-07-23 2021-10-22 武汉大学 Impervious surface extraction method and system combining deep learning and semantic probability

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07220084A (en) * 1994-02-04 1995-08-18 Canon Inc Arithmetic system, semiconductor device, and image information processor
CN106599939A (en) * 2016-12-30 2017-04-26 深圳市唯特视科技有限公司 Real-time target detection method based on region convolutional neural network
CN106886755A (en) * 2017-01-19 2017-06-23 北京航空航天大学 A kind of intersection vehicles system for detecting regulation violation based on Traffic Sign Recognition
WO2019144575A1 (en) * 2018-01-24 2019-08-01 中山大学 Fast pedestrian detection method and device
EP3525131A1 (en) * 2018-02-09 2019-08-14 Bayerische Motoren Werke Aktiengesellschaft Methods and apparatuses for object detection in a scene represented by depth data of a range detection sensor and image data of a camera
CN109543601A (en) * 2018-11-21 2019-03-29 电子科技大学 A kind of unmanned vehicle object detection method based on multi-modal deep learning
CN110555407A (en) * 2019-09-02 2019-12-10 东风汽车有限公司 pavement vehicle space identification method and electronic equipment
CN110942000A (en) * 2019-11-13 2020-03-31 南京理工大学 Unmanned vehicle target detection method based on deep learning
CN110852314A (en) * 2020-01-16 2020-02-28 江西高创保安服务技术有限公司 Article detection network method based on camera projection model

Also Published As

Publication number Publication date
CN111695403A (en) 2020-09-22

Similar Documents

Publication Publication Date Title
CN111563415B (en) Binocular vision-based three-dimensional target detection system and method
CN111428765B (en) Target detection method based on global convolution and local depth convolution fusion
JP2022515895A (en) Object recognition method and equipment
CN112991413A (en) Self-supervision depth estimation method and system
EP3992908A1 (en) Two-stage depth estimation machine learning algorithm and spherical warping layer for equi-rectangular projection stereo matching
CN104766071B (en) A kind of traffic lights fast algorithm of detecting applied to pilotless automobile
KR101907883B1 (en) Object detection and classification method
CN116258817B (en) Automatic driving digital twin scene construction method and system based on multi-view three-dimensional reconstruction
Lore et al. Generative adversarial networks for depth map estimation from RGB video
Gwn Lore et al. Generative adversarial networks for depth map estimation from RGB video
CN116188999B (en) Small target detection method based on visible light and infrared image data fusion
Ouyang et al. A cgans-based scene reconstruction model using lidar point cloud
CN116612468A (en) Three-dimensional target detection method based on multi-mode fusion and depth attention mechanism
CN115410181A (en) Double-head decoupling alignment full scene target detection method, system, device and medium
EP3992909A1 (en) Two-stage depth estimation machine learning algorithm and spherical warping layer for equi-rectangular projection stereo matching
CN115115917A (en) 3D point cloud target detection method based on attention mechanism and image feature fusion
CN111695403B (en) Depth perception convolutional neural network-based 2D and 3D image synchronous detection method
CN106650814B (en) Outdoor road self-adaptive classifier generation method based on vehicle-mounted monocular vision
US20230230317A1 (en) Method for generating at least one ground truth from a bird's eye view
Xiao et al. Research on uav multi-obstacle detection algorithm based on stereo vision
CN114648639B (en) Target vehicle detection method, system and device
CN116563807A (en) Model training method and device, electronic equipment and storage medium
Fu et al. Linear inverse problem for depth completion with rgb image and sparse lidar fusion
CN112950786A (en) Vehicle three-dimensional reconstruction method based on neural network
Vajak et al. HistWind2—An Algorithm for Efficient Lane Detection in Highway and Suburban Environments

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant