CN113887289A

CN113887289A - Monocular three-dimensional object detection method, device, equipment and product

Info

Publication number: CN113887289A
Application number: CN202111013230.8A
Authority: CN
Inventors: 安建平; 王向韬; 牟晓凡; 郝雨萌; 程新景
Original assignee: International Network Technology Shanghai Co Ltd
Current assignee: International Network Technology Shanghai Co Ltd
Priority date: 2021-08-31
Filing date: 2021-08-31
Publication date: 2022-01-04

Abstract

The invention provides a monocular three-dimensional object detection method, a monocular three-dimensional object detection device, monocular three-dimensional object detection equipment and a monocular three-dimensional object detection product, and relates to the technical field of computer vision, wherein the monocular three-dimensional object detection method comprises the following steps: acquiring an image to be detected; inputting the image to be detected into a monocular three-dimensional object detection model to obtain a two-dimensional detection result, an object depth, an observation angle and a size of the object to be detected, which are output by the monocular three-dimensional object detection model; generating a three-dimensional central point of the object to be detected in a three-dimensional space according to the two-dimensional central point and the depth of the object; generating an orientation angle of the object to be detected according to the three-dimensional center point and the observation angle; according to the three-dimensional center point, the orientation angle and the size, a three-dimensional detection result is generated, the prediction performance of the truncated object in monocular three-dimensional detection is obviously optimized, the method is not limited to the type of the detection object, and the monocular three-dimensional detection can be performed on both a normal object and the truncated object.

Description

Monocular three-dimensional object detection method, device, equipment and product

Technical Field

The invention relates to the technical field of computer vision, in particular to a monocular three-dimensional object detection method, a monocular three-dimensional object detection device, monocular three-dimensional object detection equipment and a monocular three-dimensional object detection product.

Background

With the appearance of large-scale data sets and the development of deep learning, image-based two-dimensional target detection algorithms have been developed greatly and are widely applied to the fields of automatic driving, video monitoring, industrial detection, image retrieval and the like.

The classical computer vision problem is to identify objects and scenes in an image through a mathematical model or statistical learning, and then realize motion identification, object trajectory tracking, behavior identification and the like on a video time sequence. However, since an image is a projection of a three-dimensional space in an optical system, two-dimensional detection can only realize image level recognition, and cannot sense a real three-dimensional space, and three-dimensional positioning of an object in an environment is required in application scenes such as automatic driving, unmanned delivery, and Augmented Reality (AR), and two-dimensional detection has prominent defects in the application scenes, so that a higher level of computer vision inevitably accurately obtains the shape, position, and posture of a target object in the three-dimensional space, and detection, recognition, tracking, and interaction of the target object in the three-dimensional space are realized by a three-dimensional reconstruction technology. Based on the induction, more challenging three-dimensional target detection,

How to understand the position of an object in three-dimensional space through an image is often more complicated than the problem of two-dimensional object detection at the image level. The more accurate three-dimensional information of the restored object comprises three-dimensional coordinates of each point on the surface of the restored object and the relation between the three-dimensional points, the three-dimensional characteristics of the object in computer graphics can be represented as a triangulated mesh and texture mapping for reconstructing the surface of the object, and the position of the object can be represented by a cube in a three-dimensional space without particularly requiring an accurate scene. From the projective geometry, it is impossible to accurately restore the three-dimensional position of an object by only relying on one image, and even if relative position information can be obtained, the true size cannot be obtained. Therefore, a stereoscopic vision system composed of a plurality of cameras or motion cameras, or three-dimensional space point cloud data obtained by sensors such as a depth camera and a radar, is minimally required for correctly detecting the three-dimensional space position of the target.

That is, three-dimensional target detection requires estimation of the spatial position, orientation, and size of the target, where the spatial position is usually expressed as three-dimensional (x, y, z) coordinates of the target in the camera coordinate system, the orientation is usually expressed as an azimuth angle of the target in the horizontal direction, and the size is usually expressed as high-scale information such as the length, width, and the like of the target. Compared with the two-dimensional detection which only needs to acquire information of four degrees of freedom such as pixel positions, pixel centers and the like, the three-dimensional target detection needs to solve seven degrees of freedom of a target object. The input modes of the existing three-dimensional target detection method mainly comprise laser radar point cloud, binocular images, monocular images and the like, and for specific types of targets, the method based on machine learning enables the detection of three-dimensional space of an object through a monocular camera to be possible, so that the monocular three-dimensional detection becomes a method with lower hardware requirement and higher difficulty.

At present, a monocular three-dimensional detection process usually adopts a unified frame or strategy to detect all objects, a truncated object exists in an image shot by a camera, the truncated object is an object which is positioned near an image boundary and is partially visible, the truncated object is a very important object, the object depth of the truncated object is infinitely close to zero or less than zero, the spatial position of the object in a three-dimensional space can be further deduced from the three-dimensional central point of the object and the object depth, and the three-dimensional central point of the truncated object is difficult to obtain due to the object depth problem of the truncated object, so that the three-dimensional central point is difficult to obtain, and the significant difference in visibility exists when the truncated object is detected, so that the detection performance is poor, and the method is difficult to apply to the detection of the truncated object.

Disclosure of Invention

The invention provides a monocular three-dimensional object detection method, a monocular three-dimensional object detection device, equipment and a monocular three-dimensional object detection product, which are used for solving the defect that obvious difference in visibility exists when monocular three-dimensional detection detects a truncated object in the prior art, and realizing that the prediction performance of the truncated object in the monocular three-dimensional detection is obviously optimized.

The invention provides a monocular three-dimensional object detection method, which comprises the following steps:

acquiring an image to be detected; the image to be detected comprises an object to be detected, and the object to be detected is a normal object and/or a truncated object;

inputting the image to be detected into a monocular three-dimensional object detection model to obtain a two-dimensional detection result, an object depth, an observation angle and a size of the object to be detected, which are output by the monocular three-dimensional object detection model; the two-dimensional detection result comprises the two-dimensional central point, and the object depth is obtained based on the two-dimensional central point and the characteristics of the two-dimensional central point on a characteristic map;

generating a three-dimensional central point of the object to be detected in a three-dimensional space according to the two-dimensional central point and the object depth;

generating an orientation angle of the object to be detected according to the three-dimensional center point and the observation angle;

and generating a three-dimensional detection result according to the three-dimensional central point, the orientation angle and the size.

According to the monocular three-dimensional object detection method provided by the invention, the monocular three-dimensional object detection model is based on a CenterNet network model.

According to the monocular three-dimensional object detection method provided by the invention, the image to be detected is input into a monocular three-dimensional object detection model, and a two-dimensional detection result, an object depth, an observation angle and a size of the object to be detected, which are output by the monocular three-dimensional object detection model, are obtained, and the method specifically comprises the following steps:

inputting the image to be detected into a monocular three-dimensional object detection model to obtain a two-dimensional detection result and a characteristic diagram of the object to be detected, which are output by the monocular three-dimensional object detection model; wherein, the characteristic diagram is obtained by outputting one layer of convolution layer of the monocular three-dimensional object detection model;

acquiring the feature of the two-dimensional central point on the feature map;

and inputting the image to be detected, the two-dimensional central point and the characteristics into the monocular three-dimensional object detection model to obtain the object depth, the observation angle and the size output by the monocular three-dimensional object detection model.

According to the monocular three-dimensional object detection method provided by the invention, the orientation angle of the object to be detected is generated according to the three-dimensional center point and the observation angle, and the method specifically comprises the following steps:

acquiring the type of the object to be detected;

when the type of the object to be detected is a normal object, generating the orientation angle according to the observation angle and the camera parameter; the camera is used for acquiring an image to be detected;

and when the type of the object to be detected is a truncated object, generating the orientation angle according to the observation angle and the three-dimensional central point.

According to the monocular three-dimensional object detection method provided by the invention, the monocular three-dimensional object detection model is obtained by training through the following steps:

acquiring a sample image; wherein the sample image contains a sample object, and the sample object is a normal object and/or a truncated object;

acquiring an actual two-dimensional detection result, an actual object depth, an actual observation angle and an actual size of the sample image;

and taking the sample image as input data used for training, taking the actual two-dimensional detection result, the actual object depth, the actual observation angle and the actual size as labels, and training in a deep learning mode to obtain the monocular three-dimensional object detection model for generating the two-dimensional detection result, the object depth, the observation angle and the size of the image to be predicted.

According to the monocular three-dimensional object detection method provided by the invention, the actual two-dimensional detection result, the actual object depth, the actual observation angle and the actual size of the sample image are obtained, and the method specifically comprises the following steps:

and acquiring the actual two-dimensional detection result, the actual object depth, the actual observation angle and the actual size of the sample image based on an annotation mode.

The present invention also provides a monocular three-dimensional object detecting device, comprising:

the first acquisition module is used for acquiring an image to be detected; the image to be detected comprises an object to be detected, and the object to be detected is a normal object and/or a truncated object;

the second acquisition module is used for inputting the image to be detected into a monocular three-dimensional object detection model to obtain a two-dimensional detection result, an object depth, an observation angle and a size of the object to be detected, which are output by the monocular three-dimensional object detection model; the two-dimensional detection result comprises the two-dimensional central point, and the object depth is obtained based on the two-dimensional central point and the characteristics of the two-dimensional central point on a characteristic map;

the third acquisition module is used for generating a three-dimensional central point of the object to be detected in a three-dimensional space according to the two-dimensional central point and the object depth;

the fourth acquisition module is used for generating an orientation angle of the object to be detected according to the three-dimensional center point and the observation angle;

and the three-dimensional detection module is used for generating a three-dimensional detection result according to the three-dimensional central point, the orientation angle and the size.

The invention also provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor executes the program to realize the steps of the monocular three-dimensional object detecting method.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the monocular three-dimensional object detection methods as described in any one of the above.

The present invention also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of the monocular three-dimensional object detection method as described in any one of the above.

The monocular three-dimensional object detection method, the device, the equipment and the product provided by the invention adopt a unified frame or strategy to detect all objects including normal objects and truncated objects, the two-dimensional central point of the object to be detected on the two-dimensional image is firstly obtained, the object depth, the orientation angle and the size of the object to be detected are directly regressed through the two-dimensional central point and the two-dimensional image information, and then the three-dimensional central point of the object to be predicted on a three-dimensional space is generated through the two-dimensional central point and the object depth, particularly, when the object to be detected is a truncated object which is infinitely close to zero or less than zero, the three-dimensional central point can be more accurately obtained through the mode, then the three-dimensional detection result corresponding to the object to be detected is obtained based on the three-dimensional central point, the orientation angle and the size, and the prediction performance of the truncated object in monocular three-dimensional detection is obviously optimized, the method is not limited to the type of the detection object, does not need to consider the difference between objects, and can carry out monocular three-dimensional detection on both normal objects and truncated objects; in the field of automatic driving, the close-range target perception accuracy of the vehicle can be remarkably improved, so that a more stable perception effect is achieved in practical application.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a monocular three-dimensional object detection method provided by the present invention;

FIG. 2 is a schematic diagram of a network structure of a decoder part of a monocular three-dimensional object detection model in the monocular three-dimensional object detection method provided by the present invention;

fig. 3 is a schematic flowchart of step S200 in the monocular three-dimensional object detecting method according to the present invention;

fig. 4 is a specific flowchart of step S500 in the monocular three-dimensional object detecting method according to the present invention;

FIG. 5 is a schematic structural diagram of a monocular three-dimensional object detecting device according to the present invention;

fig. 6 is a schematic structural diagram of a second acquisition module in the monocular three-dimensional object detecting device provided by the present invention;

fig. 7 is a schematic structural diagram of a fifth obtaining module in the monocular three-dimensional object detecting device according to the present invention;

fig. 8 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The monocular three-dimensional object detection method of the present invention is described below with reference to fig. 1 and 2, and is intended to detect all objects including normal objects and truncated objects using a unified frame or strategy, and includes the steps of:

s100, obtaining an image to be detected in modes of a camera and the like, wherein the image to be detected comprises an object to be detected, and the object to be detected is a normal object and/or a truncated object. Namely, the method is not limited to the type of the detection object, and monocular three-dimensional detection can be performed on both the normal object and the truncated object without considering the difference between the objects.

S200, inputting the image to be detected into the monocular three-dimensional object detection model to obtain a two-dimensional detection result, an object depth, an observation angle and a size of the object to be detected, which are output by the monocular three-dimensional object detection model. The two-dimensional detection result comprises a two-dimensional central point, and the object depth is obtained based on the two-dimensional central point and the characteristics of the two-dimensional central point on the characteristic diagram. If the object is truncated during the detection of the object, the two-dimensional center point represents the two-dimensional center point of the visible part of the truncated object in the image to be detected.

In step S200, after the image to be predicted is input as input data into the monocular three-dimensional object detection model, the object depth, the observation angle, and the size corresponding to the object to be detected can be directly regressed, where the object depth is greater than zero for a normal object and is infinitely close to zero or less than zero for a truncated object.

The main elements of three-dimensional object detection are spatial position, orientation and size, the spatial position of an object in a three-dimensional space can be further derived from a three-dimensional central point of the object and the depth of the object, and the depth of the object for intercepting the object is infinitely close to zero or less than zero, so that the three-dimensional central point is difficult to obtain in the traditional monocular three-dimensional detection technology. However, for the truncated object, the offset between the three-dimensional center projection and the center point of the two-dimensional detection frame is relatively large, and particularly, for the truncated object whose depth is close to zero, the offset may approach infinity, and even if the range of the offset is limited by the pair of loss functions, the problem of the acquisition performance (accuracy) of the three-dimensional center point of the truncated object cannot be solved practically.

In the monocular three-dimensional object detection method of the present invention, the monocular three-dimensional object detection model used is a Convolutional Neural Network (CNN) model, the CNN is essentially a Multilayer Perceptron (MLP), and the CNN adopts a way of local connection and weight sharing, so that on one hand, the number of weights is reduced to make the network easy to optimize, and on the other hand, the risk of overfitting is reduced. The CNN is one of the neural networks, and the weight sharing network structure of the CNN is more similar to a biological neural network, so that the complexity of a network model is reduced, and the number of weights is reduced. The advantage is more obvious when the input of the network is a multi-dimensional image, so that the image can be directly used as the input data of a CNN model, and the complex characteristic extraction and data reconstruction process in the traditional recognition algorithm is avoided. At present, the CNN model has a plurality of advantages in two-dimensional image processing, for example, a network can automatically extract image characteristics including color, texture, shape and a topological structure of an image; the method has good robustness, operation efficiency and the like in the aspect of processing two-dimensional image problems, particularly in the aspect of identifying displacement, scaling and other forms of distortion invariance.

CNNs themselves may take the form of a combination of different neurons and learning rules, with some advantages not found in conventional techniques: the method has good fault-tolerant capability, parallel processing capability and self-learning capability, can solve the problems of complex environmental information, unclear background knowledge and uncertain inference rule, allows the sample to have larger defect and distortion, and has high operation speed, good self-adaption performance and higher resolution. The CNN model fuses the feature extraction function into the MLP through structural recombination and weight reduction, and omits a complicated image feature extraction process before identification. The CNN model is composed of an input layer, an output layer, and a plurality of hidden layers, which can be classified into a convolutional layer (convolutional layer), a Pooling layer (Pooling layer), a RELU layer, and a Fully connected layer (Fully connected layer), wherein the convolutional layer is the core of the CNN model, and the parameters of the convolutional layer are composed of a set of learnable filters (filters) or kernels (kernels), which have a small field of view and extend to the entire depth of the input volume. During feed forward, each filter convolves the input data, typically a two-dimensional vector, but possibly height, i.e., the convolution layer is used to convolve the input layer, extracting higher-level features, calculating the dot product between the filter and the input data, and generating a two-dimensional activation map of the filter.

Preferably, the monocular three-dimensional object detection model is based on a centret network model, and the structure of the existing centret network model is optimized. In a specific network structure, an improved mobilenetV2 is adopted as an encoder (encoder) of a network, a DLAUP network is adopted as a decoder (decoder), the maximum down-sampling multiple of the network is 32, and the final output resolution is 4 times of the down-sampling feature map. The encoder part comprises 17 convolutional layers, each convolutional layer of the encoder consists of a convolutional layer, a normalization layer and an activation function layer, and the quantization training of the embedded platform is adapted by reducing some packet convolutions in a mobilenet network, so that the performance and the calculated amount reach better balance; the decoder part is of an up-sampling network structure, the whole network is of an inverted triangle structure, as shown in fig. 2, numbers in a rectangular frame in fig. 2 represent down-sampling multiples of a current feature map, and a dotted arrow represents the flow direction of the network map.

S300, generating a three-dimensional central point of the object to be detected in the three-dimensional space according to the two-dimensional central point and the object depth. Therefore, in step S300, if the object to be detected is the truncated object, the three-dimensional center point, i.e. the coordinate (x) of the three-dimensional center point, is directly regressed according to the two-dimensional center point of the truncated object and the depth of the object₀，y₀，z₀)，x₀Representing the distance, y, of the three-dimensional center point in the lateral direction of the three-dimensional space₀Representing the height-wise distance, z, of the three-dimensional center point in three-dimensional space₀The longitudinal direction distance of the three-dimensional center point in the three-dimensional space is represented.

S400, generating an orientation angle of the object to be detected according to the three-dimensional center point and the observation angle.

S500, generating a three-dimensional detection result, namely a three-dimensional detection frame corresponding to the object to be detected according to the three-dimensional center point, the orientation angle and the size.

The monocular three-dimensional object detection method adopts a unified frame or strategy to detect all objects including normal objects and truncated objects, the two-dimensional central point of the object to be detected on a two-dimensional image is obtained firstly, the object depth, the orientation angle and the size of the object to be detected are directly regressed through the two-dimensional central point and the two-dimensional image information, and then the three-dimensional central point of the object to be predicted on a three-dimensional space is generated through the two-dimensional central point and the object depth, particularly, when the object to be detected is a truncated object which is infinitely close to zero or less, the three-dimensional central point can be more accurately obtained through the mode, then the three-dimensional detection result corresponding to the object to be detected is obtained based on the three-dimensional central point, the orientation angle and the size, the prediction performance of the truncated object in monocular three-dimensional detection is obviously optimized, and the method is not limited to the type of the detected object, the difference between the objects does not need to be considered, and the monocular three-dimensional detection can be carried out on both the normal object and the truncated object; in the field of automatic driving, the close-range target perception accuracy of the vehicle can be remarkably improved, so that a more stable perception effect is achieved in practical application.

In the following, the monocular three-dimensional object detecting method according to the present invention is described with reference to fig. 3, and step S200 specifically includes the following steps:

s210, inputting the image to be detected into the monocular three-dimensional object detection model to obtain a two-dimensional detection result of the object to be detected and a characteristic diagram output by the monocular three-dimensional object detection model, wherein the characteristic diagram is obtained by outputting one layer of convolution layer of the monocular three-dimensional object detection model.

S220, acquiring the characteristics of the two-dimensional central point on the characteristic diagram;

and S230, inputting the image to be detected, the two-dimensional central point and the characteristics into the monocular three-dimensional object detection model to obtain the object depth, the observation angle and the size output by the monocular three-dimensional object detection model.

The two-dimensional detection result is a two-dimensional detection frame corresponding to the object to be detected, and the two-dimensional detection frame comprises a two-dimensional central point.

In this method, a feature map (feature map) is a feature that is extracted from a certain layer of the convolution layer of the monocular three-dimensional object detection model and is output (extracted) after being subjected to filtering processing.

In the following, the monocular three-dimensional object detecting method according to the present invention is described with reference to fig. 4, and step S500 specifically includes the following steps:

s510, acquiring the type of the object to be detected, namely whether the object to be detected is a normal object or a truncated object.

S520, when the type of the object to be detected is a normal object, generating an orientation angle according to the observation angle and the camera parameter, wherein the camera is used for acquiring an image to be detected;

and S530, when the type of the object to be detected is a truncated object, generating an orientation angle according to the observation angle and the three-dimensional central point.

Taking the automatic driving field as an example for explanation, when the object to be detected is a vehicle, the parameters for modeling the three-dimensional attitude and position of the vehicle include: the attitude of the vehicle comprises the position of the vehicle in the three-dimensional scene and the angle of the vehicle relative to the camera, and particularly comprises two parameters: the offset T of the center position of the three-dimensional envelope from the camera, the rotation matrix R of the vehicle. The rotation matrix R consists of three rotation angles: azimuth, i.e. heading (azimuth), elevation (elevation) and roll (roll) are determined, and for an autopilot scene the elevation is 0.

It is difficult to directly estimate the azimuth angle θ, which can be obtained by acquiring the rotation angle α of the vehicle with respect to the camera and the observation angle θ - α of the vehicle with respect to the camera. For normal objects, the rotation angle α is directly obtained by combining the two-dimensional center point position of the two-dimensional detection frame and the camera internal parameters, but for truncated objects, the two-dimensional center point position of the two-dimensional detection frame is greatly different from the actual three-dimensional center point position, so that the x is directly passed through₀Coordinate values and z₀The coordinate values are calculated to improve the calculation accuracy, specifically, please refer to formula (1), where formula (1) is:

in the method, a monocular three-dimensional object detection model is obtained by training through the following steps:

and A100, acquiring an image of the sample, wherein the sample image also comprises a sample object, and the sample object is a normal object and/or a truncated object.

And A200, acquiring an actual two-dimensional detection result, an actual object depth, an actual observation angle and an actual size of the sample image by means of labeling and the like.

And A300, taking the sample image as input data used for training, taking an actual two-dimensional detection result, an actual object depth, an actual observation angle and an actual size as labels, and training in a deep learning mode to obtain a monocular three-dimensional object detection model for generating the two-dimensional detection result, the object depth, the observation angle and the size of the image to be predicted.

The monocular three-dimensional object detecting device provided by the present invention is described below, and the monocular three-dimensional object detecting device described below and the monocular three-dimensional object detecting method described above may be referred to in correspondence with each other.

The monocular three-dimensional object detecting device of the present invention, which is intended to detect all objects including normal objects and truncated objects using a unified frame or strategy, will be described below with reference to fig. 5, and the device includes:

the first obtaining module 100 is configured to obtain an image to be detected in modes such as a camera, where the image to be detected includes an object to be detected, and the object to be detected is a normal object and/or a truncated object. Namely, the device is not limited to the type of the detection object, and can perform monocular three-dimensional detection on both the normal object and the truncated object without considering the difference between the objects.

The second obtaining module 200 is configured to input the image to be detected into the monocular three-dimensional object detection model, and obtain a two-dimensional detection result, an object depth, an observation angle, and a size of the object to be detected, which are output by the monocular three-dimensional object detection model. The two-dimensional detection result comprises a two-dimensional central point, and the object depth is obtained based on the two-dimensional central point and the characteristics of the two-dimensional central point on the characteristic diagram. If the object is truncated during the detection of the object, the two-dimensional center point represents the two-dimensional center point of the visible part of the truncated object in the image to be detected.

Different from other monocular three-dimensional detection devices, in the second obtaining module 200, after the image to be predicted is input to the monocular three-dimensional object detection model as input data, the object depth, the observation angle and the size corresponding to the object to be detected can be directly regressed, wherein the object depth is larger than zero for a normal object, and the object depth is infinitely close to zero or smaller than zero for a truncated object.

In the monocular three-dimensional object detection device, the used monocular three-dimensional object detection model is a CNN model, the CNN is essentially an MLP, and the CNN adopts a local connection and weight sharing mode, so that on one hand, the number of the weights is reduced, the network is easy to optimize, and on the other hand, the risk of overfitting is reduced. The CNN is one of the neural networks, and the weight sharing network structure of the CNN is more similar to a biological neural network, so that the complexity of a network model is reduced, and the number of weights is reduced. The advantage is more obvious when the input of the network is a multi-dimensional image, so that the image can be directly used as the input data of a CNN model, and the complex characteristic extraction and data reconstruction process in the traditional recognition algorithm is avoided. At present, the CNN model has a plurality of advantages in two-dimensional image processing, for example, a network can automatically extract image characteristics including color, texture, shape and a topological structure of an image; the method has good robustness, operation efficiency and the like in the aspect of processing two-dimensional image problems, particularly in the aspect of identifying displacement, scaling and other forms of distortion invariance.

CNNs themselves may take the form of a combination of different neurons and learning rules, with some advantages not found in conventional techniques: the method has good fault-tolerant capability, parallel processing capability and self-learning capability, can solve the problems of complex environmental information, unclear background knowledge and uncertain inference rule, allows the sample to have larger defect and distortion, and has high operation speed, good self-adaption performance and higher resolution. The CNN model fuses the feature extraction function into the MLP through structural recombination and weight reduction, and omits a complicated image feature extraction process before identification. The CNN model is composed of an input layer, an output layer, and a plurality of hidden layers, wherein the hidden layers can be classified into a convolutional layer (convolutional layer), a Pooling layer (Pooling layer), a RELU layer, and a Fully connected layer (Fully connected layer), the convolutional layer is a core of the CNN model, and parameters of the convolutional layer are composed of a set of learnable filters (filters) or kernels (kernels).

Preferably, the monocular three-dimensional object detection model is based on a centret network model, and the structure of the existing centret network model is optimized. In a specific network structure, an improved mobilenetV2 is adopted as an encoder (encoder) of a network, a DLAUP network is adopted as a decoder (decoder), the maximum down-sampling multiple of the network is 32, and the final output resolution is 4 times of the down-sampling feature map. The encoder part comprises 17 convolutional layers, each convolutional layer of the encoder consists of a convolutional layer, a normalization layer and an activation function layer, and the quantization training of the embedded platform is adapted by reducing some packet convolutions in a mobilenet network, so that the performance and the calculated amount reach better balance; the decoder part is of an up-sampling network structure, and the whole network is of an inverted triangle structure.

The third and fourth obtaining module 300 is configured to generate a three-dimensional center point of the object to be detected in the three-dimensional space according to the two-dimensional center point and the object depth. Therefore, in the third obtaining module 300, if the object to be detected is the truncated object, the three-dimensional center point, i.e. the coordinate (x) of the three-dimensional center point, is directly regressed according to the two-dimensional center point of the truncated object and the depth of the object₀，y₀，z₀)，x₀Representing the distance, y, of the three-dimensional center point in the lateral direction of the three-dimensional space₀Representing the height-wise distance, z, of the three-dimensional center point in three-dimensional space₀The longitudinal direction distance of the three-dimensional center point in the three-dimensional space is represented.

The fifth obtaining module 400 is configured to generate an orientation angle of the object to be detected according to the three-dimensional center point and the observation angle.

The three-dimensional detection module 500 is configured to generate a three-dimensional detection result, that is, a three-dimensional detection frame corresponding to the object to be detected, according to the three-dimensional center point, the orientation angle, and the size.

The monocular three-dimensional object detection device of the invention adopts a unified frame or strategy to detect all objects including normal objects and truncated objects, the two-dimensional central point of the object to be detected on the two-dimensional image is firstly obtained, the object depth, the orientation angle and the size of the object to be detected are directly regressed through the two-dimensional central point and the two-dimensional image information, and then the three-dimensional central point of the object to be predicted on a three-dimensional space is generated through the two-dimensional central point and the object depth, particularly, when the object to be detected is a truncated object which is infinitely close to zero or less, the three-dimensional central point can be more accurately obtained through the mode, then the three-dimensional detection result corresponding to the object to be detected is obtained based on the three-dimensional central point, the orientation angle and the size, the prediction performance of the truncated object in monocular three-dimensional detection is obviously optimized, and the device is not limited to the type of the detected object, the difference between the objects does not need to be considered, and the monocular three-dimensional detection can be carried out on both the normal object and the truncated object; in the field of automatic driving, the close-range target perception accuracy of the vehicle can be remarkably improved, so that a more stable perception effect is achieved in practical application.

In the following, referring to fig. 6, the monocular three-dimensional object detecting device according to the present invention is described, and the second obtaining module 200 specifically includes:

the first obtaining unit 210 is configured to input the image to be detected into the monocular three-dimensional object detection model, and obtain a two-dimensional detection result of the object to be detected and a feature map output by the monocular three-dimensional object detection model, where the feature map is obtained by outputting one of the convolutional layers of the monocular three-dimensional object detection model.

And a second obtaining unit 220, configured to obtain a feature of the two-dimensional center point on the feature map.

The third obtaining unit 230 is configured to input the image to be detected, the two-dimensional central point, and the feature into the monocular three-dimensional object detection model, so as to obtain an object depth, an observation angle, and a size output by the monocular three-dimensional object detection model.

In this apparatus, a feature map (feature map) is a feature extracted from a convolution layer of the monocular three-dimensional object detection model and outputted (extracted) after being subjected to filtering processing.

In the following, referring to fig. 7, the monocular three-dimensional object detecting device according to the present invention is described, and the fifth obtaining module 500 specifically includes:

a fourth acquiring unit 510, configured to acquire the type of the object to be detected, that is, whether the object to be detected is a normal object or a truncated object.

A first generating unit 520, configured to generate an orientation angle according to the observation angle and the camera parameter when the type of the object to be detected is a normal object, where the camera is used to acquire an image to be detected;

and a second generating unit 530, configured to generate an orientation angle according to the observation angle and the three-dimensional center point when the type of the object to be detected is a truncated object.

It is difficult to directly estimate the azimuth angle θ, which can be obtained by acquiring the rotation angle α of the vehicle with respect to the camera and the observation angle θ - α of the vehicle with respect to the camera, and for normal objects, the rotation angle α is directly acquired in combination with the two-dimensional center point position of the two-dimensional detection frame and the camera internal parameters, but for truncated objects, the two-dimensional center point position of the two-dimensional detection frame is greatly different from the actual three-dimensional center point position, and therefore, the azimuth angle θ is directly obtained by x₀Coordinate values and z₀The coordinate value is calculated, and the calculation precision is improved.

Fig. 8 illustrates a physical structure diagram of an electronic device, and as shown in fig. 8, the electronic device may include: a processor (processor)810, a communication Interface 820, a memory 830 and a communication bus 840, wherein the processor 810, the communication Interface 820 and the memory 830 communicate with each other via the communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a monocular three-dimensional object detection method comprising the steps of:

s100, acquiring an image to be detected; the image to be detected comprises an object to be detected, and the object to be detected is a normal object and/or a truncated object;

s200, inputting the image to be detected into a monocular three-dimensional object detection model to obtain a two-dimensional detection result, an object depth, an observation angle and a size of the object to be detected, which are output by the monocular three-dimensional object detection model; the two-dimensional detection result comprises the two-dimensional central point, and the object depth is obtained based on the two-dimensional central point and the characteristics of the two-dimensional central point on a characteristic map;

s300, generating a three-dimensional central point of the object to be detected in a three-dimensional space according to the two-dimensional central point and the object depth;

s400, generating an orientation angle of the object to be detected according to the three-dimensional center point and the observation angle;

s500, generating a three-dimensional detection result according to the three-dimensional center point, the orientation angle and the size.

In addition, the logic instructions in the memory 830 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being stored on a non-transitory computer readable storage medium, wherein when the computer program is executed by a processor, a computer is capable of executing the monocular three-dimensional object detection method provided by the above methods, the method comprising the steps of:

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the monocular three-dimensional object detection methods provided by the above methods, the method comprising the steps of:

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A monocular three-dimensional object detection method is characterized by comprising the following steps:

2. The monocular three-dimensional object detecting method according to claim 1, wherein the monocular three-dimensional object detecting model is based on a centret network model.

3. The monocular three-dimensional object detecting method according to claim 1, wherein the inputting the image to be detected into a monocular three-dimensional object detecting model to obtain a two-dimensional detecting result, an object depth, an observation angle and a size of the object to be detected output by the monocular three-dimensional object detecting model specifically comprises the following steps:

acquiring the feature of the two-dimensional central point on the feature map;

4. The monocular three-dimensional object detecting method according to claim 1, wherein the generating of the orientation angle of the object to be detected according to the three-dimensional center point and the observation angle specifically includes:

acquiring the type of the object to be detected;

5. The monocular three-dimensional object detecting method according to claim 3, wherein the monocular three-dimensional object detecting model is trained by the steps of:

6. The monocular three-dimensional object detecting method according to claim 5, wherein the acquiring of the actual two-dimensional detection result, the actual object depth, the actual observation angle, and the actual size of the sample image specifically includes the steps of:

7. A monocular three-dimensional object detecting device, comprising:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the monocular three-dimensional object detecting method according to any one of claims 1 to 6 are implemented when the processor executes the program.

9. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the monocular three-dimensional object detection method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the monocular three-dimensional object detection method according to any one of claims 1 to 6 when executed by a processor.