CN115661778A

CN115661778A - Monocular 3D detection frame prediction method and device

Info

Publication number: CN115661778A
Application number: CN202211379399.XA
Authority: CN
Inventors: 陆强
Original assignee: International Network Technology Shanghai Co Ltd
Current assignee: International Network Technology Shanghai Co Ltd
Priority date: 2022-11-04
Filing date: 2022-11-04
Publication date: 2023-01-31

Abstract

The invention relates to the technical field of automatic driving, and discloses a monocular 3D detection frame prediction method and a device, wherein the method comprises the following steps: acquiring a two-dimensional image shot by a monocular camera; and inputting the two-dimensional image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model, wherein the monocular 3D detection frame prediction model is obtained by training a two-dimensional sample image, an extrinsic correction matrix label corresponding to the two-dimensional sample image and a 3D detection frame label corresponding to the two-dimensional sample image, and the monocular 3D detection frame prediction model is used for predicting an extrinsic correction matrix according to the two-dimensional image and predicting the 3D detection frame according to the two-dimensional image and the predicted extrinsic correction matrix. The method and the device realize the end-to-end prediction and correction of the 3D detection frame through the data and the model, so that the correction of the prediction result of the 3D detection frame is more accurate, namely the prediction result of the 3D detection frame is more accurate.

Description

Monocular 3D detection frame prediction method and device

Technical Field

The invention relates to the technical field of automatic driving, in particular to a monocular 3D detection frame prediction method and device.

Background

With the development of the automatic driving technology, the front-view cameras are mounted at the front ends of the automatic driving vehicles and used for identifying objects in front of the vehicles and outputting corresponding 3D detection frames of the detected objects.

The bump and the uneven road surface during the running of the vehicle can cause the change of external parameters (including a rotation matrix and a translation vector), so that the 3D detection frame prediction is not accurate. At present, most schemes are carried out according to the flow of 'online external parameter calibration, external parameter correction and perception', namely, the external parameter of a front-view camera is corrected firstly, and then the corrected external parameter is adopted to correct a 3D detection frame output by a 3D detection frame prediction model, so that an accurate 3D detection frame is obtained. Moreover, the existing external reference correction is not corrected by a method based on network learning, and the external reference correction and the 3D detection frame prediction are carried out separately, and are not end-to-end processes. Therefore, if the external reference correction is not accurate, the correction of the prediction result of the 3D detection frame must not be accurate.

Disclosure of Invention

The invention provides a monocular 3D detection frame prediction method and device, which are used for solving the problem that in the prior art, the 3D detection frame prediction result is not accurately corrected due to the separation of the external reference correction process and the 3D detection frame prediction process.

The invention provides a monocular 3D detection frame prediction method, which comprises the following steps:

acquiring a two-dimensional image shot by a monocular camera;

inputting the two-dimensional image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model,

wherein the monocular 3D detection frame prediction model is obtained by training based on a two-dimensional sample image, an external reference correction matrix label corresponding to the two-dimensional sample image and a 3D detection frame label corresponding to the two-dimensional sample image,

the monocular 3D detection frame prediction model is used for predicting an external reference correction matrix according to the two-dimensional image and predicting a 3D detection frame according to the two-dimensional image and the predicted external reference correction matrix.

According to the monocular 3D detection frame prediction method provided by the present invention, the monocular 3D detection frame prediction model includes: a network backbone layer, a view conversion layer, a first submodel and a second submodel,

the network backbone layer is used for extracting two-dimensional features of the two-dimensional image;

the view conversion layer is used for converting the two-dimensional features into three-dimensional features;

the first sub-model is used for outputting the external reference correction matrix according to the three-dimensional characteristics;

and the second submodel is used for outputting the 3D detection frame according to the three-dimensional characteristics and the external reference correction matrix.

According to the monocular 3D detection frame prediction method provided by the invention, the two-dimensional image is input into a monocular 3D detection frame prediction model, and a 3D detection frame output by the monocular 3D detection frame prediction model is obtained, and the method comprises the following steps:

inputting the two-dimensional image into a network backbone layer to obtain the two-dimensional feature;

inputting the two-dimensional features into a view conversion layer to obtain the three-dimensional features;

inputting the three-dimensional characteristics into the first sub-model to obtain the external reference correction matrix;

and inputting the three-dimensional features and the external reference correction matrix into the second submodel to obtain the 3D detection frame.

According to the monocular 3D detection frame prediction method provided by the present invention, inputting the two-dimensional feature into a view conversion layer to obtain the three-dimensional feature, includes:

convolving the two-dimensional features, and determining depth information of each pixel in the two-dimensional features;

and converting the two-dimensional features with the depth information into the three-dimensional features according to the two-dimensional image coordinate system and the conversion matrix of the 3D coordinate system of the vehicle body.

According to the monocular 3D detection frame prediction method provided by the invention, the first submodel comprises the following steps: the method comprises the following steps of inputting the three-dimensional characteristics into the first submodel to obtain the external reference correction matrix, wherein the steps comprise:

inputting the three-dimensional features into the convolutional layer to obtain three-dimensional features extracted from the convolutional layer;

the full-connection layer outputs a one-dimensional characteristic vector according to the three-dimensional characteristics extracted by the convolution layer;

and the dimension reconstruction layer reconstructs the one-dimensional characteristic vector into the external reference correction matrix.

According to the monocular 3D detection frame prediction method provided by the present invention, the second submodel includes: the coding layer and the network output layer input the three-dimensional features and the extrinsic correction matrix into the second submodel to obtain the 3D detection frame, and the method comprises the following steps:

inputting the three-dimensional features into the coding layer to obtain coded three-dimensional features;

performing matrix transformation on the coded three-dimensional characteristics by adopting the external reference correction matrix;

and the network output layer outputs the 3D detection frame according to the three-dimensional characteristics after matrix transformation.

According to the monocular 3D detection frame prediction method provided by the present invention, before acquiring a two-dimensional image captured by a monocular camera, the method further includes: training the monocular 3D detection frame prediction model specifically comprises:

inputting a two-dimensional sample image, an extrinsic correction matrix label corresponding to the two-dimensional sample image and a 3D detection frame label corresponding to the two-dimensional sample image into the monocular 3D detection frame prediction model, wherein a training result output by a first sub-model is input into a second sub-model;

and substituting the training result output by the first sub-model and the label of the extrinsic correction matrix corresponding to the two-dimensional sample image into a preset first loss function loss1, substituting the training result output by the second sub-model and the label of the 3D detection frame corresponding to the two-dimensional sample image into a preset second loss function loss2, and finishing the training when a & loss1+ b & loss2 is converged, wherein a and b are the weighting weights of loss1 and loss2 respectively.

According to the monocular 3D detection frame prediction method provided by the invention, before acquiring the two-dimensional image shot by the monocular camera, the method further comprises the following steps: training the monocular 3D detection frame prediction model specifically comprises:

the training result output by the first submodel and the label of the external reference correction matrix corresponding to the two-dimensional sample image are brought into a preset first loss function, and the internal parameters of the network backbone layer, the view conversion layer and the first submodel are determined when the first loss function is converged;

and bringing the training result output by the second sub-model and the 3D detection frame label corresponding to the two-dimensional sample image into a preset second loss function, and finishing the training when the second loss function is converged.

According to the monocular 3D detection frame prediction method provided by the invention, before training the monocular 3D detection frame prediction model, the method comprises the following steps:

randomly generating an external reference disturbance matrix under a 3D coordinate system of the vehicle body, and taking an inverse matrix of the external reference disturbance matrix as an external reference correction matrix label;

determining an enhancement transformation matrix for the two-dimensional sample image according to the external reference disturbance matrix, an internal reference matrix of the monocular camera and a transformation matrix of a two-dimensional image coordinate system and a vehicle body 3D coordinate system;

and generating an enhanced two-dimensional sample image according to the two-dimensional sample image and the enhanced transformation matrix, and taking the two-dimensional sample image and the enhanced two-dimensional sample image as a training set of a monocular 3D detection frame prediction model.

The present invention also provides a monocular 3D detection frame prediction device, comprising:

the image acquisition module is used for acquiring a two-dimensional image shot by the monocular camera;

a model operation module for inputting the two-dimensional image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model,

The present invention also provides an electronic device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a monocular 3D detection frame prediction method as described in any one of the above when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a monocular 3D detection frame prediction method as described in any one of the above.

According to the monocular 3D detection frame prediction method and device provided by the invention, the two-dimensional image is input into the monocular 3D detection frame prediction model, the external reference correction matrix is predicted, and the 3D detection frame is predicted according to the two-dimensional image and the predicted external reference correction matrix, namely, the correction of the external reference and the prediction of the 3D detection frame are carried out in the same model, so that the 3D detection frame is predicted and corrected end to end through the data and the model, the correction of the prediction result of the 3D detection frame is more accurate, namely, the prediction result of the 3D detection frame is more accurate.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a monocular 3D detection frame prediction method provided by the present invention;

FIG. 2 is a schematic diagram of a monocular 3D detection frame prediction model structure in the monocular 3D detection frame prediction method provided by the present invention;

fig. 3 is a schematic flowchart of step S120 in the monocular 3D detection frame prediction method according to the present invention;

FIG. 4 is a schematic structural diagram of a monocular 3D detection frame prediction device according to the present invention;

fig. 5 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the monocular 3D detection frame prediction method according to the embodiment of the present invention includes:

step S110, acquiring a two-dimensional image captured by a monocular camera, which is usually installed at the front end of the vehicle and is used for capturing images of the vehicle, obstacles, guideboards, and the like in front of the vehicle, wherein the captured image is a two-dimensional image.

And S120, inputting the two-dimensional image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model, wherein the 3D detection frame is a 3D contour of an object (mainly comprising a front vehicle, an obstacle, a guideboard and the like) in front of the vehicle.

In this embodiment, the monocular 3D detection frame prediction model is obtained by training based on a two-dimensional sample image, an extrinsic correction matrix label corresponding to the two-dimensional sample image, and a 3D detection frame label corresponding to the two-dimensional sample image.

According to the monocular 3D detection frame prediction method, the two-dimensional image is input into the monocular 3D detection frame prediction model, the external reference correction matrix is predicted, the 3D detection frame is predicted according to the two-dimensional image and the predicted external reference correction matrix, namely, the external reference correction and the 3D detection frame prediction are carried out in the same model, the 3D detection frame is predicted and corrected end to end through data and the model, the 3D detection frame prediction result is corrected more accurately, and the 3D detection frame prediction result is more accurate.

As shown in fig. 2, the monocular 3D detection frame prediction model of the present embodiment includes: the system comprises a network backbone layer, a view conversion layer, a first submodel and a second submodel.

The network backbone layer is used for extracting two-dimensional features of the two-dimensional image.

The view conversion layer is used for converting the two-dimensional features into three-dimensional features.

The first sub-model is used for outputting the external reference correction matrix according to the three-dimensional characteristics.

In the specific model structure shown in fig. 2, the first sub-model and the second sub-model share the same three-dimensional feature, the first sub-model outputs an external reference correction matrix, the second sub-model outputs a 3D detection frame, the external reference correction matrix is also input into the second sub-model and is used for correcting the 3D detection frame output by the second sub-model, and the first sub-model and the second sub-model are in the same model and trained simultaneously, so that the 3D detection frame is predicted and corrected end to end through data and the model, the prediction result of the 3D detection frame is corrected more accurately, and the obtained 3D detection frame is also more accurate.

Specifically, the first submodel is a vertical part behind the view conversion layer in fig. 2, and includes: convolutional layers, fully-connected layers, and dimensionally-reconstructed layers. The second submodel is a horizontal part behind the view conversion layer in fig. 2, and includes: an encoding layer and a network output layer.

Based on the monocular 3D detection frame prediction model in fig. 2, as shown in fig. 3, step S120 includes:

and S310, inputting the two-dimensional image into a network backbone layer to obtain the two-dimensional feature, wherein the two-dimensional feature is a two-dimensional feature under a two-dimensional image coordinate system.

And S320, inputting the two-dimensional features into a view conversion layer to obtain the three-dimensional features, wherein the three-dimensional features are three-dimensional features under a 3D coordinate system of the vehicle body.

And S330, inputting the three-dimensional characteristics into the first sub-model to obtain the external reference correction matrix.

And step S340, inputting the three-dimensional characteristics and the external reference correction matrix into the second sub-model to obtain the 3D detection frame.

In this embodiment, step S320 includes:

and convolving the two-dimensional features, and determining the depth information of each pixel in the two-dimensional features. Specifically, the view conversion layer is composed of a plurality of convolution blocks, the output of the view conversion layer is a depth (depth) range prediction of each pixel in the two-dimensional features (for example, the depth range is 1-60 meters, and the depth step length is 1 meter), and after the depth dimension information exists, the image features become three-dimensional features in a 3D coordinate system. Because the training of the depth information is also included in the model training process, the trained model can automatically identify the depth information of each pixel from the input two-dimensional image.

Although the two-dimensional feature with depth information belongs to a three-dimensional feature, the two-dimensional feature is only depth information superimposed on a two-dimensional image coordinate system, and is not a three-dimensional feature in a 3D coordinate system of a vehicle body, and therefore conversion is required. Specifically, the two-dimensional features with depth information are converted into the three-dimensional features according to a two-dimensional image coordinate system and a conversion matrix of a 3D coordinate system of the vehicle body.

In this embodiment, step S330 includes:

inputting the three-dimensional features into the convolutional layer to obtain the three-dimensional features extracted by the convolutional layer, and further extracting the features of the three-dimensional features by the convolutional layer to enhance the semantic property of the extracted features.

And the full connection layer outputs a one-dimensional characteristic vector according to the three-dimensional characteristic extracted by the convolution layer.

And the dimension reconstruction layer reconstructs the one-dimensional characteristic vectors into the external reference correction matrix, namely, the data in the one-dimensional characteristic vectors are rearranged according to the rows and columns conforming to the dimension of the external reference correction matrix. Since the dimension of the external reference correction matrix is Bx 4 x 4, wherein B is batch _ size, the batch _ size is transmitted to the number of samples used for training by the model in a single time, the front three-dimensional of 4 x 4 represents a three-dimensional rotation matrix, and the rear one-dimensional represents a translation vector. Therefore, for the one-dimensional eigenvector of the full connection layer, the dimension reconstruction layer is required to reconstruct the one-dimensional eigenvector into the dimension of the extrinsic correction matrix to output the extrinsic correction matrix.

In this embodiment, step S340 includes:

and inputting the three-dimensional characteristics into the coding layer to obtain coded three-dimensional characteristics. The coding layer is mainly used for down-sampling and up-sampling three-dimensional features, so that feature semantic information is richer, and a prediction result is more accurate.

And performing matrix transformation on the coded three-dimensional characteristic by using the external reference correction matrix, specifically, performing matrix transformation on a 3-dimensional coordinate position P of the three-dimensional characteristic, wherein the transformed position is as follows: and P' = N P, N is a transformation matrix (an external reference correction matrix), and the three-dimensional features are rearranged according to the new transformed coordinate position, so that the transformed three-dimensional features can be obtained.

And the network output layer outputs the 3D detection frame according to the three-dimensional characteristics after the matrix transformation.

The process of performing matrix transformation on the coded three-dimensional features through the external reference correction matrix is the process of correcting the three-dimensional features, and the 3D detection frame predicted through the corrected three-dimensional features is more accurate.

In this embodiment, before step S110, the method further includes: training the monocular 3D detection frame prediction model specifically comprises:

and inputting the two-dimensional sample image, the extrinsic correction matrix label corresponding to the two-dimensional sample image and the 3D detection frame label corresponding to the two-dimensional sample image into the monocular 3D detection frame prediction model, wherein a training result output by the first sub-model is input into the second sub-model.

Specifically, for the first submodel, the first loss function loss1 during training is an extrinsic correction matrix loss function, and a regression loss function smooth L1 loss is adopted for calculation.

The output of the second submodel is also the output of the whole monocular 3D detection frame prediction model, that is, the output 3D detection frame, and when training, the second loss function includes the following three sub-loss functions:

the center point loss function, loss21, is calculated using the focal point loss, focal length.

The offset loss function loss22, offset refers to the deviation of the target center point caused by the coordinate rounding when the network downsamples, and is calculated by using smooth L1 loss.

The 3D size loss function loss23 is calculated using the smooth L1 loss.

Finally, when a · loss1+ b1 · loss21+ b2 · loss22+ b3 · loss23 converges, the training is completed, wherein b1, b2, and b3 are the weighting weights of loss21, loss22, and loss23, respectively, and in practical application, the ownership weight can be 1.

During training, the following training mode can also be adopted: and inputting a two-dimensional sample image, an extrinsic correction matrix label corresponding to the two-dimensional sample image and a 3D detection frame label corresponding to the two-dimensional sample image into the monocular 3D detection frame prediction model, wherein a training result output by the first sub-model is input into the second sub-model.

And substituting the training result output by the first submodel and the external reference correction matrix label corresponding to the two-dimensional sample image into a preset first loss function, and determining the internal parameters of the network backbone layer, the view conversion layer and the first submodel when the first loss function is converged.

And bringing the training result output by the second sub-model and the 3D detection frame label corresponding to the two-dimensional sample image into a preset second loss function, and finishing training when the second loss function is converged.

In the two training modes, the former training mode is completely end-to-end training and shorter in training time, the latter training mode is to train a layer related to the external reference correction matrix well, so that the model can obtain a more accurate external reference correction matrix when being applied, and then the second sub-model is trained.

Further, before training the monocular 3D detection frame prediction model, comprising:

and randomly generating an external reference disturbance matrix under a 3D coordinate system of the vehicle body, wherein the dimension of the external reference disturbance matrix is Bx 4 x 4, B is batch _ size, and the inverse matrix of the external reference disturbance matrix is used as the external reference correction matrix label. The external reference disturbance matrix is applied to two-dimensional image input, but the label position of the 3D target frame is still kept unchanged, so that the feature after disturbance is corrected by predicting the external reference correction matrix during training, and the 3D target frame before disturbance is predicted.

And determining an enhanced transformation matrix for the two-dimensional sample image according to the external parameter disturbance matrix, the internal parameter matrix of the monocular camera and the transformation matrix from the 3D coordinate system of the vehicle body to the coordinate system of the camera. Specifically, the enhancement transformation matrix M _2d is calculated as follows:

M_2d＝intrinsics*car2camera*M，

wherein intrinsics is a camera reference matrix, and car 2camera is a conversion matrix from a 3D coordinate system of the vehicle body to a camera coordinate system.

And generating an enhanced two-dimensional sample image according to the two-dimensional sample image and the enhanced transformation matrix, and taking the two-dimensional sample image and the enhanced two-dimensional sample image as a training set of a monocular 3D detection frame prediction model. Specifically, each two-dimensional sample image is matrix multiplied by an enhancement transform matrix, thereby generating a plurality of enhanced two-dimensional sample images. The two-dimensional sample image and the enhanced two-dimensional sample image are jointly used as a training set, so that the input two-dimensional sample image has more changes and diversity, and the trained monocular 3D detection frame prediction model can more accurately predict the 3D detection frame in actual application.

The following describes a monocular 3D detection frame predicting apparatus according to the present invention, and the monocular 3D detection frame predicting apparatus described below and the monocular 3D detection frame predicting method described above may be referred to in correspondence with each other.

As shown in fig. 4, the monocular 3D detection frame predicting device provided by the present invention includes:

and an image obtaining module 410, configured to obtain a two-dimensional image captured by the monocular camera.

And the model operation module 420 is configured to input the two-dimensional image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model.

The monocular 3D detection frame prediction model is obtained by training based on a two-dimensional sample image, an external reference correction matrix label corresponding to the two-dimensional sample image and a 3D detection frame label corresponding to the two-dimensional sample image.

According to the monocular 3D detection frame prediction device, the two-dimensional image is input into the monocular 3D detection frame prediction model, the external reference correction matrix is predicted, the 3D detection frame is predicted according to the two-dimensional image and the predicted external reference correction matrix, namely, the correction of the external reference and the prediction of the 3D detection frame are carried out in the same model, the 3D detection frame is predicted and corrected end to end through data and the model, the correction of the prediction result of the 3D detection frame is more accurate, namely, the prediction result of the 3D detection frame is more accurate.

Optionally, the monocular 3D detection frame prediction model includes: the system comprises a network backbone layer, a view conversion layer, a first submodel and a second submodel.

Optionally, the model operating module 420 is specifically configured to:

and inputting the two-dimensional image into a network backbone layer to obtain the two-dimensional feature.

And inputting the two-dimensional features into a view conversion layer to obtain the three-dimensional features.

And inputting the three-dimensional characteristics into the first sub-model to obtain the external reference correction matrix.

Optionally, the model operation module 420 is further specifically configured to:

and convolving the two-dimensional features, and determining the depth information of each pixel in the two-dimensional features.

Optionally, the first submodel comprises: convolutional layers, full-link layers, and dimension reconstruction layers, the model operation module 420 is specifically configured to:

and inputting the three-dimensional features into the convolutional layer to obtain the three-dimensional features extracted by the convolutional layer.

Optionally, the second submodel comprises: the coding layer and the network output layer, the model operating module 420 is further specifically configured to:

and inputting the three-dimensional features into the coding layer to obtain coded three-dimensional features.

And performing matrix transformation on the coded three-dimensional characteristics by adopting the external reference correction matrix.

Optionally, the monocular 3D detection frame predicting device of the present invention further includes: a model training module to:

And substituting the training result output by the first sub-model and the label of the extrinsic correction matrix corresponding to the two-dimensional sample image into a preset first loss function loss1, substituting the training result output by the second sub-model and the label of the 3D detection frame corresponding to the two-dimensional sample image into a preset second loss function loss2, and finishing training when a · loss1+ b · loss2 is converged, wherein a and b are respectively the weighting weight of loss1 and loss 2.

Optionally, the monocular 3D detection frame predicting device of the present invention further includes:

and the external reference disturbance matrix generation module is used for randomly generating an external reference disturbance matrix under a 3D coordinate system of the vehicle body, and taking an inverse matrix of the external reference disturbance matrix as the external reference correction matrix label.

And the enhancement transformation matrix determining module is used for determining an enhancement transformation matrix for the two-dimensional sample image according to the external parameter disturbance matrix, the internal parameter matrix of the monocular camera and the transformation matrix of the two-dimensional image coordinate system and the vehicle body 3D coordinate system.

And the enhanced sample image generation module is used for generating an enhanced two-dimensional sample image according to the two-dimensional sample image and the enhanced transformation matrix, and taking the two-dimensional sample image and the enhanced two-dimensional sample image as a training set of the monocular 3D detection frame prediction model.

Fig. 5 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 5: a processor (processor) 510, a communication Interface (Communications Interface) 520, a memory (memory) 530, and a communication bus 540, wherein the processor 510, the communication Interface 520, and the memory 530 communicate with each other via the communication bus 540. Processor 510 may call logic instructions in memory 530 to perform a monocular 3D detection box prediction method comprising:

and acquiring a two-dimensional image shot by the monocular camera.

And inputting the two-dimensional image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model.

And the monocular 3D detection frame prediction model is used for predicting an external reference correction matrix according to the two-dimensional image and predicting a 3D detection frame according to the two-dimensional image and the predicted external reference correction matrix.

Furthermore, the logic instructions in the memory 530 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, the present invention further provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, wherein when the computer program is executed by a processor, the computer is capable of executing the monocular 3D detection frame prediction method provided by the above methods, the method including:

and acquiring a two-dimensional image shot by the monocular camera.

In yet another aspect, the present invention also provides a non-transitory computer-readable storage medium, on which a computer program is stored, the computer program being implemented by a processor to perform the monocular 3D detection frame prediction method provided by the above methods, the method including:

and acquiring a two-dimensional image shot by the monocular camera.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment may be implemented by software plus a necessary general hardware platform, and may also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A monocular 3D detection frame prediction method is characterized by comprising the following steps:

acquiring a two-dimensional image shot by a monocular camera;

2. The monocular 3D detection frame prediction method of claim 1, wherein the monocular 3D detection frame prediction model comprises: a network backbone layer, a view conversion layer, a first submodel and a second submodel,

3. The monocular 3D detection frame prediction method of claim 2, wherein inputting the two-dimensional image into a monocular 3D detection frame prediction model to obtain a 3D detection frame output by the monocular 3D detection frame prediction model, comprises:

inputting the two-dimensional image into a network backbone layer to obtain the two-dimensional characteristic;

inputting the three-dimensional features into the first sub-model to obtain the external reference correction matrix;

4. The monocular 3D detection frame prediction method of claim 3, wherein inputting the two-dimensional feature into a view translation layer to obtain the three-dimensional feature comprises:

and converting the two-dimensional features with the depth information into the three-dimensional features according to the conversion matrix of the two-dimensional image coordinate system and the 3D coordinate system of the vehicle body.

5. The monocular 3D detection frame prediction method of claim 3, wherein the first submodel comprises: the method comprises the following steps of inputting the three-dimensional characteristics into the first submodel to obtain the external reference correction matrix, wherein the steps of:

6. The monocular 3D detection frame prediction method of claim 3, wherein the second submodel comprises: the encoding layer and the network output layer input the three-dimensional features and the extrinsic correction matrix into the second submodel to obtain the 3D detection frame, and the method comprises the following steps:

7. The monocular 3D detection frame predicting method according to claim 2, further comprising, before acquiring the two-dimensional image captured by the monocular camera: training the monocular 3D detection frame prediction model specifically comprises:

8. The monocular 3D detection frame predicting method according to claim 2, further comprising, before acquiring the two-dimensional image captured by the monocular camera: training the monocular 3D detection frame prediction model specifically comprises:

the training result output by the first submodel and the external reference correction matrix label corresponding to the two-dimensional sample image are brought into a preset first loss function, and the internal parameters of the network backbone layer, the view conversion layer and the first submodel are determined when the first loss function is converged;

9. The monocular 3D detection frame prediction method according to claim 7 or 8, comprising, prior to training the monocular 3D detection frame prediction model:

10. A monocular 3D detection frame prediction device, comprising:

11. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the monocular 3D detection frame prediction method according to any one of claims 1 to 9 when executing the program.

12. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program, when executed by a processor, implements the monocular 3D detection frame prediction method according to any one of claims 1 to 9.