CN112819890A

CN112819890A - Three-dimensional object detection method, device, equipment and storage medium

Info

Publication number: CN112819890A
Application number: CN202110019704.3A
Authority: CN
Inventors: 周定富; 宋希彬; 张良俊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Baidu USA LLC
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Baidu USA LLC
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-05-18

Abstract

The application discloses a three-dimensional object detection method, a device, equipment and a storage medium, relates to the technical field of image processing, further relates to a computer vision technology, can be applied to the fields of automatic driving, virtual reality, augmented reality and the like, and comprises the following steps: acquiring a single-frame color image to be detected; acquiring a depth image characteristic and a three-dimensional estimation characteristic of the single-frame color image; and carrying out three-dimensional object detection on the single-frame color image according to the depth image characteristic and the three-dimensional estimation characteristic. According to the embodiment of the application, the three-dimensional object detection precision based on the single-frame image can be improved.

Description

Three-dimensional object detection method, device, equipment and storage medium

Technical Field

The application relates to the technical field of image processing, in particular to a computer vision technology.

Background

Object detection is an important research direction in the field of computer vision. Currently, object detection mainly includes two-dimensional object detection and three-dimensional object detection. In two-dimensional object detection, after an object is identified and located in an image, a rectangular frame is circled around the object to indicate the position of the object in the image. Three-dimensional object detection obtains three-dimensional information of an object by identifying and positioning the three-dimensional object, and a cubic frame is encircled tightly around the object to represent the position of the object in the real world. The three-dimensional object detection has important application value in the fields of intelligent robots, automatic driving, virtual reality, augmented reality and the like.

Disclosure of Invention

The embodiment of the application provides a three-dimensional object detection method, a three-dimensional object detection device, three-dimensional object detection equipment, a three-dimensional object detection medium and a three-dimensional object detection program product, so that the detection precision of a three-dimensional object based on a single-frame image is improved.

In a first aspect, an embodiment of the present application provides a three-dimensional object detection method, including:

acquiring a single-frame color image to be detected;

acquiring a depth image characteristic and a three-dimensional estimation characteristic of the single-frame color image;

and carrying out three-dimensional object detection on the single-frame color image according to the depth image characteristic and the three-dimensional estimation characteristic.

In a second aspect, an embodiment of the present application provides a three-dimensional object detection apparatus, including:

the single-frame color image acquisition module is used for acquiring a single-frame color image to be detected;

the characteristic acquisition module is used for acquiring the depth image characteristic and the three-dimensional estimation characteristic of the single-frame color image;

and the three-dimensional object detection module is used for carrying out three-dimensional object detection on the single-frame color image according to the depth image characteristic and the three-dimensional estimation characteristic.

In a third aspect, an embodiment of the present application provides an electronic device, including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the three-dimensional object detection method provided in the embodiment of the first aspect.

In a fourth aspect, the present application further provides a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the three-dimensional object detection method provided in the first aspect.

In a fifth aspect, the present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the three-dimensional object detection method provided in the embodiment of the first aspect is implemented.

The depth image characteristic and the three-dimensional estimation characteristic of the single-frame color image to be detected are obtained, the single-frame color image is subjected to three-dimensional object detection according to the depth image characteristic and the three-dimensional estimation characteristic, the problem that the existing single-frame image-based three-dimensional object detection is low in detection precision is solved, and therefore the single-frame image-based three-dimensional object detection precision is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

fig. 1 is a flowchart of a three-dimensional object detection method provided in an embodiment of the present application;

fig. 2 is a flowchart of a three-dimensional object detection method according to an embodiment of the present application;

fig. 3 is a schematic flowchart of a three-dimensional object detection method according to an embodiment of the present disclosure;

fig. 4 is a structural diagram of a three-dimensional object detection apparatus according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device for implementing the three-dimensional object detection method according to the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The three-dimensional object detection method based on the single frame image is a main technology in three-dimensional object detection, and can realize three-dimensional calibration of an object only by using the single frame image. At present, three-dimensional object detection methods based on single-frame images mainly include methods based on direct estimation of two-dimensional images and three-dimensional object detection methods based on "pseudo-point clouds". The method of direct estimation based on two-dimensional images is to directly input a single frame of image into a pre-trained object detection model to detect a three-dimensional object. The three-dimensional object detection method based on the pseudo point cloud firstly estimates the depth information of an image by using a depth network so as to convert the depth image comprising the depth information into the point cloud, and finally detects the three-dimensional object by using the three-dimensional object detection method based on the point cloud.

In the three-dimensional object detection method based on the single-frame image, the method based on the direct estimation of the two-dimensional image cannot utilize the three-dimensional information of the scene in the image, and the three-dimensional object detection method based on the pseudo point cloud ignores the color information in the image, so that the detection precision of the three-dimensional object is reduced.

In an example, fig. 1 is a flowchart of a three-dimensional object detection method provided in an embodiment of the present application, and this embodiment is applicable to a case of performing three-dimensional object detection according to image features and three-dimensional features of a single frame color image, and the method may be performed by a three-dimensional object detection apparatus, which may be implemented by software and/or hardware, and may be generally integrated in an electronic device. The electronics may be a server device or a computer device, etc. Accordingly, as shown in fig. 1, the method comprises the following operations:

and S110, acquiring a single-frame color image to be detected.

Wherein, a single frame color image is also a single RGB (Red-Green-Blue, three primary colors) image.

In the embodiment of the application, a single-frame color image to be detected can be acquired, and the three-dimensional object detection is performed on an object included in the image by taking the single-frame color image as a reference. Three-dimensional object detection based on a single frame image also utilizes a three-dimensional identification frame to identify each object in the image. For example, an object such as a car in an image is identified by a three-dimensional stereo frame to represent the position of the object in the real world.

And S120, acquiring the depth image characteristic and the three-dimensional estimation characteristic of the single-frame color image.

The depth image features may be image features with high feature dimensions, such as color features, texture features, shape features, or the like, extracted from a single-frame color image. The three-dimensional estimated feature may be a three-dimensional feature estimated from a single frame color image, such as a positional feature in a three-dimensional space.

Correspondingly, after the single-frame color image to be detected is obtained, the depth image feature and the three-dimensional estimation feature of the single-frame color image can be further obtained. The depth image features can represent image features of multi-feature dimensions of the single-frame color image, and the three-dimensional estimation features can represent three-dimensional space features of the single-frame color image. That is, the depth image feature and the three-dimensional estimation feature belong to the features of a single-frame color image, but the feature types are different, and the image attributes of the emphasis are also different.

And S130, carrying out three-dimensional object detection on the single-frame color image according to the depth image characteristic and the three-dimensional estimation characteristic.

Further, three-dimensional object detection may be performed on an object included in the single-frame color image according to the depth image feature and the three-dimensional estimation feature of the single-frame color image. The depth image features may include color features of a single-frame color image, and the three-dimensional estimation features may include three-dimensional features of objects in the image. Therefore, the three-dimensional object detection is carried out on the object included in the single-frame color image according to the depth image characteristic and the three-dimensional estimation characteristic of the single-frame color image, the color information in the single-frame color image can be utilized, the three-dimensional information of the scene in the single-frame color image can be utilized, the comprehensive mining and utilization of the characteristic information of the single-frame color image are realized, and the detection precision of the three-dimensional object in the single-frame color image is improved.

In an example, fig. 2 is a flowchart of a three-dimensional object detection method provided in an embodiment of the present application, and fig. 3 is a flowchart of a three-dimensional object detection method provided in an embodiment of the present application, which is an improvement on the above technical solutions of the embodiments of the present application, and provides a plurality of specific optional implementations of obtaining a depth image feature and a three-dimensional estimation feature of the single-frame color image and performing three-dimensional object detection on the single-frame color image according to the depth image feature and the three-dimensional estimation feature.

A three-dimensional object detection method as shown in fig. 2 and 3, comprising:

s210, acquiring a single-frame color image to be detected.

And S220, inputting the single-frame color image to a depth feature estimation network.

And S230, acquiring first output data of the depth feature estimation network as the depth image feature.

The depth feature estimation network can extract the depth image features of the single-frame color image. The first output data may be output data of a depth feature estimation network.

Alternatively, the single-frame color image may be input to a depth feature estimation network (i.e., the image-based backbone network in fig. 3) to extract the depth image features of the single-frame color image (i.e., the image features in fig. 3) according to the depth feature estimation network. The depth feature estimation network may be a depth neural network, a residual error network, (Visual Geometry Group, a convolutional neural network) VGG network, a DLA (Deep Layer Aggregation) 34 network, or the like, as long as the depth image feature of a single frame color image can be extracted, and the specific network type of the depth feature estimation network is not limited in the embodiments of the present application.

The input of the single-frame color image is W × H × 3, where W represents the width pixels of the single-frame color image, H represents the height pixels of the single-frame color image, i.e., W × H represents the width and height information of the single-frame color image, and 3 represents the RGB dimensions. The first output data obtained by inputting the single-frame color image into the depth feature estimation network is w × h × c, that is, the size of the depth image feature is w × h × c. Where c represents a feature dimension. It is understood that a larger value of c indicates a higher feature dimension. In general, c may take on a value of 32 or 64. It should be noted that W is generally smaller than W and H is smaller than H due to the limitation of computing resources or the performance of the computing device.

According to the technical scheme, the depth image features of the single-frame color image are extracted by using the depth feature estimation network, so that the depth image features with higher feature dimensionality can be obtained.

And S240, inputting the single-frame color image to a depth information estimation network.

And S250, acquiring second output data of the depth information estimation network as a depth information characteristic.

And S260, acquiring the three-dimensional estimation feature according to the depth information feature.

The depth information estimation network can extract the depth information characteristics of the single-frame color image. The second output data may be output data of the depth information estimation network. The depth information features are depth information in the single-frame color image, and correspond to distance information of the real world. That is, the farther the distance is, the larger the depth value corresponding to the distance is; the closer the distance, the smaller the depth value it corresponds to.

Optionally, the single-frame color image may be input to a depth information estimation network (i.e., the depth estimation network in fig. 3), so as to extract the depth information features of the single-frame color image according to the depth information estimation network, and further obtain the three-dimensional estimation features according to the extracted depth information features. The depth information estimation network may be a monocular depth estimation network based on deep learning, and the like, as long as the depth information features of a single-frame color image can be extracted, and the specific network type of the depth information estimation network is not limited in the embodiments of the present application.

The second output data of the depth information estimation network may be a depth image, and the value of each pixel in the depth image may represent distance information in the real world corresponding to the pixel, that is, depth information.

According to the technical scheme, the depth information characteristic of the single-frame color image is extracted by using the depth information estimation network, so that the extraction and application of the three-dimensional characteristic in the single-frame color image can be realized.

In an optional embodiment of the present application, the obtaining the three-dimensional estimation feature according to the depth information feature may include: determining a three-dimensional depth information coordinate according to the depth information characteristic; determining a three-dimensional space coordinate according to the three-dimensional depth information coordinate and the camera calibration internal parameter; inputting the three-dimensional space coordinates to a deep neural network; and acquiring third output data of the deep neural network as the three-dimensional estimation feature.

The three-dimensional depth information coordinate is also a three-dimensional coordinate including a depth value of the pixel point. The third output data may be output data of a deep neural network.

In the embodiment of the application, after the depth image including the depth information is obtained, the three-dimensional depth information coordinates of each pixel point in the depth image can be obtained, and further, the three-dimensional point cloud data corresponding to the depth image can be recovered by combining with camera calibration internal reference, that is, the three-dimensional space coordinates of each pixel point are determined. For example, suppose three-dimensional depth information coordinates of a pixel in the depth image are (X, Y, d), where X and Y respectively represent horizontal and vertical coordinates of the pixel, and d represents a depth value of the pixel, and accordingly, the three-dimensional spatial coordinates of the pixel are specifically Z ═ d, (X-u 0) × d/f, and Y ═ Y-v 0) × d/f; wherein X, Y and Z represent the three-dimensional space coordinates of the pixel points, u0 and v0 represent the optical centers, and f represents the focal length of the camera. Alternatively, f may include a lateral focal length fu and a longitudinal focal length fv.

Correspondingly, after the three-dimensional space coordinates corresponding to the three-dimensional point cloud data are obtained, the three-dimensional estimation characteristics of each pixel point can be calculated by continuously utilizing the deep neural network. For example, the three-dimensional space coordinates may be input to a deep neural network such as pointet (a point cloud processing network) or pointet + +, so as to use the third output data of the deep neural network as the three-dimensional estimation features (i.e., the features of the point cloud in fig. 3) corresponding to the single-frame color image. The size of the three-dimensional estimated feature is also w × h × c.

According to the technical scheme, the camera calibration internal reference is combined with the three-dimensional depth information coordinate to determine the three-dimensional space coordinate, so that the estimated three-dimensional characteristic can be obtained according to the three-dimensional space coordinate.

And S270, performing feature fusion on the depth image features and the three-dimensional estimation features to obtain fusion image features.

The fused image feature may be an image feature obtained by fusing a depth image feature and a three-dimensional estimation feature.

Since the depth image feature and the three-dimensional estimation feature are different types of features, the depth image feature and the three-dimensional estimation feature need to be fused to obtain a fused image feature including both the three-dimensional feature and the image feature.

In an optional embodiment of the present application, the performing feature fusion on the depth image feature and the three-dimensional estimation feature may include: determining three-dimensional estimation feature weights; determining a three-dimensional estimation feature to be fused according to the three-dimensional estimation feature weight and the three-dimensional estimation feature; carrying out feature addition fusion processing on the depth image features and the three-dimensional estimation features to be fused to obtain fused image features; or, performing feature splicing and fusion processing on the depth image features and the three-dimensional estimation features to obtain the fusion image features.

The three-dimensional estimation feature weight may be a weight parameter for determining the importance degree of the three-dimensional estimation feature. The three-dimensional estimation feature to be fused can be a three-dimensional estimation feature obtained by adjusting the ratio of the three-dimensional estimation feature in the fused image feature by using the weight of the three-dimensional estimation feature.

In the embodiment of the application, two feature fusion modes can be adopted to fuse the depth image feature and the three-dimensional estimation feature. The first feature fusion method is a feature addition method, and optionally, the feature addition fusion process may be implemented by using a formula F1+ a F2. Where F denotes a fused image feature, F1 denotes a depth image feature, F2 denotes a three-dimensional estimated feature, and a denotes a three-dimensional estimated feature weight. The value range of a can be set according to actual requirements. For example, if the depth image feature needs to be emphasized in the fused image feature, the value range of a may be [0,1 ]. Accordingly, a × F2 is the three-dimensional estimation feature to be fused determined according to the three-dimensional estimation feature weight and the three-dimensional estimation feature. The second feature fusion mode is a feature splicing mode, and optionally, the feature splicing and fusion processing may be implemented by using a formula F ═ F1, F2.

According to the technical scheme, the depth image features and the three-dimensional estimation features are subjected to feature fusion by using different feature fusion processing modes, so that the feature fusion modes of the depth image features and the three-dimensional estimation features are enriched, and meanwhile, the feature fusion can be carried out as required by using the three-dimensional estimation feature weight.

And S280, inputting the fusion image characteristics to a fully-connected neural network, and acquiring output regression data of the fully-connected neural network.

And S290, carrying out three-dimensional object detection on the single-frame color image according to the output regression data.

Wherein, the output regression data is the regression parameter output by the fully connected neural network.

In this embodiment of the application, after the fused image feature is obtained, the fused image feature may be further input to the fully-connected neural network to obtain output regression data of the fully-connected neural network, and finally, the three-dimensional object detection is performed on the single-frame color image according to the obtained output regression data. The fusion image features simultaneously comprise the image features including the color information and the three-dimensional features, so that the comprehensive mining of the feature information of the single-frame color image is realized, and the accuracy of regression parameters estimated for the fusion features simultaneously comprising the image features and the three-dimensional features is higher.

In an optional embodiment of the present application, the output regression data may include an object three-dimensional size, an object orientation angle, an object center point position, and an object center point depth; the three-dimensional object detection of the single-frame color image according to the output regression data may include: determining a three-dimensional surrounding frame corresponding to each object in the single-frame color image according to the output regression data; and identifying each object according to the three-dimensional surrounding frame.

Optionally, the output regression data output by the full-connection network may include, but is not limited to, information such as an object three-dimensional size, an object direction angle, an object center point position, and an object center point depth, and a three-dimensional bounding box corresponding to each object in the single-frame color image may be determined according to the information. The three-dimensional surrounding frame can be a three-dimensional frame, such as a cuboid, a cube or a cylinder, and the specific type of the three-dimensional surrounding frame is not limited in the embodiment of the application, as long as the object in the single-frame color image can be calibrated in a three-dimensional calibration mode. For example, the length, width and height of the cuboid can be determined according to the three-dimensional size of the object, the angle relationship between the side lengths can be determined according to the direction angle of the object, the central point of the three-dimensional enclosure frame can be determined according to the position of the central point of the object, the side length related to the depth can be determined according to the depth of the central point of the object, and the like.

According to the technical scheme, after the depth image characteristics and the three-dimensional estimation characteristics of the single-frame color image are obtained by using different depth networks and the depth image characteristics and the three-dimensional estimation characteristics are subjected to characteristic fusion to obtain the fusion image characteristics, the final fusion image characteristics are used for obtaining the related data of the three-dimensional enclosure frame for identifying the object, the accuracy and precision of the three-dimensional enclosure frame can be improved, and the detection precision of the three-dimensional object based on the single-frame image is further improved.

In an example, fig. 4 is a structural diagram of a three-dimensional object detection apparatus provided in an embodiment of the present application, and the embodiment of the present application is applicable to a case of performing three-dimensional object detection based on image features and three-dimensional features of a single-frame color image, and the apparatus is implemented by software and/or hardware and is specifically configured in an electronic device. The electronics may be a server device or a computer device, etc.

A three-dimensional object detecting apparatus 300 as shown in fig. 4 includes: a single frame color image acquisition module 310, a feature acquisition module 320, and a three-dimensional object detection module 330. Wherein the content of the first and second substances,

a single-frame color image obtaining module 310, configured to obtain a single-frame color image to be detected;

a feature obtaining module 320, configured to obtain a depth image feature and a three-dimensional estimation feature of the single-frame color image;

and a three-dimensional object detection module 330, configured to perform three-dimensional object detection on the single-frame color image according to the depth image feature and the three-dimensional estimation feature.

Optionally, the feature obtaining module 320 is specifically configured to: inputting the single-frame color image to a depth feature estimation network; and acquiring first output data of the depth feature estimation network as the depth image feature.

Optionally, the feature obtaining module 320 is specifically configured to: inputting the single-frame color image to a depth information estimation network; acquiring second output data of the depth information estimation network as a depth information characteristic; and acquiring the three-dimensional estimation feature according to the depth information feature.

Optionally, the feature obtaining module 320 is specifically configured to: determining a three-dimensional depth information coordinate according to the depth information characteristic; determining a three-dimensional space coordinate according to the three-dimensional depth information coordinate and the camera calibration internal parameter; inputting the three-dimensional space coordinates to a deep neural network;

and acquiring third output data of the deep neural network as the three-dimensional estimation feature.

Optionally, the three-dimensional object detection module 330 is specifically configured to: performing feature fusion on the depth image features and the three-dimensional estimation features to obtain fusion image features; inputting the fusion image characteristics to a fully-connected neural network, and acquiring output regression data of the fully-connected neural network; and carrying out three-dimensional object detection on the single-frame color image according to the output regression data.

Optionally, the three-dimensional object detection module 330 is specifically configured to: determining three-dimensional estimation feature weights; determining a three-dimensional estimation feature to be fused according to the three-dimensional estimation feature weight and the three-dimensional estimation feature; carrying out feature addition fusion processing on the depth image features and the three-dimensional estimation features to be fused to obtain fused image features; or, performing feature splicing and fusion processing on the depth image features and the three-dimensional estimation features to obtain the fusion image features.

Optionally, the output regression data includes a three-dimensional size of the object, a direction angle of the object, a position of a center point of the object, and a depth of the center point of the object; the three-dimensional object detection module 330 is specifically configured to: determining a three-dimensional surrounding frame corresponding to each object in the single-frame color image according to the output regression data; and identifying each object according to the three-dimensional surrounding frame.

The three-dimensional object detection device can execute the three-dimensional object detection method provided by any embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. For details of the three-dimensional object detection method provided in any embodiments of the present application, reference may be made to the following description.

Since the three-dimensional object detection apparatus described above is an apparatus capable of executing the three-dimensional object detection method in the embodiment of the present application, based on the three-dimensional object detection method described in the embodiment of the present application, a person skilled in the art can understand a specific implementation manner of the three-dimensional object detection apparatus of the present embodiment and various variations thereof, and therefore, a detailed description of how the three-dimensional object detection apparatus implements the three-dimensional object detection method in the embodiment of the present application is not provided here. The device used by those skilled in the art to implement the method for detecting a three-dimensional object in the embodiments of the present application is within the scope of the present application.

In one example, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

FIG. 5 shows a schematic block diagram of an example electronic device 400 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the apparatus 400 includes a computing unit 401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)402 or a computer program loaded from a storage unit 408 into a Random Access Memory (RAM) 403. In the RAM 403, various programs and data required for the operation of the device 400 can also be stored. The computing unit 401, ROM 402, and RAM 403 are connected to each other via a bus 404. An input/output (I/O) interface 405 is also connected to bus 404.

A number of components in device 400 are connected to I/O interface 405, including: an input unit 406 such as a keyboard, a mouse, or the like; an output unit 407 such as various types of displays, speakers, and the like; a storage unit 408 such as a magnetic disk, optical disk, or the like; and a communication unit 409 such as a network card, modem, wireless communication transceiver, etc. The communication unit 409 allows the device 400 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 401 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 401 executes the respective methods and processes described above, such as the three-dimensional object detection method. For example, in some embodiments, the three-dimensional object detection method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 408. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 400 via the ROM 402 and/or the communication unit 409. When the computer program is loaded into RAM 403 and executed by computing unit 401, one or more steps of the three-dimensional object detection method described above may be performed. Alternatively, in other embodiments, the computing unit 401 may be configured to perform the three-dimensional object detection method in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service are overcome.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A three-dimensional object detection method, comprising:

acquiring a single-frame color image to be detected;

2. The method of claim 1, wherein said obtaining depth image characteristics of said single frame color image comprises:

inputting the single-frame color image to a depth feature estimation network;

and acquiring first output data of the depth feature estimation network as the depth image feature.

3. The method of claim 1, wherein said obtaining three-dimensional estimated features of said single frame color image comprises:

inputting the single-frame color image to a depth information estimation network;

acquiring second output data of the depth information estimation network as a depth information characteristic;

and acquiring the three-dimensional estimation feature according to the depth information feature.

4. The method of claim 3, wherein said deriving the three-dimensional estimation feature from the depth information feature comprises:

determining a three-dimensional depth information coordinate according to the depth information characteristic;

determining a three-dimensional space coordinate according to the three-dimensional depth information coordinate and the camera calibration internal parameter;

inputting the three-dimensional space coordinates to a deep neural network;

5. The method of claim 1, wherein the three-dimensional object detection of the single frame color image from the depth image feature and the three-dimensional estimation feature comprises:

performing feature fusion on the depth image features and the three-dimensional estimation features to obtain fusion image features;

inputting the fusion image characteristics to a fully-connected neural network, and acquiring output regression data of the fully-connected neural network;

and carrying out three-dimensional object detection on the single-frame color image according to the output regression data.

6. The method of claim 5, wherein the feature fusing the depth image features and the three-dimensional estimation features comprises:

determining three-dimensional estimation feature weights;

determining a three-dimensional estimation feature to be fused according to the three-dimensional estimation feature weight and the three-dimensional estimation feature;

carrying out feature addition fusion processing on the depth image features and the three-dimensional estimation features to be fused to obtain fused image features; or

And carrying out feature splicing and fusion processing on the depth image features and the three-dimensional estimation features to obtain fusion image features.

7. The method of claim 5, wherein the output regression data includes an object three-dimensional size, an object orientation angle, an object center point location, and an object center point depth;

the three-dimensional object detection of the single-frame color image according to the output regression data includes:

determining a three-dimensional surrounding frame corresponding to each object in the single-frame color image according to the output regression data;

and identifying each object according to the three-dimensional surrounding frame.

8. A three-dimensional object detection apparatus comprising:

9. The apparatus of claim 8, wherein the feature acquisition module is specifically configured to:

inputting the single-frame color image to a depth feature estimation network;

10. The apparatus of claim 8, wherein the feature acquisition module is specifically configured to:

11. The apparatus of claim 10, wherein the feature acquisition module is specifically configured to:

inputting the three-dimensional space coordinates to a deep neural network;

12. The apparatus of claim 8, wherein the three-dimensional object detection module is specifically configured to:

13. The apparatus of claim 12, wherein the three-dimensional object detection module is specifically configured to:

determining three-dimensional estimation feature weights;

14. The apparatus of claim 13, wherein the output regression data comprises an object three-dimensional size, an object orientation angle, an object center point location, and an object center point depth;

the three-dimensional object detection module is specifically configured to:

15. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the three-dimensional object detection method of any one of claims 1-7.

16. A non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the three-dimensional object detection method according to any one of claims 1 to 7.

17. A computer program product comprising a computer program which, when executed by a processor, implements a three-dimensional object detection method according to any one of claims 1-7.