CN115346194A

CN115346194A - Three-dimensional detection method and device, electronic equipment and storage medium

Info

Publication number: CN115346194A
Application number: CN202211023285.1A
Authority: CN
Inventors: 段由
Original assignee: Beijing Elite Road Technology Co ltd
Current assignee: Beijing Elite Road Technology Co ltd
Priority date: 2022-08-25
Filing date: 2022-08-25
Publication date: 2022-11-15

Abstract

The disclosure provides a three-dimensional detection method, a three-dimensional detection device, electronic equipment and a storage medium, and relates to the field of artificial intelligence, in particular to the fields of intelligent transportation, automatic driving, intelligent parking and the like. The specific implementation scheme is as follows: performing target detection on the original image to obtain a first parameter set of a target object, wherein the first parameter set comprises at least one of the category of the target object, a coordinate value of the target object on a first coordinate axis in a three-dimensional coordinate system, a coordinate value of the target object on a second coordinate axis in the three-dimensional coordinate system, the size of the target object and the angle of the target object; carrying out depth detection on the original image to obtain the depth of each pixel in the original image; determining the depth of the target object by using the depth of each pixel in the original image; and combining the first parameter set and the depth of the target object to obtain the three-dimensional detection parameters of the target object. The present disclosure may perform three-dimensional detection of an object.

Description

Three-dimensional detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and in particular to the field of intelligent transportation, the field of automatic driving, the field of intelligent parking, etc.

Background

Three-Dimensional detection is also called 3D (Three Dimensional) detection and can reflect the Three-Dimensional shape of an object in a scene. Three-dimensional detection is an indispensable technology in the fields of automatic driving, intelligent traffic, intelligent parking and the like.

Disclosure of Invention

The disclosure provides a three-dimensional detection method, a three-dimensional detection device, an electronic device and a storage medium.

According to an aspect of the present disclosure, there is provided a three-dimensional detection method, including:

performing target detection on the original image to obtain a first parameter set of a target object, wherein the first parameter set comprises at least one of the category of the target object, a coordinate value of the target object on a first coordinate axis in a three-dimensional coordinate system, a coordinate value of the target object on a second coordinate axis in the three-dimensional coordinate system, the size of the target object and the angle of the target object;

carrying out depth detection on the original image to obtain the depth of each pixel in the original image; determining the depth of the target object by using the depth of each pixel in the original image; and the number of the first and second groups,

and combining the first parameter set and the depth of the target object to obtain the three-dimensional detection parameters of the target object.

According to another aspect of the present disclosure, there is provided a three-dimensional detection apparatus including:

the target detection module is used for carrying out target detection on the original image to obtain a first parameter set of a target object, wherein the first parameter set comprises at least one of the category of the target object, the coordinate value of the target object on a first coordinate axis in a three-dimensional coordinate system, the coordinate value of the target object on a second coordinate axis in the three-dimensional coordinate system, the size of the target object and the angle of the target object;

the depth detection module is used for carrying out depth detection on the original image to obtain the depth of each pixel in the original image; determining the depth of the target object by using the depth of each pixel in the original image; and the number of the first and second groups,

and the combination module is used for combining the first parameter set and the depth of the target object to obtain the three-dimensional detection parameters of the target object.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the first and the second end of the pipe are connected with each other,

the memory stores instructions executable by the at least one processor to cause the at least one processor to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method according to any one of the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.

According to the method and the device, the three-dimensional detection parameters of the target object in the original image can be obtained by combining the results of target detection and depth detection.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a schematic diagram of a system 100 architecture to which the three-dimensional inspection method of the present disclosure may be applied;

FIG. 2 is a flow chart of an implementation of a three-dimensional detection method 200 according to an embodiment of the present disclosure;

FIG. 3 is a schematic three-dimensional coordinate system according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an object detection model 400 according to an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a depth detection model 500 according to an embodiment of the present disclosure;

FIG. 6 is an overall flow diagram of a three-dimensional inspection method according to an embodiment of the present disclosure;

FIG. 7A is a first schematic view of a first region in an embodiment of the present disclosure;

FIG. 7B is a second schematic view of a first region in an embodiment of the present disclosure;

FIG. 7C is a third schematic view of a first region in an embodiment of the present disclosure;

FIG. 7D is a fourth schematic view of the first region in an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a three-dimensional inspection apparatus 800 according to an embodiment of the present disclosure;

FIG. 9 is a schematic diagram of a three-dimensional inspection apparatus 900 according to an embodiment of the present disclosure;

FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the present disclosure.

In the related art, three-dimensional detection has wide application requirements. Three-dimensional detection is used to detect the solid shape of objects in a scene. For example, in a scene such as automatic driving, intelligent transportation, and intelligent parking, information such as a three-dimensional shape and a position of a target object (e.g., a vehicle, a pedestrian, and the like) needs to be determined.

The existing three-dimensional (3D) detection methods mainly include:

1. and obtaining point cloud by adopting a laser radar, and then carrying out 3D target detection by utilizing the point cloud data. Specifically, a laser pulse is emitted by the laser and the time of emission is recorded by the timer, and the returned laser pulse is received by the receiver and the time of return is recorded by the timer. The subtraction of the two times yields the "time of flight" of the light, while the speed of the light is constant, so that the distance can be calculated after knowing the speed and time.

2. And 3D detection is carried out by adopting a binocular camera. According to the scheme, the camera needs to be calibrated to obtain internal and external parameters of the camera. By calculating the parallax of the two images, the distance measurement is directly performed on the front scene (the range where the images are shot) without judging what type of obstacle appears in front. The principle of a binocular camera is similar to that of the human eye. Human eyes can perceive the distance of an object because the images of the same object presented by the two eyes are different, which is also called as parallax. The farther the object distance is, the smaller the parallax error is; conversely, the greater the parallax.

3. And 3D detection is carried out by adopting a monocular camera and a deep learning algorithm. Monocular cameras are common cameras in daily life and only have one camera. In principle, a monocular camera obtains a Two-Dimensional (2D) image, and cannot obtain a 3D viewing angle. However, with a supervised deep learning algorithm, the image captured by the monocular camera is used as training data, the three-dimensional detection parameter (such as 3D frame information) of the target object is used as labeling information, and the deep learning model is trained by using the training data and the labeling information, so that monocular 3D detection can be realized in a specific scene. When the model training process is smooth, theoretically, the monocular 3D detection result can be infinitely close to the laser radar detection result.

The above approaches each have disadvantages: in the first method, the laser radar has high cost, high power consumption, and high failure rate. In the second mode, a large amount of parameter calibration work needs to be performed on the binocular camera, so that a large amount of operation and maintenance personnel is needed. In the third mode, a deep learning model for realizing 3D detection needs to be trained, the training process is complex, manual standard training samples are required, and higher time and power consumption costs are also required for the operation of the 3D detection model.

The embodiment of the disclosure provides a three-dimensional detection method. Fig. 1 is a schematic diagram of a system 100 architecture to which the three-dimensional detection method of the embodiments of the present disclosure may be applied. As shown in fig. 1, the system architecture includes: image acquisition device 110, network 120 and three-dimensional detection device 130. The image capturing device 110 and the three-dimensional detection device 130 may establish a communication connection image through the network 120. The image capturing device 110 transmits the original image to the three-dimensional detecting device 130 through the network 120, and the three-dimensional detecting device 130 performs three-dimensional detection on the original image in response to the received original image. Finally, the three-dimensional detection device 130 returns the three-dimensional detection result to the image acquisition device, or sends the three-dimensional detection result to another server or terminal equipment. The three-dimensional inspection apparatus 130 may include a vision processing device or a remote server. The network 120 may employ wired or wireless connections. When the three-dimensional detection device 130 is a visual processing device, the image acquisition device 110 may be in communication connection with the visual processing device in a wired connection manner, for example, data communication is performed through a bus; when the three-dimensional detection device 130 is a remote server, the image capturing device 110 may perform data interaction with the remote server through a wireless network. In addition, the image capturing device 110 may be an in-vehicle camera, an intelligent traffic camera, or the like.

Fig. 2 is a flow chart of an implementation of a three-dimensional detection method 200 according to an embodiment of the disclosure. In some embodiments of the present disclosure, the three-dimensional detection method may be performed by a terminal device, a server, or other processing device. In some embodiments of the present disclosure, the three-dimensional detection method may be implemented by a processor calling computer-readable instructions stored in a memory. As shown in fig. 2, the three-dimensional detection method includes the following steps:

s210: performing target detection on the original image to obtain a first parameter set of a target object, wherein the first parameter set comprises at least one of the category of the target object, a coordinate value of the target object on a first coordinate axis in a three-dimensional coordinate system, a coordinate value of the target object on a second coordinate axis in the three-dimensional coordinate system, the size of the target object and the angle of the target object;

s220: carrying out depth detection on the original image to obtain the depth of each pixel in the original image; determining the depth of the target object by using the depth of each pixel in the original image; and the number of the first and second groups,

s230: and combining the first parameter set and the depth of the target object to obtain the three-dimensional detection parameters of the target object.

The steps S210 and S220 are executed separately, and there is no restriction on the order between them. For example, step S210 and step S220 may be executed synchronously, or step S210 is executed first and step S220 is executed later, or step S220 is executed first and step S210 is executed later; or start execution of step S210 or step S220 at any time, and so on. It is only necessary that both step S210 and step S220 are performed before step S230.

The method adopts a mode of combining target detection and depth detection to realize three-dimensional detection of the object; the three-dimensional detection is divided into two independent and simple detection processes, so that the difficulty of the three-dimensional detection can be reduced, and the consumption of time and computing resources by the three-dimensional detection is reduced.

The embodiments of the present disclosure may be applied to various scenarios, for example, may be applicable to parking scenarios. In a parking scene, since the vehicle is generally in a stationary state and is usually located in a fixed parking space, compared with an automatic driving scene or a driving scene, the scene does not need to perform accurate three-dimensional detection on a target object (such as a vehicle), and the requirement on the accuracy of the three-dimensional detection is not high. In addition, in a parking scene, the position and the angle of an image acquisition device (such as a camera) in the parking lot are fixed, so that only one parameter adjustment is needed to be carried out on the camera, and when three-dimensional detection is subsequently carried out, the parameter of the camera can be used as a constant for calculation, and calibration parameters do not need to be input again before each three-dimensional detection, so that the detection process is simple and stable. In addition, the image capturing device in the parking scene is generally disposed at a higher position, and the captured original image is more convenient for determining the depth of the target object (the reason will be described in detail in the following detailed description).

The parking scene is a broad name, and includes a parking lot, a temporary parking area near a road, a scene in which vehicles are parked in an exhibition hall, vehicles that tend to be stationary near a traffic light, and the like. Moreover, the application scene of the embodiment of the disclosure is not limited to a parking scene, and the method can be applied to the scene as long as the precision of three-dimensional detection by adopting the method provided by the disclosure meets the scene requirement. For example, the embodiment of the present disclosure may also be applied to a warehouse, a dock, and other scenes, for performing three-dimensional detection on goods, and the like.

In some embodiments, the original image may be subjected to target detection by using a pre-trained target detection model, for example, the original image is input into the pre-trained target detection model, and the first parameter set of the target object output by the target detection model is obtained. The first set of parameters includes at least one of:

(1) Class of target object (denoted as cls). For example, the target detection model may output confidence levels corresponding to a plurality of categories, where the category with the highest confidence level is the category of the target object predicted by the target detection model.

(2) The coordinate value of the target object on a first coordinate axis in the three-dimensional coordinate system, and the coordinate value of the target object on a second coordinate axis in the three-dimensional coordinate system.

FIG. 3 is a schematic three-dimensional coordinate system according to an embodiment of the present disclosure. As shown in fig. 3, in some examples, the position of the lens of the image capturing device may be used as an origin, the vertical lens direction may be used as a first coordinate axis (denoted as an X axis), the vertical ground plane direction may be used as a second coordinate axis (denoted as a Y axis), and the parallel lens direction may be used as a third coordinate axis (denoted as a Z axis). The target detection model can detect 2 coordinate values of the target object, denoted as (x, y).

(3) The size of the target object. For example, the length (denoted as L), width (denoted as W), and height (denoted as H) of the target object. The dimensions of the target object are noted as (W, H, L).

(4) The angle of the target object. For example, the Yaw angle (Yaw) of the target object is denoted as θ.

FIG. 4 is a schematic diagram of an object detection model 400 according to an embodiment of the present disclosure. As shown in fig. 4, the object detection model 400 includes a BackBone network (BackBone) 410 and a plurality of branch networks 420; the plurality of branch networks 420 may include a branch network 421 corresponding to a category of the target object, a branch network 422 corresponding to a partial coordinate of the target object (i.e., a coordinate value of the target object on a first coordinate axis in the three-dimensional coordinate system and a coordinate value of the target object on a second coordinate axis in the three-dimensional coordinate system), a branch network 423 corresponding to a size of the target object, and a branch network 424 corresponding to an angle of the target object. The branch network 420 may predict the first parameter set corresponding to the target object by using the image features of the original image extracted by the main network 410. The original Image input to the backbone network 410 may be an Image captured by a Monocular camera, referred to as a Monocular Image.

And analyzing the first parameter set to see that the first parameter set contains most of information required to be detected by three-dimensional detection, and determining the three-dimensional detection data of the target object by combining the depth of the target object. The target detection model provided by the embodiment of the disclosure is slightly changed from a common two-dimensional detection model; and reserving a main network for acquiring the image characteristic data, and adding a plurality of branches behind the main network for determining the parameters in the first parameter set. Compared with a three-dimensional detection model in the related technology, the method can reduce the complexity of the model, reduce the consumption of time and calculation power, and make the training process of the model simpler and more convenient.

In some examples, embodiments of the present disclosure employ a pre-trained depth detection model to determine the depth of each pixel in an original image; for example, the original image is input into a depth detection model trained in advance, and the depth of each pixel in the original image output by the depth detection model can be obtained. And determining the depth of the target object by using the depth of each pixel in the original image. The depth of the target object can be regarded as a coordinate value (Z) of the target object on a third coordinate axis (Z axis) in the three-dimensional coordinate system, and then the depth is combined with the coordinate value (x, y) in the first parameter set, so that the position (x, y, Z) of the target object in the three-dimensional coordinate system is obtained.

Combining a coordinate value of a target object on a first coordinate axis in a three-dimensional coordinate system, a coordinate value of the target object on a second coordinate axis in the three-dimensional coordinate system and the depth of the target object to obtain the position of the target object in a three-dimensional space;

and combining at least one of the position of the target object in the three-dimensional space, the category of the target object, the size of the target object and the angle of the target object to obtain the three-dimensional detection parameters of the target object.

The three-dimensional detection is divided into two independent and simple detection processes, a part of three-dimensional data (such as a first parameter set) of a target object is obtained by using the target detection process, the other part of three-dimensional data (such as the depth of the target object) of the target object is obtained by using depth detection, and the depth of the target object is combined with the first parameter set to obtain the three-dimensional detection parameters of the target object. By the method, the difficulty of three-dimensional detection can be reduced, the consumption of time and calculation power is reduced, and the training process of the relevant model is simplified.

FIG. 5 is a schematic diagram of a depth detection model 500 according to an embodiment of the present disclosure. As shown in fig. 5, the depth detection model 500 includes a BackBone network (backhaul) 510 and a depth detection network 520; the backbone network 510 extracts image features of an original image (e.g., a monocular image), and the depth detection network 520 predicts the depth of each pixel in the original image using the image features.

Fig. 6 is an overall flowchart of a three-dimensional detection method according to an embodiment of the present disclosure. As shown in fig. 6, the original images are input to the object detection model 400 and the depth detection model 500, respectively; the target detection model 400 outputs a first set of parameters for the target object and the depth detection model 500 outputs the depth of each pixel in the original image. And determining the depth of the target object by using the depth of each pixel in the original image. And combining the depth of the target object with the first parameter set to obtain the three-dimensional detection parameters of the target object. The three-dimensional detection parameters of the target object comprise at least one of the following:

(1) Class of target object (denoted as cls).

(2) The position (x, y, z) of the target object in the three-dimensional coordinate system.

(3) Size of target object (W, H, L).

(4) Theta of the target object.

The above items (1), (3), and (4) are determined by the target detection model 400, and (2) are combined by the depth of the target object and the coordinates (x, y) determined by the target detection model 400.

According to the embodiment of the disclosure, a YOLO (You Only Look at Look Once) model can be used as a target detection model, and a monocular depth estimation model can be used for depth detection. The present disclosure does not limit the structure of the model and the like.

The following describes how the depth of the target object is determined based on the depth of each pixel in the original image.

The depth of the target object can be determined in at least three ways:

first, the depth of the target object is determined by using the depth of each element in the target detection frame.

For example, determining a target detection frame of the target object in the original image;

determining the depth of each pixel in the target detection frame by using the depth of each pixel in the original image and the target detection frame;

and calculating the depth average value of all pixels in the target detection frame, and taking the average value as the depth of the target object.

The target detection frame is a rectangular frame defining the target object in the original image, and most of the pixels in the target detection frame are pixels of the target object, so that the average value of the depths of all the pixels in the target detection frame can roughly represent the depth of the target object. Of course, since the target detection frame includes pixels of other objects in addition to the target object, the depth of the target object determined in this way is not accurate; however, this method has an advantage of a high operation speed because it is relatively easy to determine the target detection frame.

Secondly, the depth of the target object is determined by using the depth of each pixel in the area defined by the boundary of the target object.

For example, determining the boundary of the target object in the original image;

determining the depth of each pixel in the boundary limit range by using the depth of each pixel in the original image and the boundary of the target object;

and calculating the depth average value of all pixels in the boundary limit range, and taking the average value as the depth of the target object.

All pixels of the target object are contained in the area defined by the boundary of the target object, and pixels of any other object are not contained; therefore, it is obviously accurate to determine the depth value of the target object by using the depth values of the respective pixels in the area defined by the boundary of the target object. The embodiment of the disclosure can adopt the mask map mode to determine the target object. Besides the mask pattern method, there are other methods to determine the contour of the target object, for example, an example segmentation method is used.

Thirdly, a first area in the target detection frame is determined, and the depth of the target object is determined by using the depth of each pixel in the first area.

Because the depth of a plurality of pixels is adopted to determine the overall depth of the target object, the requirement on the precision of the depth of a single pixel is not high, and the depth of the target object can be determined at lower cost and power consumption. For example, in the case where the depth accuracy of a single pixel is not high, in the depth detection result, the depths of some pixels are higher than the actual depth, the depths of some pixels are lower than the actual depth, and the higher and lower probabilities and/or deviation degrees are in the form of random distribution; then, if the average value of a plurality of pixels is used as the depth of the target object, according to the mathematical principle, a large amount of higher or lower pixel deviations will be offset during the averaging, and the accuracy of the finally obtained depth average value (i.e. the depth of the target object) can be ensured. Therefore, in the embodiments of the present disclosure, the depth of the target object may be determined using the depth of each pixel of the first region within the target detection frame. It should be noted that the embodiments of the present disclosure may also use other manners to determine the depth of the target object by using the depth of each pixel in the first region. For example, the median of each pixel in the first region is taken as the depth of the target object.

In view of the above analysis, the first region for determining the depth of the target object proposed by the embodiment of the present disclosure has the following features:

1. and determining a first area in the target detection frame, wherein the first area is in the target detection frame. This is because all pixels of the target object are within the target detection box.

2. The central point of the first area coincides with the central point of the target detection frame. The target object is located in the middle of the target detection frame, and the central point of the first area is overlapped with the central point of the target detection frame, so that the target object can be ensured to be located in the middle of the first area, and most of pixels in the first area are pixels of the target object.

3. The ratio of the area of the first region to the area of the target detection frame is greater than or equal to a preset threshold value. This is to ensure that the first region can contain a large proportion of the pixels of the target object. The preset threshold may be set according to actual conditions, for example, to 50%.

By adopting the characteristics, the depth of the target object can be accurately determined by adopting the depth of each pixel in the first area. Also, since the first region is in a fixed shape and at a fixed position in the object detection frame, the pixels contained in the first region can be easily determined. Therefore, the depth of the target object can be accurately estimated, consumption of time and calculation cost can be reduced, and speed is increased.

In an embodiment of the present disclosure, a method for determining a depth of the target object by using a depth of each pixel in an original image may include:

determining a target detection frame of the target object in the original image;

determining a first region, wherein the central point of the first region coincides with the central point of the target detection frame, and the ratio of the area of the first region to the area of the target detection frame of the target object is greater than or equal to a preset threshold value;

determining the depth of each pixel in the first area by using the depth of each pixel in the original image, the target detection frame and the first area;

the depth of the target object is determined using the depths of all pixels in the first region.

For example, the depth average of all pixels in the first region may be calculated, and the average may be taken as the depth of the target object. Or taking the median of all pixel depths in the first area as the depth of the target object; and so on.

After determining the above-mentioned characteristics of the first region, it is analyzed which regions are suitable as first regions.

Fig. 7A-7D are schematic diagrams of a first region in an embodiment of the present disclosure. It is to be noted that the images shown in fig. 7A to 7D are images within the object detection frame in the original image, and are not the original image.

As shown in fig. 7A, the shape of the target detection frame is rectangular;

the shape of the first region may be a diamond or a square, and the 4 vertices of the first region are respectively located at the midpoints of the 4 sides of the target detection box. The ratio of the area of the first region to the area of the target detection frame is 50%.

Taking fig. 7A as an example, in the case where the target detection frame is rectangular, the shape of the first region is a diamond shape. In the case where the target detection frame is square, the shape of the first region is square.

As can be seen from fig. 7A, most of the pixels inside the first region belong to the target object (e.g., the vehicle in fig. 7A), and most of the pixels in the target object are within the first region. Through experimental statistics, in the first region shown in fig. 7A, the pixels of the target object account for 83% of all the pixels, and therefore, the pixels in the first region can largely reflect the depth situation of the target object.

As shown in fig. 7B, the shape of the target detection frame is rectangular;

the first region may be circular or elliptical in shape.

Taking fig. 7B as an example, when the object detection frame is rectangular, the shape of the first region is elliptical, and the 4 vertices of the first region are located at the midpoints of the 4 sides of the object detection frame. In the case where the object detection frame is square, and 4 sides of the object detection frame are each tangent lines to the first region. In the example of fig. 7B, the ratio of the area of the first region to the area of the target detection frame is approximately 80%.

As can be seen from fig. 7B, most of the pixels in the first region belong to the target object (e.g., the vehicle of fig. 7B), and most of the pixels in the target object are in the first region. Through experimental statistics, in the first region shown in fig. 7B, the pixels of the target object account for 79% of all the pixels, and therefore, the pixels in the first region can largely reflect the depth situation of the target object.

As shown in fig. 7C, the shape of the target detection frame is rectangular;

the first region may be polygonal in shape, and each vertex of the first region is located on an edge of the target detection box.

Taking fig. 7C as an example, when the target detection frame is rectangular, the shape of the first region is a regular hexagon. In the example of fig. 7C, the ratio of the area of the first region to the area of the target detection frame is approximately 75%.

As can be seen from fig. 7C, most of the pixels in the first region belong to the target object (e.g., the vehicle of fig. 7C), and most of the pixels in the target object are in the first region. Through experimental statistics, in the first region shown in fig. 7C, the pixels of the target object account for 76% of all the pixels, and therefore, the pixels in the first region can largely reflect the depth situation of the target object.

As shown in fig. 7D, the shape of the target detection frame is rectangular;

the first region may be in an irregular figure, and the first region includes midpoints of respective sides of the object detection box.

Taking fig. 7D as an example, in the case where the object detection frame is rectangular, the shape of the first region is cross-shaped. As shown in fig. 7D, 4 sides of the object detection frame coincide with 4 sides of the 12 sides of the cross respectively, and the widths of the 4 sides are all one third of the length of the corresponding sides of the object detection frame. In the example of fig. 7D, the ratio of the first region area to the target detection frame area is 5/9.

As can be seen from fig. 7D, most of the pixels in the first region belong to the target object (e.g., the vehicle of fig. 7D), and most of the pixels in the target object are in the first region. Through experimental statistics, in the first region shown in fig. 7D, the pixels of the target object account for 77% of all the pixels, and therefore, the pixels in the first region can largely reflect the depth of the target object.

The embodiment of the disclosure is particularly suitable for high-order shooting scenes. In such scenes, the image capture device is taller than the target object (e.g., vehicle). Taking a high-level parking scene as an example, as shown in fig. 7A to 7D, the direction of most vehicles in the image is deviated to the left or right by a certain angle, and each area includes a roof, a head, a body and the like, and the depth of the whole body can be determined more accurately by using the pixel depth of these positions.

In the embodiments of the present disclosure, the first area determined based on the center point of the target detection frame can cover all or most of the target object. The depth of the target object is determined by using the depth of each pixel in the first area, so that the depth of the real target object can be reflected to the maximum extent with smaller calculation force and higher speed. The process for determining the depth of the target object by using the depth of each pixel in the first area, which is provided by the embodiment of the disclosure, can also eliminate the influence of a large amount of backgrounds.

The embodiment of the present disclosure further provides a three-dimensional detection apparatus, and fig. 8 is a schematic structural diagram of a three-dimensional detection apparatus 800 according to an embodiment of the present disclosure, which includes:

a target detection module 810, configured to perform target detection on an original image to obtain a first parameter set of a target object, where the first parameter set includes at least one of a category of the target object, a coordinate value of the target object on a first coordinate axis in a three-dimensional coordinate system, a coordinate value of the target object on a second coordinate axis in the three-dimensional coordinate system, a size of the target object, and an angle of the target object;

a depth detection module 820, configured to perform depth detection on the original image to obtain the depth of each pixel in the original image; determining the depth of the target object by using the depth of each pixel in the original image; and (c) a second step of,

and the combining module 830 is configured to combine the first parameter set and the depth of the target object to obtain a three-dimensional detection parameter of the target object.

Fig. 9 is a schematic structural diagram of a three-dimensional inspection apparatus 900 according to an embodiment of the present disclosure, and as shown in fig. 9, the three-dimensional inspection apparatus 900 includes an object inspection module 910, a depth inspection module 920, and a combination module 930. In some embodiments, the combining module 930 comprises:

a first combining sub-module 931, configured to combine the coordinate value of the target object on the first coordinate axis in the three-dimensional coordinate system, the coordinate value of the target object on the second coordinate axis in the three-dimensional coordinate system, and the depth of the target object, so as to obtain a position of the target object in the three-dimensional space;

a second combining sub-module 932, configured to combine at least one of a position of the target object in a three-dimensional space, a category of the target object, a size of the target object, and an angle of the target object to obtain a three-dimensional detection parameter of the target object.

In some embodiments, the depth of the target object corresponds to a coordinate value of the target object on a third coordinate axis in a three-dimensional coordinate system.

In some embodiments, the depth detection module 920 is configured to:

determining the boundary of the target object in the original image;

In some embodiments, the depth detection module 920 includes:

a first region determining sub-module 921, configured to determine a target detection frame of the target object in the original image, and determine a first region, where a central point of the first region coincides with a central point of the target detection frame, and a ratio of an area of the first region to an area of the target detection frame of the target object is greater than or equal to a preset threshold;

the first region depth determining sub-module 922 determines the depth of each pixel in the first region by using the depth of each pixel in the original image, the target detection frame and the first region;

the target object depth determining sub-module 923 determines the depth of the target object using the depths of all the pixels in the first region.

In some embodiments, the target object depth determination sub-module 923 is configured to calculate an average value of the depths of all the pixels in the first region, and use the average value as the depth of the target object.

In some embodiments, the target detection box is rectangular in shape;

the first area is in a diamond shape or a square shape, and the 4 vertexes of the first area are respectively located at the middle points of the 4 sides of the target detection frame.

In some embodiments, the target detection frame is rectangular in shape;

the first region is elliptical in shape, and 4 vertexes of the first region are respectively located at the midpoints of 4 edges of the target detection frame.

In some embodiments, the target detection box is square in shape;

the first area is circular in shape, and 4 sides of the target detection frame are all tangent lines of the first area.

In some embodiments, the performing target detection on the original image to obtain a first parameter set of the target object includes:

and inputting the original image into a pre-trained target detection model to obtain a first parameter set of the target object output by the target detection model.

In some embodiments, the performing depth detection on the original image to obtain the depth of each pixel in the original image includes:

and inputting the original image into a depth detection model trained in advance to obtain the depth of each pixel in the original image output by the depth detection model.

For a description of specific functions and examples of each module and sub-module of the apparatus in the embodiment of the present disclosure, reference may be made to the description of corresponding steps in the foregoing method embodiments, and details are not repeated here.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 10 illustrates a schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 1000 includes a computing unit 1001 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1002 or a computer program loaded from a storage unit 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the device 1000 can also be stored. The calculation unit 1001, the ROM1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.

A number of components in device 1000 are connected to I/O interface 1005, including: an input unit 1006 such as a keyboard, a mouse, and the like; an output unit 1007 such as various types of displays, speakers, and the like; a storage unit 1008 such as a magnetic disk, an optical disk, or the like; and a communication unit 1009 such as a network card, a modem, a wireless communication transceiver, or the like. The communication unit 1009 allows the device 1000 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Computing unit 1001 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 1001 executes the respective methods and processes described above, such as the three-dimensional detection method. For example, in some embodiments, the three-dimensional detection method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 1008. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1000 via ROM1002 and/or communications unit 1009. When the computer program is loaded into the RAM 1003 and executed by the computing unit 1001, one or more steps of the three-dimensional detection method described above may be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured to perform the three-dimensional detection method in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, causes the functions/acts specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A three-dimensional inspection method comprising:

performing target detection on an original image to obtain a first parameter set of a target object, wherein the first parameter set comprises at least one of the category of the target object, a coordinate value of the target object on a first coordinate axis in a three-dimensional coordinate system, a coordinate value of the target object on a second coordinate axis in the three-dimensional coordinate system, the size of the target object and the angle of the target object;

2. The method of claim 1, wherein the combining the first set of parameters and the depth of the target object to obtain three-dimensional detection parameters of the target object comprises:

combining the coordinate value of the target object on a first coordinate axis in a three-dimensional coordinate system, the coordinate value of the target object on a second coordinate axis in the three-dimensional coordinate system and the depth of the target object to obtain the position of the target object in a three-dimensional space;

and combining at least one of the position of the target object in the three-dimensional space, the category of the target object, the size of the target object and the angle of the target object to obtain three-dimensional detection parameters of the target object.

3. The method according to claim 1 or 2, wherein the depth of the target object corresponds to a coordinate value of the target object on a third coordinate axis in a three-dimensional coordinate system.

4. The method according to any one of claims 1-3, wherein said determining the depth of the target object using the depth of each pixel in the original image comprises:

5. The method according to any one of claims 1-3, wherein said determining the depth of the target object using the depth of each pixel in the original image comprises:

determining a boundary of the target object in the original image;

6. The method according to any one of claims 1-3, wherein said determining the depth of the target object using the depth of each pixel in the original image comprises:

determining a target detection frame of the target object in the original image, and determining a first region, wherein a central point of the first region coincides with a central point of the target detection frame, and a ratio of an area of the first region to an area of the target detection frame of the target object is greater than or equal to a preset threshold;

and determining the depth of the target object by using the depths of all pixels in the first area.

7. The method of claim 6, wherein said determining the depth of the target object using the depths of all pixels in the first region comprises:

and calculating the average value of the depths of all the pixels in the first area, and taking the average value as the depth of the target object.

8. The method of claim 6 or 7, wherein the target detection box is rectangular in shape;

the first area is in a diamond shape or a square shape, and 4 vertexes of the first area are respectively located at the midpoints of 4 sides of the target detection frame.

9. The method of claim 6 or 7, wherein the target detection box is rectangular in shape;

the first region is in an elliptical shape, and 4 vertexes of the first region are respectively located at midpoints of 4 edges of the target detection frame.

10. The method of claim 6 or 7, wherein the target detection box is square in shape;

the first region is circular in shape, and 4 sides of the target detection frame are all tangents to the first region.

11. The method according to any one of claims 1-10, wherein the performing target detection on the original image to obtain a first parameter set of the target object comprises:

12. The method according to any one of claims 1-11, wherein the performing depth detection on the original image to obtain the depth of each pixel in the original image comprises:

13. A three-dimensional inspection apparatus comprising:

14. The apparatus of claim 13, wherein the combining module comprises:

the first combination sub-module is used for combining the coordinate value of the target object on a first coordinate axis in a three-dimensional coordinate system, the coordinate value of the target object on a second coordinate axis in the three-dimensional coordinate system and the depth of the target object to obtain the position of the target object in a three-dimensional space;

and the second combination sub-module is used for combining at least one of the position of the target object in the three-dimensional space, the category of the target object, the size of the target object and the angle of the target object to obtain the three-dimensional detection parameters of the target object.

15. The apparatus according to claim 13 or 14, wherein the depth of the target object corresponds to a coordinate value of the target object on a third coordinate axis in a three-dimensional coordinate system.

16. The apparatus of any of claims 13-15, wherein the depth detection module is to:

17. The apparatus of any one of claims 13-15, wherein the depth detection module is to:

determining the boundary of the target object in the original image;

18. The apparatus of any of claims 13-15, wherein the depth detection module comprises:

a first region determining submodule, configured to determine a target detection frame of the target object in the original image, and determine a first region, where a central point of the first region coincides with a central point of the target detection frame, and a ratio of an area of the first region to an area of the target detection frame of the target object is greater than or equal to a preset threshold;

a first region depth determination submodule for determining the depth of each pixel in the first region by using the depth of each pixel in the original image, the target detection frame and the first region;

and the target object depth determining submodule determines the depth of the target object by using the depths of all the pixels in the first area.

19. The apparatus of claim 18, wherein the target object depth determination sub-module is configured to calculate an average of the depths of all pixels in the first region, and use the average as the depth of the target object.

20. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

21. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-12.

22. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-12.