CN111753739B

CN111753739B - Object detection method, device, equipment and storage medium

Info

Publication number: CN111753739B
Application number: CN202010593140.XA
Authority: CN
Inventors: 周定富; 宋希彬; 卢飞翔; 方进; 张良俊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-06-26
Filing date: 2020-06-26
Publication date: 2023-10-31
Anticipated expiration: 2040-06-26
Also published as: CN111753739A

Abstract

The application discloses an object detection method, an object detection device, object detection equipment and a storage medium, and relates to the technical fields of artificial intelligence, object detection, deep learning, a neural network, automatic driving, unmanned driving, auxiliary robots, virtual reality, augmented reality and the like. The object detection method comprises the following steps: detecting the image to obtain initial three-dimensional position information, a first mapping and model types of an object in the image; projecting by using the initial three-dimensional position information of the object and the model category to obtain a second map; and correcting the initial three-dimensional position information of the object by using the first mapping and the second mapping. The embodiment of the application can correct the initial three-dimensional position information by utilizing the model type of the object, and is beneficial to obtaining the more accurate three-dimensional position of the object.

Description

Object detection method, device, equipment and storage medium

Technical Field

The application relates to the field of image processing, in particular to the technical fields of artificial intelligence, object detection, deep learning, neural network, automatic driving, unmanned driving, auxiliary robot, virtual reality, augmented reality and the like.

Background

In many cases, the existing detection technology describes a three-dimensional object as a generalized three-dimensional bounding box. The three-dimensional object detection problem is to use image information to return (regress) to the numerical value of the three-dimensional bounding box. Based on this idea, a number of related three-dimensional detection methods are proposed. Such methods treat vehicle detection as a regression (regression) problem, and the calculation process is complex.

Disclosure of Invention

The application provides an object detection method, an object detection device and a storage medium.

According to a first aspect of the present application, there is provided an object detection method comprising:

detecting the image to obtain initial three-dimensional position information, a first mapping and model types of an object in the image;

projecting by using the initial three-dimensional position information of the object and the model category to obtain a second map;

and correcting the initial three-dimensional position information of the object by using the first mapping and the second mapping.

According to a second aspect of the present application, there is provided an object detection apparatus comprising:

the detection module is used for detecting the image to obtain initial three-dimensional position information, a first mapping and model types of objects in the image;

the projection module is used for projecting by utilizing the initial three-dimensional position information of the object and the model type to obtain a second map;

and the correction module is used for correcting the initial three-dimensional position information of the object by using the first mapping and the second mapping.

According to a third aspect of the present application, there is provided an electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the object detection method of any one of the embodiments of the above aspects.

According to a fourth aspect of the present application, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the object detection method in any one of the embodiments of the above aspects.

According to a fifth aspect of the present application, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

According to the technical scheme, the embodiment of the application can correct the initial three-dimensional position information by utilizing the model type of the object, and is beneficial to obtaining the more accurate three-dimensional position of the object.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the application or to delineate the scope of the application. Other features of the present application will become apparent from the description that follows.

Drawings

The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

FIG. 1 is a flow chart of a method of object detection according to an embodiment of the application;

FIG. 2 is a schematic diagram of a UV map;

FIG. 3 is a flow chart of an object detection method according to another embodiment of the present application;

FIG. 4 is a flow chart of an object detection method according to another embodiment of the present application;

FIG. 5 is a flow chart of an object detection method according to another embodiment of the present application;

FIG. 6 is a schematic diagram of a relationship of a camera coordinate system to a road surface coordinate system;

FIG. 7 is a flow chart of an object detection method according to another embodiment of the present application;

FIG. 8 is a flow chart of one example of three-dimensional object detection based on a single frame image;

FIG. 9 is a schematic illustration of estimating the point of contact O' of the vehicle center point with the ground;

FIGS. 10a, 10b and 10c are schematic diagrams of predicted UV maps;

FIG. 11 is a block diagram of an object detection apparatus according to an embodiment of the present application;

FIG. 12 is a block diagram of an object detection apparatus according to another embodiment of the present application;

fig. 13 is a block diagram of an electronic device of an object detection method according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Fig. 1 is a flow chart of an object detection method according to an embodiment of the application, which may include:

s101, detecting the image to obtain initial three-dimensional position information, a first mapping and model types of objects in the image.

S102, projecting by using the initial three-dimensional position information of the object and the model type to obtain a second map.

S103, correcting the initial three-dimensional position information of the object by using the first mapping and the second mapping.

The image in the embodiment of the application can comprise a frame image in a video, a shot photo and the like. For example, a frame image in a video taken in a vehicle-mounted camera (may also be referred to as a video camera), a photograph taken by a mobile phone, or the like, and various types of obstacles may be included in the image. There are various methods of object detection. For example, an object detection model can be identified by training an algorithm using artificial intelligence such as a neural network. The image is detected by using the object detection model to obtain a two-dimensional detection frame of the object in the image, and the position information of the two-dimensional detection frame can include coordinates of the two-dimensional detection frame where the object is located, for example, coordinates of an upper left corner and coordinates of a lower right corner. In addition, the object detection model can be used for carrying out three-dimensional position prediction on the image, and initial three-dimensional position information of each object in the image can be estimated. The initial three-dimensional position information of the object may include at least one of the following information: the size of the object, the coordinates of the three-dimensional coordinates of the center point, the orientation angle, etc. The models for two-dimensional detection and three-dimensional detection may be the same model or different models.

In S101, the image may be UV-segmented by using the object detection model to obtain a map of the object. The map of the object may be a UV map. The UV map may include a planar representation of the three-dimensional model surface. As shown in fig. 2, for an example of UV mapping, UV segmentation of a three-dimensional model of a vehicle on the left graph may result in UV segmentation results on a two-dimensional plane of the right graph, which may be referred to as UV mapping. The UV map may include (u, v) values corresponding to the positions (x, y) of the pixel points on the two-dimensional plane. The UV map may establish a correspondence between two-dimensional image pixels and a three-dimensional object model.

In S102, the projection may be performed using the initial three-dimensional position information and model type of the object detected from the image, and the parameters within the image camera, to obtain a second map. The second map may also be a UV map. The initial three-dimensional position information of the object may include three-dimensional coordinates (X _c ,Y _c ,Z _c ) And the orientation of the object, etc.

Furthermore, objects in many scenes in real life have strong a priori information, for example, including but not limited to: the shape of the object, the size of the object, the properties of the object, the location where the object may appear, etc. For example, for a vehicle detection problem in an autopilot scenario, since the models of the vehicles are very different, the size is not very different. It is contemplated that the use of this a priori information in object detection can simplify the problem of object detection.

In the embodiment of the application, three-dimensional models of various types included in the object can be established in advance. For example, in intelligent transportation technology, for example, in the scenes of automatic driving, unmanned driving, assisted driving, three-dimensional models of various types of vehicles, such as a car model, a minibus model, an SUV (Sports Utility Vehicle, sport utility vehicle, or referred to as an off-road vehicle) model, a bus model, and the like, may be built in advance. In the auxiliary robot scene, the three-dimensional position information of surrounding scene objects can be acquired by utilizing the image information, so that the robot is helped to avoid obstacles and grasp. In a virtual reality or augmented reality scene, three-dimensional information of an object is restored through an image, so that the virtual object is placed in a real scene.

In the early data annotation, each object in the sample image may be annotated with a category: for example 0, for cars, 1 for SUVs, 2 for minibuses, etc. The labeling data sets are then used to train an object detection model using the neural network, so that the class of the vehicle can be identified using the object detection model. If the model class of the object A is identified as a car, a car model corresponding to the car can be found.

Furthermore, a pre-established three-dimensional model can be acquired from the model class, and the three-dimensional coordinates (X _c ,Y _c ,Z _c ) And the orientation of the object, etc., as the coordinates and orientation of the center point of the three-dimensional model, for example, a car model, can be projected on an image by using a rendering technology to obtain a projected second map. The second map may be a UV map. The second map may be the same size as the original image and the first mapEach two-dimensional pixel point (x, y) coordinate of the second map has a corresponding (v, u) coordinate.

For example, the correspondence between the three-dimensional point coordinates of the three-dimensional model and the UV Map (U-V-Map) can be obtained by the following steps.

The first step: firstly, establishing a corresponding relation between a three-dimensional model of an object and a UV Map (U-V-Map). The establishment of this correspondence may be done by standard UV mapping (U-V-mapping). After the correspondence is completed, the following correspondence may be obtained: three-dimensional points (X) _m ,Y _m ,Z _m )->Corresponds to the U-coordinate and V-coordinate on the UV Map (U-V-Map), thus yielding (U, V). In addition, the values of (u, v) in the UV map may also correspond to the component properties of this three-dimensional point, such as the door, the tail, etc., respectively.

And a second step of: build model three-dimensional points (X) _m ,Y _m ,Z _m ) For example, the correspondence between three-dimensional coordinate points of the vehicle model and two-dimensional points (x, y) of the image coordinate system. The establishment of this correspondence is a process of camera projection of the annotation. The following formula (1):

(x,y,1)＝K*[R,T]*[X _m ,Y _m ,Z _m ,1] ^T /(Z) (1)，

wherein k= [ fx,0, c_x;0, fy, c_y;0,0,1]Is an internal parameter of the camera, fx, fy are focal lengths of the camera on an X axis and a Y axis respectively, and c_x and c_y are translation amounts representing an origin of a camera coordinate system; r, T is the projection matrix of the object in the camera coordinate system (including position and orientation information) [] ^T Is a representation of the transpose. The equation is a standard camera perspective projection equation.

By the steps, the point (X, y) of the two-dimensional image coordinate system and the three-dimensional object coordinate system (X _m ,Y _m ,Z _m ) Relationship between them. While three-dimensional object coordinate system (X _m ,Y _m ,Z _m ) The correspondence with the UV coordinate system has been predetermined in advance. Thus, the three-dimensional coordinate point (X) _m ,Y _m ,Z _m ) As the intermediate tie, a value (u, v) corresponding to each image pixel (x, y) can be obtained. This is thenA second UV map may be obtained.

In the embodiment of the application, the prior information such as the model type of the object, for example, the vehicle, can be utilized to correct the initial three-dimensional position information, which is beneficial to obtaining the more accurate three-dimensional position of the object.

Fig. 3 is a flowchart of an object detection method according to another embodiment of the present application. The same descriptions as those of the previous embodiment have the same meaning and are not repeated here.

Based on the above embodiment, in one possible implementation manner, in S102, the detecting the image to obtain the initial three-dimensional position information of the object in the image includes:

s201, detecting an object on the first image to obtain a two-dimensional detection frame of the object; in this step, a first map of the object may also be obtained.

S202, acquiring a second image comprising the object from the first image by utilizing the two-dimensional detection frame of the object.

S203, predicting the first image and the second image by utilizing a neural network to obtain a prediction result, wherein the prediction result comprises a junction point of the central point of the object and the ground. In this step, the model class of the object may also be included in the prediction result obtained.

S204, calculating initial three-dimensional position information of the center point of the object by using the intersection point of the center point of the object and the ground.

If a plurality of objects are included in the first image, the first image is detected by the object detection model, and a two-dimensional detection frame of the plurality of objects can be obtained. Then, a second image is obtained by cutting out the first image by utilizing the two-dimensional detection frame of each object. For example, the first image includes an object a and an object B, and the second image including the object a and the second image including the object B may be cropped from the first image.

For each object, the original first image may be combined with the second image comprising the object, using the feature layer of the neural network, respectively. For example, features of a first image and a second image comprising object a are combined to obtain one feature map, and features of a first image and a second image comprising object B are combined to obtain another feature map. And then inputting the combined characteristic map into a full-connection layer of the neural network, wherein the obtained prediction result can comprise coordinates of a junction point of the central point of the object and the ground.

In the embodiment of the application, the initial three-dimensional position information of the object is predicted by using the original image (the first image) and the image (the second image) which is obtained by clipping from the original image and comprises a certain object, and the prediction result is more accurate.

Fig. 4 is a flowchart of an object detection method according to another embodiment of the present application. The same descriptions of this embodiment as those of the above embodiment have the same meaning and are not repeated here.

On the basis of any one of the foregoing embodiments, in one possible implementation manner, in S203, predicting the first image and the second image by using a neural network to obtain a prediction result includes:

s301, acquiring the characteristics of the first image and the characteristics of the second image;

s302, inputting the features of the first image and the second image into a feature layer of the neural network for merging;

s303, inputting the combined characteristics into a plurality of fully connected layers of the neural network for prediction, and obtaining a prediction result.

Illustratively, if a second image including object a and a second image including object B are derived from the original first image, object a and object B are predicted, respectively. For the object A, after the first image and the second image including the object A are combined by using the neural network, the intersection point O 'of the central point of the object A (for example, the central point of the object can be represented by three-dimensional coordinates) and the ground is predicted' _A And a projection point O of the center point of the object A on the two-dimensional image _A And model class of object a can also be predicted. For the object B, after the first image and the second image comprising the object B are combined by utilizing a neural network, the intersection point O 'of the central point of the object B and the ground is predicted' _B The center point of the object B is in a two-dimensional imageProjection point O on _B And model class of object B can also be predicted.

Then, using the intersection point of the center point of the object a and the ground, the initial three-dimensional position information of the center point of the object a can be calculated. And calculating initial three-dimensional position information of the center point of the object B by utilizing the intersection point of the center point of the object B and the ground. The initial three-dimensional position information of the center point of the object a and/or the object B calculated here may be represented by three-dimensional coordinates.

In the embodiment of the application, after the characteristics of the original image (the first image) and the detected characteristics of the image (the second image) comprising the object are combined in the neural network, the three-dimensional position coordinates of the object obtained by prediction are more accurate, which is beneficial to reducing the number of subsequent correction and reducing the calculated amount.

Fig. 5 is a flowchart of an object detection method according to another embodiment of the present application. The same descriptions of this embodiment as those of the above embodiment have the same meaning and are not repeated here.

On the basis of any of the above embodiments, in one possible implementation manner, in S204, calculating initial three-dimensional position information of the center point of the object using the intersection point of the center point of the object and the ground includes:

s401, obtaining the distance between the camera and the object by using the normal vector of the height of the camera and the ground;

s402, calculating initial three-dimensional position information of the center point of the object by using the distance and the intersection point of the center point of the object and the ground.

For example, referring to fig. 6, if the vehicle is traveling on a relatively flat road, the Camera Height (Camera Height) of the captured image is h, and the Normal Vector (Normal Vector) under the Camera coordinate system isThe distance Z is also understood to mean the distance of a camera-mounted device, for example a vehicle, on which the camera is mounted, from an object.

And, any point g= (X, Y, Z) on the ground plane ^T Satisfy [ ]2)：

(n _x ，n _y ，n _z )*(X,Y,Z) ^T ＝h (2)

Assuming that the road surface is planar, n is therefore _x And n _z The values of h are all equal to 0, and the value of Z can be calculated from the formula (2) as a known quantity.

In the camera coordinate system, the relationship between the three-dimensional point coordinates (X, Y, Z) and the corresponding image point coordinates (X, Y) satisfies the following formulas (3) and (4):

X＝(x–u0)/fx*Z (3)，

Y＝(y–v0)/fy*Z (4)，

where (u 0, v 0) is the optical center position of the camera, fx, fy are the focal lengths of the camera in the X-axis and Y-axis, respectively. In most cases fx=fy=f, and therefore can be represented by a uniform focal length f.

By combining equations (2) - (4), the three-dimensional coordinates (X, Y, Z) of any point on the ground can be solved.

And if the intersection O ' of the object with the ground has been estimated in the above-described step 203, therefore, the three-dimensional position information corresponding to O ' can be calculated by using the coordinates (X, y) of O ' in the image in combination with the formulas (2) - (4), and the calculation result can be expressed as the three-dimensional coordinates (X _c ,Y _c ,Z _c )。

In the embodiment of the application, the initial three-dimensional position information of the center point of the object can be rapidly and accurately calculated by using the parameters of the camera such as the height and the intersection point of the center point of the object and the ground. Then, the coordinates of the three-dimensional position are used as initial three-dimensional position information of the object as initial values for further optimization.

Fig. 7 is a flowchart of an object detection method according to another embodiment of the present application. The same descriptions of this embodiment as those of the above embodiment have the same meaning and are not repeated here.

On the basis of any of the foregoing embodiments, in one possible implementation manner, in S103, correcting the initial three-dimensional position information of the object using the first map and the second map includes:

s501, establishing a loss function by using the first mapping and the second mapping;

s502, correcting the initial three-dimensional position information of the object by using the loss function.

The three-dimensional position of the object may include coordinates and an orientation angle of a center point of a three-dimensional detection frame of the object, and the like.

The loss function may be calculated by the difference in UV coordinates of the first map and the second map. For example, each pixel (x, y) of the first map has a corresponding (u, v) coordinate, and each pixel (x, y) of the second map has a corresponding (u ', v') coordinate. The value of the loss function may be calculated using the difference between the (u, v) coordinates of the first map and the (u ', v') coordinates of the second map for the pixels with the same (x, y) coordinates. For example, the loss function is equal to the sum of the differences of the U-coordinates, V-coordinates, or UV-coordinates of the pixels at all positions in two pixels.

And correcting the initial three-dimensional position information of the object by using the loss function to obtain corrected position information.

For example, the correction process may include adjusting the position and orientation information of the object in the initial projection matrix R, T of the camera coordinate system, and then substituting the new R, T into the above formula (1) to obtain the (u ', v') coordinates of the new second map, and further obtain the new UV map. The coordinates (u, v) of the new UV map and the last initial UV map are substituted into the loss function and the change in the value of the loss function is compared. If the loss function value becomes smaller, the (u ', v') coordinates of the new UV map are taken as the coordinates (u, v) of the next optimized initial UV map. If the loss function value becomes large, the initial UV map is kept unchanged. Until the value of the loss function meets the requirement, e.g. is smaller than a certain threshold. And finally obtaining the three-dimensional position information of the corrected object.

In the embodiment of the application, the loss function is established by using the predicted second mapping and the segmented first mapping, and the initial three-dimensional position information of the initial object is corrected, so that the more accurate three-dimensional position information of the object is obtained.

In one application example, three-dimensional position information of an object can be detected using a single frame image and an end-to-end three-dimensional object detection algorithm of a three-dimensional model. Specifically, the detection process may include the following procedures:

s1: inputting a single frame image into the object detection model, and acquiring a two-dimensional detection frame of the object to be detected by using a detection algorithm (such as Mask-RCNN), wherein the UV segmentation result corresponds to each object to be detected. For example, as shown in FIG. 8, UV-Seg represents the UV segmentation results, which can be represented by UV mapping (UV-Map); bboxes represent the object box detection results. If multiple objects are included in the image, each object may have a corresponding object frame, such as a two-dimensional detection frame. Further, an image including the object can be cut out from the original image using a two-dimensional detection frame.

S2: features in the whole image and the cropped image only containing the object are extracted by using a deep learning network (such as Res-Net 50 and the like) respectively. Referring to fig. 8, the image is detected by using the object detection model, and Features of the original image and Features of clipping (Cropped Features) can also be obtained. The two are combined at the feature layer to obtain a shared feature map (Shared Feature Map). For example, in fig. 8, there are three vehicles, and the image including each vehicle may be cut out and combined with the original image to obtain three combined images. For example, the features of the original image and the cropped features are respectively represented by F _all And F _object And (3) representing. Will F _all And F _object Merging is performed at the feature layer. Feature F _all Comprising the following steps: W×H×C1, feature F _object Comprising the following steps: w×h×c2, after combining, becomes: w×h× (c1+c2).

The prediction (proposal) results of the object are then output through a two-layer (or more) convolutional neural network (fully connected layer, such as the depth rendering layer (Deep Render Layers) in fig. 8). As shown in fig. 9, the proposal results may include, but are not limited to: the three-dimensional detection frame ABCD of the object, the three-dimensional size of the object and the intersection point O 'of the three-dimensional center point and the ground, wherein the projection point O, E of the three-dimensional center point on the two-dimensional image represents the point on the ground in the same line as the point O'. The three-dimensional point of the object can be calculated using the image coordinates of the E-point.

In addition, the proposal results may also include an initial model class. For example, the categories of vehicles may include cars, SUVs, vans, buses, etc., each category corresponding to a model of the vehicle.

Referring to fig. 8, according to model types, object Models (Object Models) corresponding to the model types can be obtained from Object Models (Object Models) of various types which are prepared in advance. Object position refinement (Object Refinement) then represents the subsequent refinement of the three-dimensional object model, see in particular the correction process.

S3: the distance Z of a preliminary object can be estimated using the camera height h and the normal vector n to the ground. Using the estimated Z and the estimated intersection O' of the three-dimensional center point of the object with the ground, initial three-dimensional position information (X, Y, Z) of the center point of an object can be calculated, see fig. 6 above.

Wherein, using the normal vector n of the camera height h and the ground, estimating the distance Z of the object, and obtaining initial three-dimensional position information of the center point of the preliminary object includes:

if the vehicle is traveling on a relatively flat road, the camera height of the captured image is h, and the normal vector in the camera coordinate system is n= (n) _x ，n _y ，n _z ) Any point x= (X, Y, Z) on the ground plane ^T The following formula (3-1) is satisfied:

(n _x ，n _y ，n _z )*(X,Y,Z)T＝h (3-1)，

in the camera coordinate system, the three-dimensional point coordinates (X, Y, Z) and the corresponding image point coordinates (X, Y) satisfy the following relationship:

X＝(x–u0)/fx*Z (3-2)，

Y＝(y–v0)/fy*Z (3-3)，

where (u 0, v 0) is the optical center position of the camera, fx, fy are the focal lengths of the camera on the X, Y axis, respectively. In most cases fx=fy=f is represented by a uniform focal length.

By combining the three formulas (3-1) - (3-3), the three-dimensional coordinates of any point on the ground can be solved.

And (3) calculating the initial three-dimensional coordinates of the vehicle according to the estimated intersection point O 'of the vehicle and the ground in the step S2 by utilizing the image coordinate positions of the O' and combining the formulas (3-1) - (3-3). Then, the coordinates are used as initial position information of the vehicle as initial values for further optimization.

S4, using the initial three-dimensional point positions and the estimated object, for example, the class of the vehicle model, and using the rendering technique, the three-dimensional model of the vehicle may be projected on the image to obtain a projected UV Map, see predicted UV (Predicted UV) in fig. 8, which may be represented by U 'V' -Map, see fig. 10a, 10b and 10c, an example of the U '-Map rendered by the original image fig. 10a being fig. 10b, and an example of the V' -Map rendered being fig. 10c. On this basis, differences (e.g., U '-U and V' -V) between the projected U '-V' -Map and the segmented UV-Map are established, see fig. 8, rendering and comparing losses (Rendering and Compare Loss). The difference is used as an energy loss function to correct three-dimensional position information of the object, such as correcting a position estimation of the vehicle and an orientation angle of the vehicle.

Fig. 11 is a block diagram of an object detection apparatus according to an embodiment of the present application. The apparatus may include:

the detection module 210 is configured to detect an image, and obtain initial three-dimensional position information, a first map, and a model class of an object in the image;

the projection module 220 is configured to project by using the initial three-dimensional position information of the object and the model class, so as to obtain a second map;

the correction module 230 is configured to correct the initial three-dimensional position information of the object using the first map and the second map.

As shown in fig. 12, in one possible implementation, the detection module 210 includes:

the detection sub-module 211 is configured to perform object detection on the first image to obtain a two-dimensional detection frame of the object;

an acquisition sub-module 212 for acquiring a second image including the object from the first image using a two-dimensional detection frame of the object;

a prediction sub-module 213, configured to predict the first image and the second image by using a neural network, so as to obtain a prediction result, where the prediction result includes an intersection point of a center point of the object and the ground;

the calculating sub-module 214 is configured to calculate initial three-dimensional position information of the center point of the object using an intersection point of the center point of the object and the ground.

In one possible implementation, the prediction submodule 213 is specifically configured to:

acquiring characteristics of a first image and characteristics of a second image;

inputting the features of the first image and the second image into a feature layer of the neural network for merging;

and inputting the combined characteristics into a plurality of fully connected layers of the neural network for prediction to obtain a prediction result.

In one possible implementation, the calculation submodule 214 is specifically configured to:

obtaining the distance between the camera and the object by using the normal vector of the height of the camera and the ground;

and calculating initial three-dimensional position information of the center point of the object by using the distance and the intersection point of the center point of the object and the ground.

In one possible implementation, the correction module 230 includes:

a loss function sub-module 231 for creating a loss function using the first map and the second map;

a correction sub-module 232 for correcting the initial three-dimensional position information of the object using the loss function.

The functions of each module in each device of the embodiments of the present application may be referred to the corresponding descriptions in the above methods, and are not described herein again.

According to embodiments of the present application, the present application also provides an electronic device, a readable storage medium and a computer program product.

As shown in fig. 13, there is a block diagram of an electronic device of an object detection method according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.

As shown in fig. 13, the electronic device includes: one or more processors 901, memory 902, and interfaces for connecting the components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the electronic device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In other embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple electronic devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). This embodiment takes a processor 901 as an example.

Memory 902 is a non-transitory computer readable storage medium provided by the present application. The memory stores instructions executable by the at least one processor to cause the at least one processor to perform the object detection method provided by the present application. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to execute the object detection method provided by the present application.

The memory 902 is used as a non-transitory computer readable storage medium for storing non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules (e.g., the detection module 210, the projection module 220, and the correction module 230 shown in fig. 11) corresponding to the object detection method according to the embodiment of the present application. The processor 901 executes various functional applications of the server and data processing, i.e., implements the object detection method in the above-described method embodiments, by running non-transitory software programs, instructions, and modules stored in the memory 902.

The memory 902 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created according to the use of the electronic device of the object detection method, and the like. In addition, the memory 902 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory 902 optionally includes memory remotely located relative to processor 901, which may be connected to the electronic device of the object detection method via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the object detection method may further include: an input device 903 and an output device 904. The processor 901, memory 902, input devices 903, and output devices 904 may be connected by a bus or other means, for example in fig. 10.

The input device 903 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device of the object detection method, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointer stick, one or more mouse buttons, a track ball, a joystick, etc. The output means 904 may include a display device, auxiliary lighting means (e.g., LEDs), tactile feedback means (e.g., vibration motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device may be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASIC (application specific integrated circuit), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

These computing programs (also referred to as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.

The above embodiments do not limit the scope of the present application. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present application should be included in the scope of the present application.

Claims

1. An object detection method comprising:

detecting an image to obtain initial three-dimensional position information, a first mapping and model types of an object in the image;

correcting the initial three-dimensional position information of the object by using the first mapping and the second mapping;

the detecting the image to obtain the initial three-dimensional position information of the object in the image comprises the following steps:

performing object detection on the first image to obtain a two-dimensional detection frame of the object;

acquiring a second image comprising the object from the first image by utilizing a two-dimensional detection frame of the object;

acquiring characteristics of the first image and characteristics of the second image;

the features of the first image and the second image are input into a feature layer of a neural network to be combined;

inputting the combined characteristics into the neural network for prediction to obtain initial three-dimensional position information of the object;

detecting the image to obtain a first map, including:

and carrying out UV segmentation on the image to obtain a first map of the object.

2. The method of claim 1, wherein inputting the combined features into the neural network for prediction to obtain the initial three-dimensional position information of the object, comprises:

inputting the combined characteristics into the neural network for prediction to obtain a prediction result, wherein the prediction result comprises the intersection point of the central point of the object and the ground;

and calculating initial three-dimensional position information of the center point of the object by utilizing the intersection point of the center point of the object and the ground.

3. The method of claim 2, wherein inputting the combined features into the neural network for prediction, obtaining a prediction result, comprises:

and inputting the combined characteristics into a plurality of full-connection layers of the neural network to predict, so as to obtain the prediction result.

4. A method according to claim 2 or 3, wherein calculating initial three-dimensional position information of the center point of the object using the intersection point of the center point of the object and the ground comprises:

5. The method of claim 4, wherein correcting the initial three-dimensional position information of the object using the first map and the second map comprises:

establishing a loss function using the first map and the second map;

and correcting the initial three-dimensional position information of the object by using the loss function.

6. An object detection device comprising:

the correction module is used for correcting the initial three-dimensional position information of the object by utilizing the first mapping and the second mapping;

the detection module comprises:

the detection sub-module is used for detecting the object of the first image to obtain a two-dimensional detection frame of the object;

an acquisition sub-module for acquiring a second image including the object from the first image using a two-dimensional detection frame of the object;

a prediction sub-module, configured to obtain a feature of the first image and a feature of the second image; the features of the first image and the second image are input into a feature layer of a neural network to be combined; inputting the combined characteristics into the neural network for prediction to obtain initial three-dimensional position information of the object;

the detection module detects the image to obtain a first map, including:

7. The apparatus of claim 6, wherein the prediction submodule inputs the combined features into the neural network for prediction to obtain the initial three-dimensional position information of the object, and the prediction submodule comprises:

and the calculating sub-module is used for calculating the initial three-dimensional position information of the center point of the object by utilizing the intersection point of the center point of the object and the ground.

8. The apparatus of claim 7, wherein the prediction submodule inputs the combined features into the neural network for prediction to obtain a prediction result, and the prediction submodule comprises:

9. The apparatus according to claim 7 or 8, wherein the calculation submodule is specifically configured to:

10. The apparatus of claim 9, wherein the correction module comprises:

a loss function sub-module for building a loss function using the first map and the second map;

and the correction sub-module is used for correcting the initial three-dimensional position information of the object by using the loss function.

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-5.