CN113177976A

CN113177976A - Depth estimation method and device, electronic equipment and storage medium

Info

Publication number: CN113177976A
Application number: CN202110488366.8A
Authority: CN
Inventors: 曾印权; 刘帅; 徐本睿
Original assignee: Shenzhen Anngic Technology Co ltd
Current assignee: Shenzhen Anngic Technology Co ltd
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-07-27

Abstract

The application provides a depth estimation method, a depth estimation device, an electronic device and a storage medium, which are used for solving the problem that the accuracy of obtained depth information is difficult to meet the driving requirement. The method comprises the following steps: acquiring an image to be inferred, wherein the image to be inferred is acquired by using a monocular camera; the method comprises the steps of reasoning an image to be inferred by using a neural network model, obtaining a prediction target frame in the image to be inferred and estimation depth information corresponding to the prediction target frame, wherein the neural network model is obtained after training by using a monocular image, the monocular image is obtained by performing monocular processing on an image collected by a binocular camera, the prediction target frame represents a position area where a target object in the image to be inferred is located, and the estimation depth information is the distance between the target object and the monocular camera.

Description

Depth estimation method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing and automatic driving technologies, and in particular, to a depth estimation method, apparatus, electronic device, and storage medium.

Background

In the field of assistant driving, automatic driving or unmanned driving, a monocular camera is generally used for collecting an image in front of an automobile, target detection is performed on the image in front of the automobile, so that the area position of a target object in the image in front of the automobile is identified, and then a distance measuring sensor such as a laser radar and a millimeter wave radar is used for measuring the distance between the target object and the monocular camera. Since the target object is in the depth direction of the image directly in front (which can be understood as the Z-coordinate direction perpendicular to the XY-coordinates), the distance between the target object and the monocular camera is also commonly referred to as depth information of the target object.

In the specific practice process, the depth information determined by the distance measuring sensor is accurate under the condition that the height of the monocular camera relative to the ground and the included angle between the optical axis and the ground are not changed. However, if the vehicle is always moving in an environment such as a mountain road or a bumpy road, braking, starting, acceleration, deceleration, ascending and descending can occur, which causes the error of the depth information determined by using the distance measuring sensor to rapidly increase, and the accuracy of the obtained depth information is difficult to meet the driving requirement.

Disclosure of Invention

An object of the embodiments of the present application is to provide a depth estimation method, device, electronic device, and storage medium, which are used to solve the problem that the accuracy of the obtained depth information is difficult to meet the driving requirement.

The embodiment of the application provides a depth estimation method, which comprises the following steps: acquiring an image to be inferred, wherein the image to be inferred is acquired by using a monocular camera; the method comprises the steps of reasoning an image to be inferred by using a neural network model, obtaining a prediction target frame in the image to be inferred and estimation depth information corresponding to the prediction target frame, wherein the neural network model is obtained after training by using a monocular image, the monocular image is obtained by performing monocular processing on an image collected by a binocular camera, the prediction target frame represents a position area where a target object in the image to be inferred is located, and the estimation depth information is the distance between the target object and the monocular camera. In the implementation process, the neural network model is used for reasoning the image to be inferred acquired by the monocular camera, and the neural network model is trained by using the image which is processed by the binocular image with higher precision in a monocular processing process, and the pixel points representing the real distance are screened out in the monocular processing process, so that the trained neural network model has higher accuracy. The depth estimation method does not depend on an external distance measurement sensor and the like, reduces errors caused by time asynchronization or spatial misalignment of the distance measurement sensor and the monocular camera, and improves the accuracy of depth information estimation.

Optionally, in this embodiment of the present application, before using the neural network model to reason about the image to be inferred, the method further includes: acquiring a first image and a second image, wherein the first image and the second image are different images acquired by a binocular camera aiming at a target object, and the distance between the center points of apertures of the binocular camera is equal to a pre-calculated baseline value; performing monocular processing on the first image and the second image to obtain a plurality of monocular images, and a target frame and depth information corresponding to the monocular images; and training the neural network by taking the plurality of monocular images as training data and taking the target frames and the depth information corresponding to the plurality of monocular images as training labels to obtain a neural network model. In the implementation process, the training data during acquisition is acquired by using the binocular camera, so that the data volume which is twice of that of the acquired monocular data can be acquired in the same time, the first image and the second image are subjected to monocular processing, more training data are acquired, and then the neural network is trained by using more training data, so that the accuracy of the acquired neural network model during inference is higher.

Optionally, in this embodiment of the present application, performing monocular processing on the first image and the second image includes: marking a first target frame of a target object on the first image, and marking a second target frame of the target object on the second image, wherein the target frame represents a position area of the target object in the image; and calculating the depth information of the target object according to the first target frame and the second target frame. In the implementation process, the first image and the second image of the target object are obtained by using the binocular camera with the baseline value, the baseline value is the optimal baseline value calculated in advance, and under the condition that the distance between the aperture center points of the binocular camera is set to be the optimal baseline value, the binocular camera can collect more pixel points representing the real distance, so that the image with higher precision is obtained.

Optionally, in this embodiment of the present application, calculating depth information of the target object according to the first target frame and the second target frame includes: screening out a pixel target frame with a small number of pixel points in the target frame from the first target frame and the second target frame; calculating the depth information of all pixel points in the pixel target frame according to a binocular distance measuring principle, wherein the depth information of the pixel points is the distance between the position points of the pixel points in the target object and a binocular camera; removing outlier pixel points and background pixel points from all pixel points in the pixel target frame to obtain a plurality of representative pixel points; and calculating the average value of the depth information of the representative pixel points, and determining the average value of the depth information as the depth information of the target object. In the implementation process, the depth information of all pixel points in the pixel target frame is calculated through a binocular ranging principle, and the depth information of the target object is determined according to the depth information of all pixel points in the pixel target frame, so that a sensor (such as an infrared range finder and the like) for additionally acquiring the depth information is not needed in a data acquisition stage, the problem of error increase caused by the fact that data of multiple sensors needs to be synchronized or aligned in a time-space dimension is solved, and the accuracy of estimating the depth information is effectively improved.

Optionally, in this embodiment of the present application, the binocular camera includes: a first camera and a second camera; calculating the depth information of all pixel points in the pixel target frame according to a binocular ranging principle, and the method comprises the following steps: performing feature matching on the first target frame and the second target frame to obtain parallax between the first camera and the second camera; and calculating the depth information of all pixel points in the pixel target frame according to the baseline value and the parallax. In the implementation process, the parallax between the first camera and the second camera is obtained by performing feature matching on the first target frame and the second target frame, and the depth information of all pixel points in the pixel target frame is calculated according to the baseline value and the parallax, so that a sensor (such as an infrared distance meter and the like) for additionally acquiring the depth information is not needed in the data acquisition stage, the problem of error increase caused by the fact that data of multiple sensors needs to be synchronized or aligned in the time-space dimension is avoided, and the accuracy of estimating the depth information is effectively improved.

Optionally, in this embodiment of the present application, after obtaining the prediction target frame in the image to be inferred and the estimated depth information corresponding to the prediction target frame, the method further includes: judging whether the distance corresponding to the estimated depth information is smaller than a preset distance or not; if yes, generating and outputting an auxiliary early warning signal, or decelerating through a braking system. In the implementation process, when the distance corresponding to the estimated depth information is smaller than the preset distance, the auxiliary early warning signal is generated and output, or the speed is reduced through the braking system, so that the problem that the traffic accident is caused by untimely response or untimely braking acceleration is avoided, and the safety of electronic products such as an intelligent vehicle, an unmanned aerial vehicle, an unmanned vehicle or a robot of an auxiliary driving system is effectively improved.

Optionally, in this embodiment of the present application, after obtaining the prediction target frame in the image to be inferred and the estimated depth information corresponding to the prediction target frame, the method further includes: judging whether the ratio value of the prediction target frame and the image to be inferred is larger than a preset ratio or not; if yes, the path is re-planned to avoid the target object in the predicted target frame. In the implementation process, when the ratio of the predicted target frame to the image to be inferred is greater than the preset ratio, the path is re-planned, so that the target object in the predicted target frame is avoided, and the safety of electronic products such as intelligent vehicles, unmanned aerial vehicles, unmanned vehicles or robots of the auxiliary driving system is effectively improved.

An embodiment of the present application further provides a depth estimation apparatus, including: the inference image acquisition module is used for acquiring an image to be inferred, and the image to be inferred is acquired by using a monocular camera; the image depth estimation module is used for reasoning the image to be inferred by using a neural network model to obtain a prediction target frame in the image to be inferred and estimation depth information corresponding to the prediction target frame, the neural network model is obtained after a monocular image is used for training, the monocular image is obtained by performing monocular processing on an image acquired by a binocular camera, the prediction target frame represents a position area where a target object in the image to be inferred is located, and the estimation depth information is the distance between the target object and the monocular camera.

Optionally, in an embodiment of the present application, the depth estimation apparatus further includes: the binocular image acquisition module is used for acquiring a first image and a second image, the first image and the second image are different images acquired by a binocular camera aiming at a target object, and the distance between the center points of the apertures of the binocular camera is equal to a pre-calculated baseline value; the image monocular processing module is used for performing monocular processing on the first image and the second image to obtain a plurality of monocular images, and target frames and depth information corresponding to the monocular images; and the network model training module is used for training the neural network by taking the plurality of monocular images as training data and taking the target frames and the depth information corresponding to the plurality of monocular images as training labels to obtain the neural network model.

Optionally, in an embodiment of the present application, the image monocular processing module includes: the target image labeling module is used for labeling a first target frame of a target object on the first image and labeling a second target frame of the target object on the second image, and the target frames represent the position area of the target object in the image; and the depth information calculation module is used for calculating the depth information of the target object according to the first target frame and the second target frame.

Optionally, in an embodiment of the present application, the depth information calculating module includes: the target frame screening module is used for screening out a pixel target frame with a small number of pixel points in the target frame from the first target frame and the second target frame; the depth information calculation module is used for calculating the depth information of all pixel points in the pixel target frame according to a binocular ranging principle, and the depth information of the pixel points is the distance between the position points of the pixel points in the target object and the binocular camera; the depth information screening module is used for eliminating outlier pixel points and background pixel points from all pixel points in the pixel target frame to obtain a plurality of representative pixel points; and the depth information determining module is used for calculating the average value of the depth information of the representative pixel points and determining the average value of the depth information as the depth information of the target object.

Optionally, in this embodiment of the present application, the binocular camera includes: a first camera and a second camera; a depth information calculation module comprising: the camera parallax obtaining module is used for the camera to perform feature matching on the first target frame and the second target frame to obtain parallax between the first camera and the second camera; and the pixel depth calculation module is used for calculating the depth information of all pixel points in the pixel target frame according to the baseline value and the parallax.

Optionally, in an embodiment of the present application, the depth estimation apparatus further includes: the depth distance judging module is used for judging whether the distance corresponding to the estimated depth information is smaller than a preset distance or not; and the auxiliary early warning deceleration module is used for generating and outputting an auxiliary early warning signal if the distance corresponding to the estimated depth information is smaller than the preset distance, or decelerating through a braking system.

Optionally, in an embodiment of the present application, the depth estimation apparatus further includes: the frame proportion judging module is used for judging whether the proportion value of the prediction target frame and the image to be inferred is larger than a preset proportion; and the path replanning module is used for replanning the path to avoid the target object in the predicted target frame if the ratio value of the predicted target frame to the image to be inferred is greater than the preset ratio.

An embodiment of the present application further provides an electronic device, including: a processor and a memory, the memory storing processor-executable machine-readable instructions, the machine-readable instructions when executed by the processor performing the method as described above.

Embodiments of the present application also provide a storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the method as described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.

Fig. 1 is a schematic flow chart of a data acquisition phase provided in an embodiment of the present application;

fig. 2 is a schematic diagram illustrating a binocular ranging principle provided by an embodiment of the present application;

fig. 3 is a schematic distribution diagram of pixel depth information provided in the embodiment of the present application;

FIG. 4 is a schematic diagram of a trained neural network model provided by an embodiment of the present application;

FIG. 5 is a schematic process diagram of a monocular processing provided in an embodiment of the present application;

FIG. 6 is a schematic flow chart illustrating a model inference phase provided by an embodiment of the present application;

fig. 7 is a schematic structural diagram of a depth estimation device according to an embodiment of the present application.

Detailed Description

The technical solution in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Before the depth estimation method provided by the embodiment of the present application is introduced, some concepts related to the embodiment of the present application are introduced:

the target detection network is a neural network for detecting a target object in an image, that is, the target object in the image is detected, and a position range, a classification and a probability of the target object in the image are given, the position range can be specifically labeled in the form of a detection frame, the classification refers to a specific class of the target object, and the probability refers to a probability that the target object in the detection frame is in the specific class.

A server refers to a device that provides computing services over a network, such as: x86 server and non-x 86 server, non-x 86 server includes: mainframe, minicomputer, and UNIX server.

It should be noted that the depth estimation method provided in the embodiments of the present application may be executed by an electronic device, where the electronic device refers to a device terminal having a function of executing a computer program or the server described above, and the device terminal includes, for example: smart phones, Personal Computers (PCs), tablet computers, Personal Digital Assistants (PDAs), or Mobile Internet Devices (MIDs), etc.

Before introducing the depth estimation method provided by the embodiment of the present application, an application scenario applicable to the depth estimation method is introduced, where the application scenario includes, but is not limited to, the fields of assisted driving, autonomous driving, or unmanned driving, and the like, specifically for example: the depth estimation method is used in systems such as an auxiliary driving system, an unmanned aerial vehicle, an unmanned vehicle or a robot to estimate the depth information in the image acquired by the monocular camera, so that the accuracy of depth estimation of the image acquired by the monocular camera is improved. In specific practice, a neural network can be trained by using the image target frame and the image depth information obtained by the depth estimation method, and then the trained neural network model is used to infer the target frame and the depth information of the target object in the image, which specifically includes: use the neural network model after the training to infer out the degree of depth information of place ahead barrier in systems such as auxiliary driving system, unmanned aerial vehicle, unmanned vehicle or robot, the degree of depth information here can be understood as the place ahead barrier and gather the camera between the distance to auxiliary driving system, unmanned aerial vehicle, unmanned vehicle or robot etc. keep away the barrier and handle, keep away barrier and handle for example: and outputting auxiliary early warning, braking and decelerating, changing lane and steering or re-planning a path and the like.

The reason why the accuracy of depth information is determined using the above-described ranging sensor is poor is analyzed as follows: for the first reason, in environments such as a mountain road or a bumpy road surface, due to different shaking degrees of the ranging sensor and the monocular camera, when a plurality of targets to be detected appear on the monocular camera, depth information measured by the ranging sensors is difficult to match and align with the plurality of targets to be detected in the monocular camera, so that an additional correction algorithm is needed to avoid the influence caused by the visual angle range, the alignment angle, the shaking degree and the like of the ranging sensor and the monocular camera, and matching and aligning are performed on the basis of correction; the second reason is that the distance measurement action of the distance measurement sensor and the shooting action of the monocular camera are difficult to be synchronized in time, that is, the depth information measured by the distance measurement sensor and the image shot by the monocular camera are difficult to be acquired at the same time; if the two are controlled by a high-precision control chip, the error of the two is difficult to be less than millisecond.

In order to overcome the above defects, the depth estimation method provided in the embodiments of the present application mainly includes using a neural network model to reason an image to be inferred acquired by a monocular camera, and since the neural network model is trained using a monocular image with a higher precision, and a pixel point representing a real distance is screened out in the monocular processing process, the trained neural network model has a higher accuracy. The depth estimation method does not depend on an external distance measurement sensor and the like, reduces errors caused by time asynchronization or spatial misalignment of the distance measurement sensor and the monocular camera, and improves the accuracy of depth information estimation.

It is understood that the depth estimation method described above may include: acquiring data, training a model, reasoning the model and applying depth information; acquiring data refers to a monocular image, a target frame and depth information which are used for acquiring a binocular image and performing monocular processing on the binocular image, wherein the binocular image can be acquired by a binocular camera with an optimal baseline value; the model training refers to training a neural network by using the monocular image, the target frame and the depth information to obtain a neural network model; the model reasoning means that a neural network model is used for reasoning a target frame and depth information of an image to be inferred; the depth information application refers to the practical application of using the target frame and the depth information to perform early warning, speed reduction or path re-planning and the like.

Please refer to fig. 1, which illustrates a schematic flow chart of a data obtaining phase according to an embodiment of the present application; the data acquiring stage may specifically include:

step S110: a first image and a second image are acquired, and the first image and the second image are different images acquired by a binocular camera aiming at a target object.

The binocular image refers to different images acquired by a binocular camera for a target object, for example, the binocular image may include a first image and a second image.

There are many embodiments of the step S110, including but not limited to the following: in the first embodiment, a pre-stored binocular image is acquired, specifically for example: acquiring a binocular image from a file system, or acquiring the binocular image from a database, or acquiring the binocular image from a mobile storage device; a second acquisition mode, which is to acquire binocular images on the internet by using software such as a browser, or to access the internet by using other application programs to acquire binocular images; in the third acquisition mode, a binocular camera is used for acquiring binocular images of a plurality of targets to be detected in front of the target, and an optimal baseline value is calculated according to the binocular images; then, setting the distance between the central point of the aperture of the first camera and the central point of the aperture of the second camera as an optimal baseline value; and finally, acquiring binocular images of a plurality of targets to be detected in front by using a binocular camera with the optimal baseline value.

The third embodiment described above specifically includes, for example: placing a plurality of targets to be detected at equal intervals on an axis within the farthest detection distance of the binocular camera, and acquiring the real distance of each target to be detected; and setting the baseline value of the binocular camera to be a smaller initial value, and then collecting the binocular image in front by using the binocular camera. And finally, acquiring a first image of the target object by using the first camera, and acquiring a second image of the target object by using the second camera, wherein the first camera and the second camera are binocular cameras, and the distance between the center point of the aperture of the first camera and the center point of the aperture of the second camera is equal to a pre-calculated baseline value.

It should be noted that, the first camera and the second camera may be two cameras of a binocular camera installed on an unmanned aerial vehicle, an unmanned vehicle, an intelligent vehicle with an auxiliary driving system, or a robot, or a multi-camera installed on an unmanned aerial vehicle, an unmanned vehicle, an intelligent vehicle with an auxiliary driving system, or a robot, and specifically, for example: the multi-view camera is a camera which almost simultaneously obtains pictures of a target object, for example, a controller simultaneously sends an instruction for acquiring images to the multi-view camera.

Please refer to fig. 2 for a schematic diagram of a binocular range finding principle provided by an embodiment of the present application; suppose that binocular camera includes left first camera and the second camera on right side, and first camera and second camera are all by same single-chip control simultaneous exposure and collection image. The depth information may be calculated using the formula z ═ f × b/d, where z represents the calculated distance from each target object O point to the binocular camera, f represents the focal lengths of the first camera or the second camera (the focal lengths of both may be the same), b represents the baseline value of the current binocular camera, i.e., the distance between the diaphragm center point C1 of the first camera and the diaphragm center point C2 of the second camera, and d represents the distance between the imaging center point P1 of the target object on the first camera and the imaging center point P2 on the first camera, i.e., the parallax between the first camera and the second camera.

From the above analysis, it can be known that, under the condition that the focal length of the binocular camera is not changed and the distance between the binocular camera and the target object (i.e., the depth information of the target object) is not changed, the baseline value of the binocular camera (i.e., the distance between the current first camera and the second camera) is increased, so that the parallax value of the first camera and the second camera is increased, and more pixel points representing the true distance can be obtained on the imaging plane. However, if the baseline value of the binocular camera is too large, the target objects near the two side edges cannot simultaneously present a planar image in the binocular camera. Therefore, an optimal baseline value needs to be calculated to avoid a situation where the target object cannot present planar imaging in the binocular camera at the same time.

The above-mentioned manner of calculating the optimal baseline value may be calculated by using a given error range, and the specific process may include:

step S111: a plurality of targets to be detected are placed on an axis in the farthest detection distance of the binocular camera at equal intervals, a baseline value of the binocular camera is set to be a smaller initial value, then the baseline value of the binocular camera is adjusted, and binocular images of a plurality of target objects in front are collected by the binocular camera.

The embodiment of step S111 described above is, for example: adjusting a baseline value of the binocular camera, acquiring binocular images of a plurality of target objects in front by using the binocular camera, and obtaining binocular images, wherein the binocular images comprise a first image and a second image, and the first image and the second image are exposed and acquired simultaneously aiming at the same target object. It will be appreciated that the plurality of target objects described above are each within the farthest detection distance of the binocular camera, where the farthest detection distance may be represented by Dm in fig. 2.

Step S112: labeling each target on the binocular image respectively to obtain a target frame pair, wherein the target frame pair comprises a first target frame and a second target frame.

The embodiment of step S112 described above is, for example: a first target frame of each target object is marked on a first image in the binocular images, and a second target frame of each target object is marked on a second image in the binocular images. It can be understood that, because the number of the target to be detected and the number of the binocular images are both small, the labeling action can be performed manually, and thus the labeling precision and the image quality are effectively improved.

Step S113: and calculating the first target frame and the second target frame by using a binocular ranging principle to obtain the calculated distance from each target object to the binocular camera.

The embodiment of step S113 described above is, for example: screening out a pixel target frame with a small number of pixel points in the target frame from the first target frame and the second target frame; calculating the depth information of all pixel points in the pixel target frame according to a binocular ranging principle; removing outlier pixel points and background pixel points from all pixel points in the pixel target frame to obtain a plurality of representative pixel points; the average value of the depth information of the plurality of representative pixels is calculated, and the average value of the depth information is determined as the measurement distance between the target object and the binocular camera, where the measurement distance is the depth information between the target object and the binocular camera, and the specific calculation process for calculating the depth information of all the pixels in the pixel target frame according to the binocular range finding principle will be described in detail in the following step S130.

Step S114: and counting errors between the measured distance and the real distance of each target object, and screening out a maximum error value from the errors between the measured distance and the real distance.

The embodiment of step S114 described above is, for example: obtaining the real distance from each target object to the binocular camera, and using a formula

Counting the error between the measured distance and the true distance of each target object, and then using formula E_m＝max(E¹,E²,E³,…Eⁱ,…Eⁿ) Screening out a maximum error value from errors between the measured distance and the real distance; wherein E isⁱRepresenting the error between the measured distance and the true distance of the ith target object, zⁱIndicating the calculated distance from the ith target object to the binocular camera,

the real distance between the ith target object and the binocular camera is shown, where the real distance refers to the distance between the target to be detected and the binocular camera when the target to be detected is placed on the target to be detected, n represents the total number of the target objects, and the total number of the target objects can be set according to specific situations, for example, 5 to 10, and the like, E_mRepresenting the maximum error value selected from the errors between the measured distance and the true distance.

Step S115: if the maximum error value is greater than the preset limit value, gradually increasing the baseline value of the binocular camera, and then repeatedly executing the step S111 to the step S115 until the maximum error value is less than or equal to the preset limit value; here, the baseline value of the gradually increasing binocular camera may be set according to specific situations, for example, set to 0.1, 0.2, 0.4, 1, 3, or 7, etc., and the preset limit value may also be set according to specific situations, for example, set to 2% or 5%, etc.

Step S116: and if the maximum error value is less than or equal to the preset limit value, determining the baseline value of the binocular camera at the moment as the optimal baseline value.

After step S110, step S120 is performed: a first target frame of the target object is marked on the first image, and a second target frame of the target object is marked on the second image.

The embodiment of step S120 described above is, for example: detecting a first target frame of a target object on a first image by adopting a target detection network, and detecting a second target frame of the target object on a second image by adopting the target detection network; the target frame represents a position Region of the target object in the image, and the target detection Network in this embodiment may adopt a Region Convolutional Neural Network (RCNN), a fast RCNN, a Feature Fusion Single-point multi-box Detector (FSSD), and other networks. Of course, in the specific implementation process, the target frame of the first image and the target frame of the second image may also be labeled in a manual labeling manner.

After step S120, step S130 is performed: and calculating the depth information of the target object according to the first target frame and the second target frame, wherein the depth information of the target object is the distance between the target object and the binocular camera.

The above-mentioned embodiment of calculating the depth information of the target object according to the first target frame and the second target frame in step S130 may include:

step S131: and screening out a pixel target frame with a small number of pixel points in the target frame from the first target frame and the second target frame.

Step S132: and calculating the depth information of all pixel points in the pixel target frame according to a binocular ranging principle, wherein the depth information of the pixel points is the distance between the position point of the pixel points in the target object and the binocular camera.

The embodiment of step S132 described above is, for example: performing feature matching on the first target frame and the second target frame to obtain parallax between the first camera and the second camera; calculating the focal length, the baseline value and the parallax of the first camera or the second camera according to a binocular ranging principle to obtain depth information of all pixel points in the pixel target frame; wherein the focal length of the first camera and the focal length of the second camera may be equal.

Step S133: and removing outlier pixel points and background pixel points from all pixel points in the pixel target frame to obtain a plurality of representative pixel points.

Please refer to fig. 3, which illustrates a distribution diagram of pixel depth information provided in an embodiment of the present application; the embodiment of step S133 described above includes, for example: and reordering the depth information of all the pixel points in the pixel target frame from small to large to obtain an ordered pixel point set, wherein the ordered pixel point set can be represented by Dw, and the number of elements of the set Dw is denoted as w. Then, outlier pixel points and background pixel points are eliminated from all pixel points in the pixel target frame, and the specific method can be directly to select the Tth pixel point from the sorted pixel point set Dw_lTo the T th_hThe plurality of pixel points can be represented as a set Ds, and the number of elements of the set Ds is represented as a; wherein, T_l＝w·p_l(truncation and rounding), T_h＝w·p_h(truncation and rounding), P_l、p_hCan be set according to the situation of a specific application scene, P_l、p_hCan be used to eliminate outlier and background pixels in the target frame, e.g. set P_l1% -3% of p_hSet to 40% -60%, etc.

In the implementation process, the depth information of the target object is prevented from being determined only by the depth information of a single pixel point by means of screening and average calculation of the depth information of all the pixel points in the target frame, so that the accuracy of the depth information of the target object is effectively improved.

Step S134: and calculating the average value of the depth information of the representative pixel points, and determining the average value of the depth information as the depth information of the target object.

The embodiment of step S134 described above is, for example: using the formula

Calculating the average value of the depth information of a plurality of pixel points, and determining the average value of the depth information as the depth information of the target object; wherein Z represents the average value of depth information, a represents the number of a plurality of pixel points, D_kAnd indicating the depth information of the kth pixel point in the plurality of pixel points. In the implementation process, because many products have volume limitations, it is often difficult for the binocular camera to function well in many products, such as: the unmanned aerial vehicle cannot be provided with the binocular camera with a large baseline value (for example, the baseline value is 10 times larger than the unmanned aerial vehicle body), and the parallax of the image acquired by the camera with the small baseline value cannot meet the precision requirement, so that the effect of the binocular camera is reflected in the data acquisition stage, the baseline value of the binocular camera can be theoretically set to be infinite, and the parallax of the image acquired by the binocular camera is effectively ensured to meet the precision requirement.

Please refer to fig. 4, which is a schematic diagram of a neural network model training provided in the embodiment of the present application; the above-mentioned depth estimation method may include: the method comprises four stages of data acquisition, model training, model reasoning and depth information application, wherein the data acquisition stage is introduced above, and the model training stage is described in detail below. Optionally, in this embodiment of the application, after estimating the depth information of the target object according to the first target box and the second target box, the depth information may also be used as a training label training model, and a specific process of training the neural network model may include:

step S210: a first image and a second image are acquired, and the first image and the second image are different images acquired by a binocular camera aiming at a target object.

The implementation principle and implementation manner of step S210 are similar to those of step S110, and therefore, the implementation principle and implementation manner will not be described here, and if it is not clear, reference may be made to the description of step S110.

Step S220: and performing monocular processing on the first image and the second image to obtain a plurality of monocular images, and a target frame and depth information corresponding to the monocular images.

Please refer to fig. 5, which is a schematic process diagram of the monocular processing provided in the embodiment of the present application; the monocular processing refers to marking the position of a target frame and depth information corresponding to the target frame in each image in the binocular images. The embodiment of step S220 described above is, for example: and performing monocular processing on the first image to obtain a first monocular image, performing monocular processing on the second image to obtain a second monocular image, wherein the first monocular image and the second monocular image are independent from each other and are not connected with each other, and the first monocular image and the second monocular image are called as the monocular image in the following. The above-mentioned course of the monocular processing includes, for example: identifying a target frame of the target object on the first image by using the target detection network, where the target frame on the first image may be denoted as P1(xL, y, wL, h); wherein, xL represents the abscissa of the center point of the target object in the first image of the target frame, y represents the ordinate of the target object in the first image, wL represents the width of the target frame of the target object on the first image, and h represents the height of the target frame of the target object on the first image. Labeling a target frame P2(xR, y, wR, h) on the second image similarly; the embodiment in connection with step S130 calculates the depth information of the target object, i.e., the distance from the target object to the binocular camera may be represented as Z in the drawing. A target frame P1(xL, y, wL, h, Z) with depth information on the first monocular image and a target frame P2(xR, y, wR, h, Z) with depth information on the second monocular image may be obtained.

Step S230: and training the neural network by taking the plurality of monocular images as training data and taking the depth information and the target frame corresponding to the plurality of monocular images as training labels to obtain a neural network model.

The embodiment of the step S230 is, for example: and training a neural network by taking the plurality of monocular images as training data and the target frames and the depth information corresponding to the plurality of monocular images as training labels, and estimating the prediction distance between the target object and the camera and the prediction frame of the target object in the images by using the neural network. Then, distance loss between the depth information of the target object and the predicted distance, and regression loss between the target frame and the predicted frame are calculated, respectively. Finally, a total loss value is calculated according to the distance loss and the regression loss, and then a network weight parameter of the neural network is updated according to the total loss value until the loss value is smaller than a preset proportion (for example, 5% or 10% and the like) or the number of training batches (epochs) is larger than a preset threshold (for example, 100 or 1000 and the like), so that the neural network model can be obtained. The neural network herein may include: VGG networks, ResNet networks, YOLO networks, MobileNet networks, Wide ResNet networks, and inclusion networks, among others.

In the implementation process, the neural network model trained by the monocular image is used for reasoning the image to be inferred acquired by the monocular camera, and the monocular image is obtained by performing monocular processing on the image acquired by the binocular camera. That is to say, because the neural network model is trained by using the image with higher precision and processed by the monocular image, and the pixel points representing the real distance are screened out in the process of processing by the monocular image, the accuracy of the trained neural network model is higher. The depth estimation method does not depend on an external distance measurement sensor and the like, reduces errors caused by time asynchronization or spatial misalignment of the distance measurement sensor and the monocular camera, and improves the accuracy of depth information estimation.

Please refer to fig. 6, which illustrates a flow chart of a model inference phase provided in the embodiment of the present application; the above-mentioned depth estimation method may include: the method comprises four stages of data acquisition, model training, model reasoning and depth information application, wherein the model training stage is introduced above, and the model reasoning stage is described in detail below. Optionally, in this embodiment of the application, after obtaining the neural network model, the trained neural network model may be further used to perform inference on an image (also referred to as a monocular image) acquired by the monocular camera, and the process of performing inference on the monocular image may include:

step S310: and acquiring an image to be inferred, wherein the image to be inferred is acquired by using a monocular camera.

The acquiring method of the image to be inferred in the step S310 includes: the first acquisition mode is that a monocular camera, a video recorder, a color camera or other terminal equipment is used for shooting a target object to obtain an image to be inferred; then the terminal equipment sends an image to be reasoned to the electronic equipment, then the electronic equipment receives the image to be reasoned sent by the terminal equipment, and the electronic equipment can store the image to be reasoned into a file system, a database or mobile storage equipment; the second obtaining mode is to obtain a pre-stored image to be inferred, and specifically includes: acquiring an image to be inferred from a file system, or acquiring the image to be inferred from a database, or acquiring the image to be inferred from mobile storage equipment; and in the third acquisition mode, software such as a browser is used for acquiring the image to be inferred on the Internet, or other application programs are used for accessing the Internet to acquire the image to be inferred.

Step S320: and reasoning the image to be inferred by using a neural network model to obtain a predicted target frame and estimated depth information corresponding to the predicted target frame, wherein the neural network model is obtained after training by using a monocular image, and the monocular image is obtained by performing monocular processing on an image acquired by a binocular camera.

The embodiment of the step S320 includes: reasoning an image to be inferred by using a neural network model to obtain a prediction target frame and estimation depth information corresponding to the prediction target frame; the depth estimation method can also be applied to electronic products such as an auxiliary driving system, an unmanned aerial vehicle, an unmanned vehicle or a robot.

The above-mentioned depth estimation method may include: the method comprises four stages of data acquisition, model training, model reasoning and depth information application, wherein the model reasoning stage is introduced above, and the depth information application stage is described in detail below.

Optionally, after obtaining the predicted target frame and the estimated depth information corresponding to the predicted target frame, the method may further perform early warning or deceleration on an electronic product such as an unmanned aerial vehicle, an unmanned vehicle, or a robot, where the early warning or deceleration process may include: judging whether the distance corresponding to the estimated depth information is smaller than a preset distance or not; and if the distance corresponding to the estimated depth information is smaller than the preset distance, generating and outputting an auxiliary early warning signal, or decelerating through a braking system. This embodiment is, for example: an electronic product of an intelligent vehicle, an unmanned aerial vehicle, an unmanned vehicle or a robot of the auxiliary driving system judges whether the distance corresponding to the estimated depth information is smaller than a preset distance; if the distance corresponding to the estimated depth information is smaller than the preset distance, an auxiliary early warning signal is generated and output by electronic products such as an intelligent vehicle, an unmanned aerial vehicle, an unmanned vehicle or a robot of the auxiliary driving system, or the electronic products such as the intelligent vehicle, the unmanned aerial vehicle or the robot of the auxiliary driving system decelerate through a braking system; the preset distance may be set according to specific situations, for example, the preset distance is set to be 100 meters or 150 meters, and the like.

In the implementation process, when the distance corresponding to the estimated depth information is smaller than the preset distance, the auxiliary early warning signal is generated and output, or the speed is reduced through the braking system, so that the problem that the traffic accident is caused by untimely response or untimely braking acceleration is avoided, and the safety of electronic products such as an intelligent vehicle, an unmanned aerial vehicle, an unmanned vehicle or a robot of an auxiliary driving system is effectively improved.

Optionally, if the depth estimation method is applied to an electronic product such as an unmanned aerial vehicle, an unmanned vehicle, or a robot, after obtaining the predicted target frame and the estimated depth information corresponding to the predicted target frame, the path may be re-planned to actively avoid a target object in the predicted target frame (the target object may be a road block, a pedestrian, or an oncoming vehicle, etc.), and the process of re-planning the path may include: judging whether the ratio value of the prediction target frame and the image to be inferred is larger than a preset ratio or not; and if the ratio value of the predicted target frame and the image to be inferred is larger than the preset ratio, re-planning the path to avoid the target object in the predicted target frame. This embodiment is, for example: electronic products such as an intelligent vehicle, an unmanned aerial vehicle, an unmanned vehicle or a robot of the auxiliary driving system judge whether the ratio value of the prediction target frame and the image to be inferred is larger than a preset ratio or not; if the ratio of the predicted target frame to the image to be inferred is larger than the preset ratio, the electronic products of an intelligent vehicle, an unmanned aerial vehicle, an unmanned vehicle or a robot of the auxiliary driving system replans the path to avoid the target object in the predicted target frame, wherein the target object can be an obstacle in the road; the preset proportion here may be set according to specific situations, for example, the preset proportion is set to 80% or 90%, and so on.

In the implementation process, when the ratio of the predicted target frame to the image to be inferred is greater than the preset ratio, the path is re-planned, so that the target object in the predicted target frame is avoided, and the safety of electronic products such as intelligent vehicles, unmanned aerial vehicles, unmanned vehicles or robots of the auxiliary driving system is effectively improved.

Please refer to fig. 7, which is a schematic structural diagram of a depth estimation device according to an embodiment of the present application. An embodiment of the present application provides a depth estimation apparatus 400, including:

the inference image acquisition module 410 is used for acquiring an image to be inferred, wherein the image to be inferred is acquired by using a monocular camera;

the image depth estimation module 420 is configured to use a neural network model to reason the image to be inferred, and obtain a predicted target frame in the image to be inferred and estimated depth information corresponding to the predicted target frame, where the neural network model is obtained after training using a monocular image, the monocular image is obtained by performing monocular processing on an image acquired by a binocular camera, the predicted target frame represents a position area where a target object in the image to be inferred is located, and the estimated depth information is a distance between the target object and the monocular camera.

Optionally, in an embodiment of the present application, the depth estimation apparatus further includes:

the binocular image acquisition module is used for acquiring a first image and a second image, the first image and the second image are different images acquired by a binocular camera aiming at a target object, and the distance between the center points of the apertures of the binocular camera is equal to a pre-calculated baseline value;

the image monocular processing module is used for performing monocular processing on the first image and the second image to obtain a plurality of monocular images, and target frames and depth information corresponding to the monocular images;

and the network model training module is used for training the neural network by taking the plurality of monocular images as training data and taking the target frames and the depth information corresponding to the plurality of monocular images as training labels to obtain the neural network model.

Optionally, in an embodiment of the present application, the image monocular processing module includes:

and the target image labeling module is used for labeling a first target frame of the target object on the first image and labeling a second target frame of the target object on the second image, and the target frames represent the position area of the target object in the image.

And the depth information calculation module is used for calculating the depth information of the target object according to the first target frame and the second target frame.

Optionally, in an embodiment of the present application, the depth information calculating module includes:

and the target frame screening module is used for screening out the pixel target frame with less pixel points in the target frame from the first target frame and the second target frame.

And the depth information calculation module is used for calculating the depth information of all pixel points in the pixel target frame according to a binocular ranging principle, and the depth information of the pixel points is the distance between the position points of the pixel points in the target object and the binocular cameras.

And the depth information screening module is used for eliminating outlier pixel points and background pixel points from all pixel points in the pixel target frame to obtain a plurality of representative pixel points.

And the depth information determining module is used for calculating the average value of the depth information of the representative pixel points and determining the average value of the depth information as the depth information of the target object.

Optionally, in this embodiment of the present application, the binocular camera includes: a first camera and a second camera; a depth information calculation module comprising:

and the camera parallax obtaining module is used for performing feature matching on the first target frame and the second target frame by the camera to obtain parallax between the first camera and the second camera.

And the pixel depth calculation module is used for calculating the depth information of all pixel points in the pixel target frame according to the baseline value and the parallax.

Optionally, in this embodiment of the present application, the depth estimation apparatus may further include:

and the depth distance judging module is used for judging whether the distance corresponding to the estimated depth information is smaller than a preset distance.

And the auxiliary early warning deceleration module is used for generating and outputting an auxiliary early warning signal if the distance corresponding to the estimated depth information is smaller than the preset distance, or decelerating through a braking system.

and the frame proportion judging module is used for judging whether the proportion value of the predicted target frame and the image to be inferred is larger than the preset proportion.

And the path replanning module is used for replanning the path to avoid the target object in the predicted target frame if the ratio value of the predicted target frame to the image to be inferred is greater than the preset ratio.

It should be understood that the apparatus corresponds to the above-mentioned depth estimation method embodiment, and can perform the steps related to the above-mentioned method embodiment, and the specific functions of the apparatus can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy. The device includes at least one software function that can be stored in memory in the form of software or firmware (firmware) or solidified in the Operating System (OS) of the device.

An electronic device provided in an embodiment of the present application includes: a processor and a memory, the memory storing processor-executable machine-readable instructions, the machine-readable instructions when executed by the processor performing the method as above.

The embodiment of the application also provides a storage medium, wherein the storage medium is stored with a computer program, and the computer program is executed by a processor to execute the method.

The storage medium may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

In addition, functional modules of the embodiments in the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

The above description is only an alternative embodiment of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application.

Claims

1. A method of depth estimation, comprising:

acquiring an image to be inferred, wherein the image to be inferred is acquired by using a monocular camera;

and reasoning the image to be inferred by using a neural network model to obtain a predicted target frame in the image to be inferred and estimated depth information corresponding to the predicted target frame, wherein the neural network model is obtained after training by using a monocular image, the monocular image is obtained by performing monocular processing on an image acquired by a binocular camera, the predicted target frame represents a position area where a target object in the image to be inferred is located, and the estimated depth information is the distance between the target object and the monocular camera.

2. The method of claim 1, further comprising, before said using a neural network model to reason about the image to be inferred:

acquiring a first image and a second image, wherein the first image and the second image are different images acquired by the binocular camera aiming at a target object, and the distance between the aperture center points of the binocular camera is equal to a pre-calculated baseline value;

performing monocular processing on the first image and the second image to obtain a plurality of monocular images, and a target frame and depth information corresponding to the monocular images;

and training a neural network by taking the plurality of monocular images as training data and taking target frames and depth information corresponding to the plurality of monocular images as training labels to obtain a neural network model.

3. The method of claim 2, wherein said monocular processing of the first image and the second image comprises:

marking a first target frame of the target object on the first image, and marking a second target frame of the target object on the second image, wherein the target frame represents a position area of the target object in the image;

and calculating the depth information of the target object according to the first target frame and the second target frame.

4. The method of claim 3, wherein the calculating the depth information of the target object according to the first target box and the second target box comprises:

screening out a pixel target frame with a small number of pixel points in the target frame from the first target frame and the second target frame;

calculating the depth information of all pixel points in the pixel target frame, wherein the depth information of the pixel points is the distance between the position point of the pixel points in the target object and the binocular camera;

removing outlier pixel points and background pixel points from all pixel points in the pixel target frame to obtain a plurality of representative pixel points;

and calculating the depth information average value of the plurality of representative pixel points, and determining the depth information average value as the depth information of the target object.

5. The method of claim 4, wherein the binocular camera comprises: a first camera and a second camera; the calculating the depth information of all pixel points in the pixel target frame includes:

performing feature matching on the first target frame and the second target frame to obtain a parallax between the first camera and the second camera;

and calculating the depth information of all pixel points in the pixel target frame according to the baseline value and the parallax.

6. The method according to any one of claims 1 to 5, further comprising, after obtaining the predicted target frame in the image to be inferred and the estimated depth information corresponding to the predicted target frame:

judging whether the distance corresponding to the estimated depth information is smaller than a preset distance or not;

if yes, generating and outputting an auxiliary early warning signal, or decelerating through a braking system.

7. The method according to any one of claims 1 to 5, further comprising, after obtaining the predicted target frame in the image to be inferred and the estimated depth information corresponding to the predicted target frame:

judging whether the ratio of the prediction target frame to the image to be inferred is larger than a preset ratio or not;

if so, re-planning the path to avoid the target object in the predicted target frame.

8. A depth estimation device, comprising:

the system comprises a reasoning image acquisition module, a reasoning image acquisition module and a reasoning image acquisition module, wherein the reasoning image acquisition module is used for acquiring an image to be reasoned, and the image to be reasoned is acquired by using a monocular camera;

the image depth estimation module is used for reasoning the image to be inferred by using a neural network model to obtain a prediction target frame in the image to be inferred and estimation depth information corresponding to the prediction target frame, the neural network model is obtained after training by using a monocular image, the monocular image is obtained by performing monocular processing on an image acquired by a binocular camera, the prediction target frame represents a position area where a target object in the image to be inferred is located, and the estimation depth information is the distance between the target object and the monocular camera.

9. An electronic device, comprising: a processor and a memory, the memory storing machine-readable instructions executable by the processor, the machine-readable instructions, when executed by the processor, performing the method of any of claims 1 to 7.

10. A storage medium, having stored thereon a computer program which, when executed by a processor, performs the method of any one of claims 1 to 7.