WO2022165722A1 - 单目深度估计方法、装置及设备 - Google Patents

单目深度估计方法、装置及设备 Download PDF

Info

Publication number
WO2022165722A1
WO2022165722A1 PCT/CN2021/075318 CN2021075318W WO2022165722A1 WO 2022165722 A1 WO2022165722 A1 WO 2022165722A1 CN 2021075318 W CN2021075318 W CN 2021075318W WO 2022165722 A1 WO2022165722 A1 WO 2022165722A1
Authority
WO
WIPO (PCT)
Prior art keywords
map
camera
image
dsn
estimated
Prior art date
Application number
PCT/CN2021/075318
Other languages
English (en)
French (fr)
Inventor
摩拉莱斯•斯皮诺扎•卡洛斯•埃曼纽尔
李正卿
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2021/075318 priority Critical patent/WO2022165722A1/zh
Publication of WO2022165722A1 publication Critical patent/WO2022165722A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery

Definitions

  • the embodiments of the present application relate to the field of computer vision, and in particular, to a method, apparatus, and device for monocular depth estimation.
  • Monocular depth estimation uses images captured by a single camera as input to estimate real-world depth images (depth maps). Each pixel in the depth map stores a depth value, and the depth value is the distance between the three-dimensional (3-dimension, 3D) coordinate point of the real world corresponding to the pixel and the viewpoint of the camera. Monocular depth estimation can be applied to many important application scenarios that require 3D environmental information. These application scenarios include but are not limited to augmented reality (AR), navigation (such as autonomous driving), scene reconstruction, scene recognition, object detection, etc.
  • AR augmented reality
  • navigation such as autonomous driving
  • scene reconstruction scene recognition
  • object detection etc.
  • the monocular camera used in Monocular Depth Estimation (MDE) is usually an RGB or Grayscale camera.
  • RGB or Grayscale (Gray) cameras can capture better image effects when the lighting is good, the light ratio is small, and the camera/scene motion is stable.
  • the monocular camera as an RGB camera as an example, the depth map is obtained by obtaining two RGB frames from the monocular camera and performing stereo matching calculation on the two RGB frames.
  • the above method of estimating depth by stereo matching through two RGB frames has the following problems.
  • the monocular camera needs to be in a moving state while the target object in the real world is in a static state, and the texture details in the environment are required. and better lighting conditions.
  • the target objects in the scene are often dynamic, such as cars on the road. This makes the above monocular depth estimation method unsuitable for depth estimation of target objects in real life.
  • the present application provides a monocular depth estimation method, apparatus and device, which are suitable for depth estimation of target objects in general or common scenes in real life.
  • an embodiment of the present application provides a monocular depth estimation method.
  • the method may include: acquiring an image to be estimated and a first parameter corresponding to the image to be estimated, where the first parameter is a value of a camera that shoots the image to be estimated. Camera calibration parameters.
  • the to-be-estimated image is input into the first neural network model, and the first distance-scaled normal DSN map output by the first neural network model is obtained, and the first DSN map is used to represent the plane of the target object corresponding to the to-be-estimated image. The orientation and the distance between this plane and this camera.
  • a first camera filter map is determined, and the first camera filter map is used to represent the mapping relationship between the 3D point and the 2D plane of the target object in space, and the 2D plane is the The imaging plane of the camera.
  • a first depth map corresponding to the image to be estimated is determined according to the first DSN map and the first camera filter map.
  • this implementation method obtains the depth map based on the first DSN map and the first camera filter map, and the depth map can accurately reflect the distance of the target object, thereby improving the accuracy of monocular depth estimation. sex. Different from the method of estimating depth through stereo matching through two RGB frames, this implementation can perform depth estimation through one frame of the image to be estimated, and there is no scene limitation that requires the monocular camera to be in motion and the real-world target object to be stationary. , which can be applied to depth estimation of target objects in general or common scenes in real life.
  • the first neural network model is obtained by training using a training image and a second DSN map corresponding to the training image, and the second DSN map is obtained according to the second depth map corresponding to the training image, and The camera calibration parameters corresponding to the training image are determined.
  • the first neural network model since the first neural network model is obtained by training using the training image and the second DSN map corresponding to the training image, the first neural network model has the ability to output the DSN map corresponding to the input image, and can then use the DSN map corresponding to the input image.
  • the DSN map and the camera filter map corresponding to the input image can obtain the depth map corresponding to the input image to achieve monocular depth estimation.
  • this training image is used as input to the initial neural network model.
  • the loss function includes at least one of a first loss function, a second loss function or a third loss function, and the loss function is used to adjust parameters of the initial neural network model to obtain the first neural network model through training.
  • the first loss function is used to represent the error between the second DSN image and the third DSN image
  • the third DSN image is the DSN image corresponding to the training image output by the initial neural network model.
  • the second loss function uses to represent the error between the second depth map and the third depth map, the third depth map is determined according to the third DSN map and the second camera filter map, the second camera filter map is based on the training The camera calibration parameter corresponding to the image is determined by the training image, and the third loss function is used to represent the matching degree of the second depth map and the third depth map.
  • the neural network model by evaluating the error between the second DSN map and the third DSN map, the error between the second depth map and the third depth map, or the second depth map One or more of the matching degrees with the third depth map, adjust the neural network model so that the adjusted neural network model meets one or more accuracy requirements, thereby improving the performance of using the trained neural network model.
  • the accuracy of the monocular depth estimation method of the embodiment of the present application by evaluating the error between the second DSN map and the third DSN map, the error between the second depth map and the third depth map, or the second depth map One or more of the matching degrees with the third depth map.
  • the training image may be an image captured by any camera, such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera.
  • a camera such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera.
  • the neural network model is trained by using images captured by any camera such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera, so that the Monocular depth estimation methods can support depth estimation for images captured by different types of cameras.
  • a camera such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera.
  • determining the filter map of the first camera according to the image to be estimated and the first parameter includes: determining the first camera according to the position coordinates of the pixels of the image to be estimated and the first parameter A filter map, the first camera filter map includes a camera filter map vector corresponding to the pixel, the camera filter vector is used to represent the mapping relationship between the 3D point and the pixel, and the pixel is the 3D point projected to the A point in a 2D plane.
  • the first camera filter map is determined according to the position coordinates of the pixel points of the image to be estimated and the first parameter.
  • the camera filter map is related to the pixel points of the input image and the camera model, and is not affected by the 3D structure of the target object in the scene.
  • the corresponding camera filter maps are the same and can be calculated only once.
  • the camera filter map can be recalculated according to the camera calibration parameters of the new camera.
  • the depth map is obtained through the camera filter map and DSN map, which can improve the processing speed of monocular depth estimation.
  • the position coordinates of the pixel include abscissa and ordinate
  • the camera filter map vector corresponding to the pixel includes a first camera filter map component and a second camera filter map component
  • the first camera filter map The component is determined according to the abscissa and the first parameter
  • the second camera filter map component is determined according to the ordinate and the first parameter, or according to the abscissa, the ordinate and the first parameter.
  • the first parameter when the field of view of the camera that captures the image to be estimated is less than 180 degrees, the first parameter includes the center coordinates (c x , cy ) and the focal length (f x , f y ) of the camera.
  • the first camera filter map component is determined according to the abscissa and the first parameter
  • the second camera filter map component is determined according to the ordinate and the first parameter.
  • the first parameter when the field of view of the camera that captures the image to be estimated is less than 180 degrees, the first parameter includes the center coordinates (c x , cy ) and the focal length (f x , f y ) of the camera,
  • the first camera filter map component is Fu
  • the second camera filter map component is F v ,
  • the first parameter when the field of view of the camera of the to-be-estimated image is greater than 180 degrees, the first parameter includes the width pixel value W and the height pixel value H of the to-be-estimated image, and the first camera filters the mapping component. is determined according to the abscissa and the first parameter, and the second camera filter mapping component is determined according to the abscissa, the ordinate and the first parameter.
  • the first DSN map includes the first DSN vector corresponding to the pixel of the image to be estimated, and the first DSN map corresponding to the image to be estimated is determined according to the first DSN map and the first camera filter map.
  • a depth map includes: determining a depth value corresponding to the pixel point according to the first DSN vector corresponding to the pixel point and the camera filter map vector corresponding to the pixel point. Wherein, the first depth map includes the depth value corresponding to the pixel point.
  • the depth value corresponding to the pixel is determined according to the first DSN vector corresponding to the pixel and the camera filter mapping vector corresponding to the pixel, including:
  • the inverse depth value corresponding to the pixel point is determined.
  • the depth value corresponding to the pixel point is determined.
  • is the inverse depth value corresponding to the pixel point
  • N is the first DNS vector corresponding to the pixel point
  • F is the camera filter map vector corresponding to the pixel point.
  • the method may further include: acquiring a training image, a second depth image corresponding to the training image, and camera calibration parameters corresponding to the training image.
  • the initial neural network model is trained by using the training image, the second depth image corresponding to the training image, and the camera calibration parameters corresponding to the training image, and the first neural network model is obtained.
  • the training image, the second depth image corresponding to the training image, and the camera calibration parameters corresponding to the training image to train the initial neural network model, and obtain the first neural network model, including: according to the camera corresponding to the training image.
  • the parameters and the training image are calibrated to determine a second camera filter map, where the second camera filter map includes camera filter map vectors corresponding to pixels in the training image data.
  • the orientation of the plane where the 3D point in the scene corresponds to and the distance from the camera.
  • the training image is input to the initial neural network model, and the third DSN map output by the initial neural network model is obtained.
  • the parameters of the initial neural network model are adjusted to obtain the first neural network model.
  • the mapping vector is F
  • the second DSN map includes the DSN vector of the plane where the 3D point in the scene corresponding to the pixel point of the training image is located.
  • obtaining a second DSN map according to the second camera filter map and the second depth image including:
  • N i (N xi , N yi , N zi );
  • i (u, v)
  • ⁇ i is the inverse depth value of the 3D point in the scene corresponding to the pixel point i
  • the inverse depth value is the inverse of the depth value
  • the second DSN map includes the scene corresponding to the pixel point of the training image The DSN vector of the plane in which the 3D points in .
  • the first loss function is
  • N i (N xi , N yi , N zi ) represents the DSN vector corresponding to the pixel i in the third DSN map, Represents the entire set of valid pixels.
  • the second loss function is
  • represents the paradigm, represents the inverse depth value corresponding to pixel i in the second depth map, ⁇ i represents the inverse depth value corresponding to pixel i in the third depth map, Represents the entire set of valid pixels.
  • I is the image data in the second depth map that matches the third depth map, Represents a matching set of pixels.
  • the loss function is
  • ⁇ DEP , ⁇ DSN and ⁇ INP are greater than or equal to 0, respectively.
  • acquiring the training image and the second depth image corresponding to the training image includes at least one of the following:
  • Acquire multiple training images which are image data obtained by shooting scenes with multiple calibrated and synchronized cameras; use 3D vision technology to process the multiple training images to obtain second images corresponding to the multiple training images. depth image; or,
  • the data optimization includes at least one of hole filling optimization, sharpening occlusion edge optimization, or temporal consistency optimization.
  • the environmental conditions of the scene captured by the original image are changed to obtain the training images under different environmental conditions.
  • the second depth image corresponding to each training image is obtained by a depth sensor; or, the second depth image corresponding to each training image is obtained after the teacher's monocular depth estimation network processes the input training image.
  • the output depth image is obtained.
  • an embodiment of the present application provides a monocular depth estimation apparatus, which has a function of implementing the first aspect or any possible design of the first aspect.
  • the functions can be implemented by hardware, and can also be implemented by hardware executing corresponding software.
  • the hardware or software includes one or more modules corresponding to the above functions, for example, an acquisition unit or module, a DSN unit or module, a camera filter mapping unit or module, and a depth estimation unit or module.
  • embodiments of the present application provide an electronic device, which may include: one or more processors; one or more memories; wherein the one or more memories are used to store one or more programs ; the one or more processors are configured to run the one or more programs to implement the method according to the first aspect or any possible design of the first aspect.
  • embodiments of the present application provide a computer-readable storage medium, which is characterized by comprising a computer program, and when the computer program is executed on a computer, causes the computer to execute the first aspect or any of the first aspect.
  • a computer program when executed on a computer, causes the computer to execute the first aspect or any of the first aspect.
  • an embodiment of the present application provides a chip, which is characterized in that it includes a processor and a memory, the memory is used for storing a computer program, and the processor is used for calling and running the computer program stored in the memory, to A method as described in the first aspect or any possible design of the first aspect is performed.
  • embodiments of the present application provide a computer program product, which, when the computer program product runs on a computer, causes the computer to execute the method described in the first aspect or any possible design of the first aspect.
  • the image to be estimated and the camera calibration parameters are obtained, the first DSN map corresponding to the image to be estimated is obtained through the first neural network model, and the first camera is determined according to the camera calibration parameters. Filter the map, and then obtain a depth map based on the first DSN map and the first camera filter map. Different from directly using the neural network model to output the depth map, the embodiment of the present application obtains the depth map based on the first DSN map and the first camera filter map, and the depth map can accurately reflect the distance of the target object, thereby improving the accuracy of monocular depth estimation. sex.
  • the embodiment of the present application can perform depth estimation through one frame of the image to be estimated, and there is no need for the monocular camera to be in motion and the real-world target object to be in a static state. Limits that can be applied to depth estimation of target objects in general or common real-life scenes.
  • FIG. 1 is a schematic diagram of a system architecture 100 provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a convolutional neural network (CNN) 200 provided by an embodiment of the present application;
  • CNN convolutional neural network
  • FIG. 3 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a system architecture 400 provided by an embodiment of the present application.
  • FIG. 5 is a flowchart of a monocular depth estimation method provided by an embodiment of the present application.
  • FIG. 6 is a flowchart of another monocular depth estimation method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of a geometric correspondence between a 3D point and a perspective (pinhole) camera model in a scene provided by an embodiment of the present application;
  • FIG. 8 is a schematic diagram of a monocular depth estimation processing process provided by an embodiment of the present application.
  • FIG. 9 is a flowchart of a method for training a first neural network model provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of a training process provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of a training process of a first neural network model provided by an embodiment of the application.
  • FIG. 12 is a schematic diagram of a training process of a first neural network model provided by an embodiment of the application.
  • FIG. 13 is a schematic structural diagram of a monocular depth estimation apparatus provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of an electronic device provided by an embodiment of the application.
  • FIG. 15 is a schematic structural diagram of another monocular depth estimation apparatus provided by an embodiment of the present application.
  • At least one (item) refers to one or more, and "a plurality” refers to two or more.
  • “And/or” is used to describe the relationship between related objects, indicating that there can be three kinds of relationships, for example, “A and/or B” can mean: only A, only B, and both A and B exist , where A and B can be singular or plural.
  • the character “/” generally indicates that the associated objects are an “or” relationship.
  • At least one item(s) below” or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s).
  • At least one (a) of a, b or c can mean: a, b, c, "a and b", “a and c", “b and c", or "a and b and c" ", where a, b, c can be single or multiple respectively.
  • the monocular depth estimation method of the embodiment of the present application does not require the scene limitation that the monocular camera is in a moving state and the target object in the real world is in a static state, and can be applied to Depth estimation of target objects in real-life general or common scenes.
  • the general or general scene described in the embodiments of the present application specifically refers to any scene without conditional restrictions, where the conditional restrictions may include but are not limited to lighting condition restrictions, camera type restrictions, target object type restrictions, or camera and target objects in the scene Restrictions on the relative positional relationship between them, etc.
  • the lighting condition can be the lighting of the environment in which the scene is located.
  • Camera types can be RGB cameras, grayscale cameras, event cameras, night vision cameras, or thermal cameras, etc.
  • the target object type can be a person, an animal, or an object, etc.
  • the relative positional relationship between the camera and the target object in the scene can be close-range, distant, still or moving, and so on.
  • AR augmented reality
  • navigation eg, automatic driving or assisted driving
  • scene reconstruction e.g., scene understanding, or object detection, and the like.
  • the image to be estimated is input into the first neural network model, the first distance scaled normal (DSN) map of the first neural network model is obtained, and the image to be estimated and the One parameter is to determine the first camera filter map, and then determine the first depth map corresponding to the image to be estimated according to the first DSN map and the first camera filter map.
  • the first parameter is a camera calibration parameter of the camera that captures the image to be estimated.
  • the first DSN map is related to the 3D structure (eg, geometric structure) of the target object in the scene, and is not affected by the camera model (eg, including geometric projection model and camera calibration parameters, etc.).
  • the first camera filter map is related to the pixel points of the image to be estimated and the camera model, and is not affected by the 3D structure of the target object in the scene. Compared with the traditional depth estimation network, determining the depth map through the first DSN map and the first camera filter map can improve the accuracy and efficiency of monocular depth estimation. For its specific implementation, reference may be made to the explanations of the following embodiments.
  • Target Objects including but not limited to people, animals or objects.
  • the objects may be objects in the natural environment, such as grass, or trees, etc., or may be objects in the human environment, such as buildings, roads, or vehicles.
  • the surface of the target object usually has a regular planar structure. In some embodiments, even if the surface of the target object does not have a completely planar structure, the surface of the target object may be divided into a plurality of small planar regions.
  • 3D point of target object in space A point on the plane of the outer surface of the target object in space.
  • the 3D point of the target object in space can be any point on the outer surface of the vehicle, such as a point on the plane formed by the windshield of the front windshield, and a point on the plane formed by the license plate.
  • the DSN map used to represent the orientation of the plane of the target object corresponding to the input image and the distance between the plane and the camera.
  • the camera refers to the camera that captures the input image
  • the plane of the target object refers to the plane of the outer surface of the target object in 3D space.
  • the target object is a cube
  • the camera shoots the cube from an angle, and the obtained input image presents 3 planes of the cube
  • the plane of the target object refers to the 3 planes of the cube.
  • the DSN map includes the DSN vectors of the three planes of the cube, and the DSN vector of each plane can represent the orientation of the respective plane and the distance between the respective plane and the camera.
  • the DSN vector of each plane is related to the 3D structure of the target object in the scene and is not affected by the camera model.
  • the DSN vectors corresponding to the coplanar 3D points of the target object in the same plane are the same.
  • the DSN vector of each pixel is always the same as the DSN vector of the adjacent pixels belonging to the same plane.
  • each pixel in the DSN map stores a DSN vector.
  • the number of data stored in each pixel in the DSN map is called the number of channels. Since each pixel stores a DSN vector here, the number of channels is equal to the number of components included in the DSN vector.
  • the data stored by one pixel of is 3-channel data, and one channel of one pixel in the DSN diagram is used to store one component of the DSN vector, that is, one dimension component.
  • the monocular depth estimation method may use a neural network model to process an input image to obtain a DSN map.
  • the input image is the image to be estimated, and the first neural network model (also called the target neural network model) is used to process the to-be-estimated image to obtain the first DSN map.
  • the first neural network model also called the target neural network model
  • the training process of the neural network model the input image is the training image, and the initial neural network model is used to process the training image to obtain the second DSN map.
  • This embodiment of the present application uses the first DSN map and the second DSN map to distinguish the DSN maps output by the neural network model in different processes.
  • Camera filter map used to represent the mapping relationship between the 3D point of the target object in space and the 2D plane, the 2D plane is the imaging plane of the camera.
  • the camera filter map is related to the pixel points of the input image and the camera model, and is not affected by the 3D structure of the target object in the scene.
  • the camera model may include a geometric projection model, camera calibration parameters, and the like, and the camera calibration parameters may include camera center coordinates, focal length, and the like.
  • the corresponding camera filter maps are the same and can be calculated only once.
  • the camera filter map can be recalculated according to the camera calibration parameters of the new camera.
  • each pixel in the camera filter map stores a camera filter map vector.
  • the camera 1 shoots the target object 1, and the input image 11 is obtained.
  • the camera 1 shoots the target object 2, and the input image 12 is obtained. Since the input image 11 and the input image 12 are both collected by the camera 1, the input image
  • the camera filter map corresponding to 11 is the same as the camera filter map corresponding to the input image 12 .
  • Depth map used to represent the distance (depth) from the 3D point in space of the target object corresponding to the input image to the camera.
  • Each pixel in the depth map stores a depth value that estimates the distance between the real-world 3D point corresponding to the pixel and the camera's viewpoint at the time the input image was captured by the camera.
  • the real-world 3D point can be any target object in any scene, a 3D point in space.
  • the depth value of a pixel in the depth map can be determined by two parts, which include the DSN vector of the pixel and the camera filter map vector of the pixel.
  • the electronic device in this embodiment of the present application may be a mobile phone (mobile phone), a tablet computer (Pad), a computer with a wireless transceiver function, a virtual reality (Virtual Reality, VR) terminal device, an augmented reality (Augmented Reality, AR) terminal device, Terminal equipment in industrial control, terminal equipment in assisted driving, terminal equipment in self driving, terminal equipment in remote medical surgery, terminal equipment in smart grid (smart grid) equipment, terminal equipment in transportation safety, terminal equipment in smart city, terminal equipment in smart home, smart watch, smart bracelet, smart glasses, and other sports accessories or Wearables and more.
  • a terminal device in a smart home may be a smart home appliance such as a smart TV and a smart speaker.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is an activation function of the neural unit, which is used to perform nonlinear transformation on the features obtained in the neural network, and convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • a deep neural network also known as a multi-layer neural network, can be understood as a neural network with multiple hidden layers.
  • the DNN is divided according to the positions of different layers.
  • the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complicated, in terms of the work of each layer, it is not complicated. In short, it is the following linear relationship expression: in, is the input vector, is the output vector, is the offset vector, W is the weight matrix (also called coefficients), and ⁇ () is the activation function.
  • Each layer is just an input vector After such a simple operation to get the output vector Due to the large number of DNN layers, the coefficient W and offset vector The number is also higher.
  • the DNN Take the coefficient W as an example: Suppose that in a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the kth neuron in the L-1 layer to the jth neuron in the Lth layer is defined as
  • the input layer does not have a W parameter.
  • more hidden layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vectors of many layers).
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional layers and subsampling layers, which can be viewed as a filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a convolutional layer of a convolutional neural network a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way to extract features is independent of location.
  • the convolution kernel can be formalized in the form of a matrix of random size, and the convolution kernel can be learned to obtain reasonable weights during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the pixel value of the image can be a red-green-blue (RGB) color value, and the pixel value can be a long integer representing the color.
  • the pixel value is 256*Red+100*Green+76*Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component. In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness.
  • the pixel values can be grayscale values.
  • an embodiment of the present application provides a system architecture 100 .
  • the data collection device 160 is used to collect training data.
  • the training data in this embodiment of the present application may include the training image and the second DSN map corresponding to the training image, or include the training image and the second depth map corresponding to the training image, or include the training image and the second DSN map corresponding to the training image.
  • the training device 120 processes the training data through the training method of the first neural network model in the following embodiments of the present application, and compares the output image with the target image (for example, the second DSN map) until training.
  • the difference between the image output by the device 120 and the target image is less than a certain threshold, so that the training of the target model/rule 101 is completed.
  • the target model/rule in this embodiment of the present application is used to process the input image to be estimated, and output a first DSN map, where the first DSN map is used to represent the orientation of the plane of the target object corresponding to the image to be estimated and the relationship between the plane and the camera distance between.
  • the target model/rule 101 can be used to implement the monocular depth estimation method provided by the embodiment of the present application, that is, the image to be processed, such as the image to be estimated, is input into the target model/rule 101 after relevant preprocessing, Get the first depth map.
  • the target model/rule 101 in this embodiment of the present application may specifically be a neural network.
  • the training data maintained in the database 130 may not necessarily come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily train the target model/rule 101 completely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of Examples.
  • the target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. Laptops, augmented reality (AR)/virtual reality (VR), in-vehicle terminals, etc., can also be servers or the cloud.
  • the execution device 110 is configured with an (input/output, I/O) interface 112 for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140, and the input
  • the data may include: an image to be estimated.
  • the preprocessing module 113 is configured to perform preprocessing according to the input data (eg, the image to be estimated) received by the I/O interface 112 .
  • the preprocessing module 113 may be used to perform image filtering and other processing on the input data.
  • the preprocessing module 113 and the preprocessing module 114 may also be absent, and the calculation module 111 may be directly used to process the input data.
  • the execution device 110 When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the data and instructions obtained by corresponding processing may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result, such as the first depth map obtained as described above, to the client device 140, so as to be provided to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete The above task, thus providing the user with the desired result.
  • the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .
  • FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • the training device 120 and the execution device 110 are two devices, and in other cases, the training device 120 and the execution device 110 may be one device.
  • the training device 120 and the execution device 110 may be a server or a server cluster
  • the client device 140 may establish a connection with the server
  • the server may process the image to be estimated by using the monocular depth estimation method in the embodiment of the present application to obtain the A depth map, providing the first depth map to client device 140 .
  • the execution device 110 and the client device 140 are two devices, and in other cases, the execution device 110 and the client device 140 may be one device.
  • the execution device 110 and the client device 140 may be a smart phone
  • the training device 120 may be a server or a server cluster
  • the server may process the training data through the training method of the first neural network model in the embodiment of the present application
  • a target model/rule is generated, and the target model/rule is provided to the smartphone, so that the smartphone can process the image to be estimated by the monocular depth estimation method of the embodiment of the present application to obtain a first depth map.
  • a target model/rule 101 is obtained by training according to the training device 120.
  • the target model/rule 101 may be the first neural network model in the present application.
  • the first neural network model in the present application A neural network model can include CNN or deep convolutional neural networks (DCNN), among others.
  • CNN is a very common neural network
  • a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture. learning at multiple levels of abstraction.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional/pooling layer 220 (where the pooling layer is optional), and a fully connected layer 230 .
  • the convolutional/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, and layer 223 is a convolutional layer Layer 224 is a pooling layer, 225 is a convolutional layer, and 226 is a pooling layer; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers. layer, 226 is the pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 may include many convolution operators.
  • the convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator is essentially Can be a weight matrix, which is usually pre-defined, usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image during the convolution operation on the image. ...It depends on the value of the stride step) to process, so as to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • each weight matrix is stacked to form the depth dimension of the convolutional image, where the dimension can be understood as determined by the "multiple" described above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the size of the feature maps extracted from the multiple weight matrices with the same size is also the same, and then the multiple extracted feature maps with the same size are combined to form a convolution operation. output.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
  • the initial convolutional layer eg, 221
  • the features extracted by the later convolutional layers eg, 226 become more and more complex, such as features such as high-level semantics.
  • features with higher semantics are more suitable for the problem to be solved.
  • the pooling layer can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the pixel values in the image within a certain range to produce an average value as the result of average pooling.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to utilize the fully connected layer 230 to generate one or a set of outputs of the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 2), and the parameters contained in the multiple hidden layers may be based on the relevant training data of specific task types It is obtained by pre-training, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.
  • the output layer 240 After the multi-layer hidden layers in the fully connected layer 230, that is, the last layer of the entire convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error,
  • the forward propagation of the entire convolutional neural network 200 (as shown in Figure 2, the propagation from the direction 210 to 240 is forward propagation)
  • the back propagation (as shown in Figure 2, the propagation from the 240 to 210 direction is the back propagation) will Start to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.
  • the convolutional neural network 200 shown in FIG. 2 is only used as an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models. Including a part of the network structure shown in FIG. 2 , for example, the convolutional neural network adopted in this embodiment of the present application may only include an input layer 210 , a convolutional layer/pooling layer 220 and an output layer 240 .
  • FIG. 3 is a hardware structure of a chip provided by an embodiment of the application, and the chip includes a neural network processor 30 .
  • the chip can be set in the execution device 110 as shown in FIG. 1 to complete the calculation work of the calculation module 111 .
  • the chip can also be set in the training device 120 as shown in FIG. 1 to complete the training work of the training device 120 and output the target model/rule 101 .
  • the algorithms of each layer in the convolutional neural network shown in Figure 2 can be implemented in the chip shown in Figure 3.
  • Both the monocular depth estimation method and the training method of the first neural network model in the embodiment of the present application can be implemented in the chip as shown in FIG. 3 .
  • the neural network processor 30 may be a neural-network processing unit (NPU), a tensor processing unit (TPU), or a graphics processor (graphics processing unit, GPU), etc., all suitable for large-scale applications.
  • NPU neural-network processing unit
  • TPU tensor processing unit
  • GPU graphics processor
  • the NPU is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and the main CPU assigns tasks.
  • the core part of the NPU is the operation circuit 303, and the controller 304 controls the operation circuit 303 to extract the data in the memory (weight memory or input memory) and perform operations.
  • TPU is Google's fully customized artificial intelligence accelerator application-specific integrated circuit for machine learning.
  • the arithmetic circuit 303 includes multiple processing units (process engines, PEs). In some implementations, arithmetic circuit 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 303 is a general-purpose matrix processor.
  • the arithmetic circuit 303 fetches the weight data of the matrix B from the weight memory 302 and buffers it on each PE in the arithmetic circuit 303 .
  • the arithmetic circuit 303 fetches the input data of the matrix A from the input memory 301 , performs matrix operation according to the input data of the matrix A and the weight data of the matrix B, and stores the partial result or the final result of the matrix in the accumulator 308 .
  • the vector calculation unit 307 can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector computing unit 307 can be used for network computation of non-convolutional/non-FC layers in the neural network, such as pooling, batch normalization, local response normalization, etc. .
  • the vector computation unit 307 can store the processed output vectors to the unified buffer 306 .
  • the vector calculation unit 307 may apply a nonlinear function to the output of the arithmetic circuit 303, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 307 generates normalized values, merged values, or both.
  • vector computation unit 307 stores the processed vectors to unified memory 306 .
  • the vector processed by the vector computing unit 307 can be used as the activation input of the arithmetic circuit 303, for example, for use in subsequent layers in the neural network, as shown in FIG. 2, if the current processing layer is the hidden layer 1 (231), the vector processed by the vector calculation unit 307 can also be used for calculation in the hidden layer 2 (232).
  • Unified memory 306 is used to store input data and output data.
  • the weight data is directly stored in the weight memory 302 through a storage unit access controller (direct memory access controller, DMAC) 305.
  • Input data is also stored in unified memory 306 via the DMAC.
  • the bus interface unit (bus interface unit, BIU) 310 is used for the interaction of the DMAC and the instruction fetch buffer (instruction fetch buffer) 309; the bus interface unit 301 is also used for the instruction fetch memory 309 to obtain instructions from the external memory; the bus interface unit 301 also The memory cell access controller 305 acquires the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to store the input data in the external memory DDR into the unified memory 306 , or store the weight data into the weight memory 302 , or store the input data into the input memory 301 .
  • An instruction fetch buffer 309 connected to the controller 304 is used to store the instructions used by the controller 304.
  • the controller 304 is used for invoking the instructions cached in the memory 309 to control the working process of the operation accelerator.
  • the unified memory 306 , the input memory 301 , the weight memory 302 and the instruction fetch memory 309 are all on-chip memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access.
  • Memory double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (high bandwidth memory, HBM) or other readable and writable memory.
  • each layer in the convolutional neural network shown in FIG. 2 may be performed by the operation circuit 303 or the vector calculation unit 307 .
  • both the training method of the first neural network model and the monocular depth estimation method in the embodiment of the present application may be executed by the operation circuit 303 or the vector calculation unit 307 .
  • an embodiment of the present application provides a system architecture 400 .
  • the system architecture includes a local device 401, a local device 402, an execution device 410 and a data storage system 450, wherein the local device 401 and the local device 402 are connected with the execution device 410 through a communication network.
  • Execution device 410 may be implemented by one or more servers.
  • the execution device 410 may be used in conjunction with other computing devices, such as data storage, routers, load balancers and other devices.
  • the execution device 410 may be arranged on one physical site, or distributed across multiple physical sites.
  • the execution device 410 may use the data in the data storage system 450 or call the program code in the data storage system 450 to implement the training method and/or the monocular depth estimation method of the first neural network model in the embodiment of the present application.
  • a user may operate respective user devices (eg, local device 401 and local device 402 ) to interact with execution device 410 .
  • Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, etc.
  • Each user's local device can interact with the execution device 410 through a communication network of any communication mechanism/communication standard.
  • the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the local device 401 and the local device 402 obtain the first neural network model from the execution device 410, deploy the first neural network model on the local device 401 and the local device 402, and use the first neural network model Perform monocular depth estimation.
  • the first neural network model may be directly deployed on the execution device 410.
  • the execution device 410 acquires the images to be processed from the local device 401 and the local device 402, and uses the first neural network model to perform the processing on the images to be processed. Monocular depth estimation.
  • the above execution device 410 may also be a cloud device, in this case, the execution device 410 may be deployed in the cloud; or, the above execution device 410 may also be a terminal device, in this case, the execution device 410 may be deployed on the user terminal side, the embodiment of the present application This is not limited.
  • FIG. 5 is a flowchart of a monocular depth estimation method according to an embodiment of the present application. As shown in FIG. 5 , the method in this embodiment may include:
  • Step 101 Acquire the image to be estimated and the first parameter corresponding to the image to be estimated.
  • the image to be estimated is an image obtained by photographing a target object in a 3D space by a camera.
  • the to-be-estimated image may be an image captured (also referred to as acquisition) by any camera such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera.
  • the image to be estimated may be a frame of image.
  • the first parameter is the camera calibration parameter of the camera that captures the image to be estimated.
  • Camera calibration parameters are parameters related to the camera's own characteristics.
  • the camera calibration parameters may include the center coordinates and focal length of the camera.
  • the camera calibration parameters may include width pixel values and height pixel values of the image to be estimated.
  • a way to acquire the image to be estimated may be to acquire the image to be estimated by acquiring the image to be estimated through any of the above-mentioned cameras of the device itself.
  • Another way to obtain the image to be estimated may be to receive the image to be estimated sent by other devices, and the image to be estimated may be collected by cameras of other devices.
  • Step 102 Input the image to be estimated into the first neural network model, and obtain the first DSN map output by the first neural network model.
  • the first DSN map is used to represent the orientation of the plane of the target object corresponding to the image to be estimated and the distance between the plane and the camera.
  • the input image here is the image to be estimated.
  • the first DSN map is related to the geometry of the target object in 3D space, independent of the camera model, which can more accurately represent the three-dimensional world.
  • the first neural network model may be any neural network model, for example, a deep neural network (Deep Neural Network, DNN), a convolutional neural network (Convolutional Neural Networks, CNN) or a combination thereof, and the like.
  • a deep neural network Deep Neural Network, DNN
  • a convolutional neural network Convolutional Neural Networks, CNN
  • the first neural network model is obtained by training using the training image and the second DSN map corresponding to the training image.
  • the second DSN graph is the ground truth in the process of training the neural network model.
  • the second DSN map is determined according to a second depth map corresponding to the training image and camera calibration parameters corresponding to the training image.
  • the second depth map is the ground truth in the process of training the neural network model, and the second DSN map can be determined by the camera calibration parameters corresponding to the second depth map and the training image.
  • the first neural network model is trained by the training image and the second DSN map corresponding to the training image, and learns the mapping feature of the DSN map obtained from the input image, so that the above-mentioned image to be estimated can be intelligently perceived and output corresponding to the image to be estimated.
  • DSN diagram DSN diagram.
  • Step 103 Determine a first camera filter map according to the image to be estimated and the first parameter.
  • the first camera filter map is used to represent the mapping relationship between the 3D points of the target object in space and the 2D plane, where the 2D plane is the imaging plane of the camera.
  • the input image here is the image to be estimated, that is, the first camera filter map is related to the pixels of the image to be estimated and the camera model, and is not affected by the image. The influence of the 3D structure of the target object in the scene.
  • the first camera filter map is determined according to the position coordinates of the pixels of the image to be estimated and the first parameter.
  • the first camera filter map includes camera filter map vectors corresponding to pixels.
  • the pixel points in the first camera filter map store the camera filter map vector.
  • the camera filter vector is used to represent the mapping relationship between the 3D point and the pixel point, and the pixel point is the point where the 3D point is projected to the 2D plane (camera imaging plane).
  • the position coordinates of the pixel points may include an abscissa and an ordinate
  • the camera filter map vector corresponding to the pixel includes a first camera filter map component and a second camera filter map component
  • the first camera filter map component is based on the abscissa.
  • the second camera filter map component is determined according to the ordinate and the first parameter, or according to the abscissa, the ordinate and the first parameter.
  • the first parameter may include the center coordinates and focal length of the camera.
  • the first camera filter map component is determined according to the abscissa and the first parameter, and the second camera filter map component is determined according to the ordinate and the first parameter.
  • the above-mentioned first parameter may include the width pixel value and the height pixel value of the to-be-estimated image, and the first camera filter map component is based on The abscissa and the first parameter are determined, and the second camera filter map component is determined according to the abscissa, the ordinate and the first parameter.
  • Step 104 Determine a first depth map corresponding to the image to be estimated according to the first DSN map and the first camera filter map.
  • the first depth map may include a depth value corresponding to a pixel, in other words, a pixel in the first depth map stores a depth value, and the depth value is used to represent the time when the image to be estimated was captured by the camera, and the real world corresponding to the pixel.
  • the first depth map is a dense, edge-aware, metric-scale depth map.
  • the first DSN map and the first camera filter map through the above steps, two parts can be obtained: the first DSN map and the first camera filter map, and through these two parts, the first depth map corresponding to the image to be estimated can be finally obtained.
  • This embodiment of the present application uses the first neural network model to obtain the first DSN map, where the first DSN map can accurately represent the geometric structure of the target object in the 3D space.
  • the first camera filter map is determined according to the camera calibration parameters of the image to be estimated.
  • the first camera filter map is related to the pixels of the image to be estimated and the camera model, and is not affected by the 3D structure of the target object in the scene.
  • the depth map is then obtained based on these two parts.
  • the embodiment of the present application obtains the depth map based on the first DSN map and the first camera filter map, and the depth map can accurately reflect the distance of the target object and improve the accuracy of monocular depth estimation. .
  • the first DSN map may include the first DSN vector corresponding to the pixel points of the image to be estimated, in other words, the pixels in the first DSN map store the DSN vector. According to the first DSN vector corresponding to the pixel point and the camera filter map vector corresponding to the pixel point, the depth value corresponding to the pixel point can be determined. That is, corresponding operations are performed on vectors at the same pixel position in the first DSN map and the first camera filter map, so as to obtain the depth value of the corresponding pixel position.
  • the specific implementation of determining the depth value corresponding to the pixel i may be, by the following formula 1 and formula: 2 OK.
  • ⁇ i is the inverse depth value of the 3D point of the target object in the scene corresponding to pixel i
  • N i is the first DSN vector corresponding to pixel i
  • F i is the camera filter mapping vector corresponding to pixel i.
  • Ni can be obtained from pixel i of the first DSN map
  • F i can be obtained from pixel i of the first camera filter map.
  • the depth value corresponding to pixel i can be obtained by taking the inverse of the inverse depth value. For example, it is determined according to the following formula 2.
  • Z i is the inverse depth value corresponding to pixel i.
  • the above pixel point i may be any pixel point in the image to be estimated.
  • the depth value corresponding to the pixel i is determined by formula 1 and formula 2, which can be applied to the monocular depth estimation of the image to be estimated collected by the camera of the perspective camera model, and can also be applied to the non-perspective camera.
  • Monocular depth estimation is performed on the image to be estimated collected by the camera of the model, and the camera of the non-perspective camera model includes but is not limited to a panoramic camera, a 360-degree spherical camera, a catadioptric camera, a fisheye camera, and the like.
  • the image to be estimated and the camera calibration parameters are obtained
  • the first DSN map corresponding to the image to be estimated is obtained through the first neural network model
  • the first camera filter map is determined according to the camera calibration parameters, and then based on the first DSN map and
  • the embodiment of the present application obtains the depth map based on the first DSN map and the first camera filter map, and the depth map can accurately reflect the distance of the target object, thereby improving the accuracy of monocular depth estimation. sex.
  • the embodiment of the present application can perform depth estimation through one frame of the image to be estimated, and there is no need for the monocular camera to be in motion and the real-world target object to be in a static state. Limits that can be applied to depth estimation of target objects in general or common real-life scenes.
  • FIG. 6 is a flowchart of another monocular depth estimation method according to an embodiment of the present application. As shown in FIG. 6 , the method in this embodiment may include:
  • Step 201 Acquire the image to be estimated and the first parameter corresponding to the image to be estimated.
  • the first parameter corresponding to the image to be estimated may include the center coordinate and focal length of the camera, or may include the width pixel value and the height pixel value of the image to be estimated.
  • the width pixel value and the height pixel value of the image to be estimated may be calculated based on the center coordinates of the camera.
  • Step 202 Input the image to be estimated into the first neural network model, and obtain the first DSN map output by the first neural network model.
  • steps 201 to 202 For the explanation of steps 201 to 202, reference may be made to the specific explanation of steps 101 to 102 of the embodiment shown in FIG. 5, which will not be repeated here.
  • Step 203 Determine whether the field of view of the camera that shoots the image to be estimated is less than 180°, if yes, go to Step 204, and if not, go to Step 205.
  • Step 204 Determine a first camera filter map according to the abscissa and ordinate of the pixel of the image to be estimated, and the center coordinate and focal length of the camera that captures the image to be estimated.
  • the first camera filter map includes camera filter map vectors corresponding to pixels.
  • the pixel point i is taken as an example, and the specific implementation manner of determining the camera filter mapping vector corresponding to the pixel point i in this step may be determined by the following formulas 3 to 5.
  • F i is the camera filter map vector corresponding to the pixel i
  • F u is the second camera filter map component of F i
  • F v is the second camera filter map component of F i
  • i (u, v)
  • u is the abscissa of the pixel i
  • v is the ordinate of the pixel i.
  • (c x , c y ) are the center coordinates of the camera that captures the image to be estimated
  • (f x , f y ) are the focal lengths of the camera that captures the image to be estimated.
  • the above-mentioned pixel point i may be any pixel point in the image to be estimated, so that the first camera filter map can be obtained.
  • the first camera filter map can be determined by using Formula 3 to Formula 5 in this step.
  • Step 205 Determine the first camera filter map according to the abscissa and ordinate of the pixel points of the image to be estimated, and the width and height pixel values of the image to be estimated.
  • the pixel point i is taken as an example, and the specific implementation manner of determining the camera filter mapping vector corresponding to the pixel point i in this step may be determined by, for example, formula 3, formula 6 and formula 7.
  • W is the width pixel value of the image to be estimated
  • H is the height pixel value of the image to be estimated
  • the above-mentioned pixel point i may be any pixel point in the image to be estimated, so that the first camera filter map can be obtained.
  • Formula 3, Formula 6, and Formula 7 in this step may be used to determine the first camera filter map.
  • Step 206 Determine a first depth map corresponding to the image to be estimated according to the first DSN map and the first camera filter map.
  • step 206 may refer to the specific explanation of step 104 in the embodiment shown in FIG. 5 , which is not repeated here.
  • the image to be estimated and the camera calibration parameters are obtained
  • the first DSN map corresponding to the image to be estimated is obtained through the first neural network model
  • the first camera filter map is determined according to the camera calibration parameters, and then based on the first DSN map and
  • the embodiment of the present application obtains the depth map based on the first DSN map and the first camera filter map, and the depth map can accurately reflect the distance of the target object, thereby improving the accuracy of monocular depth estimation. sex.
  • the embodiment of the present application can perform depth estimation through one frame of the image to be estimated, and there is no need for the monocular camera to be in motion and the real-world target object to be in a static state. Limits that can be applied to depth estimation of target objects in general or common real-life scenes.
  • the monocular depth estimation method of the embodiment of the present application can be applied to perform depth estimation on images to be estimated collected by cameras of different camera models, so as to realize generalized perception of images to be estimated from different cameras.
  • the depth value can be determined by the DSN vector of the pixel and the camera filter mapping vector of the pixel. Exemplary explanations are given.
  • a 3D point of the target object in the scene is captured by a camera with position coordinates at (0, 0, 0) and stored in a 2D pixel at pixel i in the image plane.
  • P represents the 3D point
  • P (X, Y, Z)
  • (X, Y, Z) represents the position coordinates of the 3D point in space (also called three-dimensional space coordinates)
  • i (u, v)
  • ( u, v) represent the position coordinates of pixel i
  • the position coordinates of pixel i are measured from the upper left corner of the image plane.
  • the geometric correspondence between the position coordinates of the 3D points in the scene in space and the position coordinates of the pixel point i is given by the following formula 1:
  • the camera calibration parameters may include the center coordinates and focal length of the camera of the perspective (pinhole) camera model, (c x , c y ) are the center coordinates, and (f x , f y ) are the focal lengths.
  • the 3D point can also be modeled and represented as a unit surface normal, (n x , ny , n z ) represents the unit surface normal of the 3D point of the target object in the scene.
  • the surface in the unit surface normal refers to the plane formed by a 3D point and its adjacent coplanar 3D points (which can be extended to an infinite plane).
  • the representation of this unit vector is independent of the camera model used.
  • the distance from the camera to the extended plane of the 3D point can be defined as h. Then the distance h can be calculated geometrically by the scalar product of the unit surface normal and the three-dimensional space coordinate:
  • Equation 8 the geometric relationship of the perspective camera model can be used to replace the 3D point space coordinates in Equation 9, and get:
  • the embodiment of the present application proposes a new 3D structure representation method, which decomposes the inverse depth of a 3D point in the scene (ie, the inverse of the depth) into a DSN vector and a camera filter mapping vector, where N represents the DSN vector of the 3D point, and F The camera filter map vector representing pixel i. See Equation 11 to Equation 15 below.
  • represents the inverse depth of the 3D point.
  • the inverse depth can be determined according to the 3D point structure representation provided by the embodiment of the present application, that is, the inverse depth value can be determined by the DSN vector and the camera filter map vector, and is expressed as:
  • the DSN vector of the 3D point is:
  • the camera filter map vector of the corresponding pixel i is:
  • F u is the filter mapping component of the first camera
  • F v is the filter mapping component of the second camera.
  • the camera model used may not be limited to the camera of the above-mentioned perspective camera model, but may also be a camera of a non-perspective camera model, for example, a panoramic camera, a 360-degree spherical camera , catadioptric camera, or fisheye camera, etc.
  • the 3D structure representation of the embodiments of the present application can perform geometric correspondence adaptation between the 3D structure of the scene and the non-perspective camera model.
  • the camera filter map vector (calculated according to Equation 15) needs to be updated for different types of cameras.
  • the camera may be a 360-degree panoramic camera.
  • the geometric correspondence between 3D point P and pixel i in the scene can be given by:
  • Equation 9 the inverse depth of a 3D point in the scene can be decomposed into a DSN vector and a camera filter map vector, where N represents the DSN vector and F represents the camera filter map vector.
  • represents the inverse depth of the 3D point.
  • the DSN vector corresponding to the 3D point can be calculated by Equation 14.
  • the embodiment of the present application provides a theoretical basis for the monocular depth estimation method of the embodiment shown in FIG. 5 or FIG. 6 . So that the monocular depth estimation method of the embodiment of the present application can output the first DSN vector corresponding to each pixel point of the image to be estimated through the first neural network model, and obtain the first camera filter mapping vector corresponding to each pixel point according to formula 3, Then, the depth value corresponding to each pixel is obtained according to formula 1 and formula 2, so as to realize monocular depth estimation and improve the accuracy of depth estimation.
  • FIG. 8 is a schematic diagram of a monocular depth estimation processing process according to an embodiment of the present application.
  • the first neural network model is a convolutional neural network as an example for schematic illustration.
  • the monocular depth estimation method can include: inputting the image data L301 into the convolutional neural network L302, for example, the above-mentioned to-be-estimated image can be used as the image data L301.
  • the convolutional neural network L302 outputs the DSN map L303.
  • a channel of a pixel in a DSN map is used to store a component of the DSN vector.
  • the convolutional neural network may be based on the encoder-decoder architecture of ResNet-18.
  • the convolutional neural network can be trained by the training method shown in Figure 9 below.
  • F i is the filter mapping vector corresponding to the pixel i
  • F i (F ui , F vi ) as an example
  • the camera filter map L304 only needs to be calculated once.
  • the camera filter map L304 can be used to filter the DSN map L303.
  • the inverse depth map L306 is obtained by calculation, and then the depth map is obtained based on the inverse depth map L306.
  • FIG. 9 is a flowchart of a training method of a first neural network model according to an embodiment of the present application.
  • the first neural network model may also be called a monocular depth estimation model.
  • the method of this embodiment may include:
  • Step 301 Acquire a training image and a second DSN map corresponding to the training image.
  • the training image can be an image captured by any camera, such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera.
  • a camera such as an RGB camera, a grayscale camera, a night vision camera, a thermal camera, a panoramic camera, an event camera, or an infrared camera.
  • training images can be derived from the database shown in Figure 1.
  • the database may store multiple training images and the second DSN map corresponding to the training image, or store multiple training images and the second depth map corresponding to the training image, or store multiple training images and the second depth map corresponding to the training image. DSN map and second depth map.
  • the training data in the database can be obtained in the following ways.
  • the plurality of training images are images captured by a plurality of calibrated and synchronized cameras of the scene.
  • a second depth map corresponding to the multiple training images can be obtained.
  • the 3D vision technology may be a structure from motion (sfm) restoration technology, a multi-view 3D reconstruction technology, or a view synthesis technology.
  • different types of cameras include depth cameras and thermal cameras.
  • the depth cameras and thermal cameras are mounted in a frame and calibrated.
  • the depth map captured by the depth camera can be directly aligned with the image obtained by the thermal camera.
  • the depth map obtained by the camera can be used as the second depth map corresponding to the training image, and the image obtained by the thermal camera can be used as the training image.
  • an original image is obtained.
  • the original image can be an image captured by any of the above-mentioned types of cameras, and data optimization or data enhancement is performed on the original image to obtain a training image.
  • the data optimization includes at least one of hole filling optimization, sharpening occlusion edge optimization, or temporal consistency optimization.
  • One example is to optimize the original image through an image processing filter to obtain a training image.
  • the image processing filter may be bilateral filtering, or guided image filtering, or the like.
  • Another example is to optimize the original image by using the temporal information or consistency between frames of the video to obtain the training image. For example, add temporal constraints via optical flow methods.
  • geometric information or semantic segmentation information which can be calculated by some convolutional neural networks, such as the segmentation convolutional neural network for surface normal vector segmentation or more surface categories.
  • convolutional neural networks for segmentation of different categories such as people, vegetation, sky, cars, etc.
  • This data augmentation is used to change the environmental conditions of the scene captured by the original image to obtain training images under different environmental conditions.
  • the environmental conditions may include lighting conditions, weather conditions, visibility conditions, and the like.
  • the second depth map corresponding to the training image may be a depth map output by the teacher's monocular depth estimation network after processing the input training image.
  • the teacher monocular depth estimation network here can be a trained MDE convolutional neural network.
  • the second DSN map may be determined by using the training image, the camera calibration parameters corresponding to the training image, and the second depth map.
  • the second DSN map corresponding to the training image can be obtained by calculating in the following two ways.
  • the second DSN map includes the DSN vector of the plane where the 3D point in the scene corresponding to the pixel point of the training image is located.
  • a possible implementation is to calculate the unit surface normal corresponding to the pixel point i. (n xi , n yi , n zi ) is the unit surface normal corresponding to pixel i. In some embodiments, it can be calculated using the vector cross product of adjacent pixels of pixel i.
  • the camera calibration parameters may be camera calibration parameters of the camera that shoots the training image.
  • the distance from the plane where the 3D point corresponding to the pixel point i is located to the camera can be calculated by the following formula 19.
  • hi is the distance from the plane where the 3D point corresponding to pixel i is located to the camera
  • the camera calibration parameters include the center coordinates and focal length of the camera
  • (c x , c y ) are the center coordinates of the camera
  • (f x , f y ) is the focal length of the camera
  • (u, v) is the position coordinate of pixel i
  • Z is the depth value corresponding to pixel i.
  • the DSN vector of the plane where the 3D point corresponding to pixel i is located is calculated.
  • the DSN vector of the plane where the 3D point corresponding to the pixel point i is located can be calculated by the following formula 20.
  • N i is the DSN vector of the plane where the 3D point corresponding to the pixel point i is located.
  • the DSN map can be calculated from the inverse depth map.
  • the calculation method can be as follows: by calculating the image gradient at the pixel point i in the inverse depth map, the DSN vector corresponding to the pixel point i is obtained, that is, the DSN vector of the plane where the 3D point corresponding to the pixel point i is located.
  • the DSN vector of the plane where the 3D point corresponding to the pixel point i is located is calculated according to the following formula 31.
  • N i is the DSN vector of the plane where the 3D point corresponding to the pixel i is located
  • N i (N xi , N yi , N zi )
  • ⁇ i is the inverse depth value of the 3D point in the scene corresponding to the pixel i
  • (c x , c y ) is the center coordinate of the camera
  • (f x , f y ) is the focal length of the camera
  • (u, v) is the position coordinate of the pixel i.
  • Step 302 using the training image and the second DSN map to train the initial neural network model to obtain the first neural network model.
  • the training image may be input into the initial neural network model to obtain a third DSN map.
  • the second camera filter map is determined according to the camera calibration parameters corresponding to the training image and the training image.
  • a third depth map is obtained according to the second camera filter map and the third DSN map. According to the difference between the third DSN map and the second DSN map corresponding to the training image, or the difference between the third depth map and the second depth map, or, in the degree of matching between the third depth map and the second depth map
  • the parameters of the initial neural network model are adjusted, and the above process is repeated until the training ends, and the above-mentioned first neural network model is obtained.
  • the parameters of the neural network model can be adjusted according to the loss function until the first neural network model that satisfies the training objective is obtained.
  • the loss function may include at least one of a first loss function, a second loss function, or a third loss function.
  • the first loss function is used to represent the error between the second DSN map and the third DSN map
  • the second loss function is used to represent the error between the second depth map and the third depth map.
  • the third loss function is used to represent the matching degree between the second depth map and the third depth map.
  • a training image L501 is acquired from the database L400.
  • the training image L501 is input into the convolutional neural network L502, and the convolutional neural network L502 outputs the third DSN map L503.
  • a second camera filter map L504 is obtained according to the camera calibration parameters corresponding to the training image L501.
  • the third DSN map L503 and the second camera filter map L504 are provided to a filter L505 which outputs an inverse depth map L506.
  • the above-mentioned third depth map may be obtained based on the inverse depth map L506.
  • the third DSN map L503, the inverse depth map L506, the real DSN map L508, and the real inverse depth map L509 are provided to the loss function L507 to determine the loss function value, and adjust the convolutional neural network L502 according to the loss function value.
  • the real DSN map L508 is the second DSN map corresponding to the training image.
  • the real inverse depth map L509 can be obtained from the second depth map corresponding to the training image.
  • the loss function (L507) is defined as follows.
  • ⁇ DEP , ⁇ DSN and ⁇ INP are hyperparameters for weighted depth loss function DSN loss function and the patch loss function ⁇ DEP , ⁇ DSN and ⁇ INP are greater than or equal to 0, respectively. For example, assigning zero to a hyperparameter cancels the effect of its corresponding loss function on network training.
  • represents the normal form, which can be L1 normal form, L2 normal form, etc.
  • the inverse depth value in the depth loss function may also be replaced by the depth value as appropriate.
  • the DSN normal loss function the loss function used to compute the error between the true DSN vector (L508) and the estimated DSN vector (L503).
  • the calculation of the loss function will be performed on all valid pixels Carried on.
  • the loss function can be calculated through the above formula 22 to formula 25 to adjust the network parameters to obtain the first neural network model.
  • the estimated DSN map is obtained by inputting the training image into the convolutional neural network.
  • the estimated inverse depth map is obtained from the estimated DSN map and the camera filter map corresponding to the training image.
  • the loss function value is calculated based on the estimated DSN map, the true DSN map, the estimated inverse depth map, and the true inverse depth map.
  • the parameters of the convolutional neural network are adjusted according to the value of the loss function, and the above steps are repeated to obtain the first neural network model by training.
  • the first neural network model can learn the mapping relationship between the input image and the DSN map, so that the first neural network model can output the DSN map corresponding to the input image based on the input image, and use the DSN map and the camera filter map corresponding to the input image,
  • the depth map corresponding to the input image can be obtained to achieve monocular depth estimation.
  • the DSN map is related to the 3D structure of the target object corresponding to the input image, and is not affected by the camera model, so that in the model application stage, even if the The camera model corresponding to the estimated image is different from the camera model corresponding to the training image, and the DSN map corresponding to the to-be-estimated image output by the first neural network model can also more accurately represent the 3D structure of the target object corresponding to the to-be-estimated image.
  • a depth map is obtained based on the DSN map corresponding to the image to be estimated and the camera filter map related to the camera model. The depth map can more accurately represent the distance of the target object in space, thereby improving the accuracy of monocular depth estimation.
  • the trained first neural network model can perceive An image captured by an arbitrary camera, i.e. generalizable perception to the camera. And a depth map is estimated based on the output of the first neural network model.
  • the first neural network model trained through the above steps can be configured in the electronic device, so that the electronic device can realize relatively accurate monocular depth estimation.
  • the first neural network model can also be configured in the server, so that the server can process the image to be estimated sent by the electronic device and return the DSN map, and then the electronic device can obtain the depth map based on the DSN map to achieve a relatively accurate monocular depth. estimate.
  • the first neural network model may be a software function module or a solidified hardware circuit, for example, the hardware circuit may be an arithmetic circuit, etc.
  • the specific form of the first neural network model is not specifically limited in this embodiment of the present application.
  • the following method may also be used to train the neural network model to obtain the first neural network model.
  • FIG. 11 is a schematic diagram of a training process of a first neural network model according to an embodiment of the present application.
  • the monocular depth estimation module L515 ie, including the part involved in the monocular depth estimation process of the embodiment shown in FIG. 8 ) estimates the depth by learning the monocular depth estimation teacher network L511 .
  • the teacher network is used to synthesize paired input and output ground truth data through inverse rendering.
  • the monocular depth estimation module L515 is trained using the input and output ground truth data. That is, the neural network model in the monocular depth estimation module L515 is trained to obtain the first neural network model.
  • a trained monocular depth estimation teacher network L511 can be used as the MDE processor.
  • the teacher network L511 can estimate the depth map corresponding to the generated input image.
  • the input images and estimated depth maps of the teacher network can be used as ground truth data to augment the database L400.
  • the teacher network L511 can estimate a synthetic depth map L512 corresponding to the noise by inputting a noise image L510. Again by using the encoder of the teacher network L511 and the noisy image L510, the inverse rendering synthesis can be used for the synthetic image data L513 as training input.
  • the inverse rendering synthetic image data L513 and the synthetic depth map L512 generated by the teacher network synthesis may be added to the database L400 as a set of paired ground truth data L514.
  • the monocular depth estimation module L515 is set as a student network, and uses the inverse rendering synthetic image data L513 generated by the teacher network as input, and the synthetic depth map L512 as ground truth data for training.
  • the monocular depth estimation module L515 processes the inversely rendered synthetic image data L513 and outputs an estimated depth map L516.
  • the estimated depth map L516 and the synthetic depth map L512 are provided to the training loss function L517.
  • the training loss function L517 can adopt the loss function in the embodiment shown in FIG. 9 above, and of course other forms of loss functions can also be used.
  • the application examples are not specifically limited. Adjust the neural network model in the monocular depth estimation module L515 according to the training loss function L517 to obtain the first neural network model.
  • the monocular depth estimation module L515 can be trained by learning other MDE processors (for example, off-the-shelf MDE software, MDE networks that have been trained, etc.), without directly accessing these MDE processors/networks for training raw data used. This approach is achieved through a cutting-edge knowledge distillation algorithm. This method has higher training efficiency.
  • MDE processors for example, off-the-shelf MDE software, MDE networks that have been trained, etc.
  • FIG. 12 is a schematic diagram of a training process of a first neural network model according to an embodiment of the present application.
  • the monocular depth estimation module L523 (the same module as L515 above) and the MDE processor L521 use noise as input at the same time.
  • a trained monocular depth estimation teacher network L521 can be used as an MDE processor for augmenting the database L400.
  • the teacher network L521 can estimate the depth map corresponding to the input image.
  • the monocular depth estimation module L523 is set as a student network.
  • a noisy image can be simultaneously input to the monocular depth estimation module L523 and the MDE processor L521.
  • the monocular depth estimation module L523 processes the noisy image and outputs the depth map L524 estimated by the student network.
  • the monocular depth estimation teacher network L521 processes the noisy image and outputs the depth map L522 estimated by the teacher network.
  • the depth map L524 estimated by the student network and the depth map L522 estimated by the teacher network are provided to the training loss function L525.
  • the monocular depth estimation module L523 is adjusted to reduce the loss error between the depth map L524 estimated by the student network and the depth map L522 estimated by the teacher network. network) training.
  • the monocular depth estimation module L523 can be trained by learning other MDE processors (for example, off-the-shelf MDE software, MDE networks that have been trained, etc.), without directly accessing these MDE processors/networks for training raw data used.
  • MDE processors for example, off-the-shelf MDE software, MDE networks that have been trained, etc.
  • This approach is achieved through a cutting-edge knowledge distillation algorithm. This method has higher training efficiency.
  • the embodiments of the present application further provide a monocular depth estimation apparatus, which is used for performing the method steps in the above method embodiments.
  • the monocular depth estimation apparatus may include: an acquisition module 91 , a DSN module 92 , a camera filter mapping module 93 and a depth estimation module 94 .
  • the obtaining module 91 is configured to obtain the to-be-estimated image and the first parameter corresponding to the to-be-estimated image, where the first parameter is a camera calibration parameter of the camera that shoots the to-be-estimated image.
  • the distance scaled normal DSN module 92 is used to input the image to be estimated into the first neural network model, and obtain the first distance scaled normal DSN map output by the first neural network model, and the first DSN map is used for Indicates the orientation of the plane of the target object corresponding to the image to be estimated and the distance between the plane and the camera.
  • the camera filter mapping module 93 is configured to determine a first camera filter map according to the to-be-estimated image and the first parameter, where the first camera filter map is used to represent the 3D point of the target object in space The mapping relationship with the 2D plane, where the 2D plane is the imaging plane of the camera.
  • the depth estimation module 94 is configured to determine a first depth map corresponding to the image to be estimated according to the first DSN map and the first camera filter map.
  • the first neural network model is obtained by training using a training image and a second DSN map corresponding to the training image, and the second DSN map is based on a second depth corresponding to the training image and the camera calibration parameters corresponding to the training images.
  • the training image is used as the input of the initial neural network model
  • the loss function includes at least one of a first loss function, a second loss function or a third loss function
  • the loss function is used to adjust the initial loss function.
  • the parameters of the neural network model are obtained by training to obtain the first neural network model.
  • the first loss function is used to represent the error between the second DSN map and the third DSN map
  • the third DSN map is the DSN map corresponding to the training image output by the initial neural network model.
  • the second loss function is used to represent the error between the second depth map and the third depth map
  • the third depth map is determined according to the third DSN map and the second camera filter map
  • the The second camera filter map is determined according to the camera calibration parameters corresponding to the training image and the training image
  • the third loss function is used to represent the matching degree of the second depth map and the third depth map.
  • the camera filter map module 93 is configured to: determine the first camera filter map according to the position coordinates of the pixel points of the to-be-estimated image and the first parameter, and the first camera filter map
  • the figure includes a camera filter mapping vector corresponding to the pixel point, the camera filter vector is used to represent the mapping relationship between the 3D point and the pixel point, and the pixel point is the projection of the 3D point to the 2D plane. point.
  • the position coordinates of the pixel include abscissa and ordinate
  • the camera filter map vector corresponding to the pixel includes a first camera filter map component and a second camera filter map component, the first camera filter map
  • the filter map component is determined according to the abscissa and the first parameter
  • the second camera filter map component is determined according to the ordinate and the first parameter, or according to the abscissa, the ordinate and the first parameter is determined.
  • the first parameter when the field of view of the camera that captures the image to be estimated is less than 180 degrees, the first parameter includes the center coordinates (c x , cy ) and the focal length (f x , f y ) of the camera ), the first camera filter map component is determined according to the abscissa and the first parameter, and the second camera filter map component is determined according to the ordinate and the first parameter.
  • the first parameter includes a width pixel value W and a height pixel value H of the to-be-estimated image
  • the first camera filter map component is based on the The abscissa and the first parameter are determined
  • the second camera filter map component is determined according to the abscissa, the ordinate and the first parameter.
  • the first DSN map includes a first DSN vector corresponding to a pixel of the image to be estimated
  • the depth estimation module 94 is configured to: according to the first DSN vector corresponding to the pixel and the pixel
  • the camera filtering map vector corresponding to the point determines the depth value corresponding to the pixel point.
  • the first depth map includes depth values corresponding to the pixels.
  • the monocular depth estimation apparatus provided in the embodiment of the present application can be used to execute the above-mentioned monocular depth estimation method, and the content and effect thereof may refer to the method section, which will not be repeated in this embodiment of the present application.
  • the electronic device may include: an image collector 1001 configured to acquire an image to be estimated and a first parameter corresponding to the image to be estimated; one or more processors 1002 ; a memory 1003 ;
  • the various devices described above may be connected by one or more communication buses 1005 .
  • the above-mentioned memory 1003 stores one or more computer programs 1004, one or more processors 1002 are used to execute one or more computer programs 1004, and the one or more computer programs 1004 include instructions, and the above-mentioned instructions can be used to execute the above-mentioned Various steps in a method embodiment.
  • processors 1002 are used to execute one or more computer programs 1004 to perform the following actions:
  • the to-be-estimated image and a first parameter corresponding to the to-be-estimated image are acquired, where the first parameter is a camera calibration parameter of a camera that captures the to-be-estimated image.
  • a first camera filter map is determined, and the first camera filter map is used to represent the mapping relationship between the 3D point and the 2D plane of the target object in space.
  • the 2D plane is the imaging plane of the camera.
  • a first depth map corresponding to the image to be estimated is determined according to the first DSN map and the first camera filter map.
  • the first neural network model is obtained by training using a training image and a second DSN map corresponding to the training image, and the second DSN map is based on a second depth corresponding to the training image and the camera calibration parameters corresponding to the training images.
  • the training image is used as the input of the initial neural network model
  • the loss function includes at least one of a first loss function, a second loss function or a third loss function
  • the loss function is used to adjust the initial loss function.
  • the parameters of the neural network model are obtained by training to obtain the first neural network model.
  • the first loss function is used to represent the error between the second DSN map and the third DSN map
  • the third DSN map is the DSN map corresponding to the training image output by the initial neural network model.
  • the second loss function is used to represent the error between the second depth map and the third depth map
  • the third depth map is determined according to the third DSN map and the second camera filter map
  • the The second camera filter map is determined according to the camera calibration parameters corresponding to the training image and the training image
  • the third loss function is used to represent the matching result of the second depth map and the third depth map.
  • the first camera filter map is determined according to the position coordinates of the pixel points of the to-be-estimated image and the first parameter, and the first camera filter map includes the corresponding pixels of the pixels.
  • a camera filter mapping vector where the camera filter vector is used to represent the mapping relationship between the 3D point and the pixel point, where the pixel point is the point where the 3D point is projected onto the 2D plane.
  • the position coordinates of the pixel include abscissa and ordinate
  • the camera filter map vector corresponding to the pixel includes a first camera filter map component and a second camera filter map component, the first camera filter map
  • the filter map component is determined according to the abscissa and the first parameter
  • the second camera filter map component is determined according to the ordinate and the first parameter, or according to the abscissa, the ordinate and the first parameter is determined.
  • the first parameter when the field of view of the camera that captures the image to be estimated is less than 180 degrees, the first parameter includes the center coordinates (c x , cy ) and the focal length (f x , f y ) of the camera ), the first camera filter map component is determined according to the abscissa and the first parameter, and the second camera filter map component is determined according to the ordinate and the first parameter.
  • the first parameter includes a width pixel value W and a height pixel value H of the to-be-estimated image
  • the first camera filter map component is based on the The abscissa and the first parameter are determined
  • the second camera filter map component is determined according to the abscissa, the ordinate and the first parameter.
  • the first DSN map includes a first DSN vector corresponding to a pixel of the image to be estimated, according to the first DSN vector corresponding to the pixel and a camera filter mapping vector corresponding to the pixel , and determine the depth value corresponding to the pixel point.
  • the first depth map includes depth values corresponding to the pixels.
  • the electronic device shown in FIG. 14 may also include other devices such as an audio module and a SIM card interface, which are not limited in this embodiment of the present application.
  • the embodiment of the present application further provides a monocular depth estimation apparatus.
  • the monocular depth estimation apparatus includes a processor 1101 and a transmission interface 1102, and the transmission interface 1102 is used to obtain the image to be estimated and the to-be-estimated image.
  • the transmission interface 1102 may include a sending interface and a receiving interface.
  • the transmission interface 1102 may be any type of interface according to any proprietary or standardized interface protocol, such as high definition multimedia interface (HDMI), mobile Mobile Industry Processor Interface (MIPI), Display Serial Interface (DSI) standardized by MIPI, Embedded Display Port (Embedded Display Port) standardized by Video Electronics Standards Association (VESA) , eDP), Display Port (DP) or V-By-One interface, V-By-One interface is a digital interface standard developed for image transmission, as well as various wired or wireless interfaces, optical interfaces, etc.
  • HDMI high definition multimedia interface
  • MIPI mobile Mobile Industry Processor Interface
  • DSI Display Serial Interface
  • MIPI mobile Mobile Industry Processor Interface
  • DSI Display Serial Interface
  • Embedded Display Port Embedded Display Port
  • DP Display Port
  • V-By-One interface is a digital interface standard developed for image transmission, as well as various wired or wireless interfaces, optical interfaces, etc.
  • the processor 1101 is configured to call the program instructions stored in the memory to execute the monocular depth estimation method according to the above method embodiment.
  • the apparatus further includes a memory 1103 .
  • the processor 1102 may be a single-core processor or a multi-core processor group
  • the transmission interface 1102 is an interface for receiving or sending data
  • the data processed by the monocular depth estimation apparatus may include video data or image data.
  • the monocular depth estimation apparatus may be a processor chip.
  • inventions of the embodiments of the present application further provide a computer storage medium, where the computer storage medium may include computer instructions, when the computer instructions are executed on the electronic device, the electronic device is made to perform various steps of the above method embodiments.
  • inventions of the embodiments of the present application further provide a computer program product, which, when the computer program product runs on a computer, causes the computer to execute each step of the foregoing method embodiments.
  • the processor mentioned in the above embodiments may be an integrated circuit chip, which has signal processing capability.
  • each step of the above method embodiments may be completed by a hardware integrated logic circuit in a processor or an instruction in the form of software.
  • the processor can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other Programming logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the steps of the methods disclosed in the embodiments of the present application may be directly embodied as executed by a hardware coding processor, or executed by a combination of hardware and software modules in the coding processor.
  • the software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art.
  • the storage medium is located in the memory, and the processor reads the information in the memory, and completes the steps of the above method in combination with its hardware.
  • the memory mentioned in the above embodiments may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically programmable Erase programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory may be random access memory (RAM), which acts as an external cache.
  • RAM random access memory
  • DRAM dynamic random access memory
  • SDRAM synchronous DRAM
  • SDRAM double data rate synchronous dynamic random access memory
  • ESDRAM enhanced synchronous dynamic random access memory
  • SLDRAM synchronous link dynamic random access memory
  • direct rambus RAM direct rambus RAM
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium.
  • the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)

Abstract

一种单目深度估计方法、装置及设备。所述单目深度估计方法可以包括:获取待估计图像和相机标定参数,通过第一神经网络模型得到待估计图像对应的第一DSN图,根据相机标定参数确定第一相机滤波映射图,再基于第一DSN图和第一相机滤波映射图得到深度图。所述单目深度估计方法可以适用于现实生活中一般或普遍场景中的目标物体的深度估计,并且具有较好的深度估计准确性。

Description

单目深度估计方法、装置及设备 技术领域
本申请实施例涉及计算机视觉领域,尤其涉及一种单目深度估计方法、装置及设备。
背景技术
单目深度估计是利用单个相机拍摄的图像作为输入,估计现实世界的深度图像(深度图)。深度图中的每个像素都存储一个深度值,该深度值是该像素对应的现实世界的三维(3-dimension,3D)坐标点与相机视点之间的距离。单目深度估计可应用于诸多重要的需要三维环境信息的应用场景。这些应用场景包括但不限于增强现实(AR)、导航(如自动驾驶)、场景重建、场景识别、物体检测等。
单目深度估计(Monocular Depth Estimation,MDE)所使用的单目相机通常为RGB或灰度(Gray)摄像头。其原因是此类摄像头在智能手机、平板电脑等普通消费电子设备上被广泛使用。RGB或灰度(Gray)摄像头在光照好、小光比、相机/场景运动稳定的情况下,能捕捉到较好的图像效果。以单目相机为RGB摄像头为例,通过获取来自单目相机的两个RGB帧,对两个RGB帧进行立体匹配计算得到深度图。
上述通过两个RGB帧进行立体匹配估计深度的方式存在如下问题,在采集该两个RGB帧过程中,需要单目相机处于运动状态而现实世界的目标物体处于静止状态,且环境中的纹理细节和光照条件较好。然而,现实生活中,场景中的目标物体往往是动态的,例如,马路上的汽车等。这使得上述单目深度估计方式无法适用于现实生活中的目标物体的深度估计。
发明内容
本申请提供一种单目深度估计方法、装置及设备,以适用于现实生活中一般或普遍场景中的目标物体的深度估计。
第一方面,本申请实施例提供一种单目深度估计方法,该方法可以包括:获取待估计图像和该待估计图像对应的第一参数,该第一参数为拍摄该待估计图像的相机的相机标定参数。将该待估计图像输入至第一神经网络模型中,获取第一神经网络模型输出的第一距离缩放法线DSN图,该第一DSN图用于表示该待估计图像对应的目标物体的平面的朝向和该平面与该相机之间的距离。根据该待估计图像和该第一参数,确定第一相机滤波映射图,该第一相机滤波映射图用于表示该目标物体在空间中的3D点与2D平面的映射关系,该2D平面为该相机的成像平面。根据该第一DSN图和该第一相机滤波映射图,确定该待估计图像对应的第一深度图。
与直接使用神经网络模型输出深度图不同,本实现方式,基于第一DSN图和第一相机滤波映射图得到深度图,该深度图可以精确反映目标物体的距离,从而提升单目深度估计的准确性。与通过两个RGB帧进行立体匹配估计深度的方式不同,本实现方式通过一帧 待估计图像便可以进行深度估计,没有需要单目相机处于运动状态而现实世界的目标物体处于静止状态的场景限制,可以适用于现实生活中一般或普遍场景中的目标物体的深度估计。
一种可能的设计中,该第一神经网络模型为使用训练图像和该训练图像对应的第二DSN图进行训练得到的,该第二DSN图是根据该训练图像对应的第二深度图、以及该训练图像对应的相机标定参数确定的。
本实现方式,由于第一神经网络模型是使用训练图像和该训练图像对应的第二DSN图进行训练得到的,所以第一神经网络模型具有输出输入图像对应的DSN图的能力,进而可以使用该DSN图和输入图像对应的相机滤波映射图,可以得到输入图像对应的深度图,以实现单目深度估计。
一种可能的设计中,该训练图像作为初始神经网络模型的输入。损失函数包括第一损失函数、第二损失函数或第三损失函数中至少一项,该损失函数用于调整该初始神经网络模型的参数,以训练得到该第一神经网络模型。该第一损失函数用于表示该第二DSN图和第三DSN图之间的误差,该第三DSN图为该初始神经网络模型输出的该训练图像对应的DSN图,该第二损失函数用于表示该第二深度图和第三深度图之间的误差,该第三深度图是根据该第三DSN图和第二相机滤波映射图确定的,该第二相机滤波映射图为根据该训练图像对应的相机标定参数和该训练图像确定的,该第三损失函数用于表示该第二深度图和第三深度图的匹配程度。
本实现方式,在神经网络模型训练过程中,通过评价该第二DSN图和第三DSN图之间的误差,该第二深度图和第三深度图之间的误差,或该第二深度图和第三深度图的匹配程度中一项或多项,对神经网络模型进行调整,以使得调整后的神经网络模型,满足一项或多项精度需求,进而提升使用训练后的神经网络模型的本申请实施例的单目深度估计方法的准确率。
一种可能的设计中,该训练图像可以是RGB相机、灰度相机、夜视相机、热敏相机、全景相机、事件相机或红外相机等任意相机拍摄所得到的图像。
本实现方式,通过采用RGB相机、灰度相机、夜视相机、热敏相机、全景相机、事件相机或红外相机等任意相机拍摄所得到的图像对神经网络模型进行训练,使得本申请实施例的单目深度估计方法可以支持对不同类型的相机所采集的图像进行深度估计。
一种可能的设计中,根据该待估计图像和该第一参数,确定第一相机滤波映射图,包括:根据该待估计图像的像素点的位置坐标和该第一参数,确定该第一相机滤波映射图,该第一相机滤波映射图包括该像素点对应的相机滤波映射向量,该相机滤波向量用于表示该3D点与该像素点的映射关系,该像素点为该3D点投影至该2D平面的点。
本实现方式,通过该待估计图像的像素点的位置坐标和该第一参数,确定该第一相机滤波映射图。该相机滤波映射图与输入图像的像素点以及相机模型相关,而不受场景中目标物体的3D结构影响。对于相同相机所拍摄的输入图像,其对应的相机滤波映射图相同,可以只计算一次。在更换相机时,可以根据新的相机的相机标定参数重新计算相机滤波映射图。对于相同相机所拍摄的输入图像,通过相机滤波映射图和DSN图得到深度图,可以提升单目深度估计的处理速度。
一种可能的设计中,该像素点的位置坐标包括横坐标和纵坐标,该像素点对应的相机滤波映射向量包括第一相机滤波映射分量和第二相机滤波映射分量,该第一相机滤波映射分量是根据该横坐标和该第一参数确定的,该第二相机滤波映射分量是根据该纵坐标和该第一参数,或者根据该横坐标、该纵坐标和该第一参数确定的。
一种可能的设计中,当拍摄该待估计图像的相机的视场角小于180度时,该第一参数包括述相机的中心坐标(c x,c y)和焦距(f x,f y),该第一相机滤波映射分量是根据该横坐标和该第一参数确定的,该第二相机滤波映射分量是根据该纵坐标和该第一参数确定的。
一种可能的设计中,当拍摄该待估计图像的相机的视场角小于180度时,该第一参数包括相机的中心坐标(c x,c y)和焦距(f x,f y),该像素点的位置坐标为i=(u,v),该第一相机滤波映射分量为F u
Figure PCTCN2021075318-appb-000001
该第二相机滤波映射分量为F v
Figure PCTCN2021075318-appb-000002
一种可能的设计中,当该待估计图像的相机的视场角大于180度时,该第一参数包括该待估计图像的宽度像素值W和高度像素值H,该第一相机滤波映射分量是根据该横坐标和该第一参数确定的,该第二相机滤波映射分量是根据该横坐标、该纵坐标和该第一参数确定的。
一种可能的设计中,当该待估计图像的相机的视场角大于180度时,该第一参数包括该待估计图像的宽度像素值W和高度像素值H,该像素点的位置坐标为i=(u,v),该第一相机滤波映射分量为F u
Figure PCTCN2021075318-appb-000003
该第二相机滤波映射分量为F v
Figure PCTCN2021075318-appb-000004
Figure PCTCN2021075318-appb-000005
一种可能的设计中,该第一DSN图包括该待估计图像的像素点对应的第一DSN向量,根据该第一DSN图和该第一相机滤波映射图,确定该待估计图像对应的第一深度图,包括:根据该像素点对应的第一DSN向量和该像素点对应的相机滤波映射向量,确定该像素点对应的深度值。其中,该第一深度图包括该像素点对应的深度值。
一种可能的设计中,根据该像素点对应的第一DSN向量和该像素点对应的相机滤波映射向量,确定该像素点对应的深度值,包括:
根据公式ξ=N·F,确定该像素点对应的逆深度值。
根据该像素点对应的逆深度值,确定该像素点对应的深度值。
其中,ξ为该像素点对应的逆深度值,N为该像素点对应的第一DNS向量,F为该像素点对应的相机滤波映射向量。
一种可能的设计中,该方法还可以包括:获取训练图像、训练图像对应的第二深度图像、以及训练图像对应的相机标定参数。使用训练图像、训练图像对应的第二深度图像以及训练图像对应的相机标定参数对初始神经网络模型进行训练,获取该第一神经网络模型。
一种可能的设计中,使用训练图像、训练图像对应的第二深度图像以及训练图像对应的相机标定参数对初始神经网络模型进行训练,获取第一神经网络模型,包括:根据训练图像对应的相机标定参数和训练图像,确定第二相机滤波映射图,该第二相机滤波映射图像包括该训练图像数据中的像素点对应的相机滤波映射向量。根据该第二相机滤波映射图和该第二深度图像,获取第二DSN图,该第二DSN图包括该训练图像的像素点对应的第二DSN向量,该第二DNS向量用于表示该像素点对应的场景中的3D点所在平面的朝向和与相机之间的距离。将训练图像输入至初始神经网络模型,获取初始神经网络模型输出的第三DSN图。根据第二DSN图、第三DSN图、第二深度图像或第二相机滤波映射图中 至少两项,调整初始神经网络模型的参数,获取第一神经网络模型。
一种可能的设计中,根据第二相机滤波映射图和第二深度图像,获取第二DSN图,包括:根据训练图像的像素点i=(u,v)和像素点i的相邻像素点,确定像素点i对应的场景中的3D点的单位表面法线;
根据公式
Figure PCTCN2021075318-appb-000006
确定相机到3D点所在平面的距离h i
根据公式
Figure PCTCN2021075318-appb-000007
确定3D点所在平面的DSN向量N i
其中,i=(u,v),单位表面法线为(n xi,n yi,n zi),Z为第二深度图像中的3D点的深度值,训练图像的像素点i对应的相机滤波映射向量为F,
Figure PCTCN2021075318-appb-000008
Figure PCTCN2021075318-appb-000009
第二DSN图包括训练图像的像素点对应的场景中的3D点所在平面的DSN向量。
一种可能的设计中,根据所述第二相机滤波映射图和第二深度图像,获取第二DSN图,包括:
根据公式
Figure PCTCN2021075318-appb-000010
确定训练图像的像素点i对应的场景中的3D点所在平面的DSN向量N i,N i=(N xi,N yi,N zi);
其中,i=(u,v),ξ i为像素点i对应的场景中的3D点的逆深度值,逆深度值为深度值的倒数,第二DSN图包括训练图像的像素点对应的场景中的3D点所在平面的DSN向量。
一种可能的设计中,第一损失函数为
Figure PCTCN2021075318-appb-000011
Figure PCTCN2021075318-appb-000012
Figure PCTCN2021075318-appb-000013
其中,
Figure PCTCN2021075318-appb-000014
表示第二DSN图中像素点i对应的DSN向量,N i=(N xi,N yi,N zi)表示第三DSN图中像素点i对应的DSN向量,
Figure PCTCN2021075318-appb-000015
表示全部的有效像素集合。
一种可能的设计中,第二损失函数为
Figure PCTCN2021075318-appb-000016
Figure PCTCN2021075318-appb-000017
其中,‖·‖代表范式,
Figure PCTCN2021075318-appb-000018
表示第二深度图中像素点i对应的逆深度值,ξ i表示第三深度图中像素点i对应的逆深度值,
Figure PCTCN2021075318-appb-000019
表示全部的有效像素集合。
一种可能的设计中第三损失函数为
Figure PCTCN2021075318-appb-000020
Figure PCTCN2021075318-appb-000021
Figure PCTCN2021075318-appb-000022
其中,
Figure PCTCN2021075318-appb-000023
是拉普拉斯算子,I是第二深度图中与第三深度图相匹配的图像数据,
Figure PCTCN2021075318-appb-000024
表 示相匹配的像素集合。
一种可能的设计中,损失函数为
Figure PCTCN2021075318-appb-000025
Figure PCTCN2021075318-appb-000026
其中,λ DEP,λ DSN和λ INP分别大于或等于0。
一种可能的设计中,获取训练图像和训练图像对应的第二深度图像,包括以下至少一项:
获取多个训练图像,多个训练图像是由多个经标定和同步的相机拍摄场景得到的图像数据;使用3D视觉技术对所述多个训练图像进行处理,得到多个训练图像对应的第二深度图像;或者,
获取至少一个训练图像,以及至少一个训练图像中每个训练图像对应的第二深度图像;或者,
获取至少一个原始图像,对至少一个原始图像进行数据优化或数据增强,得到至少一个训练图像,数据优化包括空洞填充优化、锐化遮挡边缘优化或时间一致性优化中至少一项,数据增强用于改变原始图像所拍摄场景的环境条件以获取不同环境条件下的所述训练图像。
一种可能的设计中,每个训练图像对应的第二深度图像为深度传感器获取的;或者,每个训练图像对应的第二深度图像为教师单目深度估计网络对输入的训练图像进行处理后输出的深度图像。
第二方面,本申请实施例提供一种单目深度估计装置,该装置具有实现上述第一方面或第一方面任一种可能的设计的功能。所述功能可以通过硬件实现,也可以通过硬件执行相应的软件实现。所述硬件或软件包括一个或多个与上述功能相对应的模块,例如,获取单元或模块,DSN单元或模块,相机滤波映射单元或模块,深度估计单元或模块。
第三方面,本申请实施例提供一种电子设备,该电子设备可以包括:一个或多个处理器;一个或多个存储器;其中,所述一个或多个存储器用于存储一个或多个程序;所述一个或多个处理器用于运行所述一个或多个程序,以实现如第一方面或第一方面任一种可能的设计所述的方法。
第四方面,本申请实施例提供一种计算机可读存储介质,其特征在于,包括计算机程序,所述计算机程序在计算机上被执行时,使得所述计算机执行如第一方面或第一方面任一种可能的设计所述的方法。
第五方面,本申请实施例提供一种芯片,其特征在于,包括处理器和存储器,所述存储器用于存储计算机程序,所述处理器用于调用并运行所述存储器中存储的计算机程序,以执行如第一方面或第一方面任一种可能的设计所述的方法。
第六方面,本申请实施例提供一种计算机程序产品,当计算机程序产品在计算机上运行时,使得计算机执行如第一方面或第一方面任一种可能的设计所述的方法。
本申请实施例的单目深度估计方法、装置及设备,通过获取待估计图像和相机标定参数,通过第一神经网络模型得到待估计图像对应的第一DSN图,根据相机标定参数确定第一相机滤波映射图,再基于第一DSN图和第一相机滤波映射图得到深度图。与直接使用神经网络模型输出深度图不同,本申请实施例基于第一DSN图和第一相机滤波映射图得到深度图,该深度图可以精确反映目标物体的距离,从而提升单目深度估计的准确性。与通 过两个RGB帧进行立体匹配估计深度的方式不同,本申请实施例通过一帧待估计图像便可以进行深度估计,没有需要单目相机处于运动状态而现实世界的目标物体处于静止状态的场景限制,可以适用于现实生活中一般或普遍场景中的目标物体的深度估计。
附图说明
图1为本申请实施例提供的一种***架构100的示意图;
图2为本申请实施例提供的一种卷积神经网络(CNN)200的示意图;
图3为本申请实施例提供的一种芯片硬件结构的示意图;
图4位本申请实施例提供的一种***架构400的示意图;
图5为本申请实施例提供的一种单目深度估计方法的流程图;
图6为本申请实施例提供的另一种单目深度估计方法的流程图;
图7为本申请实施例提供的一种场景中的3D点和透视(针孔)相机模型之间的几何对应关系的示意图;
图8为本申请实施例提供的一种单目深度估计处理过程的示意图;
图9为本申请实施例提供的一种第一神经网络模型的训练方法的流程图;
图10为本申请实施例提供的一种训练过程的示意图;
图11为本申请实施例提供的一种第一神经网络模型的训练过程的示意图;
图12为本申请实施例提供的一种第一神经网络模型的训练过程的示意图;
图13为本申请实施例提供的一种单目深度估计装置的结构示意图;
图14为本申请实施例提供的一种电子设备的结构示意图;
图15为本申请实施例提供的另一种单目深度估计装置的结构示意图。
具体实施方式
本申请实施例涉及的术语“第一”、“第二”等仅用于区分描述的目的,而不能理解为指示或暗示相对重要性,也不能理解为指示或暗示顺序。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元。方法、***、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
应当理解,在本申请中,“至少一个(项)”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,用于描述关联对象的关联关系,表示可以存在三种关系,例如,“A和/或B”可以表示:只存在A,只存在B以及同时存在A和B三种情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,“a和b”,“a和c”,“b和c”,或“a和b和c”,其中a,b,c分别可以是单个,也可以是多个。
与通过两个RGB帧进行立体匹配估计深度的方式不同,本申请实施例的单目深度估计方法,没有需要单目相机处于运动状态而现实世界的目标物体处于静止状态的场景限制,可以适用于现实生活中一般或普遍场景中的目标物体的深度估计。本申请实施例所述的一般或普遍场景具体指没有条件限制的任意场景,其中,条件限制可以包括但不限于光照条 件限制、相机类型限制、目标物体类型限制、或相机与场景中的目标物体之间相对位置关系限制等。举例而言,光照条件可以是场景所在环境的光照好坏。相机类型可以是RGB相机、灰度相机、事件相机、夜视相机或热敏相机等。目标物体类型可以是人物、动物或物体等。相机与场景中的目标物体之间相对位置关系可以是近景、远景、静止或移动等。
本申请实施例所述的一般或普遍场景可以包括但不限于增强现实(Augmented Reality,AR)、导航(例如,自动驾驶或辅助驾驶)、场景重建、场景理解或物体检测等。
本申请实施例提供的方案中,将待估计图像输入至第一神经网络模型中,获取第一神经网络模型的第一距离缩放法线(Distance Scaled Normal,DSN)图,根据待估计图像和第一参数,确定第一相机滤波映射图,进而根据第一DSN图和第一相机滤波映射图,确定该待估计图像对应的第一深度图。其中,第一参数是拍摄该待估计图像的相机的相机标定参数。第一DSN图与场景中目标物体的3D结构(例如,几何结构)有关,而不受相机模型(例如,包括几何投影模型和相机标定参数等)的影响。而第一相机滤波映射图与待估计图像的像素点以及相机模型相关,而不受场景中目标物体的3D结构影响。相较于传统深度估计网络,通过第一DSN图和第一相机滤波映射图确定深度图,可以提升单目深度估计的准确率和效率。其具体实现方式可以参见下述实施例的解释说明。
首先对本申请实施例中的部分用语进行解释说明,以便于理解本申请实施例的单目深度估计方法。
目标物体:包括但不限于人物、动物或物体。物体可以是自然环境中的物体,例如,草地、或树木等,也可以是人文环境中的物体,例如,建筑物、道路、或车辆等。目标物体的表面通常具有规则平面结构。在一些实施例中,即使目标物体的表面没有完全平面的结构,该目标物体的表面也是可以被分割成多个小的平面区域。
目标物体在空间中的3D点:目标物体在空间中的外表面的平面上的点。例如,目标物体为车辆,那么目标物体在空间中的3D点可以是车辆的外表面上的任意一点,比如,前档风玻璃所构成平面上的点,车牌所构成平面上的点等。
距离缩放法线(Distance Scaled Normal,DSN)图:用于表示输入图像对应的目标物体的平面的朝向和该平面与相机之间的距离。其中,相机是指拍摄该输入图像的相机,目标物体的平面是指目标物体在3D空间中的外表面的平面。例如,目标物体为一个正方体,相机从一个角度拍摄该正方体,所得到的输入图像中呈现了该正方体的3个平面,那么目标物体的平面是指该正方体的3个平面。DSN图包括该正方体的3个平面的DSN向量,每一个平面的DSN向量可以表示各自平面的朝向和各自平面与相机之间的距离。每一个平面的DSN向量与场景中目标物体的3D结构有关,而不受相机模型的影响。目标物体的处于相同平面的共面3D点所对应的DSN向量相同。在DSN图中,每个像素点的DSN向量总是与相邻的属于同一平面的像素点的DSN向量相同。以输入图像为灰度图像为例,与输入图像中的每个像素点存储一个灰度值不同,DSN图中的每个像素点存储一个DSN向量。DSN图中的每个像素点所存储的数据的个数称为通道个数。由于这里每个像素点存储一个DSN向量,所以,通道个数等于DSN向量所包括的分量的个数。例如,空间中一个 3D点P=(X,Y,Z)的DSN向量为N=(N x,N y,N z),即DSN向量所包括的分量的个数为3,那么DSN图中的一个像素点所存储的数据为3通道的数据,DSN图中的一个像素点的一个通道用于存储该DSN向量的一个分量,也即一个维度的分量。以DSN向量为N=(N x,N y,N z)为例,DSN图中的一个像素点的一个通道用于存储N x,另一个通道用于存储N y,另一个通道用于存储N z
本申请实施例的单目深度估计方法可以使用神经网络模型对输入图像进行处理,得到DSN图。在神经网络模型应用过程中,输入图像为待估计图像,使用第一神经网络模型(也称目标神经网络模型)对待估计图像进行处理,得到第一DSN图。在神经网络模型训练过程中,输入图像为训练图像,使用初始神经网络模型对训练图像进行处理,得到第二DSN图。本申请实施例使用第一DSN图和第二DSN图以区别不同过程中神经网络模型所输出的DSN图。
相机滤波映射图:用于表示目标物体在空间中的3D点与2D平面的映射关系,该2D平面为相机的成像平面。相机滤波映射图与输入图像的像素点以及相机模型相关,而不受场景中目标物体的3D结构影响。相机模型可以包括几何投影模型、相机标定参数等,相机标定参数可以包括相机的中心坐标和焦距等。对于相同相机所拍摄的输入图像,其对应的相机滤波映射图相同,可以只计算一次。在更换相机时,可以根据新的相机的相机标定参数重新计算相机滤波映射图。以输入图像为灰度图像为例,与输入图像中的每个像素点存储一个灰度值不同,相机滤波映射图中的每个像素点存储一个相机滤波映射向量。举例而言,相机1拍摄目标物体1,得到的输入图像11,相机1拍摄目标物体2,得到的输入图像12,由于输入图像11和输入图像12均是相机1采集得到的,所以,输入图像11对应的相机滤波映射图和输入图像12对应的相机滤波映射图相同。
深度图:用于表示输入图像对应的目标物体在空间中的3D点到相机的距离(深度)。深度图中的每个像素点存储一个深度值,该深度值估计的是在相机拍摄该输入图像的时间,像素点对应的现实世界的3D点与相机视点之间的距离。该现实世界的3D点可以是任意场景中的任意目标物体,在空间中的3D点。
深度图中一个像素点的深度值可以通过两部分确定,这两部分包括该像素点的DSN向量和该像素点的相机滤波映射向量。
本申请实施例的电子设备可以是手机(mobile phone)、平板电脑(Pad)、带无线收发功能的电脑、虚拟现实(Virtual Reality,VR)终端设备、增强现实(Augmented Reality,AR)终端设备、工业控制(industrial control)中的终端设备、辅助驾驶的终端设备、无人驾驶(self driving)中的终端设备、远程手术(remote medical surgery)中的终端设备、智能电网(smart grid)中的终端设备、运输安全(transportation safety)中的终端设备、智慧城市(smart city)中的终端设备、智慧家庭(smart home)中的终端设备、智能手表、智能手环,智能眼镜,以及其他运动配件或可穿戴设备等等。例如,智慧家庭(smart home)中的终端设备可以是智能电视、智能音箱等智能家电。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2021075318-appb-000027
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于对神经网络中获取到的特征进行非线性变换,将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2021075318-appb-000028
其中,
Figure PCTCN2021075318-appb-000029
是输入向量,
Figure PCTCN2021075318-appb-000030
是输出向量,
Figure PCTCN2021075318-appb-000031
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2021075318-appb-000032
经过如此简单的操作得到输出向量
Figure PCTCN2021075318-appb-000033
由于DNN层数多,系数W和偏移向量
Figure PCTCN2021075318-appb-000034
的数量也比较多。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2021075318-appb-000035
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2021075318-appb-000036
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权 重可以理解为提取特征的方式与位置无关。卷积核可以以随机大小的矩阵的形式化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(5)像素值
图像的像素值可以是一个红绿蓝(RGB)颜色值,像素值可以是表示颜色的长整数。例如,像素值为256*Red+100*Green+76*Blue,其中,Blue代表蓝色分量,Green代表绿色分量,Red代表红色分量。各个颜色分量中,数值越小,亮度越低,数值越大,亮度越高。对于灰度图像来说,像素值可以是灰度值。
下面介绍本申请实施例提供的***架构。
参见附图1,本申请实施例提供了一种***架构100。数据采集设备160用于采集训练数据。示例性地,本申请实施例中的训练数据可以包括训练图像和训练图像对应的第二DSN图,或者包括训练图像和训练图像对应的第二深度图,或者包括训练图像和训练图像对应的第二DSN图和第二深度图。在采集到训练数据之后,数据采集设备160将这些训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。
下面对训练设备120如何基于训练数据得到目标模型/规则101进行描述。示例性地,训练设备120通过本申请下述实施例的第一神经网络模型的训练方法对训练数据进行处理,将输出的图像与目标图像(例如,第二DSN图)进行比对,直到训练设备120输出的图像与目标图像的差值小于一定阈值,从而完成目标模型/规则101的训练。本申请实施例的目标模型/规则用于对输入的待估计图像进行处理,输出第一DSN图,该第一DSN图用于表示待估计图像对应的目标物体的平面的朝向和平面与相机之间的距离。
该目标模型/规则101能够用于实现本申请实施例提供的单目深度估计方法,即,将待处理的图像,例如待估计图像,通过相关预处理后输入该目标模型/规则101,即可得到第一深度图。本申请实施例中的目标模型/规则101具体可以为神经网络。需要说明的是,在实际的应用中,数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的***或设备中,如应用于图1所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。在附图1中,执行设备110配置有(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:待估计图像。
预处理模块113用于根据I/O接口112接收到的输入数据(如待估计图像)进行预处理,在本申请实施例中,预处理模块113可以用于对输入数据进行图像滤波等处理。
在本申请实施例中,也可以没有预处理模块113和预处理模块114,而直接采用计算模块111对输入数据进行处理。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储***150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储***150中。
最后,I/O接口112将处理结果,如上述得到的第一深度图返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在附图1中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,附图1仅是本申请实施例提供的一种***架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图1中,数据存储***150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储***150置于执行设备110中。
再例如,在附图1中,训练设备120和执行设备110是两个设备,在其他情况下,训练设备120和执行设备110可以是一个设备。举例而言,训练设备120和执行设备110可以是一个服务器或服务器集群,客户设备140可以与服务器建立连接,服务器可以通过本申请实施例的单目深度估计方法,对待估计图像进行处理,得到第一深度图,将第一深度图提供给客户设备140。
再例如,在附图1中,执行设备110和客户设备140是两个设备,在其他情况下,执行设备110和客户设备140可以是一个设备。举例而言,执行设备110和客户设备140可以是一个智能手机,训练设备120可以是一个服务器或服务器集群,服务器可以通过本申 请实施例的第一神经网络模型的训练方法对训练数据进行处理,生成目标模型/规则,将目标模型/规则提供给该智能手机,使得该智能手机可以通过本申请实施例的单目深度估计方法,对待估计图像进行处理,得到第一深度图。
如图1所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是本申请中的第一神经网络模型,具体的,本申请中的第一神经网络模型可以包括CNN或深度卷积神经网络(deep convolutional neural networks,DCNN)等等。
由于CNN是一种非常常见的神经网络,下面结合图2重点对CNN的结构进行详细的介绍。如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
如图2所示,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及全连接层(fully connected layer)230。
卷积层/池化层220:
卷积层:
如图2所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进 行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图2中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
全连接层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用全连接层230来生成一个或者一组所需要的类的数量的输出。因此,在全连接层230中可以包括多层隐含层(如图2所示的231、232至23n),该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在全连接层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图2由210至240方向的传播为前向传播)完成,反向传播(如图2由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图2所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,仅包括图2中所示的网络结构的一部分,比如,本申请实施例中所采用的卷积神经网络可以仅包括输入层210、卷积层/池化层220和输出层240。
下面介绍本申请实施例提供的一种芯片硬件结构。
图3为本申请实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器30。该芯片可以被设置在如图1所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。如图2所示的卷积神经网络中各层的算法均可在如图3所示的芯片 中得以实现。本申请实施例中的单目深度估计方法以及第一神经网络模型的训练方法均可在如图3所示的芯片中得以实现。
神经网络处理器30可以是神经网络处理器(neural-network processing unit,NPU),张量处理器(tensor processing unit,TPU),或者图形处理器(graphics processing unit,GPU)等一切适合用于大规模异或运算处理的处理器。以NPU为例:神经网络处理器NPU30作为协处理器挂载到主中央处理器(central processing unit,CPU)(host CPU)上,由主CPU分配任务。NPU的核心部分为运算电路303,控制器304控制运算电路303提取存储器(权重存储器或输入存储器)中的数据并进行运算。其中,TPU是谷歌(***)为机器学习全定制的人工智能加速器专用集成电路。
在一些实现中,运算电路303内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路303是二维脉动阵列。运算电路303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路303是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路303从权重存储器302中取矩阵B的权重数据,并缓存在运算电路303中的每一个PE上。运算电路303从输入存储器301中取矩阵A的输入数据,根据矩阵A的输入数据与矩阵B的权重数据进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)308中。
向量计算单元307可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元307可以用于神经网络中非卷积/非FC层的网络计算,如池化(pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现中,向量计算单元能307将经处理的输出的向量存储到统一缓存器306。例如,向量计算单元307可以将非线性函数应用到运算电路303的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元307生成归一化的值、合并值,或二者均有。在一些实现中,向量计算单元307将经处理的向量存储到统一存储器306。在一些实现中,经向量计算单元307处理过的向量能够用作运算电路303的激活输入,例如用于神经网络中后续层中的使用,如图2所示,若当前处理层是隐含层1(231),则经向量计算单元307处理过的向量还可以被用到隐含层2(232)中的计算。
统一存储器306用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器(direct memory access controller,DMAC)305,被存入到权重存储器302中。输入数据也通过DMAC被存入到统一存储器306中。
总线接口单元(bus interface unit,BIU)310,用于DMAC和取指存储器(instruction fetch buffer)309的交互;总线接口单元301还用于取指存储器309从外部存储器获取指令;总线接口单元301还用于存储单元访问控制器305从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据存入到统一存储器306中,或将权重数据存入到权重存储器302中,或将输入数据存入到输入存储器301中。
与控制器304连接的取指存储器(instruction fetch buffer)309,用于存储控制器304使用的指令。
控制器304,用于调用指存储器309中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器306,输入存储器301,权重存储器302以及取指存储器309均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
其中,图2所示的卷积神经网络中各层的运算可以由运算电路303或向量计算单元307执行。示例性地,本申请实施例中的第一神经网络模型的训练方法以及单目深度估计方法均可以由运算电路303或向量计算单元307执行。
如图4所示,本申请实施例提供了一种***架构400。该***架构包括本地设备401、本地设备402以及执行设备410和数据存储***450,其中,本地设备401和本地设备402通过通信网络与执行设备410连接。
执行设备410可以由一个或多个服务器实现。可选的,执行设备410可以与其它计算设备配合使用,例如:数据存储器、路由器、负载均衡器等设备。执行设备410可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备410可以使用数据存储***450中的数据,或者调用数据存储***450中的程序代码来实现本申请实施例的第一神经网络模型的训练方法和/或单目深度估计方法。
用户可以操作各自的用户设备(例如本地设备401和本地设备402)与执行设备410进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备410进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。
在一种实现方式中,本地设备401、本地设备402从执行设备410获取到第一神经网络模型,将第一神经网络模型部署在本地设备401、本地设备402上,利用该第一神经网络模型进行单目深度估计。
在另一种实现中,执行设备410上可以直接部署第一神经网络模型,执行设备410通过从本地设备401和本地设备402获取待处理的图像,并采用第一神经网络模型对待处理的图像进行单目深度估计。
上述执行设备410也可以为云端设备,此时,执行设备410可以部署在云端;或者,上述执行设备410也可以为终端设备,此时,执行设备410可以部署在用户终端侧,本申请实施例对此并不限定。
下面以具体地实施例对本申请的技术方案以及本申请的技术方案如何解决上述技术问题进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例中不再赘述。下面将结合附图,对本申请的实施例进行描述。
图5为本申请实施例的一种单目深度估计方法的流程图,如图5所示,本实施例的方法可以包括:
步骤101、获取待估计图像和待估计图像对应的第一参数。
待估计图像是通过相机对3D空间中的目标物体进行拍摄所得到的图像。该待估计图像可以是RGB相机、灰度相机、夜视相机、热敏相机、全景相机、事件相机或红外相机等任意相机拍摄(也称采集)所得到的图像。该待估计图像可以是一帧图像。第一参数为拍摄待估计图像的相机的相机标定参数。相机标定参数是与相机自身特性相关的参数。一种示例,相机标定参数可以包括相机的中心坐标和焦距。另一种示例,相机标定参数可以包括待估计图像的宽度像素值和高度像素值。
一种获取待估计图像的方式,可以是通过设备自身的上述任意相机采集获取该待估计图像。另一种获取待估计图像的方式,可以是接收其他设备发送的待估计图像,该待估计图像可以是其他设备的相机所采集的。
步骤102、将待估计图像输入至第一神经网络模型中,获取第一神经网络模型输出的第一DSN图。
其中,第一DSN图用于表示待估计图像对应的目标物体的平面的朝向和该平面与相机之间的距离。第一DSN图的具体解释说明可以参见上述部分用语解释,这里的输入图像即为该待估计图像。第一DSN图与3D空间中的目标物体的几何结构相关,而不受相机模型的影响,该第一DSN图可以更精确地表示三维世界。
该第一神经网络模型可以是任意神经网络模型,例如,深度神经网络(Deep Neural Network,DNN)、卷积神经网络(Convolutional Neural Networks,CNN)或其组合等。
第一神经网络模型为使用训练图像和训练图像对应的第二DSN图进行训练得到的。第二DSN图为训练神经网络模型过程中的真实数据(ground truth)。在一些实施例中,第二DSN图是根据训练图像对应的第二深度图、以及训练图像对应的相机标定参数确定的。第二深度图为训练神经网络模型过程中的真实数据(ground truth),第二DSN图可以通过第二深度图和训练图像对应的相机标定参数确定。
该第一神经网络模型通过训练图像和训练图像对应的第二DSN图训练,学习到由输入图像得到DSN图的映射特征,从而可以对上述待估计图像进行智能化感知,输出待估计图像对应的DSN图。
步骤103、根据待估计图像和第一参数,确定第一相机滤波映射图。
其中,该第一相机滤波映射图用于表示目标物体在空间中的3D点与2D平面的映射关系,该2D平面为相机的成像平面。第一相机滤波映射图的具体解释说明可以参见上述部分用语解释,这里的输入图像即为该待估计图像,即第一相机滤波映射图与待估计图像的像素点以及相机模型相关,而不受场景中目标物体的3D结构影响。
一种可实现方式,根据待估计图像的像素点的位置坐标和第一参数,确定第一相机滤波映射图。该第一相机滤波映射图包括像素点对应的相机滤波映射向量。换言之,第一相机滤波映射图中的像素点存储相机滤波映射向量。该相机滤波向量用于表示3D点与像素点的映射关系,该像素点为3D点投影至2D平面(相机成像平面)的点。
示例性的,像素点的位置坐标可以包括横坐标和纵坐标,像素点对应的相机滤波映射向量包括第一相机滤波映射分量和第二相机滤波映射分量,第一相机滤波映射分量是根据横坐标和第一参数确定的,第二相机滤波映射分量是根据纵坐标和第一参数,或者根据横坐标、纵坐标和第一参数确定的。
当拍摄待估计图像的相机的视场角小于180度时,即非全景相机拍摄的待估计图像,上述第一参数可以包括相机的中心坐标和焦距。第一相机滤波映射分量是根据横坐标和第一参数确定的,第二相机滤波映射分量是根据纵坐标和第一参数确定的。当待估计图像的相机的视场角大于180度时,即全景相机拍摄的待估计图像,上述第一参数可以包括待估计图像的宽度像素值和高度像素值,第一相机滤波映射分量是根据横坐标和第一参数确定的,第二相机滤波映射分量是根据横坐标、纵坐标和第一参数确定的。
步骤104、根据第一DSN图和第一相机滤波映射图,确定待估计图像对应的第一深度图。
第一深度图可以包括像素点对应的深度值,换言之,第一深度图中的像素点存储有深度值,该深度值用于表示在相机拍摄该待估计图像的时间,像素点对应的现实世界的3D点与相机视点之间的距离。该第一深度图是稠密的、可边缘感知的公制尺度深度图。
本申请实施例通过上述步骤可以得到两部分:第一DSN图和第一相机滤波映射图,通过这两部分最终可以得到该待估计图像对应的第一深度图。本申请实施例使用第一神经网络模型得到第一DSN图,该第一DSN图可以精确的表示3D空间中的目标物体的几何结构。根据待估计图像的相机标定参数确定第一相机滤波映射图,第一相机滤波映射图与待估计图像的像素点以及相机模型相关,而不受场景中目标物体的3D结构影响。再基于这两部分得到深度图。而直接使用神经网络模型输出深度图,会出现由于拍摄待估计图像的相机与神经网络模型训练过程中的训练图像的相机的相机标定参数不同,而产生估计误差的问题。与直接使用神经网络模型输出深度图不同,本申请实施例基于第一DSN图和第一相机滤波映射图得到深度图,该深度图可以精确反映目标物体的距离,提升单目深度估计的准确性。
该第一DSN图可以包括待估计图像的像素点对应的第一DSN向量,换言之,第一DSN图中的像素点存储DSN向量。根据像素点对应的第一DSN向量和像素点对应的相机滤波映射向量,可以确定像素点对应的深度值。即对第一DSN图和第一相机滤波映射图中相同像素位置的向量进行相应运算,可以得到相应像素位置的深度值。
示例性的,以像素点i对应的第一DSN向量和该像素点i对应的相机滤波映射向量为例,确定该像素点i对应的深度值的具体实现方式可以是,通过如下公式1和公式2确定。
ξ i=N i·F i       (公式1)
其中,ξ i为像素点i对应的场景中目标物体的3D点的逆深度值,N i为像素点i对应的第一DSN向量,F i为像素点i对应的相机滤波映射向量。N i可以从第一DSN图的像素点i获取,F i可以从第一相机滤波映射图的像素点i获取。
对逆深度值取倒数便可以得到像素点i对应的深度值。例如,根据下述公式2确定。
Figure PCTCN2021075318-appb-000037
其中,Z i为像素点i对应的逆深度值。
上述像素点i可以是待估计图像中的任意一个像素点。
需要说明的是,通过公式1和公式2确定像素点i对应的深度值,既可以适用于对透视相机模型的相机所采集的待估计图像进行单目深度估计,也可以适用于对非透视相机模型的相机所采集的待估计图像进行单目深度估计,该非透视相机模型的相机包括但不限于全景相机、360度球面相机、折反射相机、鱼眼相机等。
本实施例,通过获取待估计图像和相机标定参数,通过第一神经网络模型得到待估计图像对应的第一DSN图,根据相机标定参数确定第一相机滤波映射图,再基于第一DSN图和第一相机滤波映射图得到深度图。与直接使用神经网络模型输出深度图不同,本申请实施例基于第一DSN图和第一相机滤波映射图得到深度图,该深度图可以精确反映目标物体的距离,从而提升单目深度估计的准确性。与通过两个RGB帧进行立体匹配估计深度的方式不同,本申请实施例通过一帧待估计图像便可以进行深度估计,没有需要单目相机处于运动状态而现实世界的目标物体处于静止状态的场景限制,可以适用于现实生活中一般或普遍场景中的目标物体的深度估计。
图6为本申请实施例的另一种单目深度估计方法的流程图,如图6所示,本实施例的方法可以包括:
步骤201、获取待估计图像和待估计图像对应的第一参数。
待估计图像对应的第一参数可以包括相机的中心坐标和焦距,或者,可以包括待估计图像的宽度像素值和高度像素值。
在一些实施例中,待估计图像的宽度像素值和高度像素值可以基于相机的中心坐标计算得到。
步骤202、将待估计图像输入至第一神经网络模型中,获取第一神经网络模型输出的第一DSN图。
其中,步骤201至步骤202的解释说明可以参见图5所示实施例的步骤101至步骤102的具体解释说明,此处不再赘述。
步骤203、判断拍摄待估计图像的相机的视场角是否小于180,若是,则执行步骤204,若否,则执行步骤205。
步骤204、根据待估计图像的像素点的横坐标和纵坐标、以及拍摄待估计图像的相机的中心坐标和焦距,确定第一相机滤波映射图。
如上所述,该第一相机滤波映射图包括像素点对应的相机滤波映射向量。本实施例以像素点i为例,本步骤确定该像素点i对应的相机滤波映射向量的具体实现方式可以是,通过如下公式3至公式5确定。
F i=(F u,F v,1)        (公式3)
Figure PCTCN2021075318-appb-000038
Figure PCTCN2021075318-appb-000039
其中,F i为像素点i对应的相机滤波映射向量,F u为F i的第二相机滤波映射分量,F v为F i的第二相机滤波映射分量。i=(u,v),u为像素点i的横坐标,v为像素点i的纵坐标。(c x,c y)为拍摄待估计图像的相机的中心坐标,(f x,f y)为拍摄待估计图像的相机的焦距。
上述像素点i可以是待估计图像中的任意一个像素点,从而可以得到第一相机滤波映射图。
对透视相机模型的相机所采集的待估计图像进行单目深度估计过程中,可以采用本步骤中公式3至公式5确定第一相机滤波映射图。
步骤205、根据待估计图像的像素点的横坐标和纵坐标、以及待估计图像的宽度像素值和高度像素值,确定第一相机滤波映射图。
本实施例以像素点i为例,本步骤确定该像素点i对应的相机滤波映射向量的具体实现方式可以是,通过如公式3、公式6和7确定。
Figure PCTCN2021075318-appb-000040
Figure PCTCN2021075318-appb-000041
其中,W为待估计图像的宽度像素值,H为待估计图像的高度像素值。
需要说明的是,W和H也可以通过中心坐标(c x,c y)得到,例如,W=2c x,H=2c y
上述像素点i可以是待估计图像中的任意一个像素点,从而可以得到第一相机滤波映射图。
对全景相机所采集的待估计图像进行单目深度估计过程中,可以采用本步骤中公式3、公式6和公式7确定第一相机滤波映射图。
步骤206、根据第一DSN图和第一相机滤波映射图,确定待估计图像对应的第一深度图。
其中,步骤206的解释说明可以参见图5所示实施例的步骤104的具体解释说明,此处不再赘述。
本实施例,通过获取待估计图像和相机标定参数,通过第一神经网络模型得到待估计图像对应的第一DSN图,根据相机标定参数确定第一相机滤波映射图,再基于第一DSN图和第一相机滤波映射图得到深度图。与直接使用神经网络模型输出深度图不同,本申请实施例基于第一DSN图和第一相机滤波映射图得到深度图,该深度图可以精确反映目标物体的距离,从而提升单目深度估计的准确性。与通过两个RGB帧进行立体匹配估计深度的方式不同,本申请实施例通过一帧待估计图像便可以进行深度估计,没有需要单目相机处于运动状态而现实世界的目标物体处于静止状态的场景限制,可以适用于现实生活中一般或普遍场景中的目标物体的深度估计。
本申请实施例的单目深度估计方法,可以适用于对不同相机模型的相机所采集的待估计图像进行深度估计,从而实现对不同相机的待估计图像的泛化感知。
示例性的,以图7所示的场景中的3D点和透视(针孔)相机模型之间的几何对应关系,对深度值可以通过像素点的DSN向量和该像素点的相机滤波映射向量确定进行示例性解释说明。
参照图7所示,场景中目标物体的一个3D点被位置坐标位于(0,0,0)的相机捕获,并存储在位于图像平面中像素点i的2D像素里。P表示该3D点,P=(X,Y,Z),(X,Y,Z)表示该3D点在空间中的位置坐标(也称三维空间坐标),i=(u,v),(u,v)表示像素点i的位置坐标,像素点i的位置坐标是从图像平面的左上角测量。场景中的3D点在空间中的位置坐标与像素点i的位置坐标之间的几何对应关系由以下公式1给出:
Figure PCTCN2021075318-appb-000042
其中,相机标定参数可以包括透视(针孔)相机模型的相机的中心坐标和焦距,(c x,c y)为中心坐标,(f x,f y)为焦距。
场景中目标物体的3D点除了空间中的位置坐标P=(X,Y,Z)的表示方法以外,该3D点也可建模表示为单位表面法线,(n x,n y,n z)表示场景中目标物体的3D点的单位表面法线。单位表面法线中的表面指的是3D点与其相邻的共面3D点所构成的平面(可拓展为无穷大平面)。而这种单位向量的表示方法是与所使用的相机模型不相关的。于是,可以定义相机到3D点的扩展平面的距离为h。那么距离h在几何上可以用单位表面法线和三维空间坐标的标量积来计算得出:
h=(n x,n y,n z)·(X,Y,Z)        (公式9)
根据公式8,可以使用透视相机模型的几何关系,替换公式9中的3D点空间坐标,并得到:
Figure PCTCN2021075318-appb-000043
基于公式10,本申请实施例提出一个全新的3D结构表示方法,将场景中3D点的逆深度(即深度的倒数)分解为DSN向量和相机滤波映射向量,N表示3D点的DSN向量,F表示像素点i的相机滤波映射向量。参见下述公式11至公式15。
Figure PCTCN2021075318-appb-000044
Figure PCTCN2021075318-appb-000045
其中,ξ表示3D点的逆深度。
因此,逆深度可以根据本申请实施例提供的3D点结构表示法,即逆深度值可以通过DSN向量和相机滤波映射向量确定,表示为:
ξ=N·F=N xF u+N yF v+N z        (公式13)
其中,3D点的DSN向量为:
Figure PCTCN2021075318-appb-000046
其对应的像素点i的相机滤波映射向量为:
Figure PCTCN2021075318-appb-000047
其中,F u即为上述第一相机滤波映射分量,F v即为上述第二相机滤波映射分量。
需要说明的是,在单目深度估计方法的诸多实施例中,所使用的相机模型可以不限于上述透视相机模型的相机,也可以是非透视相机模型的相机,例如,全景相机、360度球面相机、折反射相机、或鱼眼相机等。在这些涉及非透视相机模型的相机的实施例中,本申请实施例的3D结构表示法可对场景的3D结构和非透视相机模型之间做几何对应的适配。在使用非透视相机模型的相机的情况下,一般场景的3D结构(根据公式9做几何建模)和DSN向量(根据公式13做几何建模)仍然有效成立。但是相机滤波映射向量(根据公式15计算)需要根据不同类型的相机进行更新。例如,在一种实施例中,相机可以是 360度全景相机。在这种情况下,场景中3D点P与像素点i之间的几何对应关系可以通过以下方式给出:
Figure PCTCN2021075318-appb-000048
其中,
Figure PCTCN2021075318-appb-000049
r为从位于(0,0,0)点的相机到场景3D点P的球面投影半径,W、H分别为图像的宽度像素值和高度像素值。通过结合公式9和公式16,场景中3D点的逆深度可分解为DSN向量和相机滤波映射向量,N表示DSN向量,F表示相机滤波映射向量。
Figure PCTCN2021075318-appb-000050
其中,ξ表示3D点的逆深度。
因为,逆深度可以使用公式13提出的场景中3D点的3D结构表示法来结算。其中3D点对应的DSN向量可由公式14计算得到。而3D点在2D平面i=(u,v)对应的相机滤波映射向量为:
Figure PCTCN2021075318-appb-000051
本申请实施例为图5或图6所示实施例的单目深度估计方法提供了理论依据。使得本申请实施例的单目深度估计方法,可以通过第一神经网络模型输出待估计图像的各个像素点对应的第一DSN向量,根据公式3得到各个像素点对应的第一相机滤波映射向量,之后根据公式1和公式2得到各个像素点对应的深度值,从而实现单目深度估计,提升深度估计的准确性。
图8为本申请实施例的一种单目深度估计处理过程的示意图,如图8所示,本实施例以第一神经网络模型为卷积神经网络为例进行示意性举例说明,本实施例的单目深度估计方法可以包括:将图像数据L301输入至卷积神经网络L302,例如,可以将上述待估计图像作为图像数据L301。卷积神经网络L302输出DSN图L303。本实施例以DSN图L303中的一个像素点所存储的数据为3通道的数据为例,即DSN图L303包括的DSN向量包括三个分量,图像数据L301的像素点i,i=(u,v),N i为像素点i对应的DSN向量,N i=(N xi,N yi,N zi)为例,进行举例说明。在DSN图L303中,位于位置坐标i=(u,v)的像素点上,存储有N i=(N xi,N yi,N zi)。例如,DSN图中的一个像素点的一个通道用于存储该DSN向量的一个分量。在一些实施例中,卷积神经网络可以是基于ResNet-18的编码器-解码器架构的。
该卷积神经网络可以通过下述图9所示的训练方法训练得到。
根据公式4和公式5,或公式6和公式7,可以通过图像平面中的位置坐标i=(u,v)和相机标定参数来计算得到相机滤波映射图L304。在相机滤波映射图L304中,位于位置坐标i=(u,v)的像素点上,存储像素点i对应的相机滤波映射向量。本实施例以F i为像素点 i对应的滤波映射向量,F i=(F ui,F vi)为例,相机滤波映射图L304中的位置坐标i=(u,v)的像素点存储有(F ui,F vi)。对于相机标定参数固定的相机,则只需计算一次相机滤波映射图L304。
根据公式1,可使用相机滤波映射图L304来对DSN图L303做滤波处理,例如,通过滤波器L305,计算得到逆深度图L306,进而基于逆深度图L306得到深度图。本实施例以ξ i为像素点i对应的逆深度值为例,逆深度图L306中的位置坐标i=(u,v)的像素点存储有ξ i
图9为本申请实施例的一种第一神经网络模型的训练方法的流程图,该第一神经网络模型也可称为单目深度估计模型,如图9所示,本实施例的方法可以包括:
步骤301、获取训练图像和训练图像对应的第二DSN图。
该训练图像可以是RGB相机、灰度相机、夜视相机、热敏相机、全景相机、事件相机或红外相机等任意相机拍摄所得到的图像。
例如,训练图像可以来源于图1所示的数据库。数据库中可以存储有多个训练图像和训练图像对应的第二DSN图,或者存储有多个训练图像和训练图像对应的第二深度图,或者存储有多个训练图像和训练图像对应的第二DSN图以及第二深度图。
数据库中的训练数据可以通过如下方式获取。一种可实现方式,多个训练图像是由多个经标定和同步的相机拍摄场景得到的图像。使用3D视觉技术对多个训练图像进行处理,可以得到多个训练图像对应的第二深度图。该3D视觉技术可以是运动结构(structure from motion,sfm)恢复技术、多视角三维重建技术或视图合成技术等。
另一种可实现方式,将不同类型的相机安装在机架中并标定,可以得到训练图像和训练图像对应的第二DSN图。例如,不同类型的相机包括深度相机和热敏相机,将深度相机和热敏相机安装在机架中并标定,深度相机拍摄所得到的深度图可直接与热敏相机所得到的图像对齐,深度相机拍摄所得到的深度图可以作为训练图像对应的第二深度图,热敏相机所得到的图像可以作为训练图像。
又一种可实现方式,获取原始图像,原始图像可以是上述任意一种类型的相机拍摄得到的图像,对原始图像进行数据优化或数据增强,得到训练图像。该数据优化包括空洞填充优化、锐化遮挡边缘优化或时间一致性优化中至少一项。一种举例,通过图像处理滤波器来优化原始图像,以得到训练图像。该图像处理滤波器可以是双边滤波、或引导式图像滤波等。另一种举例,通过使用视频的帧与帧之间的时间信息或一致性来优化原始图像,以得到训练图像。例如,通过光流法添加时间约束。再一种举例,通过几何信息或语义分割信息来完成,这种方式可以由一些卷积神经网络来计算完成,比如,用于表面法向量分割或更表层类别的分割卷积神经网络。例如用于人物、植被、天空、汽车等不同类别分割的卷积神经网络。该数据增强用于改变原始图像所拍摄场景的环境条件以获取不同环境条件下的训练图像。该环境条件可以包括光照条件、天气条件、能见度条件等。
再一种可实现方式,训练图像对应的第二深度图可以为教师单目深度估计网络对输入的训练图像进行处理后输出的深度图。这里的教师单目深度估计网络可以是经过训练的MDE卷积神经网络。
可以理解的,上述训练数据的获取方式可以组合使用,以得到本申请实施例的数据库 中的训练数据。
当训练数据包括训练图像和训练图像对应的第二深度图时,可以通过训练图像、训练图像对应的相机标定参数以及第二深度图确定第二DSN图。例如,可以通过如下两种方式计算得到训练图像对应的第二DSN图。
第二DSN图包括训练图像的像素点对应的场景中的3D点所在平面的DSN向量。
这里以第二深度图中的像素点i,所对应的DSN向量N i=(N xi,N yi,N zi)为例,对基于深度图计算得到对应的DSN图的实现方式进行解释说明。
一种可实现方式,计算像素点i对应的单位表面法线。(n xi,n yi,n zi)为像素点i对应的单位表面法线。在一些实施例中,可以使用像素点i的相邻像素的向量叉积来计算。
然后,根据像素点i对应的单位表面法线、像素点i对应的深度值、像素点i的位置坐标以及相机标定参数,计算像素点i对应的3D点所在平面到相机的距离。其中,相机标定参数可以是拍摄训练图像的相机的相机标定参数。
例如,可以通过如下公式19计算像素点i对应的3D点所在平面到相机的距离。
Figure PCTCN2021075318-appb-000052
其中,h i为像素点i对应的3D点所在平面到相机的距离,相机标定参数包括相机的中心坐标和焦距,(c x,c y)为相机的中心坐标,(f x,f y)为相机的焦距,(u,v)为像素点i的位置坐标,Z为像素点i对应的深度值。
之后,根据像素点i对应的3D点所在平面到相机的距离和像素点i对应的单位表面法线,计算像素点i对应的3D点所在平面的DSN向量。
例如,可以通过如下公式20计算像素点i对应的3D点所在平面的DSN向量。
Figure PCTCN2021075318-appb-000053
其中,N i为像素点i对应的3D点所在平面的DSN向量。
另一种可实现方式,DSN图可由逆深度图计算得到。计算方式可以为,通过计算逆深度图中像素点i处的图像梯度,得到像素点i对应的DSN向量,即像素点i对应的3D点所在平面的DSN向量。
例如,根据通过如下公式31计算像素点i对应的3D点所在平面的DSN向量。
Figure PCTCN2021075318-appb-000054
其中,N i为像素点i对应的3D点所在平面的DSN向量,N i=(N xi,N yi,N zi),ξ i为像素点i对应的场景中的3D点的逆深度值,(c x,c y)为相机的中心坐标,(f x,f y)为相机的焦距, (u,v)为像素点i的位置坐标。
步骤302、使用训练图像和第二DSN图对初始神经网络模型进行训练,得到第一神经网络模型。
在本步骤中,可以将训练图像输入初始神经网络模型,得到第三DSN图。根据训练图像对应的相机标定参数和训练图像,确定第二相机滤波映射图。根据第二相机滤波映射图和第三DSN图,得到第三深度图。根据第三DSN图和训练图像对应的第二DSN图之间的差异,或者,第三深度图和第二深度图之间的差异,或者,第三深度图和第二深度图的匹配度中至少一项,调整初始神经网络模型的参数,重复上述过程,直至训练结束,得到上述第一神经网络模型。
换言之,可以根据损失函数调整神经网络模型的参数,直至得到满足训练目标的第一神经网络模型。损失函数可以包括第一损失函数、第二损失函数或第三损失函数中至少一项。第一损失函数用于表示第二DSN图和第三DSN图之间的误差,第二损失函数用于表示第二深度图和第三深度图之间的误差。第三损失函数用于表示第二深度图和第三深度图的匹配程度。
结合图10所示的本申请实施例的一种训练过程的示意图,对本申请实施例进行解释说明。如图10所示,从数据库L400中获取训练图像L501。将训练图像L501输入至卷积神经网络L502中,卷积神经网络L502输出第三DSN图L503。第三DSN图L503的位置坐标位于i=(u,v)的像素点存储有估计的DSN向量,N i表示该估计的DSN向量,N i=(N xi,N yi,N zi)。根据训练图像L501对应的相机标定参数得到第二相机滤波映射图L504。第二相机滤波映射图L504的位置坐标位于i=(u,v)的像素点存储有相机滤波映射向量,Fi表示该相机滤波映射向量,F i=(F ui,F vi)。将第三DSN图L503和第二相机滤波映射图L504提供至滤波器L505,滤波器L505输出逆深度图L506。逆深度图L506的位置坐标位于i=(u,v)的像素点存储有估计的逆深度值,ξ i表示该估计的逆深度值。在一些实施例中,可以基于逆深度图L506得到上述第三深度图。之后,将第三DSN图L503、逆深度图L506、真实DSN图L508以及真实逆深度图L509提供给损失函数L507,以确定损失函数值,根据损失函数值调整卷积神经网络L502。真实DSN图L508即为训练图像对应的第二DSN图。真实逆深度图L509可以由训练图像对应的第二深度图得到。真实DSN图L508的位置坐标位于i=(u,v)的像素点存储有真实DSN向量,
Figure PCTCN2021075318-appb-000055
表示该真实DSN向量,
Figure PCTCN2021075318-appb-000056
Figure PCTCN2021075318-appb-000057
真实逆深度图L509的位置坐标位于i=(u,v)的像素点存储有真实逆深度值,
Figure PCTCN2021075318-appb-000058
表示真实逆深度值。
一种可实现方式,损失函数
Figure PCTCN2021075318-appb-000059
(L507)的定义如下。
Figure PCTCN2021075318-appb-000060
其中的λ DEP,λ DSN和λ INP是超参数,用于加权深度损失函数
Figure PCTCN2021075318-appb-000061
DSN损失函数
Figure PCTCN2021075318-appb-000062
以及修补损失函数
Figure PCTCN2021075318-appb-000063
λ DEP,λ DSN和λ INP分别大于或等于0。例如,给一个 超参数赋零会取消其相应损失函数对网络训练的影响。
Figure PCTCN2021075318-appb-000064
是一个计算真实逆深度值
Figure PCTCN2021075318-appb-000065
(L509)和估计的逆深度值ξ i(L506)之间误差的损失函数。该损失函数的计算会在全部的有效像素
Figure PCTCN2021075318-appb-000066
上进行。该有效像素
Figure PCTCN2021075318-appb-000067
可以是含有有效的真实逆深度值的像素。对于每一个位置坐标位于i=(u,v)的有效像素,计算公式如下:
Figure PCTCN2021075318-appb-000068
其中‖·‖代表范式,其可以是L1范式,L2范式等。根据实施例的具体情况,深度损失函数中的逆深度值也可视情况替换为深度值。
Figure PCTCN2021075318-appb-000069
是DSN法线损失函数,用于计算真实DSN向量(L508)和估计的DSN向量(L503)之间误差的损失函数。该损失函数的计算会在全部的有效像素
Figure PCTCN2021075318-appb-000070
上进行。该有效像素
Figure PCTCN2021075318-appb-000071
可以是含有有效的真实DSN向量的像素。对于每一个位置坐标位于i=(u,v)的有效像素,计算公式如下:
Figure PCTCN2021075318-appb-000072
Figure PCTCN2021075318-appb-000073
是修补损失函数,会在估计的DSN图L503中的全部像素
Figure PCTCN2021075318-appb-000074
上被计算,方式如下:
Figure PCTCN2021075318-appb-000075
其中,
Figure PCTCN2021075318-appb-000076
是拉普拉斯算子,而I是与真实逆深度图L509相匹配的图像数据,
Figure PCTCN2021075318-appb-000077
表示相匹配的像素点的个数。
通过上述公式22至公式25可以计算损失函数,以调整网络参数,得到第一神经网络模型。
本实施例,通过将训练图像输入至卷积神经网络,得到估计的DSN图。根据估计的DSN图和训练图像对应的相机滤波映射图,得到估计的逆深度图。基于估计的DSN图、真实的DSN图、估计的逆深度图和真实的逆深度图,计算损失函数值。根据损失函数值调整卷积神经网络的参数,重复上述步骤,以训练得到第一神经网络模型。第一神经网络模型可以学习到输入图像和DSN图的映射关系,从而使得第一神经网络模型可以基于输入图像输出输入图像对应的DSN图,使用该DSN图和输入图像对应的相机滤波映射图,可以得到输入图像对应的深度图,以实现单目深度估计。由于第一神经网络模型学习的是输入图像和DSN图的映射关系,该DSN图与输入图像对应的目标物体的3D结构有关,而不受相机模型的影响,从而使得在模型应用阶段,即使待估计图像对应的相机模型与训练图像对应的相机模型不同,该第一神经网络模型输出的待估计图像对应的DSN图,也可以较为准确的表示待估计图像对应的目标物体的3D结构。之后,基于待估计图像对应的DSN图,和与相机模型相关的相机滤波映射图,得到深度图。该深度图可以较为准确的表示目标物体在空间中的距离,从而提升单目深度估计的准确性。
由于训练图像可以是RGB相机、灰度相机、夜视相机、热敏相机、全景相机、事件相机或红外相机等任意相机拍摄所得到的图像,所以使得使用训练得到的第一神经网络模型可以感知任意相机拍摄的图像,即对相机可泛化感知。并基于第一神经网络模型的输出估计出深度图。
通过上述步骤训练得到的第一神经网络模型可以配置到电子设备中,以使得电子设 备可以实现较为准确的单目深度估计。该第一神经网络模型也可以配置到服务器中,以使得服务器可以对电子设备发送的待估计图像进行处理,返回DSN图,之后由电子设备基于DSN图得到深度图,实现较为准确的单目深度估计。
第一神经网络模型可以是软件功能模块,也可以是固化的硬件电路,例如,该硬件电路可以是运算电路等,本申请实施例对第一神经网络模型的具体形态不作具体限定。
本申请实施例还可以采用如下方法训练神经网络模型,以得到第一神经网络模型。
例如,图11为本申请实施例的一种第一神经网络模型的训练过程的示意图。如图11所示,单目深度估计模块L515(即包括图8所示实施例的单目深度估计处理过程所涉及的部分)通过学习单目深度估计教师网络L511来估计深度。教师网络用于通过逆渲染来合成配对输入输出真实(Ground Truth)数据。使用该输入输出真实(Ground Truth)数据对单目深度估计模块L515进行训练。即对单目深度估计模块L515中的神经网络模型的训练,以得到第一神经网络模型。
示例性的,一个完成训练的单目深度估计教师网络L511可被用来作为MDE处理器。教师网络L511可以估计生成输入图像对应的深度图。教师网络的输入图像和估计的深度图可作为真实(Ground Truth)数据用于扩充数据库L400。
示例性的,教师网络L511可以通过输入一张噪声图像L510,估计与噪声对应的合成深度图L512。再通过使用教师网络L511的编码器和噪声图像L510,逆向渲染合成可被用于作为训练输入的合成图像数据L513。
示例性的,通过教师网络合成生成的逆渲染合成图像数据L513和合成深度图L512可作为一组配对的真实(Ground Truth)数据L514加入到数据库L400。
示例性的,单目深度估计模块L515被设定为学生网络,并使用教师网络生成的逆渲染合成图像数据L513作为输入,合成深度图L512作为真实(Ground Truth)数据,来进行训练。单目深度估计模块L515对逆渲染合成图像数据L513进行处理,输出估计的深度图L516。之后,将估计的深度图L516和合成深度图L512提供至训练损失函数L517,该训练损失函数L517可以采用上述图9所示实施例中的损失函数,当然也可以采用其他形式的损失函数,本申请实施例不作具体限定。根据训练损失函数L517调整单目深度估计模块L515中的神经网络模型,以得到第一神经网络模型。
本实施例,可以让单目深度估计模块L515,通过学习其他MDE处理器(例如,现成的MDE软件、已完成训练的MDE网络等)进行训练,而不直接访问这些MDE处理器/网络被训练时所使用的原始数据。这种方式是通过一种前沿的知识蒸馏算法来实现的。这种方式具有较高的训练效率。
例如,图12为本申请实施例的一种第一神经网络模型的训练过程的示意图。如图12所示,单目深度估计模块L523(与上文中L515相同的模块)和MDE处理器L521同时使用噪声作为输入,通过无数据蒸馏的方法,使单目深度估计模块L523从MDE处理器L521处学习深度估计的能力。
示例性的,一个完成训练的单目深度估计教师网络L521可被用来作为MDE处理器,用于扩充数据库L400。教师网络L521可以估计输入图像对应的深度图。单目深度估计模 组L523被设定为学生网络。一张噪声图像可以被同时输入到单目深度估计模块L523和MDE处理器L521。单目深度估计模块L523对噪声图像进行处理,输出学生网络估计出的深度图L524。单目深度估计教师网络L521对噪声图像进行处理,输出教师网络估计出的深度图L522。将学生网络估计出的深度图L524和教师网络估计出的深度图L522提供至训练损失函数L525。通过训练损失函数L525,调整单目深度估计模块L523,以降低学生网络估计出的深度图L524和教师网络估计出的深度图L522之间的损失误差,实现对单目深度估计模块L523(即学生网络)的训练。
本实施例,可以让单目深度估计模块L523,通过学习其他MDE处理器(例如,现成的MDE软件、已完成训练的MDE网络等)进行训练,而不直接访问这些MDE处理器/网络被训练时所使用的原始数据。这种方式是通过一种前沿的知识蒸馏算法来实现的。这种方式具有较高的训练效率。
本申请实施例还提供一种单目深度估计装置,用于执行以上各方法实施例中的方法步骤。如图13所示,该单目深度估计装置可以包括:获取模块91、DSN模块92、相机滤波映射模块93和深度估计模块94。
获取模块91,用于获取待估计图像和待估计图像对应的第一参数,所述第一参数为拍摄所述待估计图像的相机的相机标定参数。
距离缩放法线DSN模块92,用于将所述待估计图像输入至第一神经网络模型中,获取第一神经网络模型输出的第一距离缩放法线DSN图,所述第一DSN图用于表示所述待估计图像对应的目标物体的平面的朝向和所述平面与所述相机之间的距离。
相机滤波映射模块93,用于根据所述待估计图像和所述第一参数,确定第一相机滤波映射图,所述第一相机滤波映射图用于表示所述目标物体在空间中的3D点与2D平面的映射关系,所述2D平面为所述相机的成像平面。
深度估计模块94,用于根据所述第一DSN图和所述第一相机滤波映射图,确定所述待估计图像对应的第一深度图。
在一些实施例中,所述第一神经网络模型为使用训练图像和所述训练图像对应的第二DSN图进行训练得到的,所述第二DSN图是根据所述训练图像对应的第二深度图、以及所述训练图像对应的相机标定参数确定的。
在一些实施例中,所述训练图像作为初始神经网络模型的输入,损失函数包括第一损失函数、第二损失函数或第三损失函数中至少一项,所述损失函数用于调整所述初始神经网络模型的参数,以训练得到所述第一神经网络模型。
所述第一损失函数用于表示所述第二DSN图和第三DSN图之间的误差,所述第三DSN图为所述初始神经网络模型输出的所述训练图像对应的DSN图,所述第二损失函数用于表示所述第二深度图和第三深度图之间的误差,所述第三深度图是根据所述第三DSN图和第二相机滤波映射图确定的,所述第二相机滤波映射图为根据所述训练图像对应的相机标定参数和所述训练图像确定的,所述第三损失函数用于表示所述第二深度图和第三深度图的匹配程度。
在一些实施例中,相机滤波映射模块93用于:根据所述待估计图像的像素点的位置坐标和所述第一参数,确定所述第一相机滤波映射图,所述第一相机滤波映射图包括所述像 素点对应的相机滤波映射向量,所述相机滤波向量用于表示所述3D点与所述像素点的映射关系,所述像素点为所述3D点投影至所述2D平面的点。
在一些实施例中,所述像素点的位置坐标包括横坐标和纵坐标,所述像素点对应的相机滤波映射向量包括第一相机滤波映射分量和第二相机滤波映射分量,所述第一相机滤波映射分量是根据所述横坐标和所述第一参数确定的,所述第二相机滤波映射分量是根据所述纵坐标和所述第一参数,或者根据所述横坐标、所述纵坐标和所述第一参数确定的。
在一些实施例中,当拍摄所述待估计图像的相机的视场角小于180度时,所述第一参数包括述相机的中心坐标(c x,c y)和焦距(f x,f y),所述第一相机滤波映射分量是根据所述横坐标和所述第一参数确定的,所述第二相机滤波映射分量是根据所述纵坐标和所述第一参数确定的。当所述待估计图像的相机的视场角大于180度时,所述第一参数包括所述待估计图像的宽度像素值W和高度像素值H,所述第一相机滤波映射分量是根据所述横坐标和所述第一参数确定的,所述第二相机滤波映射分量是根据所述横坐标、所述纵坐标和所述第一参数确定的。
在一些实施例中,所述第一DSN图包括所述待估计图像的像素点对应的第一DSN向量,深度估计模块94用于:根据所述像素点对应的第一DSN向量和所述像素点对应的相机滤波映射向量,确定所述像素点对应的深度值。其中,所述第一深度图包括所述像素点对应的深度值。
本申请实施例提供的单目深度估计装置可以用于执行上述单目深度估计方法,其内容和效果可参考方法部分,本申请实施例对此不再赘述。
本申请实施例另一些实施例还提供了一种电子设备,用于执行以上各方法实施例中的方法。如图14所示,该电子设备可以包括:图像采集器1001,图像采集器1001用于获取待估计图像和所述待估计图像对应的第一参数;一个或多个处理器1002;存储器1003;上述各器件可以通过一个或多个通信总线1005连接。其中上述存储器1003中存储一个或多个计算机程序1004,一个或多个处理器1002用于执行一个或多个计算机程序1004,该一个或多个计算机程序1004包括指令,上述指令可以用于执行上述方法实施例中的各个步骤。
例如,一个或多个处理器1002用于运行一个或多个计算机程序1004,以实现以下动作:
获取待估计图像和待估计图像对应的第一参数,所述第一参数为拍摄所述待估计图像的相机的相机标定参数。
将所述待估计图像输入至第一神经网络模型中,获取第一神经网络模型输出的第一距离缩放法线DSN图,所述第一DSN图用于表示所述待估计图像对应的目标物体的平面的朝向和所述平面与所述相机之间的距离。
根据所述待估计图像和所述第一参数,确定第一相机滤波映射图,所述第一相机滤波映射图用于表示所述目标物体在空间中的3D点与2D平面的映射关系,所述2D平面为所述相机的成像平面。
根据所述第一DSN图和所述第一相机滤波映射图,确定所述待估计图像对应的第一深度图。
在一些实施例中,所述第一神经网络模型为使用训练图像和所述训练图像对应的第二DSN图进行训练得到的,所述第二DSN图是根据所述训练图像对应的第二深度图、以及所述训练图像对应的相机标定参数确定的。
在一些实施例中,所述训练图像作为初始神经网络模型的输入,损失函数包括第一损失函数、第二损失函数或第三损失函数中至少一项,所述损失函数用于调整所述初始神经网络模型的参数,以训练得到所述第一神经网络模型。
所述第一损失函数用于表示所述第二DSN图和第三DSN图之间的误差,所述第三DSN图为所述初始神经网络模型输出的所述训练图像对应的DSN图,所述第二损失函数用于表示所述第二深度图和第三深度图之间的误差,所述第三深度图是根据所述第三DSN图和第二相机滤波映射图确定的,所述第二相机滤波映射图为根据所述训练图像对应的相机标定参数和所述训练图像确定的,所述第三损失函数用于表示所述第二深度图和第三深度图的匹配结果。
在一些实施例中,根据所述待估计图像的像素点的位置坐标和所述第一参数,确定所述第一相机滤波映射图,所述第一相机滤波映射图包括所述像素点对应的相机滤波映射向量,所述相机滤波向量用于表示所述3D点与所述像素点的映射关系,所述像素点为所述3D点投影至所述2D平面的点。
在一些实施例中,所述像素点的位置坐标包括横坐标和纵坐标,所述像素点对应的相机滤波映射向量包括第一相机滤波映射分量和第二相机滤波映射分量,所述第一相机滤波映射分量是根据所述横坐标和所述第一参数确定的,所述第二相机滤波映射分量是根据所述纵坐标和所述第一参数,或者根据所述横坐标、所述纵坐标和所述第一参数确定的。
在一些实施例中,当拍摄所述待估计图像的相机的视场角小于180度时,所述第一参数包括述相机的中心坐标(c x,c y)和焦距(f x,f y),所述第一相机滤波映射分量是根据所述横坐标和所述第一参数确定的,所述第二相机滤波映射分量是根据所述纵坐标和所述第一参数确定的。当所述待估计图像的相机的视场角大于180度时,所述第一参数包括所述待估计图像的宽度像素值W和高度像素值H,所述第一相机滤波映射分量是根据所述横坐标和所述第一参数确定的,所述第二相机滤波映射分量是根据所述横坐标、所述纵坐标和所述第一参数确定的。
在一些实施例中,所述第一DSN图包括所述待估计图像的像素点对应的第一DSN向量,根据所述像素点对应的第一DSN向量和所述像素点对应的相机滤波映射向量,确定所述像素点对应的深度值。其中,所述第一深度图包括所述像素点对应的深度值。
当然,图14所示的电子设备还可以包含如音频模块以及SIM卡接口等其他器件,本申请实施例对此不做任何限制。
本申请实施例还提供一种单目深度估计装置,如图15所示,该单目深度估计装置包括处理器1101和传输接口1102,该传输接口1102用于获取待估计图像和所述待估计图像对应的第一参数。
传输接口1102可以包括发送接口和接收接口,示例性的,传输接口1102可以为根据任何专有或标准化接口协议的任何类别的接口,例如高清晰度多媒体接口(high definition multimedia interface,HDMI)、移动产业处理器接口(Mobile Industry Processor Interface,MIPI)、MIPI标准化的显示串行接口(Display Serial Interface,DSI)、视频电子标准协会 (Video Electronics Standards Association,VESA)标准化的嵌入式显示端口(Embedded Display Port,eDP)、Display Port(DP)或者V-By-One接口,V-By-One接口是一种面向图像传输开发的数字接口标准,以及各种有线或无线接口、光接口等。
该处理器1101被配置为调用存储在存储器中的程序指令,以执行如上述方法实施例的单目深度估计方法,其内容和效果可参考方法部分,本申请实施例对此不再赘述。可选的,该装置还包括存储器1103。该处理器1102可以为单核处理器或多核处理器组,该传输接口1102为接收或发送数据的接口,该单目深度估计装置处理的数据可以包括视频数据或图像数据。示例性的,该单目深度估计装置可以为处理器芯片。
本申请实施例另一些实施例还提供一种计算机存储介质,该计算机存储介质可包括计算机指令,当该计算机指令在电子设备上运行时,使得该电子设备执行上述方法实施例的各个步骤。
本申请实施例另一些实施例还提供一种计算机程序产品,当该计算机程序产品在计算机上运行时,使得该计算机执行上述方法实施例的各个步骤。
以上各实施例中提及的处理器可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。处理器可以是通用处理器、数字信号处理器(digital signal processor,DSP)、特定应用集成电路(application-specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。本申请实施例公开的方法的步骤可以直接体现为硬件编码处理器执行完成,或者用编码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成上述方法的步骤。
上述各实施例中提及的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(dynamic RAM,DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。应注意,本文描述的***和方法的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本 申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的***、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的***、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个***,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (18)

  1. 一种单目深度估计方法,其特征在于,包括:
    获取待估计图像和所述待估计图像对应的第一参数,所述第一参数为拍摄所述待估计图像的相机的相机标定参数;
    将所述待估计图像输入至第一神经网络模型中,获取第一神经网络模型输出的第一距离缩放法线DSN图,所述第一DSN图用于表示所述待估计图像对应的目标物体的平面的朝向和所述平面与所述相机之间的距离;
    根据所述待估计图像和所述第一参数,确定第一相机滤波映射图,所述第一相机滤波映射图用于表示所述目标物体在空间中的3D点与2D平面的映射关系,所述2D平面为所述相机的成像平面;
    根据所述第一DSN图和所述第一相机滤波映射图,确定所述待估计图像对应的第一深度图。
  2. 根据权利要求1所述的方法,其特征在于,所述第一神经网络模型为使用训练图像和所述训练图像对应的第二DSN图进行训练得到的,所述第二DSN图是根据所述训练图像对应的第二深度图、以及所述训练图像对应的相机标定参数确定的。
  3. 根据权利要求2所述的方法,其特征在于,所述训练图像作为初始神经网络模型的输入;
    损失函数包括第一损失函数、第二损失函数或第三损失函数中至少一项,所述损失函数用于调整所述初始神经网络模型的参数,以训练得到所述第一神经网络模型;
    所述第一损失函数用于表示所述第二DSN图和第三DSN图之间的误差,所述第三DSN图为所述初始神经网络模型输出的所述训练图像对应的DSN图,所述第二损失函数用于表示所述第二深度图和第三深度图之间的误差,所述第三深度图是根据所述第三DSN图和第二相机滤波映射图确定的,所述第二相机滤波映射图为根据所述训练图像对应的相机标定参数和所述训练图像确定的,所述第三损失函数用于表示所述第二深度图和第三深度图的匹配程度。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述根据所述待估计图像和所述第一参数,确定第一相机滤波映射图,包括:
    根据所述待估计图像的像素点的位置坐标和所述第一参数,确定所述第一相机滤波映射图,所述第一相机滤波映射图包括所述像素点对应的相机滤波映射向量,所述相机滤波向量用于表示所述3D点与所述像素点的映射关系,所述像素点为所述3D点投影至所述2D平面的点。
  5. 根据权利要求4所述的方法,其特征在于,所述像素点的位置坐标包括横坐标和纵坐标,所述像素点对应的相机滤波映射向量包括第一相机滤波映射分量和第二相机滤波映射分量,所述第一相机滤波映射分量是根据所述横坐标和所述第一参数确定的,所述第二相机滤波映射分量是根据所述纵坐标和所述第一参数,或者根据所述横坐标、所述纵坐标和所述第一参数确定的。
  6. 根据权利要求5所述的方法,其特征在于,当拍摄所述待估计图像的相机的视场角小于180度时,所述第一参数包括述相机的中心坐标(c x,c y)和焦距(f x,f y),所述第一相机滤波映射分量是根据所述横坐标和所述第一参数确定的,所述第二相机滤波映射分量是根 据所述纵坐标和所述第一参数确定的;
    当所述待估计图像的相机的视场角大于180度时,所述第一参数包括所述待估计图像的宽度像素值W和高度像素值H,所述第一相机滤波映射分量是根据所述横坐标和所述第一参数确定的,所述第二相机滤波映射分量是根据所述横坐标、所述纵坐标和所述第一参数确定的。
  7. 根据权利要求4至6任一项所述的方法,其特征在于,所述第一DSN图包括所述待估计图像的像素点对应的第一DSN向量,根据所述第一DSN图和所述第一相机滤波映射图,确定所述待估计图像对应的第一深度图,包括:
    根据所述像素点对应的第一DSN向量和所述像素点对应的相机滤波映射向量,确定所述像素点对应的深度值;
    其中,所述第一深度图包括所述像素点对应的深度值。
  8. 一种单目深度估计装置,其特征在于,包括:
    获取模块,用于获取待估计图像和所述待估计图像对应的第一参数,所述第一参数为拍摄所述待估计图像的相机的相机标定参数;
    距离缩放法线DSN模块,用于将所述待估计图像输入至第一神经网络模型中,获取第一神经网络模型输出的第一距离缩放法线DSN图,所述第一DSN图用于表示所述待估计图像对应的目标物体的平面的朝向和所述平面与所述相机之间的距离;
    相机滤波映射模块,用于根据所述待估计图像和所述第一参数,确定第一相机滤波映射图,所述第一相机滤波映射图用于表示所述目标物体在空间中的3D点与2D平面的映射关系,所述2D平面为所述相机的成像平面;
    深度估计模块,用于根据所述第一DSN图和所述第一相机滤波映射图,确定所述待估计图像对应的第一深度图。
  9. 根据权利要求8所述的装置,其特征在于,所述第一神经网络模型为使用训练图像和所述训练图像对应的第二DSN图进行训练得到的,所述第二DSN图是根据所述训练图像对应的第二深度图、以及所述训练图像对应的相机标定参数确定的。
  10. 根据权利要求9所述的装置,其特征在于,所述训练图像作为初始神经网络模型的输入;
    损失函数包括第一损失函数、第二损失函数或第三损失函数中至少一项,所述损失函数用于调整所述初始神经网络模型的参数,以训练得到所述第一神经网络模型;
    所述第一损失函数用于表示所述第二DSN图和第三DSN图之间的误差,所述第三DSN图为所述初始神经网络模型输出的所述训练图像对应的DSN图,所述第二损失函数用于表示所述第二深度图和第三深度图之间的误差,所述第三深度图是根据所述第三DSN图和第二相机滤波映射图确定的,所述第二相机滤波映射图为根据所述训练图像对应的相机标定参数和所述训练图像确定的,所述第三损失函数用于表示所述第二深度图和第三深度图的匹配程度。
  11. 根据权利要求8至10任一项所述的装置,其特征在于,所述相机滤波映射模块用于:
    根据所述待估计图像的像素点的位置坐标和所述第一参数,确定所述第一相机滤波映射图,所述第一相机滤波映射图包括所述像素点对应的相机滤波映射向量,所述相机滤波 向量用于表示所述3D点与所述像素点的映射关系,所述像素点为所述3D点投影至所述2D平面的点。
  12. 根据权利要求11所述的装置,其特征在于,所述像素点的位置坐标包括横坐标和纵坐标,所述像素点对应的相机滤波映射向量包括第一相机滤波映射分量和第二相机滤波映射分量,所述第一相机滤波映射分量是根据所述横坐标和所述第一参数确定的,所述第二相机滤波映射分量是根据所述纵坐标和所述第一参数,或者根据所述横坐标、所述纵坐标和所述第一参数确定的。
  13. 根据权利要求12所述的装置,其特征在于,当拍摄所述待估计图像的相机的视场角小于180度时,所述第一参数包括述相机的中心坐标(c x,c y)和焦距(f x,f y),所述第一相机滤波映射分量是根据所述横坐标和所述第一参数确定的,所述第二相机滤波映射分量是根据所述纵坐标和所述第一参数确定的;
    当所述待估计图像的相机的视场角大于180度时,所述第一参数包括所述待估计图像的宽度像素值W和高度像素值H,所述第一相机滤波映射分量是根据所述横坐标和所述第一参数确定的,所述第二相机滤波映射分量是根据所述横坐标、所述纵坐标和所述第一参数确定的。
  14. 根据权利要求11至13任一项所述的装置,其特征在于,所述第一DSN图包括所述待估计图像的像素点对应的第一DSN向量,所述深度估计模块用于:根据所述像素点对应的第一DSN向量和所述像素点对应的相机滤波映射向量,确定所述像素点对应的深度值;其中,所述第一深度图包括所述像素点对应的深度值。
  15. 一种单目深度估计装置,其特征在于,包括处理器和传输接口,
    所述传输接口,用于获取待估计图像和所述待估计图像对应的第一参数;
    所述处理器,被配置为调用存储在存储器中的程序指令,以执行如权利要求1-7中任一项所述的方法。
  16. 一种电子设备,其特征在于,包括:
    图像采集器,所述图像采集器用于获取待估计图像和所述待估计图像对应的第一参数;
    一个或多个处理器;
    存储器,用于存储程序指令;
    所述一个或多个处理器被配置为调用存储在所述存储器中的程序指令,以实现如权利要求1-7中任一项所述的方法。
  17. 一种计算机可读存储介质,其特征在于,包括计算机程序,所述计算机程序在计算机或处理器上被执行时,使得所述计算机或所述处理器执行权利要求1-7中任一项所述的方法。
  18. 一种计算机程序产品,其特征在于,所述计算机程序产品包括计算机程序,当所述计算机程序被计算机或处理器执行时,用于执行权利要求1-7中任一项所述的方法。
PCT/CN2021/075318 2021-02-04 2021-02-04 单目深度估计方法、装置及设备 WO2022165722A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/075318 WO2022165722A1 (zh) 2021-02-04 2021-02-04 单目深度估计方法、装置及设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/075318 WO2022165722A1 (zh) 2021-02-04 2021-02-04 单目深度估计方法、装置及设备

Publications (1)

Publication Number Publication Date
WO2022165722A1 true WO2022165722A1 (zh) 2022-08-11

Family

ID=82740792

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/075318 WO2022165722A1 (zh) 2021-02-04 2021-02-04 单目深度估计方法、装置及设备

Country Status (1)

Country Link
WO (1) WO2022165722A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115731365A (zh) * 2022-11-22 2023-03-03 广州极点三维信息科技有限公司 基于二维图像的网格模型重建方法、***、装置及介质
CN115965758A (zh) * 2022-12-28 2023-04-14 无锡东如科技有限公司 一种图协同单目实例三维重建方法
CN116993679A (zh) * 2023-06-30 2023-11-03 芜湖合德传动科技有限公司 一种基于目标检测的伸缩机皮带磨损检测方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108615244A (zh) * 2018-03-27 2018-10-02 中国地质大学(武汉) 一种基于cnn和深度滤波器的图像深度估计方法及***
CN108765481A (zh) * 2018-05-25 2018-11-06 亮风台(上海)信息科技有限公司 一种单目视频的深度估计方法、装置、终端和存储介质
CN110503680A (zh) * 2019-08-29 2019-11-26 大连海事大学 一种基于非监督的卷积神经网络单目场景深度估计方法
CN110738697A (zh) * 2019-10-10 2020-01-31 福州大学 基于深度学习的单目深度估计方法
US20210004646A1 (en) * 2019-07-06 2021-01-07 Toyota Research Institute, Inc. Systems and methods for weakly supervised training of a model for monocular depth estimation

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108615244A (zh) * 2018-03-27 2018-10-02 中国地质大学(武汉) 一种基于cnn和深度滤波器的图像深度估计方法及***
CN108765481A (zh) * 2018-05-25 2018-11-06 亮风台(上海)信息科技有限公司 一种单目视频的深度估计方法、装置、终端和存储介质
US20210004646A1 (en) * 2019-07-06 2021-01-07 Toyota Research Institute, Inc. Systems and methods for weakly supervised training of a model for monocular depth estimation
CN110503680A (zh) * 2019-08-29 2019-11-26 大连海事大学 一种基于非监督的卷积神经网络单目场景深度估计方法
CN110738697A (zh) * 2019-10-10 2020-01-31 福州大学 基于深度学习的单目深度估计方法

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115731365A (zh) * 2022-11-22 2023-03-03 广州极点三维信息科技有限公司 基于二维图像的网格模型重建方法、***、装置及介质
CN115965758A (zh) * 2022-12-28 2023-04-14 无锡东如科技有限公司 一种图协同单目实例三维重建方法
CN115965758B (zh) * 2022-12-28 2023-07-28 无锡东如科技有限公司 一种图协同单目实例三维重建方法
CN116993679A (zh) * 2023-06-30 2023-11-03 芜湖合德传动科技有限公司 一种基于目标检测的伸缩机皮带磨损检测方法
CN116993679B (zh) * 2023-06-30 2024-04-30 芜湖合德传动科技有限公司 一种基于目标检测的伸缩机皮带磨损检测方法

Similar Documents

Publication Publication Date Title
US12008797B2 (en) Image segmentation method and image processing apparatus
WO2021043273A1 (zh) 图像增强方法和装置
WO2021043168A1 (zh) 行人再识别网络的训练方法、行人再识别方法和装置
US11232286B2 (en) Method and apparatus for generating face rotation image
WO2022165722A1 (zh) 单目深度估计方法、装置及设备
WO2021164731A1 (zh) 图像增强方法以及图像增强装置
WO2022001372A1 (zh) 训练神经网络的方法、图像处理方法及装置
WO2022042049A1 (zh) 图像融合方法、图像融合模型的训练方法和装置
US20210398252A1 (en) Image denoising method and apparatus
CN110222717B (zh) 图像处理方法和装置
WO2021063341A1 (zh) 图像增强方法以及装置
WO2022179581A1 (zh) 一种图像处理方法及相关设备
CN111797881B (zh) 图像分类方法及装置
WO2022100419A1 (zh) 一种图像处理方法及相关设备
WO2021042774A1 (zh) 图像恢复方法、图像恢复网络训练方法、装置和存储介质
CN113011562A (zh) 一种模型训练方法及装置
CN112258565B (zh) 图像处理方法以及装置
CN110222718A (zh) 图像处理的方法及装置
WO2022179606A1 (zh) 一种图像处理方法及相关装置
CN113284055A (zh) 一种图像处理的方法以及装置
CN113066018A (zh) 一种图像增强方法及相关装置
CN115239581A (zh) 一种图像处理方法及相关装置
CN114170290A (zh) 图像的处理方法及相关设备
WO2022179603A1 (zh) 一种增强现实方法及其相关设备
WO2021213012A1 (zh) 体重检测方法、人体特征参数检测方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21923747

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21923747

Country of ref document: EP

Kind code of ref document: A1