WO2014033731A2

WO2014033731A2 - A system and method for depth estimation

Info

Publication number: WO2014033731A2
Application number: PCT/IN2013/000421
Authority: WO
Inventors: Anup SABLE; Vinay Govind Vaidya
Original assignee: Kpit Cummins Infosystems Limited
Priority date: 2012-07-10
Filing date: 2013-07-08
Publication date: 2014-03-06
Also published as: WO2014033731A3

Abstract

A system and method for real time depth imaging is disclosed herein. The method consists of capturing plurality of images through an input device placed on a rotating platform; transferring the captured images to an image processing device via a communication connection; rectification of successive captured images in common field of view of different positions without calibration of the input device by the image processing device; determination of error in extent of the rectification resulting on different pair of images by the image processing device; block matching of images to determine similarity between the image blocks by the image processing device; determination of disparity between the image blocks by the image processing device; estimation of real time depth of the image by the image processing device; and displaying the real time depth on an output device.

Description

"A SYSTEM AND METHOD FOR DEPTH ESTIMATION"

FIELD OF INVENTION:

The present invention generally relates to image processing and more specifically, relates to 3D imaging techniques that renders information regarding relative depth of a point in a real space captured in a camera.

BACKGROUND & PRIOR ART:

Conventional depth/distance/range estimation is done using sensors that directly provide depth information. Such sensors include ultrasound, radar, LIDAR, etc. and many of their variants. However, depth from imaging is a widely researched and well established technology in today's world. Depth from imaging is developing rapidly and has gained increasing importance in various fields like security, entertainment, medical, manufacturing, etc. This technology combines the 2D data, with some amount of overlap, to generate a perspective and realistic depth image. Numerous techniques have been developed for construction of depth maps from 2D data.

Humans perceive depth using their pair of eyes - which provide stereo vision. Any point in the real 3D world projects as two different points on the retina of the two eyes; thereby forming a disparity. The human brain interprets this disparity data and creates depth map of the scene. Many products and applications have been developed and put to use based on this concept of stereo imaging. In a real-time depth imaging system, the data capture, processing, reconstruction and display is realized simultaneously to provide 3D images from the captured 2D images. Various methods have been proposed for real time 3D imaging, with more on-going research for improved methods.

There are some techniques used for calculating the depth of an object. The LASER (Light Amplification by Stimulated Emission of Radiation) Triangulation projects LASER on an object and acquires the height profile using a camera. In another method, a known light pattern is projected on to an object. The depth information is calculated according to the distortion of the light pattern. Time of Flight (TOF) based Depth Sensor synchronizes light source with an image sensor in order to calculate distance based on the time between the pulse of light and the reflected light back onto the sensor. In the field of medical imaging, Optical Coherence Tomography (OCT), uses infrared light to calculate depth information by measuring the reflections of light through the cross-section of the object. These are some of the different technologies to find the distance of an object from the source. Over the last decade, these techniques either got replaced or boosted up based on the performance of existing techniques. There are also some shortcomings of these methods. In the LASER triangulation method, the LASER sensor should be kept clean; else it may affect the accuracy of the system. TOF based depth camera has low resolution which sometimes gives non homogeneous depth map, intensity based distance error and light interference effects. These technologies are costly and take significant time to acquire data of an object. In order to optimize the cost and make the system work in various conditions, a rotating camera system is proposed.

The existing methods for real time 3D/depth imaging, like LIDAR, stereo imaging, etc. utilize lasers, sensors, special equipments, multiple cameras, etc. for reconstruction of 3D images. This increases the hardware requirement and the processing complexity of the images, making the systems complicated and costly. Thus, there is a need for a real-time 3D imaging which is simple, efficient and economical.

SUMMARY OF INVENTION:

The present invention discloses a system and method for real time depth estimation to determine relative depth of a point captured in a 2D space. The method of real time depth estimation comprises of: Capturing plurality of images through an input device placed on a rotating platform; Transferring the captured images to an image processing device via a communication connection; Rectification of successive captured images in common field of view (FOV) of different positions without calibration of said input device by said image processing device; determination of error in extent of said rectification resulting on different pair of images by said image processing device; Block matching of images to determine similarity between the image blocks by said image processing device; determination of disparity between the image blocks by said image processing device; estimation of real time depth of the image by said image processing device; and Displaying the real time depth on an output device. BRIEF DESCRIPTION OF DRAWINGS:

Fig. 1 illustrates the process flow for depth estimation.

Fig. 2 illustrates re-projection error analysis.

Fig. 3 illustrates a system for depth estimation, according to the embodiment of the invention.

Fig. 4 illustrates the rotating platform setup for capturing the images.

Fig. 5 illustrates original images showing corresponding epipolar lines in both images.

Fig. 6 illustrates rectified images showing epipolar lines.

Fig. 7 illustrates computation of disparity.

Fig. 8 illustrates example original images taken at 5° rotation of camera

Fig. 9 illustrates rectified images of original image of Fig. 7.

Fig. 10 illustrates a graph showing the relation between disparity and depth.

Fig. 11 illustrates a graph of distance versus % error.

DETAILED DESCRIPTION:

The present invention discloses a system and method for real time depth estimation to determine relative depth of a point captured in a 2D space. The method comprises: Capturing plurality of images through an input device placed on a rotating platform; Transferring the captured images to an image processing device via a communication connection; Rectification of successive captured images in common field of view (FOV) of different positions without calibration of said input device by said image processing device; determination of error in extent of said rectification resulting on different pair of images by said image processing device; Block matching of images to determine similarity between the image blocks by said image processing device; determination of disparity between the image blocks by said image processing device; estimation of real time depth of the image by said image processing device; and Displaying the real time depth on an output device.

In a preferred embodiment, the system of the present invention consists of the input device, the communication connection, the image processor and the output device. The input device may include a single, high resolution and high frame per second (FPS) camera, optical sensors, light sensors, imaging sensors, position and angle sensors, navigation sensors, RF sensors, ultrasonic sensors, or any other similar sensors. The input device is mounted on a rotating platform. The input device is pivotally rotated on the platform in a continuous manner to capture images of its surroundings. The rotation is controlled in such a manner that there is some overlap between the data captured in one frame to the next while the input device rotates. This is required to find similar points on the 3D world as seen from different position (angle) of the input device and use this data to create a 3D map of the surroundings. Since the data captured is huge and the time to process the data would be large, a GP-GPU platform or any other similar platform having parallel processing capabilities is used to reconstruct the 3D images in real time. Such parallel processing of the large amount of captured image data provides for a real time reconstruction of depth maps from the images captured. Moreover, since the platform itself could be placed on a moving object, for example, an automobile, which is in motion; there are possibilities of distortions in the video captured because of jitter from the camera and the motion of the automobile. The method of invention uses image processing techniques to compensate for these distortions and provide robust and real time representation of depth map from the acquired 2D images.

The present invention and the manner in which it is to be performed is described in the accompanying drawings and figures. While the figures are for exemplary purpose, it is not intended to limit the scope of the invention to the figures.

Fig. 1 depicts the process flow diagram according to the method of the present invention. According to this method, plurality of images are captured through an input device. The input device may be selected from a group of, but not limited to, any existing camera having a high resolution and a high FPS, optical sensor, light sensors, imaging sensors, position and angle sensors, navigation sensors, RF sensors and the like. Additionally, a single or multiple input devices may be used. These captured images are then rectified. The rectified images are matched to determine similarity between two images. This is followed by computation of disparity which consequently renders information regarding depth estimation. Fig. 3 illustrates a system for depth estimation, according to the embodiment of the invention.

Fig. 3 illustrates an image processing device (200) for implementing one or more embodiments of the present subject matter. Figure 3 and the following discussion are intended to provide a brief, general description of the suitable computing environment in which certain embodiments of the inventive concepts contained herein may be implemented. The image processing device (200) may include a processor (202), a memory (204), a removable storage (206), and a non-removable storage (208). The image processing device (200) additionally includes a bus (210) and a network interface (212). The image processing device (200) may include or have access to one or more input devices (214), one or more output devices (216), and one or more communication connections (218). The one or more input devices (214) may be various sensors known in the art. Different sensors that may be used include, but not limited to, optical sensor, light sensors, imaging sensors, position and angle sensors, navigation sensors, RF sensors, etc.

In one embodiment of the invention, the input device (214) is an ultrasound sensor. In yet another embodiment of the invention, the input device (214) is a single, high resolution, low-cost camera. The one or more output devices (216) may be a display. The communication connections (218) may include mobile networks such as General Packet Radio Service (GPRS), Wireless Fidelity (Wi-Fi), Worldwide Interoperability for Microwave Access (WiMax), Long Term Evolution (LTE), and the like. The memory (204) may include volatile memory and/or non-volatile memory for storing computer program (220). A variety of computer-readable storage media may be stored in and accessed from the memory elements of the image processing device (200), the removable storage (206) and the non-removable storage (208). The processor (202), as used herein, is a multi-core parallel processor. As a large amount of data is captured by the input device (214) at every second as it is rotating, a very high speed processing is required in order to process the captured data. Hence, a multi-core parallel processor (202) is used to process the captured data at high speed and in real time.

The multi-core processor (202) carries out the execution of huge amount of data in parallel, making it an efficient real time processing. Additionally, the processing means any type of computational circuit, such as, but not limited to, a microprocessor, a microcontroller, a complex instruction set computing microprocessor, a reduced instruction set computing microprocessor, a very long instruction word microprocessor, an explicitly parallel instruction computing microprocessor, a graphics processor, a digital signal processor, or any other type of processing circuit. The processor (202) may also include embedded controllers, such as generic or programmable logic devices or arrays, application specific integrated circuits, single-chip computers, smart cards, and the like. Embodiments of the present subject matter may be implemented in conjunction with program modules, including functions, data structures, and application programs, for performing tasks, or defining abstract data types or low-level hardware contexts. Machine-readable instructions stored on any of the above-mentioned storage media may be executable by the processor (202) of the image processing device (200).

The images are captured using the rotating platform setup as shown in Fig. 4. It also illustrates the direction of rotating input device. In an embodiment of the invention, the input device is an image capturing device, which is a single camera having a high resolution and a high FPS. The arrow in Fig. 4 denotes the direction of rotation of camera. The camera is mounted on a rotating table that rotates through an angle 0°. In an another embodiment, the system creates a depth map in full 360° field of view. The lowermost shaded portion illustrated in Fig. 4 is the Field of View (FOV) of the camera at position 1, i.e. FOV1, and the uppermost shaded portion is the FOV of camera at position 2, i.e. FOV2, when the camera is rotated by an angle 0°. The middle shaded portion shows the common FOV or both positions. In a preferred method, the angle θ° is kept small in order to contain the problem of quick disappearance of the objects in successive views. In yet another embodiment of the invention, the input device (214) is a rotating ultrasound sensor which scans the images in two planes, horizontal and vertical.

The images obtained are then rectified by alignment of epipolar lines of one image so that they become parallel with their corresponding epipolar lines in another image. According to the method of the present invention, the image rectification is done without the input device calibration. In a preferred embodiment, Fusiello and Luca Irsara's approach for image rectification is used. In an another embodiment, Hartley's approach which makes use of fundamental matrix to calculate rigid transformation; or Isgro and Trucco's approach, which rectify images from feature points directly without calculating the fundamental matrix, are used. The geometric re-projection error, i.e. an error corresponding to an image between a projected point and measured point, is calculated for the extent of rectification resulting on different pair of images. The re-projection error analysis, which is carried out by the processor (202), is shown in Fig. 2. Fig. 5 illustrates original images showing corresponding epipolar lines in both the images. The images so rectified as shown in Fig. 6.

The rectified images then undergo the process of matching features or finding corresponding points between two images of a scene which helps to compute disparity between the pixels. This is done by block matching process, which determines the similarity between the image blocks, i.e. sum of absolute differences (SAD). For block matching, n x n pixel block is taken around every pixel of the reference image. This block slides on the other view in a same row of pixel as images are rectified. The block is matched to the reference block when the result of SAD is minimum. For example, one reference matrix is considered and SAD is performed with other image matrices. The one which gives sum as minimum is a respective matched block.

The SAD is represented by the equation (1),

Where, 'X' and Ύ' are two different blocks & indicates the Sum of Absolute Differences to the matrix 'X' of one view and matrix Ύ' of another view.

After matching, the disparity between the pixels obtained is calculated, i.e. the displacement of the pixel when the pixels are matched in two different views is calculated. The disparity is computed by the relation as shown in equation (2)

d = Xt - X, (2)

where, d is disparity,

X_tis Pixel in first view,

A"_ris Pixel in second view. Fig. 7 illustrates computation of disparity. In Fig. 7, Ί ' denotes first view pixel whereas '2' denotes second view pixel matched with first view pixel and'd' is the disparity.

The depth of the object is a ratio of product of focal length of camera & distance between two camera positions to the disparity between pixels. It is determined by the relation as shown in equation (3) where 'Z' indicates the depth of object from the camera;

/ ' ' indicates the focal length of camera;

' indicates the distance between two fixed cameras;

and x^l - r^r is disparity between pixels.

Hence, higher the disparity between the pixels of an object less will be the depth of that object. In embodiments where the input device (214), maybe a sensor, inclusive of, but not limited to ultrasound or infra-red sensor, the depth of the object is based on the different positions of the sensor during which a signal is transmitted, the distance covered by the transmitted signal in reaching the object and the time taken by the signal to travel to the object.

The method of the present invention uses uncalibrated rectification approach, which gives good results as compared to calibrated rectification results. One such example is demonstrated below.

The method is implemented on many set of rotated images taken in different environment. Fig. 8 shows original images captured at 5° rotation of camera along rotating platform, for an exemplary purpose. The rectified images corresponding to original images are shown in Fig. 9. The circles marked in Fig. 9 are some reference points taken to obtain the depth information of the stated marked points which are indicated in Table 1. The door knob is represented by a black circle, 'V point on side glass is represented by a red circle, and the upper corner of the side poster is represented by the light pink circle.

Table 1 shows the pixel positions from the rectified images and the corresponding disparity calculated with respect to the Objects' marked with colored circles in Fig. 9. It can be noticed that the 'y' coordinate in first view is matched with the 'y' coordinate in second view. It clearly shows that the images are rectified and the disparity present is along 'x' direction only.

Table-1

With the calculation of disparity, it is seen that the relative depth is inversely proportional to the disparity. Similarly, as illustrated in Fig. 8 and Fig. 9, the door knob has highest disparity i.e., 46 pixels as compared to 'V point on side glass which has 34 pixels and upper corner of side poster which has 30 pixels. Therefore the depth of object from the input device is inversely proportional to the disparity, as shown in equation (4).

k.

X )

Where,

'Z' is the depth of the ob ject from the camera

'd' is the disparity and

'k' is a constant value

Fig. 10 shows the graph representing the depth corresponding to the disparity. It is evident from the graph that if disparity is decreasing or declining, the depth increases. Therefore, the door knob is the closest point as depth comes to be the least according to the above relation and Side poster is the farthest point among other points taken in Fig. 9.

Table 2, as below, shows the average error at objects distance respectively. Table 2

Fig. 1 1 shows the average percentage error of the proposed system. The accuracy of the system depends on the quality and the rotation speed of the input device. For the proposed setup, the camera used is an off-the shelf web camera of VGA resolution (640 x 480).

The existing techniques, of image capture and reconstruction like LIDA are expensive. The system of the present invention is a frugal method to estimate depth information a given object. It is difficult to find the depth of far objects as the disparity between the pixels of both views tends to move towards zero. The method of the present invention works better for images having noticeable pixel differences i.e., for the images with good texture, contrast, etc.

The applications of the present invention can be in a scene reconstruction for various domains, like, but not limited to, earth sciences, entertainment industry, cultural heritage, digital archival, etc. The system of the present invention has common application in fields like, but not limited to, robotic vision, military for spying, and to develop an autonomous vehicle. This system also finds applications where it can be used as a shape identifier, such as finding the shapes of bottle or coffee cup. It can also be used to enhance the accuracy of identification systems like facial recognition or other biometrics.

It is evident that the present invention and its advantages are not limited to the above described embodiments only. With minor modifications, substitutions and equivalents will be apparent to those skilled in the art without departing from the spirit and scope of the present invention as described in the claims. Accordingly, the specifications, examples and figures are to be regarded as illustrative of the invention only, rather than in restrictive sense.

Claims

WE CLAIM,

1. A system and method for real time depth estimation, said method comprising,

(a) Capturing plurality of images through an input device placed on a rotating platform;

(b) Transferring the captured images to an image processing device via a communication connection;

(c) Rectification of successive captured images in common field of view (FOV) of different positions without calibration of said input device by said image processing device;

(d) determination of error in extent of said rectification resulting on different pair of images by said image processing device;

(e) Block matching of images to determine similarity between the image blocks by said image processing device;

(f) determination of disparity between the image blocks by said image processing device;

(g) estimation of real time depth of the image by said image processing device; and

(h) Displaying the real time depth on an output device.

2. The system and method for real time depth estimation according to claim 1; wherein said platform rotation is controlled such that data captured in successive images during rotation is overlapped to create 3D map of surrounding.

3. The system and method for real time depth estimation according to claim 1; wherein successive images are captured at a fixed predetermined interval.

4. The system and method for real time depth estimation according to claim 1; wherein said block matching is calculated by sum of absolute differences (SAD) method.

5. The system and method for real time depth estimation according to claim 1; wherein said relative depth of the object is a ratio of product of focal length of camera and distance between two camera positions to the disparity between pixels.

6. The system and method for real time depth estimation according to claim 1; wherein said input device and said platform both are rotating.

7. The system and method for real time depth estimation according to claim 1, wherein the input device is an optical imaging device.

8. The system and method for real time depth estimation according to claim 1, wherein the input device is an ultrasound sensor.

9. The system and method for real time depth estimation according to claim 1; wherein said image processing device comprises a processor, a memory, a removable storage, and a non-removable storage.

10. The system and method for real time depth estimation according to claim 9; wherein said image processing device additionally includes a bus and a network interface.