WO2005088539A2

WO2005088539A2 - Augmented reality system with coregistration of virtual objects on images of real objects

Info

Publication number: WO2005088539A2
Application number: PCT/BE2005/000036
Authority: WO
Inventors: Benoit Macq; Quentin Noirhomme; Michel Cornet D'elzius; Pierre Delville
Original assignee: Université Catholique de Louvain
Priority date: 2004-03-15
Filing date: 2005-03-15
Publication date: 2005-09-22
Also published as: WO2005088539A3; GB0405792D0

Abstract

Methods and apparatus to perform registration in markerless Augmented Reality systems is described. Spatial and/or colour filtering of the real world images is implemented to reduce computational intensity. The present invention provides a powerful tool to help the surgeon while s/he is planning and/or performing complex surgical procedures.

Description

AUGMENTED REALITY SYSTEM WITH COREGISTRATION OF VIRTUAL OBJECTS ON IMAGES OF REAL OBJECTS.

FIELD OF THE INVENTION The present invention relates to methods and apparatus for augmented reality, especially in medical applications such as surgery, and software therefor.

TECHNICAL BACKGROUND Augmented Reality (AR) systems have been investigated in several fields: in the medical domain, in entertainment, for military training, in engineering design, in robotics and telerobotics, in manufacturing and maintenance, and in consumer product design. The basic structure to support augmented reality applications is based on a perception of the real world, a representation of at least one virtual world and a registration procedure to align the real with the virtual. The registration procedure enables to know where a virtual object should be positioned in the real world. Basically, the main challenges for developing robust AR systems focus on methods allowing to perceive the real world and register it to the virtual world. The real world can be acquired by using different methods such as laser scanning, infrared scanning, computer vision with and without fiducial markers, GPS methods for mobile systems, etc. The appropriate method for a specific application will depend on the task, desired resolution, performance and financial resources. The computer vision methods allow acquiring a view of the real world from one or two cameras worn by the user or fixed in the environment. For instance, Sauer [Sauer, F., Schoepf, U. J., Khamene, A., Vogt, S., Das, M., Silverman, S. G.: "Augmented Reality system for CT-guided interventions: system description and initial phantom trials" Medical Imaging 2003: Visualization, Image-Guided Procedures, and Display, Proceedings of the SPIE, Vol. 5029. (2003) 384-394], describes a method using a stereo pair of colour cameras to capture live video and a third monochrome camera to capture a ring of infrared LED markers. It is used for head tracking and works in conjunction with retroreflective optical markers which are placed around the surgical workspace. A similar approach is proposed in [Splechtna, R.C., Fuhrmann, A.L., Wegenkittl, R: ARAS - Augmented Reality Aided Surgery System Description. VRVis Technical Report, http://www.vrvis.at/TR/2002/TR_VRVis_ 2002_ 040_ Full.pdf (2002)] where the markers used for tracking the patient are mounted on one of the self-retaining retractors, because the patients position relative to this surgical tool stays fixed during the whole intervention. Because markers are situated in discrete places and solid deformations cannot be perceived correctly, this approach is suitable if restrictions due to markers, such as full time visibility, precision of placement, etc., are not an issue. On the other hand, another class of methods performs the registration right on the virtual world and is sensibly more complex than the first one. This approach takes into account 3D perceptions of the environment and the accuracy of the method depends on the technology used to acquire the 3D scene. For instance, with a laser scanning method described in [Grimson, W.E.L., Lozano-Perez, T., Wells, W.M. Ill, Ettinger, G.J., White, S.J.,Kikinis, R: An Automatic Registration Method for Frameless Stereotaxy, Image Guided Surgery, and Enhanced Reality Visualization. Transactions on Medical Imaging (1996)] [Mellor, J.P.: Enhanced Reality Visualization in a Surgical Environment. Massachusetts Institute of Technology, Artificial Intelligence Laboratory, Technical Report N. 1544 available at www.ai.mit.edu/projects/medical-vision/ (1995)], the precision is so high that the discrepancy between real model and points obtained from the real scene is less than 2mm. On the other hand this method has two main drawbacks. The high price of a laser scan camera and the time needed between each image acquisition, which is actually about 1 second per image. Recent efforts, using stereovision algorithms, have explored the use of features to track objects in the world. For instance, [Najafi, H., Klinker, G. -.Model-Based Tracking with Stereovision for AR. The Second IEEE and ACM International Symposium on Mixed and Augmented Reality -ISMAR'03 (2003) 313-314] described the combined use of a 3D model with stereoscopic analysis and allows accurate pose estimation in the presence of partial occlusions by non-rigid objects like the hands of the user. These methods all have a common structure, a perception of the environment, and virtual images are obtained and afterwards a coregistration allows to know where these synthetic images may be positioned in the environment. 1. Perceptions of the environment

The environment may be perceived in various ways, the adequate method for an application will depend on the use, the budget, the resolution and the desired speed.

a. Global perception I) Laser scanning.

This method allows to obtain a very good resolution (less than mm resolution, even down to a hundredth of mm). There is research on co-registrations using this type of method in Harvard University and in MIT. The disadvantage compared to cameras is the time needed between each image acquisition although nowadays this can be reduced to about one second per image. II) Infrared scanning.

The goal of this technology is to offer a compromise which achieves a very low resolution in real time. As for the laser scanning big improvements are expected in the coming years. III) 3D perception on the basis of the "eyes" of the user.

This is the method allows to achieve the augmented reality without the need of using the reference of the environment but directly onto one of the user. A 3D surface of the environment is deduced from two or more views coming from two or more cameras carried by the user. This technique is rapid and cheap but is limited by problems linked to stereovision (occlusions,...). Even if stereovision is an impossible problem, one can never give a guarantee for a perfect result for all the environments under all conditions. Various companies have developed software allowing to achieve good results at the specific conditions. The libraries of Point Grey are an example thereof. IV) Fixed 3D perception.

The advantage here is that the user does not have to carry the cameras with him. However there may be occlusions in the vision of the 3D reconstruction. The positioning of the eyes of the user will then have to be found by another means. The cameras will be put in places judged to be strategic and which will optimise the 3 dimensional reconstruction. In this case if the positioning of the eyes of the user may be found in the precise way transparent spectacles may be used. However, this problem may be solved in adding two cameras on the user. If this were to be used in an operating theatre there would be a series of cameras fixed around the surgery table permitting to reconstitute a 3 dimensional surface.

b . Perception of precise points . Here one tries to position certain points which are the same on the synthetic image and in the environment. These points may be physical markers placed in a location known in the environment and for which one knows the correspondence in the synthetic image. A large range of different markers have been used. Having less points has the advantage of needing less computation for the coregistration. However, this limited number of points results in limitations as to the use of the perception of the environment. Hence, a coregistration on the formed solid object becomes far more complicated. This type of methods is to be considered only if the small number of points is not a disturbance and if having less points offers an advantage (computation time, accuracy achieved more easily on a small number of points ...). These points may also be features (contours, specific gradients, corners, ...) which are detected in the environment and on the synthetic image. If one is not sure that those characteristic points correspond to the same thing in the reality and in the virtual world, this will complicate the coregistration. There is obviously no restriction as to the frequency spectrum used for performing this perception. For example, in certain cases infrared cameras may be complementary or even more appropriate. Moreover, as each method has its own advantages it will often be useful to combine several of them in order to obtain a more complete solution.

2. Coregistration. This step consists in positioning the virtual images in or onto the environment. In the case of a global perception of the scene we distinguish achieving this by scanning, by stereovision and by infrared. In the case of a cloud of points obtained by laser scanning studies have been carried out at MIT and at the Harvard Medical School. At MIT the accuracy is such that the majority of points obtained by scanning are at less than two mm from the model. The coregistration points are superimposed on the model of the skin in order to check the quality of the registration. They can be shown as lines of dots. These dots may be colour coded at different distances from the model, e.g. the distance of 0 mm of the model, yellow at 1 mm and red at 5 mm. This system has been used in the Brigham and Women's Hospital on more than 200 cases of neurosurgery and is used in average for one to two cases per week. It allows to achieve great spatial position which interferes only a little in the normal operation procedures. It allows to reduce the costs by 1000 to 5000 dollars per case as the surgeon can perform the surgery more rapidly while remaining confident. No coregistration has been performed on the images obtained by infrared and by stereovision. In the case of a marker, one will have to determine the spatial position of the synthetic image in such a way that its markers will be closer to the ones in the environment. This is currently achievable for example by using the libraries ARToolKit7, (the Advanced Rendering Toolkit). This is a collection of libraries in C object allowing to perform a whole series of graphical applications and, amongst others, the coregistration on markers. These libraries are under license GNU.

SUMMARY OF THE INVENTION An object of the present invention is to provide improved augmented reality systems and methods, especially those which are more economical to implement. This present invention presents a markerless technique and apparatus to merge 3D virtual models onto real scenes without the need for tracking markers. Neither landmark markers nor projected structured light are required. Real scenes are acquired using a stereovision technique and a region of interest in the real scene is selected. The selection can be, for instance by means of applying colour and/or spatial filters. The registration algorithm operates by minimizing a distance such as the mean square distance between points obtained from the stereovision and a surface from a virtual model of the object to be viewed. An advantage of the present invention is that it is suitable for applications requiring focused user's attention and high accuracy in the registration process. In a further embodiment the approach is integrated into a multi-platform framework for visualization and processing of images such as medical images. The achieved spatial accuracy (e.g. rotation and translation) and time performance are suitable for real time applications such as surgery. An advantage of the present invention is that it provides a low-cost markerless approach for augmented reality systems. Instead of using fiducial markers or feature tracking, the present invention makes use of preliminary knowledge of the scene to ber viewed. This solution is cost effective as it is targeted for any stereo camera providing a pair of digital pictures of a real scene. The same camera can also be used to project the final augmented scene, e.g. in a head-mounted display. It can be considered a low-cost solution when compared to laser scanning techniques for image acquisition. The method uses off-the-shelf stereo, cameras, virtual reality goggles and MRI data sets to build up the 3D model considering the medical case. The present invention provides a system for augmenting a user's view of real- world objects to provide a combined augmented reality image comprising: a display device for displaying the combined image, a means for obtaining a first segmented image of the surface of the object from a volumetric image of the object, a stereo camera system for capturing stereo images of the object, a processor for generating a cloud of points form the stereo images and for colour or spatial filtering of the cloud of points, the processor being adapted to register the filtered cloud of points with the first segmented image and to display the combined image. The present invention also provides a method for augmenting a user's view of real- world objects to provide a combined augmented reality image comprising: obtaining a first segmented image of the surface of the object from a volumetric image of the object, capturing stereo images of the object, generating a cloud of points form the stereo images, colour or spatial filtering of the cloud of points, registering the filtered cloud of points with the first segmented image, and displaying the combined image. The present invention provides a computer program product which when executed on a processing engine augments a user's view of real- world objects to provide a combined augmented reality image the computer program product comprising code for: obtaining a first segmented image of the surface of the object from a volumetric image of the object, capturing stereo images of the object, generating a cloud of points form the stereo images, colour or spatial filtering of the cloud of points, registering the filtered cloud of points with the first segmented image, and displaying the combined image. The computer program product may be stored on a machine readable medium such as an optical disk (CD-ROM, DVD-ROM), a magnetic tape or a magnetic disk, for example. The present invention also provides a combined digital image or video image generated by any of the methods of the present invention. The present invention is suitable for indoor Augmented Reality (AR) applications with a possible preliminary modelling of the objects of interest to be seen. In particular, it is appropriate for applications where focused attention, e.g. distances between 50 and 100 cm, and high accuracy are required, e.g. surgery. In a further embodiment the present invention has been integrated into a medical framework which provides a rich set of tools for visualization and manipulation of 2D and 3D medical data sets, e.g. by using VTK and ITK libraries. The present invention will now be described with reference to the following drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

Fig. 1 shows a schematic representation of a computer arrangement of according to an embodiment of the present invention.

Fig. 2 is a schematic flow diagram of a method in accordance with an embodiment of the present invention, e.g. for an AR system for a medical application.

Fig. 3 shows the results of the method of Fig. 2.

Fig. 4 shows an image of a phantom head used in an example of the present invention. Fig. 5 shows steps in the acquisition of images in AR system in accordance with an embodiment of the present invention.

Fig. 6 shows internal information of a patient's head reconstructed from MRI data in accordance with an embodiment of the present invention.

Fig. 7 shows the axes of the stereo camera.

Fig. 8 shows an RGB spectral analysis of a stereo image, left to right red, green and blue components.

Fig. 9 shows a high pass filtering step as used in an embodiment of the present invention.

A white area stands for the accepted points and a black area for the rejected ones.

Fig. 10 shows a low pass filtering step as used in an embodiment of the present invention.

A white area stands for the rejected points and a black area for the accepted ones.

Fig. 11 - left figure shows a 2D model whereby elements of the surface are in black, right figure illustrates the distance map of this model.

Fig. 12 - left figure shows a cranium slice, right figure shows the corresponding distance map where light intensity represents distance, shorter distances are darker.

Fig. 13 shows final augmented real scene.

Fig. 14 shows a cube used for validation - left for rotation, right for translation.

Fig. 15 shows real (black) and computer (lighter grey) rotations of a cube.

Fig. 16 shows real (black) and computer (lighter grey) translations of a cube.

Fig. 17 shows stereo cameras coupled to a 3D stereo glasses and a gypsum phantom representing the patient.

DETAILED DESCRIPTION OF THE PRESENT INVENTION The present invention will be described with respect to particular embodiments and with reference to certain drawings but the invention is not limited thereto but only by the claims. Any reference signs in the claims shall not be construed as limiting the scope. The drawings described are only schematic and are non-limiting. In the drawings, the size of some of the elements may be exaggerated and not drawn on scale for illustrative purposes. Where the term "comprising" is used in the present description and claims, it does not exclude other elements or steps. Where an indefinite or definite article is used when referring to a singular noun e.g. "a" or "an", "the", this includes a plural of that noun unless something else is specifically stated. Furthermore, the terms first, second, third and the like in the description and in the claims, are used for distinguishing between similar elements and not necessarily for describing a sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other sequences than described or illustrated herein. Moreover, the terms top, bottom, over, under and the like in the description and the claims are used for descriptive purposes and not necessarily for describing relative positions. It is to be understood that the terms so used are interchangeable under appropriate circumstances and that the embodiments of the invention described herein are capable of operation in other orientations than described or illustrated herein. Regarding the terms definition, the terms "digital" and "virtual" are used indiscriminately to refer to a world that is not "physical" or "real". The terms "real" and "physical" share the same meaning of being "not digital or virtual". The present invention provides a method and apparatus for performing augmented reality. This method consists in co-registering a virtual object with an image from one or several cameras directed towards the real object. For example, the real object is filmed by means of one or various cameras which permits to deduce an approximately three- dimensional image therefrom. The virtual model of the object can be obtained beforehand, e.g. from volumetric scans such as a CT-Scan of MRI. The present invention relates to an augmented reality application. For instance, augmented reality relates to an application wherein the normal vision of a person is enhanced by means of one or more further images. The additional image and the normal image are preferably co-registered. In an aspect of the present invention a surface-to- surface registration is performed with one surface extracted from the stereovision images and representing the real world and the other one from volumetric data, e.g. by thresholding of a volumetric scan, and representing the virtual world. The methods and apparatus of the present invention are adapted to perform the co-registration on the stereoscopic view itself. One of the inputs to a method or system in accordance with the present invention is from one or more cameras. The invention uses cameras for capturing the dynamic and noisy reality, not a static object, e.g. in real rime. The output is to an AR display, e.g. goggles. Furthermore, in accordance with the present invention not all images need be acquired prior to the AR display time. For example, in one aspect of the present invention one model is derived from a stereoscopic image whereby the stereoscopic image need not be previously acquired and can be obtained in real-time which is very different from a laser acquisition process, or PET, MRI, etc. The 3D points modelling the real time images are deduced from the stereovision images. The present invention is able to deal with noise in acquired images. The present invention uses a distance map which has noise after stereovision processing. The noise being random, a higher precision is usually obtained when working with more points. However, more points increases computational intensity. In another aspect of the present invention a filtering is done on the basis of the colour in the depth map. If faster results are required, the points of the stereovision image are subsampled. Alternatively or additionally, the images can be spatially filtered. As the coregistration is performed directly on the stereoscopic vision it the installation or projection of markers onto the object to be viewed is not required. Moreover, obtaining a three-dimensional image by means of a camera and not by means of laser scanning can provide a three-dimensional image in real time. Furthermore, the use of simple cameras makes the product much cheaper. Hence, the product allows performance of coregistration in any environment on which a model can be made. This is a cheap solution for performing the augmented reality in an environment for which a model can be made and, e.g. where the positioning of markers constitutes a disturbance or is not possible. Another aspect of the present invention is a medical intraoperative application although the present invention is not limited to this application. In particular the present invention provides an intraoperative tool allowing an augmented vision for the surgeon. In particular, the artificial reality goggles can be used for display. The present invention presents a low-cost markerless methods and apparatus for augmented reality systems. Instead of fiducial markers or feature tracking, the present invention relies, in one aspect, on preliminary knowledge of the scene to be viewed. This solution is cost effective as it is targeted for any stereo camera providing a pair of digital pictures of a real scene. The same camera (or the stored images from the camera) can also be used in the projection of the final augmented scene, e.g. in a head-mounted display. It can be considered a low-cost solution when compared to laser scanning techniques for image acquisition. The present invention is particularly suitable for indoor Augmented Reality (AR) applications with a possible preliminary modelling of the objects of interest to be seen. In particular, it is appropriate for applications where focused attention, e.g. distances between 50 and 100 cm, and high accuracy are required, e.g. surgery. In a further embodiment of the present invention the method and apparatus has been integrated into a medical framework which provides a rich set of tools for visualization and manipulation of 2D and 3D medical data sets by using VTK (Visualization ToolKit - www.vtk.org) and ITK (Insight Registration and Segmentation ToolKit - www.itk.orR) libraries. This framework is called Medical Studio (www.medicalstudio.org) and has been developed at the Communication and Remote Sensing Laboratory from UCL, Belgium (www.tele.ucl.ac.be). A system according to an embodiment of the present invention basically comprises three main parts:

1) methods, apparatus and software for real scene acquisition,

2) methods, apparatus and software for 3D mesh input and

3) methods, apparatus and software for output of an augmented view.

The great advantage to perform this integration of images into an augmented view is that functionalities such as 3D rendering, segmentation, model reconstruction from segmented volume data sets and easy visualization and manipulation of 2D and 3D datasets like zoom, translations, rotations, placing points, plan views, can be also applied to the augmented scene providing to an operative an enhanced work environment, e.g. providing a surgeon with a better understanding of complex and critical surgical procedures. A proposed system for use with the present invention is shown in Fig. 1. Fig. 1 is a schematic representation of a computing system which can be utilized with the methods and in a system according to the present invention. A computer 10 is depicted which may include a display such as AR goggles 14 and/or a video display terminal, a data input means such as a keyboard 16, and a graphic user interface indicating means such as a mouse 18. Computer 10 may be implemented as a general purpose computer, e.g. a UNIX workstation or a Personal Computer. Computer 10 includes a Central Processing Unit ("CPU") 15, such as a conventional microprocessor of which a Pentium IV processor supplied by Intel Corp. USA is only an example, and a number of other units interconnected via system bus 22 (which may include a hierarchy of busses). The computer 10 includes at least one memory. Memory may include any of a variety of data storage devices known to the skilled person such as random-access memory ("RAM"), read-only memory ("ROM"), non-volatile read/write memory such as a hard disc as known to the skilled person. For example, computer 10 may further include random-access memory ("RAM") 24, readonly memory ("ROM") 26, as well as an optional display adapter 27 for connecting system bus 22 to an optional video display terminal 14 such as AR goggles, and an optional input/output (I/O) adapter 29 for connecting peripheral devices (e.g., solid state, disk and/or tape drives 23) to system bus 22. Computer 10 further includes user interface adapter 19 for connecting a keyboard 16, mouse 18, optional speaker 36, as well as allowing stereovision inputs, e.g. from a stereo camera system 20 and optional camera controllers 40 and/or network cards. In addition there is a system 21 for capturing volumetric data, e.g. MRI or CT- Scan data as well as a controller 41 therefor. Data transfer can be allowed via a network 39, such as a LAN or WAN, connected to bus 22 via a communications adapter 39. The adapter 39 may also connect computer 10 to a data network such as the Internet. This allows transmission of volumetric data over a telecommunications network, e.g. entering the volumetric data at a near location and transmitting it to a remote location, e.g. via the Internet, where a processor carries out a method in accordance with the present invention. The present invention also includes within its scope that the relevant volumetric data are input directly into the computer 10 using the keyboard 16 or from storage devices such as 23. Computer 10 also includes a graphical user interface that resides within machine- readable media to direct the operation of computer 10. Any suitable machine-readable media may retain the graphical user interface, such as a random access memory (RAM) 24, a read-only memory (ROM) 26, a magnetic diskette, magnetic tape, or optical disk (the last three being located in disk and tape drives 23). Any suitable operating system and associated graphical user interface (e.g., Microsoft Windows) may direct CPU 15. In addition, computer 10 includes a control program 51 which resides within computer memory storage 52. Control program 51 contains instructions that when executed on CPU 15 carry out operations supporting any of the methods of the present invention. Those skilled in the art will appreciate that the hardware represented in Fig. 1 may vary for specific applications. For example, other peripheral devices such as optical disk media, audio adapters, or chip programming devices, such as PAL or EPROM programming devices well-known in the art of computer hardware, and the like may be utilized in addition to or in place of the hardware already described. In the example depicted in Fig. 1, the computer program product for carrying out any of the methods of the present invention may be stored in any of the above mentioned memories. However, it is important that while the present invention has been, and will continue to be, that those skilled in the art will appreciate that the mechanisms of the present invention are capable of being distributed as a program product in a variety of forms, and that the present invention applies equally regardless of the particular type of signal bearing media used to actually carry out the distribution. Examples of computer readable signal bearing media include: recordable type media such as floppy disks and CD ROMs and transmission type media such as digital and analogue communication links. A suitable hardware setup described above comprises the following components:

• stereoscopic goggles;

• stereo video camera, e.g. a Bumblebee camera from Point Grey Research [RESEARCH P. G.: Bumblebee camera, digiclops and tri clops libraries, http://www.ptgrey.com (2003). 2,4] and

• a computer system. The augmented scene can be displayed in the AR video-based goggles. These goggles are opaque, but as the images captured by the stereo cameras are placed right in front of them, they become virtually transparent. In the optimal (but optional) case the distance between the camera lenses should correspond to the distance between the screens of the AR goggles and the distance between the operator's eyes. The real world image is captured using two or more cameras in a stereoscopic arrangement. In one embodiment of the process of image acquisition according to the present invention is performed with PGR libraries for stereovision computation [RESEARCH P. G.: Bumblebee camera, digiclops and triclops libraries. http://www.ptgrey.com (2003). 2,4]. The 3D reconstruction as well as the image processing and registration procedure are performed with Visualization ToolKit (VTK) algorithms) [SCHROEDER W. J. (Ed.): The. visualization Toolkit User's Guide. Kitware inc., 2001. 2]. The algorithm flowchart of Fig. 2 schematically shows a method in accordance with an embodiment of the present invention. The algorithm is shown pictorially in Fig. 3 showing the results of each related step described in Fig. 2. The method has two distinct phases of processing:

a) off-line and b) real time (on-line) or optionally off-line.

Off-line processing involves the following steps:

(i) initialization of the stereo camera parameters;

(ii) initialization of the filters (spatial and/or colour) parameters;

(iii) 3D mesh reconstruction; and

(iv) basic transformations applied to the mesh and distance map computation.

Real time processing (optionally off-line processing) involves the following steps: (i) stereovision algorithm computation; (ii) points filtering; (iii) registration algorithm;

(iv) inversion of the transform matrix and

(v) rendering the AR scene.

In step 110 (Fig. 3) the stereo images are acquired. In step 100 (Fig. 2) the images from the stereo camera are processed to generate a graphics image in a suitable format, e.g. jpg images. Also in step 100 the stereo pictures (i.e. .jpg file showed in Fig. 2) provided by the stereo camera are processed through stereovision algorithms in order to obtain a cloud of points of the real scene (i.e. .pts file in Fig. 2) in step 113 (Fig. 3). For this, any suitable libraries may be used. PGR libraries provided with the Bumblebee camera. The initial images are kept, to be put back in the final augmented scene. As a result of this first step, a cloud of points has been obtained that may be transformed into a VTK data format in step 101 (Fig. 2). Once the transformation is done, these points are filtered in step 102 using colour and/or spatial filters to only select a region of interest in the image that will be used as one of the inputs to the registration algorithm. By selecting a spatial or colour region reduces the computational intensity of the registration procedure. The other input data for the registration algorithm is a reference 3D mesh representation (see Fig. 2) stored in the computer of the virtual model. The mesh is obtained from volumetric data by a suitable segmentation scheme, e.g. by first capturing MRI or CT-Scan images (Fig. 3, step 111) and then segmenting them to obtain a virtual model, e.g. by thresholding, Marching Cubes. Preferably, basic transformations (e.g. rotation and translation) are applied in step 103 (Fig. 2) to this mesh to guarantee that distance between the real world and the virtual model cannot exceed a 30° difference. This can help to prevent loss of registration as the images change. This is not a big problem considering that it is commonly accepted that an AR tool should be initialized, For instance, the surgeon will first have to look at his patient right in the eyes. After that initialization, the algorithm computes the matching of a present image from the last match of images and no more initializations are required. The next operation is the conversion of the 3D mesh representation of the virtual model into a distance map in step 104. This is done by taking the volumetric data and by first rasterizing the model surface. Then a distance transform is applied such as a Euclidean distance transform. The basic data used will typically be volumetric data and the surface of the object, e.g. the skin contours are obtained by a suitable algorithm, e.g. thresholding, Marching Cubes. Starting from a distance map from step 104 and filtered points from step 102, the registration algorithm minimizes a distance between these two sets of data in step 105. In an embodiment of the present invention the mean square distance between the points obtained from the camera images and the virtual model's surface obtained from the volumetric data. As a result a 4x4 matrix is obtained in step 106, the matrix representing the rigid transformation which matches, in the best way, the cloud of points onto the distance map.

In an embodiment of the present invention the virtual model is applied to the real images (although the present invention is not limited thereto). In step 114 of Fig. 3 filtered points of the virtual model are applied to the originally stored stereovision images. The inverse of the 4x4 matrix is applied to the reference mesh in step 106. At this point there is a match between the real object and its virtual model. In step 107 (Fig. 2, step 115, Fig. 3) augmented reality is performed by superimposing the virtual model on the initial images coming from the stereo camera. This is done using camera parameter information and stereo processing infomiation such as focal length and distances of the lenses, distance of the real object from the centre of the cameras, etc. In such a medical application the skin image generated from the volumetric data can be peeled back to reveal the internal structures located relative to the viewpoint of the camera. Thus the surgeon has "x-ray vision" useful for minimally invasive surgery. The present invention is not limited to applying the virtual to the real. An example of the apparatus and methods of the present invention will mainly be described in the following with respect to a craniofacial surgery tool and a situation close to a surgical room but the present invention is not limited thereto. Purely for demonstration purposes a gypsum "phantom" head was made of a patient's head. Such a phantom head can be made by printing a 3D model reconstructed from volumetric scans of a patient's head, e.g. reconstructed from MRI data sets of a patient (see Fig. 4 for an example generated from segmenting MRI data). The phantom head was put on a dark grey-blue sheet. A patient is typically covered with a blue or green sheet except where the surgery is to take place. Fig. 5 shows in the two top windows left and right views acquired with a stereo camera. The left bottom windows shows the cloud of points acquired and filtered in accordance with the present invention in real time. The right bottom windows shows the cloud of points filtered and correctly registered to the model 3D. The reference 3D mesh representation that will be used for the distance map creation must be as accurate as possible. When the volumetric data is obtained from a human patient, the relevant surface is the skin. The skin is segmented from the volumetric dataset, e.g. the MRI dataset or a CT Scan dataset. To do this any suitable algorithm may be used, e.g. the Marching Cubes algorithm [LORENSEN W., CLINE H.: Marching cubes, a high resolution 3D surface constmction algorithm. Computer Graphics (1997), 163-169. 4], which will construct a 3D surface from the MRI images given the intensity of the skin as an input parameter. An alternative algorithm could be a thresholding segmentation as described in the book by Sueten, "Fundamentals of Medical Imaging" Cambridge University Press, 2001. The algorithm should be carefully applied as each structure within the head with the same intensity as the skin will also be reconstructed. This could be a problem for the registration algorithm as these structures will also be on the distance map and could drive the algorithm into local minimum. Hence the segmented image should be cleaned up to remove spurious features within the volume of the model. For example a region growing algorithm could be applied to only get the skin reconstructed. The internal volume of the object to be viewed is preferably also segmented to show the internal structures which will be of interest to the operator, e.g. surgeon. For example internal brain structures can be derived. This can be done by any suitable algorithm such as Marching Cubes or a thresholding technique. For example an automatic atlas-based segmentation method [DHaese, P.F., Bois d' Aische, A., Merchan, Th. E., Macq, B., Li, R., Dawant, B.: Automatic Segmentation of Brain Structures for Radiation Therapy Planning. SPIE Medical Image Processing (2003)] is implemented in Medical Studio. These internal structures will not be used for registration but in the final display of the AR image. Fig. 6 illustrates the reconstructed 3D model highlighting some internal structures of the cranium (e.g. the segmented brain). Stereovision has already been extensively covered in many papers [SCHARSTEIN D., SZELISKI R.: A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV 47(1/2/3) (June 2002), 7-42. 4, SCHARSTEIN D., SZELISKI R.: High- accuracy stereo depth maps using stmctured light. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2003),Madison, WI l(June 2003), 195- 202. 4] . Methods of the present invention are focused on AR and not 3D recovery as such. Any suitable library can be used such as the Point Grey Research [RESEARCH P. G.: Bumblebee camera, digiclops and triclops libraries, http://www.ptgrey.com (2003). 2,4] Digiclops and Triclops libraries, for they were supplied with the Bumblebee stereo camera. Such libraries supply functions permitting computer vision applications. They are in a standard C/C++ interface and some have stereovision parameters, e.g. disparity range, matching windows size, etc. Camera calibration is part of the Digiclops library. A major characteristic (and drawback) of stereovision is its intrinsic lack of accuracy. Because of occlusions and imperfections of stereovision algorithms the output (e.g. 3D image, depth map, cloud of points) is usually "noisy". Moreover, the cloud of points after stereovision processing describes the whole scene captured and is consequently not so suitable for registration with a virtual model corresponding to a small part of it. Selecting the points belonging to the object of interest is of advantage. Two main assumptions regarding the real scene are considered:

1. Large parts of the background have a uniform colour or are of a colour not found in the object of interest.

2. The object of interest's position does not change abruptly from one frame to the other.

A further useful restriction is that the operator (e.g. surgeon) needs accuracy when the real object is at the centre of his sight. For this reason there is a spatial limitation on the relevant part of the images. To suppress points not belonging to the object (referred as 'noise') a filter based on colour and/or space assumptions is applied. These assumptions can be easily deduced from a preliminary observation. For instance during surgery where the patient body is "wrapped" in a green (or blue) sheet points in the captured image having high green (or blue) colour content can be deleted. The criteria for filtering (e.g. colour and distance or position) are determined in each of the environments where AR is to be performed on a scene-by-scene basis. In the case of the present example, the operation is carried out in an indoor space where light conditions are constant in time. After an initial initialization, the rest of the method is fully automatic. Each point of the cloud computed by stereovision algorithms is composed of 3 coordinates and 3 RGB colour intensities. The 3 axis coordinates are the point's position (in meters) in space regarding to the centre of the stereo camera along the three axis of the Bumblebee camera (Fig. 7). To identify areas of the image which are not interesting any suitable image analysis method can be used, e.g. regional colour coding or a colour histogram. For example, a "spectral" analysis highlights the differences in colour distribution. Fig. 8 shows three figures, each figure corresponds to, from left to right, the Red, Green and Blue colours. The horizontal axis represents the intensity (from 0 to 255) and the vertical axis is the number of points with that intensity. Peaks corresponding to uniform intensities of light can be detected. Those peaks correspond to the first assumption that large parts of the background have a uniform colour. It is thus possible to isolate pixels with regard to their intensity as well as their dominant colour. For instance, the dark blue sheet of Fig. 5 is responsible of the first peak of all the three drawings of Fig. 8. This "spectral" analysis permits to determine the criteria of a band pass colour filter. Band pass filters are composed of high pass and low pass filters combined. The high pass filter criteria are determined in order to filter dark backgrounds of blue dominance (the sheet in the example is grey-blue). As every colour has RGB elements, a threshold is set for each RGB component to augment filter reliability. Fig. 9 and Fig. 10 illustrate on one of the two stereo pictures the concept of colour filtering. Fig. 9 illustrates the high pass filter red component. Comparison of the two other colours (not shown here) shows the head has a dominant red colour. The low pass filter criteria are set to filter light blue sheet standing in front of the cranium. Fig. 10 illustrates the low pass filter blue component. A comparison with the two other elements (not shown here) shows clearly the sheet blue dominance. A spatial filtering step useful in the present invention is built on observations in the relevant real space, e.g. a surgical room. The surgeon space of major interest is determined by observing the surgeon when performing his surgical acts. For example, a surgeon typically stands between 10 and 80cm of the patient's head which is at the centre of his sight during surgical acts. For this reason the spatial filtering suppresses points with a z co-ordinate outside the domain [0.1,0.8]. The registration algorithm minimizes a distance between the virtual model and the real world model. For example, a mean square distance is minimised between the points derived from the stereovision camera and the surface representation derived from segmentation of volumetric data, e.g. MRI segmentation. For a given set of digitized points M and an object surface S, the best transform () is: T = argmin ( ∑ d_s(T(p)) Eq. (1) peM d_s(q) = min (d (q,s) | s e S) Eq. (2)

The surface of the volumetric data is rasterized into a 3D image. Equation 2 can be efficiently pre-computed for all points q using a suitable distance transform such as a Euclidean distance transform (EDT). For this purpose, in craniofacial surgery the segmented scalp surface is rasterized into a 3D image based on the volumetric patient data. Then, a suitable EDT is applied, e.g. an implementation of Saito's EDT [SAITO T., TORIWAKI, J.: New algorithms for Euclidian distance transformations of an n- dimensional digitized picture with applications. Pattern Recognition 27 (Nov. 1994), 1551-1565] found in [CUISENAIRE 0.: Distance transformation, fast algorithms and applications to medical image processing. Ph.D. dissertation, Universite Catholique de Louvain, B-1348 Louvain-la-Neuve. Belgium. (Oct. 1999)]. It has a O(N^4/3) complexity, N being the number of points. When q does not fall exactly on the image grid, the value of ds(q) is interpolated. Equation 1 is then easily solved, as the sum over only M usually involves a few hundred terms and ds(q) is available in the form of a lookup table (see Fig. 11 and Fig. 12). Given the low computational cost of the matching criterion, any suitable minimization technique can be applied. For example, an iterative scheme similar to Powell's [POWELL M. I. D.: An efficient method for finding the minimum of a function of several variables without calculating derivatives. The Computer Journal (1964), 155- 162] can be used to compute the minimum. Convergence properties are particularly good thanks to the use of an exact Euclidean distance map, as opposed to the often-used chamfer metric. Such a scheme sometimes only converges to a local optimum, and so a proper initial transform should be used. This can be obtained either by full search at a coarse resolution of translations and rotations, or by using a standard setup during data acquisition. In the present example, the first pictures of the real scene were known in advance. TQ is the transform symbolized by the 4x4 transform matrix:

where the 3x3 submatrix R symbolizes rotations and the vector T symbolizes the translation. This transformation is applied to the cloud of points obtained from the camera images to put them on the model surface in step 105. As the virtual image is to be placed onto the real image in this embodiment, T () is inverted and applied to the model surface. Fig. 13 shows the cloud of points filtered from the real scene and the model surface after transformation T^'!(). In the background you can see one of the 2D initial images from the stereo camera. To put the 2D image at the right 3D place compared to the cloud of points and model surface information is used such as:

1. Distance of the object from the cameras. This information is given by the coordinates of the filtered cloud of points. Each point coordinate corresponds to the position in meters from the centre of the cameras.

2. Focal length of the camera lenses.

The augmented view can be improved by eliminating some noise (e.g. which comes from the segmentation process) in the reference 3D model. An indirect method has been used to determine the precision of the method according to the present invention. The precision of display is hard to determine, unless there are several operators with AR goggles and real-time temporal resolution. The precision of the registration algorithm has already been determined for instance in [NOIRHOMME Q., ROMERO E., CUISENAIRE 0., FERRANT M., VANDERMEER EN Y., OLIVIER E., MACQ B.: Registration of transcranial magnetic stimulation, a visualization tool for brain functions. Proceedings of DSP 2002, 14th International Conference on Digital Signal Processing (July 2002), 311-314] and [CUISENAIRE 0., FERRANT M., VANDER- MEEREN Y., OLIVIER R, MACQ B.: Registration and visualization of transcranial magnetic stimulation on magnetic resonance images. Lecture Notes in Computer Science, The Fourth International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2001) 2208 (Oct. 2001)]. As the cloud of points is noisy the algorithm precision cannot be computed from the mean square distance of the points and the surface - this merely measure the inaccuracies in the image acquisition technologies. Real transformations applied to the real object (i.e. rotations and translations) should be measured and a determination made as to whether the transformations applied to the model surface match them. The absolute precision of the validation is limited by the hand precision of the operators having conducted the tests (estimated to be 1 mm in translation and 1° rotation). A series of tests have been performed to determine the spatial precision in translation and rotation as well as the computing time of the method. The experimental set up is shown in Fig. 14. Different parameters of the registration algorithm were tested in order to determine which is the most promising to improve the registration algorithm computing time. In accordance with an aspect of the present invention a promising parameter is the number of points to register. The following Table shows influence of subsampling on accuracy, computing time linearly depends on the number of points. A subsampling of 400 corresponds to 20 points and a computing time of 10 seconds.

Table: Influence of subsampling on accuracy

For the spatial precision tests the following was considered:

1. A cube in rotation (20 pictures of 1 degree rotations from -20 to +17° ) - see Fig. 15. 2. A cube in translation (20 pictures of 1 cm translations from -5cm to +15 cm) - see Fig. 16. The goal was to emulate the movement of the operator looking at the real object and treat it as real-time data. The Bumblebee camera was statically fixed on a tripod, at 70 cm from the object and at a 45° horizontal angle. A raler was disposed under the object to assure the maximum precision the human hand can reach. (Fig. 14). The mean square error found for rotation was 0.7°, and the mean square error found in translation is 1 cm. Two remarks are important here:

1. Concerning the translations, if the object of attention is at the centre of the sight, the precision is 1 mm whereas at the edge of the sight the registration is less accurate (clearly visible on Fig. 16). As an operator using AR technology does not need accuracy unless the object of his/her attention is at the centre of his/her sight, the relevant precision in translation is 1 mm.

2. The precision determined corresponds to the precision of the human hand displacing the object. The Augmented Reality systems without using fiducial markers according to the present invention can be carried out using off-the-shelf stereo cameras and virtual reality goggles. Considering a medical application, volumetric data is captured by well-known techniques, e.g. MRI, to generate data sets from which to build up the 3D model. The present invention merges the following aspects:

1. Real scene acquisition is performed using the stereovision technique. It guarantees the low-cost solution regarding to the others solutions such as laser scanning.

2. The selection of a smaller region of interest in the real scene is performed applying colour and/or spatial filters. This contributes to eliminating the noise in the acquired image and speeds up the computation time of the registration algorithm.

3. The registration algorithm is performed by minimizing a distance such as a mean square distance between the points derived from stereovision and a surface derived from the virtual model. The following problems may be encountered when implementing methods according to the present invention:

• The registration algorithm falling into local minima. A potential reason for that is the noise encountered in the virtual model. One solution is a better 3D reconstruction can be performed to eliminate undesired meshes inside the object. For example the insides of the patient 3D image should be cleaned up to leave only the skin surface.

• Cloud of points for more complex objects are noisier than those of simple objects. Some potential reasons for that are the texture and the brightness of the object surface which is too shiny. Some texture, e.g. the human skin will improve this. The present invention includes within its scope:

1. Supporting all method under the same platform, e.g. stereovision algorithm for a suitable operating system such as Linux.

2. Selecting not only with colour and/or spatial filters but also with texture.

_3. A dynamic registration for deformable objects using finite elements. For instance in a maxillo-facial surgery there are some structures that can be moved during the surgery acts.

4. The bounding box can be automatically positioned around the object of attention in order to reduce noise influence down to zero, e.g. by pattern recognition algorithms.

5. Improving registration results for more complex objects, by enhancing cloud of points reliability as well as protecting the algorithm from falling in local minima.

6. A user interface to interactively manipulate the registered virtual objects. Fig. 17 shows an arrangement in accordance with an embodiment of the present invention. The operator wears AR goggles with a stereo camera mounted in front thereof. The augmented image is displayed within the goggles. By using a fixed relationship between the camera to the goggles and by using the stereo images captured by the camera as in input to the combined real- world and virtual-world images, maintaining registration becomes easier.

Claims

1. A system for augmenting a user's view of real- world objects to provide a combined augmented reality image comprising: a display device for displaying the combined image, a means for obtaining a first segmented image of the surface of the object from a volumetric image of the object, a stereo camera system for capturing stereo images of the object, a processor for generating a cloud of points form the stereo images and for colour or spatial filtering of the cloud of points, the processor being adapted to register the filtered cloud of points with the first segmented image and to display the combined image.

2. The system according to claim 1, wherein the means for obtaining the first segmented image includes means for generating a distance map.

3. The system according to claim 1 or 2 wherein the processor is adapted to apply a spatial transform derived from the registration to a second segmented image of the volumetric data to thereby display the combined image.

4. The system according to according to any of the previous claim, wherein the processor is adapted to store the captured stereo images and to use these images in the display of the combined images.

5. The system according to any of the previous claims, further comprising goggles for displaying the combined image.

6. The system according to any of the previous claims, wherein the processor is adapted to generate a spectral analysis of the cloud of points.

7. The system according to claim 6, wherein the processor is adapted to eliminate points which belong to a set of points of a uniform colour.

8. The system according to any previous claim, wherein the processor is adapted to apply a band pass colour filter to thereby generate the filtered cloud of points.

9. A method for augmenting a user's view of real-world objects to provide a combined augmented reality image comprising: obtaining a first segmented image of the surface of the object from a volumetric image of the object, capturing stereo images of the object, generating a cloud of points form the stereo images, colour or spatial filtering of the cloud of points, registering the filtered cloud of points with the first segmented image, and displaying the combined image.

10. The method according to claim 9, wherein obtaining the first segmented image includes generating a distance map.

11. The method according to claim 9 or 10, further comprising applying a spatial transform derived from the registration step to a second segmented image of the volumetric data to thereby display the combined image.

12. The method according to according to any of the claims 9 to 11 , further comprising storing the captured stereo images and using these images in the display of the combined images.

13. The method according to any of the claims 9 to 12, further comprising generating a spectral analysis of the cloud of points.

14. The method according to claim 13, further comprising eliminating points which belong to a set of points of a uniform colour.

15. The method according to any of the claims 9 to 14, further comprising applying a band pass colour filter to thereby generate the filtered cloud of points.

16. A computer program product which when executed on a processing engine augments a user's view of real-world objects to provide a combined augmented reality image the computer program product comprising code for: obtaining a first segmented image of the suiface of the object from a volumetric image of the object, capturing stereo images of the object, generating a cloud of points form the stereo images, colour or spatial filtering of the cloud of points, registering the filtered cloud of points with the first segmented image, and displaying the combined image.

17. The computer program product according to claim 16, wherein obtaining the first segmented image includes generating a distance map.

18. The computer program product according to claim 16 or 17, further comprising code for applying a spatial transform derived from the registration step to a second segmented image of the volumetric data to thereby display the combined image.

19. The computer program product according to according to any of the claims 16 to 18, further comprising code for storing the captured stereo images and using these images in the display of the combined images.

20. The computer program product according to any of the claims 16 to 19, further comprising code generating a spectral analysis of the cloud of points.

21. The computer program product according to claim 20, further comprising code for eliminating points which belong to a set of points of a uniform colour.

22. The computer program product according to any of the claims 16 to 21, further comprising code for applying a band pass colour filter to thereby generate the filtered cloud of points.

23. A machine readable storage medium for storing a computer program product according to any of claims 16 to 22.

24. A combined digital image or video image generated by the method according to any of the claims 9 to 15.