WO2011029209A2

WO2011029209A2 - Method and apparatus for generating and processing depth-enhanced images

Info

Publication number: WO2011029209A2
Application number: PCT/CH2010/000218
Authority: WO
Inventors: Christoph Niederberger; Stephan Würmlin Stadler; Richard Keiser; Remo Ziegler; Marco Feriencik; Marcel Germann; Marcel Müller
Original assignee: Liberovision Ag
Priority date: 2009-09-10
Filing date: 2010-09-07
Publication date: 2011-03-17
Also published as: WO2011029209A3

Abstract

A method for generating and processing depth-enhanced images comprises the steps of - providing an image of a scene (10); - computing a depth-enhanced representation of the scene (10) comprising information on the relative position of scene elements in (real or virtual) 3D space, wherein each scene element (23) corresponds to a particular object (12) in the scene and to an image segment generated by observing the object with the physical camera; -inputting one or more annotation elements (24) and information on the relative position of the one or more annotation elements (24) with regard to the scene elements (23); - defining camera parameters of a viewing camera; - rendering, by means of a rendering processor (7), a rendered image as seen by the viewing camera, wherein the one or more annotation (24) elements are shown in a spatially consistent relation to the scene elements (23).

Description

METHOD AND APPARATUS FOR

GENERATING AND PROCESSING DEPTH-ENHANCED IMAGES

FIELD OF THE INVENTION

The invention relates to the field of digital image processing, and in particular to a method as described in the preamble of the independent claims.

BACKGROU N D OF THE INVENTION

An notations Methods

It is known to annotate scenes of sports events on television by manually painting on a still image of a scene, or by automatically inserting markers (e.g. touchdown line) that appear to be part of the scene, with correct location and perspective.

Annotating sports scenes for example on television can be done either

1. By drawing over the image in 2D (Fig. la).

2. By drawing using chromakeying or similar segmentation methods to distinguish between fore- and background (Fig lb).

3. By drawing in a calibrated camera "on the field" in perspectively correct 2D, and possibly using chromakeying or similar segmentation methods to distinguish between fore- and background (Fig. lc).

4. By drawing over the scene in perspectively correct "3D" (e.g. commercial content) (Fig. Id).

Output device methods

Until today, such content has been shown on conventional 2D displays. Some content has been created for 3D displays, mainly based on either stereoscopic recording (with 2 cameras) of a real scene, or artificial rendering from 3D computer graphics models.

For such displays, different methods exist: Stereoscopic displays: two views are generated (from where the viewer's eyes are expected) and

o combined into one image (red/blue or red/green) [Analglyph]

o polarized orthogonally over each other [Polarized]

o shown after each other and viewed with so called shutter glasses where the glasses shutter the view in sync with the display and the shown image

(left/right) [Time-multiplexing]

Autostereoscopic displays:

o Parallax barriers/illumination: visual medium partitions display image into different views

o Lenticular sheets: cylindrical vertical lenses refract light from display into the different views

Computer generated holography

o reproduction of the holographic light interference pattern

Comparison:

(cross talk describes the effect of one eye seeing a portion of the other eye's view)

SPECIFIC PROBLEM

Annotations / virtual elements

A full 3D immersive perception of annotations is not given with the above methods. Especially, for example, in the case when it is possible to move a virtual camera around a scene and generate a view as seen by the virtual camera, the 3D effect is imperfect due to wrong occlusion effects. Consequently, elements of the scene and annotation elements are not displayed in a consistent spatial relationship. Ideally, the virtual elements are inserted into a scene such that it looks as if they are actually a part of the scene. See the following two examples (Fig. 2a and 2b) of the desired rendering of spatial relationships. Here, as in the other examples, the cylindrical objects represent objects (e.g. players in a sports event) of the real or virtual scene, and the arrows are examples for annotation elements. The scene or at least one image of the scene is assumed to be provided, and the annotations are to be inserted in to the scene such that they appear in a realistic fashion. The annotation elements (typically modeled as 3D computer graphics objects) are perceived by a user, by means of a 2D or 3D display device, as 3D objects or as surfaces located in 3D space and being part of the scene:

Fig. 2a: The arrow is in front of the middle object and goes around the left object (partly in front, partly behind).

Fig. 2b The arrow passes in between the objects, in front of the middle one and behind the other ones.

With color segmentation, such effects are not possible since such methods have no knowledge about an ordering or the distance from the camera. But as both examples show, such information is absolutely necessary since there is no way to distinguish between the middle object and the other objects on the basis of only the color information. A workaround for this problem, in the context of sports annotation, where the 3D objects are players on a flat playing field, is to render annotation elements either onto the playing field (that is, behind all the players and in front of the playing field) or in the air above the player's heads (that is, in front of all the players and the playing field). However, it is not possible to add annotation elements at a perceived height of, say, between 0 and 2 meters, as in the above example.

Even more, with LiberoVision's 3D replay technology (as described in patent application PCT/CH2007/000265, filed 24.05.2007, which is hereby incorporated in its entirety by reference), this effect becomes even more valuable. The realistic integration of truly 3D annotation elements into the scene that change their shape when viewed (maybe only visually, that is, when perceived from a different viewpoint, while remaining unchanged geometrically) while moving the virtual camera around the scene is desired.

3D Display output Regarding the output, novel 3D displays create the feeling of a three-dimensional picture by implying a notion of 3D/depth to the viewer. Such displays are especially suited for the above described content.

Several formats exist for the current generation of images or image sequences sent to that display. These usually require either an image of the scene with some kind of depth information, or two images (stereoscopic), or the image including depth information for objects and an "empty" background image (that is, without foreground objects) and depth, or a full 3D notion of the scene.

Thus, in order to bring a scene, typically recorded with an ordinary sensing device such as, e.g. a TV camera, and in particular a sports scene, to such a display, the availability or the creation of additional information is required. These can be partially obtained by using specific devices (e.g. depth scanner) or by analyzing and interpreting the image itself. The recorded transmitted image or image sequence, according to the current state of the art does not provide that required additional information.

One object of the invention is to generate such additional information for a 3D display, in particular for known 3D display types as described in the above.

SUMMARY OF THE INVENTION

Two main aspects of the invention are described in the accompanying claims, corresponding to at least two distinct and independently realizable embodiments according to the respective independent claims.

According to a first main aspect of the invention, a method for generating and processing depth-enhanced images is provided, comprising the steps of

• providing, by means of a physical camera or a storage device, an image of a scene, the image being a still image or an image in a sequence of video images;

• computing, by means of a depth analysis processor, from the image a depth-enhanced representation of the scene, the depth-enhanced representation comprising information on the relative position of scene elements in (real or virtual) 3D space, wherein each scene element corresponds to a particular object in the scene and to an image segment generated by observing the object with the physical camera; • inputting, by means of an input device, one or more annotation elements and information on the relative position of the one or more annotation elements with regard to the scene elements;

• defining camera parameters (position, orientation and lens settings) of at least one viewing camera, the viewing camera parameters being identical to or an approximation to those of the physical camera, or being parameters of a virtual camera;

• rendering, by means of a rendering processor, at least one rendered image as seen by the at least one viewing camera, wherein the one or more annotation elements are shown in a spatially consistent relation to the scene elements.

In a preferred embodiment of the invention, the at least one rendered image is displayed on a 2D or 3D-display. Thereby, in the at least one rendered image, the inserted annotation elements appear, according to their location in 3D space, to lie behind or in front of the scene elements.

In a preferred embodiment of the invention, the depth-enhanced representation of the scene comprises, for each pixel of the entire image or of part of the image, distance ordering information and wherein the relative position of scene elements and annotation elements is expressed by their distance ordering information. Preferably, the distance ordering information is one of a distance to the viewing camera and a relative ordering according to distance to the viewing camera. The relative ordering of distance indicates which of two arbitrarily selected pixels is associated with an object that is closer to the virtual camera.

In a preferred embodiment of the invention, the depth-enhanced representation of the scene comprises a mapping of segments of the image onto scene elements of a 3D representation of the scene and the relative position of scene elements and annotation elements is expressed by their location in the 3D representation of the scene.

In yet another preferred embodiment of the invention, the method comprising the steps of:

• by means of one or more further observation devices or the storage device, providing further information; • when computing the depth-enhanced representation of the scene by means of the depth analysis processor, taking into account, in addition to the image, this further information;

• wherein the one or more further observation devices comprise one or more of distance measuring scanners, further physical cameras, position determining systems.

In another preferred embodiment of the invention, two or more viewing cameras are defined, and an image is rendered for each of the viewing cameras, resulting in a pair of stereoscopic images or a group of multiscopic images; and optionally in a video sequence of pairs or groups of images.

The camera calibration parameters for the viewing camera can be set to be identical to the calibration parameters of an existing camera, in particular the one that provided the original image. Alternatively, the parameters of the viewing camera can be modified interactively by a user, generating a virtual view.

In a further preferred embodiment of the invention, the method, in the step of inputting one or more annotation elements, comprises the step of computing a 3D intersection of these annotation elements with scene elements, detecting an intersection and optionally indicating to a user that an intersection has been detected, and/or optionally also automatically correcting the shape or the position of the annotation element such that no intersection occurs, e.g. by stretching an annotation element to pass around a scene element in the virtual 3D space.

In a further preferred embodiment of the invention, the step of inputting information on the relative position of the one or more annotation elements comprises the step of interpreting the position inputted by the input device as indicating a position on a ground plane in the scene.

In a further preferred embodiment of the invention, the step of inputting information on the relative position of the one or more annotation elements comprises the step of interpreting the position inputted by the input device as indicating a position on a plane parallel to and at a given height above a ground plane in the scene. Preferably, when indicating the position, the height above the ground plane is controllable by means of an additional input device or input parameter such as a scroll wheel, keyboard keys, pen pressure.

According to a second main aspect of the invention, a method for generating and processing depth-enhanced images is provided, comprising the steps of

• providing, by means of a physical camera or a storage device or by receiving a broadcast, an image of a scene, the image being a still image or an image in a sequence of video images;

• computing, by means of a depth analysis processor, from the image a depth-enhanced representation of the scene, the depth-enhanced representation comprising information on the relative position of scene elements in (real or virtual) 3D space, wherein each scene element corresponds to a particular object in the scene and to an image segment generated by observing the object with the physical camera;

• rendering and displaying the depth-enhanced representation of the scene on a 3D output device.

In a further preferred embodiment of the invention, the step of rendering and displaying the depth-enhanced representation of the scene comprises the steps of

• defining camera calibration parameters (position, orientation and lens settings) of two viewing cameras, the viewing camera parameters defining a pair of (virtual) stereoscopic cameras;

• rendering, by means of a rendering processor, two rendered stereoscopic images as seen by the two viewing cameras;

• displaying the two rendered stereoscopic images by means of the 3D-display.

In a further preferred embodiment of the invention, the method comprises the step of inputting, by means of an input device, one or more annotation elements and information on the relative position of the one or more annotation elements with regard to the scene elements; and, in the step of rendering and displaying the depth-enhanced representation of the scene, comprises including the annotation elements.

In a further preferred embodiment of the invention, the image is received through a broadcast, the broadcast further comprising at least one of camera calibration information and a color model, and comprising the steps of performing the subsequent computation, rendering and displaying of the depth-enhanced representation by a receiver of the broadcast, based on the image and at least one of the calibration information and the color model.

An apparatus for generating and processing depth-enhanced images, comprises a depth analysis processor and a rendering processor configured to perform the method steps of the method according to one of the preceding claims.

In preferred embodiment of the invention, the apparatus comprises a receiving unit configured to receive a broadcast, the broadcast comprising at least one image and further comprising at least one of calibration information and a color model, and the depth analysis processor being configured to compute, from the image and at least one of the calibration information and the color model, a depth-enhanced representation of the scene, the depth- enhanced representation comprising information on the relative position of scene elements in 3D space.

A computer program for generating and processing depth-enhanced images is loadable into an internal memory of a digital computer, and comprises computer program code means to make, when said computer program code means is loaded in the computer, the computer execute the method according to the invention. In a preferred embodiment of the invention, a computer program product comprises a computer readable medium, having the computer program code means recorded thereon. The computer readable medium preferably is non-transitory, that is, tangible. In another preferred embodiment of the invention, the computer program is embodied or encoded as a reproducible computer- readable signal, and thus can be transmitted in the form of such a signal.

An important application of the invention lies in the field of processing, annotating and/or displaying still images and video images from sports events. In such situations, one or more of the following points are valid, and the information according to each point can, but must not necessarily, according to different embodiments of the invention, be used to segment an image and optionally also to provide distance information: • the action takes place on a flat playing field or track (summarily called "playing field"). Alternatively, a field may have a non-flat but known topography (as e.g. a golf course or dirt bike track).

• the color of the playing field (background) is known or a color model/color scheme can be deducted from images of it.

• the location of markers on the playing field (such as lines and their corners or intersections) is known, the markers having a different color than the playing field.

• scene elements (players or participants) move, most of the time, on the surface of the playing field. Some scene elements (balls) do not.

• scene elements can be distinguished from the playing field by colour (segmentation, chromakeying).

• scene elements can be classified by color (different teams).

• camera calibration parameters (typically the relative position and orientation of the real or virtual camera with respect to the scene, and optical parameters of the camera) are known or are automatically computed from an image of the playing field and from the information about the location of the markers.

Given some or all the above information, it is possible to create a 2.5D or a 3D model from a single image (see, e.g., PCT/CH2007/000265 and references cited therein), and shall only be summarized here: There exist established procedures to segment the image by color, separating players from the background. A player or a group of players is seen as a blob of non-background color. One or more image pixels at the lower side of the blob can be assumed to belong to a part of the player that stands on a point on the playing field. The 3D location of this point is found by intersecting the line of sight corresponding to these pixels with the playing field. The blob image is projected onto a vertical surface ("painted on a billboard") standing at the location of these pixels. According to PCT/CH 2007/000265 and references, a view as seen from a virtual camera location is generated by rendering the scene comprising the background field with the painted billboards standing on it. Other implementations can use, as scene elements, more detailed 3D surfaces carrying the blob images, that is, the blob images are projected onto the 3D surfaces. Given other sources information, e.g. depth information from distance scanners and/or images from additional cameras and/or from images slightly earlier or later in time, information according to one or more of the points listed above may not be required.

In the most general case, there are no assumptions on background surface shape and colour, and all positions and shapes of both the background and of moving scene elements are estimated from multiple images.

According to one aspect of the present invention, the information about the relative spatial location of the scene elements is used to insert annotation elements such that they appear, according to their 3D location, behind or in front of the scene elements.

According to another aspect of the present invention, the information about the relative spatial location of the scene elements is used to generate image data for driving a 3D TV display. As this spatial location information can be derived from the image with only little additional information (camera calibration and/or color model), an ordinary TV image stream can be enhanced to drive a 3D display, and the enhancement can take place at the TV receiver itself. There is only a very small additional load on communication.

Throughout the present application, the term "camera" is used for the sake of convenience. It is however to be understood, that the term may stand for any sensing device providing "image" information on a scene, where an image is a 2D or enhanced representation of the scene. In the simplest case, the image is a 2D camera image, in more sophisticated implementations of the invention it is a depth-enhanced, e.g. 2.5D or "pseudo 3D" representation including depth information obtained by a distance measurement device such as a laser scanner, or by a stereoscopic (or multiscopic) system using two or more cameras and providing "depth from stereo" information as a starting point for the implementation of the present invention.

Regarding output devices, when the term "3D-display" or "3D output device" is used, it is to be understood as representing any kind of display device that evokes, in a user, the perception of depth, be it on a screen or in 3D space.

The annotation elements are typically defined by a user by means of a graphical input device. Basic 3D shapes of annotation elements (Arrows, circles) can be predefined and only stretched and positioned by a user. Entire sets of annotation elements may also be predefined and retrieved from storage for manipulation or adjustment by a user or automatically. According to another embodiment of the invention, annotation elements are generated and/or positioned automatically, for example an "offside wall" computed from player positions, or trajectories of players across the field, determined by means of motion tracking of the players.

In more detail, the method for generating and processing depth-enhanced images comprises the following steps.

1. For each pixel of the (real or virtual) image of a scene, an ordering labeling or distance measure from the camera (real or virtual) is required (so called depth).

a. This can be retrieved by pixel calculations, for example, by calculating the distance from the (real or virtual) camera to the position of the element of which the pixel is part of.

b. or with a precise segmentation of the scene into different objects (object separation).

2. Optionally filter image/depth maps for smooth borders.

3. Either

a. For rendering annotations:

i. Rendering the virtual elements (graphical annotations, possibly represented by 3D models, typically defined and positioned by a user) from the same perspective(s) as the image. This step comprises

1. Either: Automatically providing a depth map for the annotation element, according to its position in the scene and a stencil (defining where the object lies in the image), too (stencils are commonly used in computer graphics, e.g. as incorporated in OpenGL).

2. Or: the annotation element can be labeled with an ordering label according to the user's wish, that is, the user defines only the relative placement of the annotation element with regard to the scene elements.

ii. Combining the two images (i.e. rendering the composite image): For each pixel, use the pixel from the images which is less distant (that is, has a lower ordering label) from the (real or virtual) camera by comparing the distance or the ordering information, respectively.

iii. Optionally, step (i) involves an interaction (a.k.a. telestration) to define the virtual element and its position/depth in the image.

iv. Optionally make sure that step (i) and (iii) in 3D avoids penetrating the objects (scene elements) with the virtual elements.

b. For displaying on a 3D display

i. Prepare the required information depending on the output device based on one or more of:

1. The input image(s)

2. The depth information associated with the image(s)

3. Optionally: An "empty background" image for one or all of the input images. Such an image can be created by removing the "foreground objects" (that is, scene elements such as players, or the ball).

4. Optionally: Depth information of the "empty background" image(s).

5. Optionally the additional step: Deriving warped or otherwise transformed images depicting the scene from a different viewpoint (for example for stereo- or multiscopic displays), based on the input image(s) and based on

a. Image color information

b. Detected blobs (corresponding to objects)

c. Depth map(s)

d. Calibration information

ii. Transfer the prepared information to the output device. Depending on the type of output device, the information of steps 1 and 2 is sufficient and is determined, or the information of steps 1 and 2 and 3 and 4, or of all steps.

c. Or a combination of (a) and (b), where in step (b) the combined depth map of both the input image(s) and the annotations is used. As a result, the invention allows for ...

• the combination of using the step of generating a depth map/order labeling with a rendering of artificial graphical elements to receive a realistic looking annotated scene

• solving the problem on how to add graphical 3D elements into a 2D (sports) scene in between the objects (players, ...) and not only below (under) or above (over). Thus, not only flat (below) or "in front of the foreground" elements are allowed but any 3D element can be placed into the scene.

• solving the problem of creating a 3D TV capable representation easily with only few requirements and capable of real-time transmission and conversion without hardly any loss of quality/resolution.

Examples for the application of the inventive method and data processing device for the insertion of annotations, for generating a 3D-display, and for a combination of both, are described in the following:

Annotations

• "3D offside wall": instead of just placing a line on the ground into the picture, a "wall" (vertical plane) can be added, showing on-side objects in the foreground and off-side objects in the background of the plane. With a semitransparent plane, this effect visualizes the situation even better.

• "Vertical elements": Considering for example Basketball, a graphical element depicting a jump shot consisting of a vertical or curved rising arrow or similar can be added into the scene and appears to be placed at the "correct" 3D location in between the objects.

• "Volume objects", "3D arrows", ... placed at heights above ground in the scene, where the spatial relationship with the scene elements has to be taken into account.

3D Display

• This solution can, for example, be used to broadcast/transmit only the picture, calibration information, and optionally a color model (for each frame or with regular updates or over another channel) to the receiver (TV) which generates the 3D picture itself (instead of transmitting picture and depth or also the "background" and its depth - resulting in either a lower resolution or higher bandwidth requirements). The calibration information specifies the relative position and orientation of the real or virtual camera (i.e. the viewpoint from which the picture is taken) with respect to the scene, and optical parameters of the camera specifying the mapping of the scene onto the camera picture. The color model typically specifies several sets of one or more color ranges, wherein each set of one or more color ranges defines, e.g., the playing field, other background elements, players of one team, players of the other team, etc. Based on this information, the receiver is able to segment the 2D image, model the 3D relation of the playing field with respect to the camera, and determine the location of the players (scene elements) on the playing field. This gives the information required to display the scene on the 3D-display.

o The calibration and color model information requires (compared to the image(s) itself only very little data and can, thus, be included in the image without any loss of quality,

o The processing and generation of the required data for the 3D display can be easily integrated in the end-user device.

Combination

• Taking advantage of the added value of a 3D TV by using three dimensional annotation elements increases the feeling of immersion and depth.

Further preferred embodiments are evident from the dependent patent claims. Features of the method claims may be combined with features of the device claims and vice versa.

DEFINITION OF TERMS

In the general as well as in the detailed description of the invention, the following terms are used: scene: a collection of 3D objects in a 3D environment. The scene may be from the real world, or may be modeled by means of a computer ("virtual scene"). The 3D objects that constitute the scene are also called "scene elements" view: a 2D or 2.SD or 3D representation of a scene, as seen from a particular viewpoint in 3D space. Again, the view may be generated by a real camera in the real world, or by a virtual camera in a virtual scene. The term "view" expresses the fact that there is a 3D context from which the view is derived. Once a view is generated, and in the context of a display, it may also be considered to be an image.

Note: in some places in this text, the term "scene" is used in lieu of "view", since even in the case when a view is manipulated and an annotation is inserted into the view, (without a complete 3D model of the scene existing), conceptually the annotation is considered to be inserted into the scene. Thus, one may talk about "inserting an annotation into a scene", although the computer graphic manipulations operate on a view, or on a view enhanced with depth ordering information (also called 2.5D view, that is, two-and-a-half-dimensional view, or pseudo-3D). different types of cameras.

• A "real camera" is a physical camera capturing images from the real world. The parameters defining the pose (position and orientation) and optical characteristics (focal length, field of view, zoom factor etc.) of the camera are called calibration parameters, since they represent real variables in the real world that usually are determined by a calibration process. A "hypothetical camera" is related to a real camera whose calibration parameters are not known exactly, or not known at all. In such a case the image from the corresponding real camera is processed as if the calibration parameters of the real camera were those of the hypothetical camera. A "virtual camera" is a conceptual entity, defined by a set of camera parameters, which are of the same type as the above calibration parameters, but are predetermined or computed, rather than being calibrated to match the real world. Based on these camera parameters, the virtual camera is used to render images from elements in a 3D scene. The term "rendering" is commonly used in 3D computer graphics to denote the computation of a 2D image from a 3D scene. In the context of a 3D display device, one may also say that a 3D scene is rendered on the display device, be it via 2D images or by a rendering process that does not require 2D images.

• A "viewing camera" may be a virtual camera, used to define a computer generated view, or a real camera, whose captured image of the real scene is enhanced by the virtual annotation elements and/or by creating the depth-enhanced representation of the scene. The annotation elements are computed and rendered from the point of view of a virtual camera having parameters corresponding to real camera, in order to insert the annotation elements correctly (with regard to perspective and visibility) into the depth-enhanced representation. annotation: a graphic element, which conceptually is a 3D object located in 3D space, and which is inserted into an existing scene. Usually this insertion, i.e. the act of annotating, is initiated and/or controlled by a human user.

Note: In some places of this document, annotation elements may be called simply "virtual elements", as opposed to elements of the scene that represent real world objects such as players in a sports scene, and incorporate image and/or position information from these real world objects. Scene elements representing the real world objects may be inserted, according to their real position, and/or with their associated real image data into a 3D model and then rendered again for viewing on a display. Such scene elements shall also be considered "real". In a more general context, the scene may also be generated in a purely virtual manner, that is, without image information from a real, static or dynamically unfolding recorded or live event. component: an element of an image, depending on the manner in which the image is represented. Typically, a component corresponds to a pixel. connected component: a number of components that are considered together as a unit. In the context of an image, such a unit may be called a blob. In the context of the scene, such a unit often corresponds to an object.

BRIEF DESCRIPTION OF THE DRAWINGS

The subject matter of the invention will be explained in more detail in the following text with reference to preferred exemplary embodiments which are illustrated in the attached drawings, in which is shown in a schematic manner:

Fig la-d: known graphic annotations;

Fig 2a-b: issues involved in 3D graphic annotations;

Fig. 3: a system for recording, processing and displaying depth-enhanced images; Fig. 4: different stages in image processing;

Fig. 5: a flow diagram of a method for enhancing images by creating a 3D-enhanced representation incorporating annotations;

Fig. 6: a flow diagram of a method for enhancing images by creating a 3D-enhanced representation for displaying it or rendering it on a 3D display device; and

Fig. 7: an apparatus for receiving image information and for generating and displaying depth-enhanced images.

The reference symbols used in the drawings, and their meanings, are listed in summary form in the list of reference symbols. In principle, identical elements are provided with the same reference symbols in the figures.

PREFERRED EMBODIMENTS OF THE INVENTION

Fig. 3 shows, schematically, a system for recording, processing and displaying depth- enhanced images. A physical camera 1 and further observation devices, are arranged to observe a scene 10 comprising a background 11 and objects 12. The further observation devices may be further physical camera(s) 2a, position determining system(s) 2b, distance scanner(s) 2b. A virtual camera 3 is characterized by calibration data (position, orientation, optical parameters), and the system is configured to generate a view of the scene 10 as it would be seen by the virtual camera 3. The system further comprises a computer-readable storage means 4, a data processing unit programmed to operate as a depth analysis processor 5, one or more user interface devices such as a display 6a and an input device 6b (such as pointing device, keyboard, and the like), a data processing unit programmed to operate as a rendering processor 7, and a 2D or 3D display 8.

Fig. 4 shows, schematically, different stages in image processing as performed by the system: a) original image with stadium background 21, playing field background 22 and players 23 (in a highly schematic representation). b) images segmented into background and different scene elements. Each player is represented by an individual segment, i.e. a blob of pixels. Overlapping players (not shown) may be represented by just one blob comprising the pixels corresponding to both players. c) annotation element 24 in a desired 2D position in the image b) scene and annotation element 24 rendered, taking into account distance ordering of each pixel.

The system, and in particular the depth analysis processor 5 and rendering processor 7 are configured to execute the methods according to one or both of the flowcharts according to Figs. 5 and 6.

Fig. 7 schematically shows an apparatus for receiving image information and for generating and displaying depth-enhanced images. The apparatus comprises a receiving unit 9 for receiving broadcast images and additional information, a depth analysis processor 5 and a rendering processor 7. The apparatus may be embodied as a separate device connected to the 3D display 8, or it may be incorporated, together with the 3D display 8, in a common housing. If no user interaction for annotating scenes is required, no dedicated input and output devices are provided.

The individual steps involved in the preferred embodiments of the invention are explained in the following. This is done by describing different variants for the stages of

• inputting different types of input data used for the image processing,

• processing the input data, and

• outputting the enhanced image data.

Depending on the type of available input data and on user/operator requirements, different methods for processing the input data can be chosen.

INPUT

• One or more (video) images showing a scene, in particular a sports scene. If more than one image is available and to be used, then all from the same instant in time (synchronized). Typically, these are video images/frames/fields from TV cameras. The images shall be called "input images" or simply "images", where context allows. • Optional: Calibration information for at least one of the images. Calibration information includes at least camera position, focus and orientation information. It can include other parameters (distortion, etc.). This is not required for method C2 described in the Processing section below.

• Optional: color model information that can be used to separate background from foreground in the image(s). Optionally, the color model includes different information for different foreground objects (e.g. players, referees, ball, etc.) to distinguish between different foreground objects as well. The color model is, for example, determined with user interaction, e.g. by having a user assigning labels to blobs in a segmented image, e.g. by identifying particular blobs as being foreground objects or even distinguishing such objects as being part of a particular team. From this, the system learns the color distribution associated with this team. Color models may be either replaced or supplemented with other useful information for separation of foreground and background, such as shape, edge, or priors/templates. Furthermore, the color model can be learned automatically from the team's jersey colors (available from the clubs or the associations).

• Optional: One or more video images showing the same [sports] scene at a different time, e.g. one frame before or after the input image mentioned above.

• Optional: "empty" background information of the scene/image, i.e. showing no "foreground" objects/players...

PROCESSING

Step 1: Determine a distance measure for each pixel of an image from a (given or virtual) camera showing the same scene. The distance measure is not required to be a metric - the only requirement is that one is able to compare two measures and determine which one is smaller than the other, that is, which of two or more entities such as a pixel or blob or object etc., each being associated with a measure, lies closer to the camera.

If distance information is required only for a given real camera (that is, a camera providing the input image that is to be processed), the following methods A-E can be applied alone or in combination: • Method A (external information based): Use information from external device (laser scanner, object tracking device/method, e.g. by (differential) GPS or RF triangulation, etc.), for example.

• Object positions: Determine the object's location in the image and perform a blob detection including that knowledge (This is similar to method C but with additional distance or location information for refining and guiding the segmentation process which extracts the blobs).

• Scanning device:

o If the device is positioned at essentially the same location as the camera:

Pixel-wise distance information for each input image pixel is directly available.

o If the device is not positioned at essentially the same location as the camera: reproject the scanner's 3D information into the camera space and optionally perform filtering to reduce noise in the distance measurements. In other words: transform the 3D information into 3D surfaces as seen by the camera, and project the camera image onto the 3D surfaces.

• Method B (stereo based): If more than one input image is available, then

• No calibration information is required for the "second" camera, but it can be helpful

• If no 2^nd calibration is available:

o Use a stereo algorithm to determine distance information for each input image pixel (or for those pixels or image areas only for which the algorithm gives a result). "Depth from stereo" algorithms are commonly known.

o Such a stereo algorithm can use the color information in order to pre- segment the input image into foreground and background (playing field, stadium e.g.) pixels / parts (see Method C)

• If 2^nd calibration information is available, too:

o Use knowledge of calibration to put constraints on the stereo algorithm improving its quality o Optionally pre-segment the image(s) (see Method C) to gain prior knowledge for even a better quality: a blob in image A must correspond to a blob in image B, consequently intersection results of the corresponding rays restrict the possible depth values.

• Method C (blob detection): If only one input image and the color model information is available:

• Pixel-wise classification into background and (differing) foreground pixels, wherein the foreground pixels may be classified into different classes, and clustering these pixels into connected components or blobs, corresponding to objects.

• Optionally using manual interaction to distinguish between different objects (object detection, and/or pixel-wise classification)

• Optionally using (semi) automatic object recognition algorithms. These may be based on shape, colour, etc.

• Determining 3D position of the connected components by assuming that these connected components represent objects on the ground plane (field) and intersecting a ray through the connected component or pixel(s) at the bottom of the connected component with the ground plane (or a plane parallel to the ground plane).

• Calculating the distance from the camera to the calculated 3D position of the component and assign it to all pixels of the connected component.

• Method C2 (order labeling): If only one input image is given, without calibration, but the color model information is available:

• Classification and separation of input image based on the color model information into multiple objects, as described in the pending patent application PCT/CH2007/000265.

• Using manual interaction to define a layering of the objects describing its rank/distance from the camera. For example, the user clicks on one object (i.e., clicks one of its pixels according to the separation or segmentation) and defines a rank; then the user hits another object and defines another rank, where a lower rank indicates that the object is closer to the camera (front) as objects with a higher rank.

• Another variant is to assume a default calibration (e.g. from previous images from the same camera, from arbitrary assumptions, or just from a standard default calibration representing a typical camera setup). Then a distance can be assigned based on the intersection of a ray originating from the center of projection of the camera through each object's lowest pixel with the field plane and assigning that distance to all pixels belonging to the object according to the separation/segmentation.

• Method C3 (background): If an empty scene background image is available

• a pixel-wise segmentation, without requiring a color model, can be performed to get a classification into fore- and background ("Background Segmentation"). This can be done by subtracting the empty scene image, as projected according to the view seen by the camera providing the input image, from the input image. Alternatively, a statistical method can be used, assuming that, as seen over time and in different views, the color seen on a background surface that appears most often is the color of the background itself.

• Continue as in C2.

• Method D (temporal information): If no color model is given but images from different time steps are available:

• Use visual flow algorithm (in combination with calibration information) to distinguish between foreground pixels/objects and background parts.

• Determining an estimated position of such connected components by assuming that these components represent objects on the ground plane (playing field, constituting the background) and intersecting a ray through the component with the ground plane. Alternatively, the ray may be intersected with a plane parallel to the ground plane. This is done under the assumption that certain features, such as the center of gravity of a player and of the connected components, i.e. the blob, corresponding to the player, lie at a certain average height. Intersecting the ray through this center of gravity with a plane at said average height returns the estimated position of the player.

• Calculating distance from camera to the calculated position of the component and assign it to all pixels of the component.

• Method E (multi view): If more than one input image is available including color information

• Apply Method C and use multi-camera information to match components from different views to each other. Then do not intersect rays with the plane but calculate (in 3D) the point of smallest distance between the rays of corresponding parts of the matched components. Use that point as position of the component. This allows to locate objects in 3D space correctly, e.g. a flying ball, or players jumping into the air.

• Calculate distance from camera to the calculated position of the component and assign it to all pixels of the component.

• Depending in circumstances, various combinations of the above methods are possible.

If the distance information is required for a virtual camera (as described in patent application PCT/CH2007/000265):

• Rendering the scene from one or more virtual viewpoints requires position information of the objects in the scene for a proper parallax effect.

• Rendering the scene from the virtual viewpoint, e.g. with OpenGL, automatically yields a depth map of that scene. This may result in different distances for different parts of an object.

• Assigning each object a (depth) label can be used to generate a "depth map" where each pixel of the same object has the same depth value. This guarantees a consistent depth over the entire object.

• Such a method to determine the position of the objects is described in the pending patent application PCT/CH 2007/000265. If a sequence of images of the same camera is available, one can use known temporal coherence methods (e.g. filtering) to improve the quality over time.

Once the distance measure (thus, the real distance, or the relative ranking) for the scene elements is known, the annotation elements are inserted by user interaction. Typically, this is done by the user drawing, with a pointing device, on a view/image of the scene. This is explained in further detail below, under "other aspects"

OUTPUT

On a 2D display (on ly for an notations) Just render the scene in a picture and display it.

On a 3D display (optionally with an notations)

Once the picture and depth/rank information is available, the data can be transformed into the specific format required by the available 3D display.

OTHER ASPECTS

• How is the "interaction" (or telestration) to define the virtual element and its position/depth in the image done (see 3.a.iii above)? In a preferred embodiment of the invention, the following steps are implemented in this interaction:

• Input: a pointing device such as a pen or mouse or finger marks a pixel, which corresponds to a ray from the viewpoint through that pixel on the viewing plane. Therefore, it also corresponds to an infinite number of potential depth values. From a geometrical view, it is not obvious which depth value is "correct" or user- desired.

• Problem: How do you get the desired depth as an interaction metaphor?

• Possible solutions:

o the pointing device position is interpreted as indicating a position on the ground, i.e. the 3D point chosen to correspond to the pointing device's position is the one where the ray from the viewpoint passes through the ground. This is like "painting" on ground. However, if 3D annotation objects are supposed to appear at a certain height over that ground position, there is the problem that the object does not appear at the location where user is interacting with the image.

o "Painting" on a virtual plane parallel to the ground plane but at a specific height (e.g. lm), where most of the drawing objects will appear. Then, the 3D annotation object appears exactly where the interaction takes place. However, all objects are on the same height over ground.

o "Painting" on a virtual plane parallel to the ground plane but with different heights for different tools, or by controlling the height of the annotation element above ground by means of an additional input device, such as a scroll wheel or specific keyboard keys (up/down).

• Preventing annotation elements from intersecting scene objects.

• Input: depth along ray from viewpoint, or equivalent 3D position information, e.g. from an interaction as described above.

• Problem: How do you make sure that the annotation element, such as an arrow, does not go (in 3D) through the scene object but only around it?

• Possible solutions:

o Do not care

o Pre-computed collision map or similar: Calculate valid areas/volumes where no object is situated. This can be done in 3D, by intersecting volumes of scene elements and annotation elements, or in 2D, by intersecting areas, where the areas are defined by a vertical projection of the scene and annotation elements onto the playing field. This can be simplified by assuming scene elements (players) to have fixed shape such as an upright cylinder of fixed dimension. If user is inserting an annotation element at/through such an area/volume in which it intersects a scene element, the annotation element is automatically and dynamically readjusted, e.g. by bending the annotation element around the scene element. In situations where an annotation element has a variable shape, e.g. an arrow with fixed start and end points, and with other control points in between, and the user moves one of the control points, then the intersection detection, during movement of the control point, preferably is in operation and causes the line to snap to a trajectory where there is no intersection.

• Enforcing a particular spatial relationship between annotation element and scene element:

• Problem: in certain situations, it may be desirable to explicitly force an annotation element to be perceived as being located in front or behind a particular scene element or object.

• Possible solution:

o A distance/rank can be manually assigned to that scene element or object, which will cause that object to be rendered in front of or behind the annotation.

o For "front" objects: The object can be rendered a second time, without considering the distance/rank. With that, the object will appear "in front" of the annotation.

Claims

1. Method for generating and processing depth-enhanced images, comprising the steps of

• providing, by means of a physical camera (1, 2a) or a storage device (4), an image of a scene (10), the image being a still image or an image in a sequence of video images;

• computing, by means of a depth analysis processor, from the image a depth-enhanced representation of the scene (10), the depth-enhanced representation comprising information on the relative position of scene elements in (real or virtual) 3D space, wherein each scene element (23) corresponds to a particular object (12) in the scene and to an image segment generated by observing the object with the physical camera;

• inputting, by means of an input device (6b), one or more annotation elements (24) and information on the relative position of the one or more annotation elements (24) with regard to the scene elements (23);

• defining camera parameters (position, orientation and lens settings) of at least one viewing camera, the viewing camera parameters being identical to or an approximation to those of the physical camera (1, 2a), or being parameters of a virtual camera (3);

• rendering, by means of a rendering processor (7), at least one rendered image as seen by the at least one viewing camera, wherein the one or more annotation (24) elements are shown in a spatially consistent relation to the scene elements (23).

2. The method of claim 1, comprising the further step of displaying the at least one rendered image on a 2D or 3D-display (8).

3. The method of one of the preceding claims, wherein in the at least one rendered image the inserted annotation elements (24) appear, according to their location in 3D space, to lie behind or in front of the scene elements (23).

4. The method of one of the preceding claims, wherein the depth-enhanced representation of the scene (10) comprises, for each pixel of the entire image or of part of the image, distance ordering information and wherein the relative position of scene elements (23) and annotation elements (24) is expressed by their distance ordering information.

5. The method of claim 4, wherein the distance ordering information is one of a distance to the viewing camera and a relative ordering according to distance to the viewing camera.

6. The method of one of claims 1-5, wherein the depth-enhanced representation of the scene comprises a mapping of segments of the image onto scene elements of a 3D representation of the scene (10) and wherein the relative position of scene elements (23) and annotation elements (24) is expressed by their location in the 3D representation of the scene (10).

7. The method of one of the preceding claims, comprising the steps of

• by means of one or more further observation devices (2a, 2b, 2c) or the storage device (4), providing further information;

• when computing the depth-enhanced representation of the scene by means of the depth analysis processor (5), taking into account, in addition to the image, this further information;

wherein the one or more further observation devices (2a, 2b, 2c) comprise one or more of distance measuring scanners (2b), further physical cameras (2a), position determining systems (2c).

8. The method of one of the preceding claims, wherein two or more viewing cameras are defined, and an image is rendered for each of the viewing cameras, resulting in a pair of stereoscopic images or a group of multiscopic images; and optionally in a video sequence of pairs or groups of images.

9. The method of one of the preceding claims, wherein in the step of inputting one or more annotation elements (24), comprises the step of computing a 3D intersection of these annotation elements (24) with scene elements (23), detecting an intersection and optionally indicating to a user that an intersection has been detected, and/or optionally also automatically correcting the shape or the position of the annotation element (24) such that no intersection occurs, e.g. by stretching an annotation element (24) to pass around a scene element (23) in the virtual 3D space.

10. The method of one of the preceding claims, wherein the step of inputting information on the relative position of the one or more annotation elements (24) comprises the step of interpreting the position inputted by the input device (6b) as indicating a position on a ground plane (11) in the scene (10).

11. The method of one of the preceding claims, wherein the step of inputting information on the relative position of the one or more annotation elements (24) comprises the step of interpreting the position inputted by the input device (6b) as indicating a position on a plane parallel to and at a given height above a ground plane ( 11) in the scene (10).

12. The method of claim 11, wherein, when indicating the position, the height above the ground plane (11) is controllable by means of an additional input device or input parameter.

13. A method for generating and processing depth-enhanced images, comprising the steps of

• providing, by means of a physical camera (1, 2a) or a storage device (4) or by receiving a broadcast, an image of a scene (10), the image being a still image or an image in a sequence of video images;

• computing, by means of a depth analysis processor (5), from the image a depth- enhanced representation of the scene (10), the depth-enhanced representation comprising information on the relative position of scene elements (23) in (real or virtual) 3D space, wherein each scene element (23) corresponds to a particular object (12) in the scene (10) and to an image segment generated by observing the object with the physical camera;

• rendering and displaying the depth-enhanced representation of the scene (10) on a 3D output device (8).

14. The method of claim 13, wherein the step of rendering and displaying the depth- enhanced representation of the scene comprises the steps of

• rendering, by means of a rendering processor (7), two rendered stereoscopic images as seen by the two viewing cameras;

• displaying the two rendered stereoscopic images by means of the 3D-display.

15. The method of claim 13 or 14, comprising the step of inputting, by means of an input device (6b), one or more annotation elements (24) and information on the relative position of the one or more annotation elements (24) with regard to the scene elements (23); and, in the step of rendering and displaying the depth-enhanced representation of the scene (10), including the annotation elements.

16. The method of one of claims 13 to 15, wherein the image is received through a broadcast, the broadcast further comprising at least one of camera calibration information and a color model, and comprising the steps of performing the subsequent computation, rendering and displaying of the depth-enhanced representation by a receiving unit (9) of the broadcast, based on the image and at least one of the calibration information and the color model.

17. An apparatus for generating and processing depth-enhanced images, comprising a depth analysis processor (5) and a rendering processor (7) configured to perform the method steps of the method according to one of the preceding claims.

18. The apparatus of claim 17, comprising a receiving unit (9) configured to receive a broadcast, the broadcast comprising at least one image and further comprising at least one of calibration information and a color model, and the depth analysis processor (5) being configured to compute, from the image and at least one of the calibration information and the color model, a depth-enhanced representation of the scene, the depth-enhanced representation comprising information on the relative position of scene elements (23) in 3D space.

19. A non-transitory computer readable medium comprising computer readable program code encoding a computer program that, when loaded and executed on a computer, causes the computer to perform the method according to one of method steps 1 through 16.

20. A reproducible computer-readable signal encoding the computer program that, when loaded and executed on a computer, causes the computer to perform the method according to one of method steps 1 through 16.