NL2005720C2

NL2005720C2 - System and method for generating a depth map.

Info

Publication number: NL2005720C2
Application number: NL2005720A
Authority: NL
Inventors: Luc Petrus Johannes Vosters; Gerard Haan
Original assignee: Univ Eindhoven Tech
Priority date: 2010-11-18
Filing date: 2010-11-18
Publication date: 2012-05-22

Description

System and method for generating a depth map BACKGROUND OF THE INVENTION 5

Field of the invention

The present invention relates to a system for generating a depth map.

The present invention further relates to a method for generating a depth map.

10 Related Art

The introduction of the first 3D TV sets in the consumer market has generated vast interest in stereoscopic 3D broadcasting. Producing native 3D video for live events is still time consuming and costly. It requires broadcasters to invest in expensive new equipment like, stereo camera’s and stereo rigs, and to hire specially trained 15 stereographers. Real time 2D-To-3D conversion would be a cheaper option using additional hardware only. Unfortunately 2D-To-3D conversion is an extremely difficult task for which no optimal generally applicable solution exists.

However, methods are proposed for particular broadcasting scenarios. US2009/0196492 describes a method to derive a stereoscopic image from a 2D image 20 obtained a soccer field. The cited publication presumes that the depth increases linearly as a function of the vertical image coordinate by a field depth gradient which is calculated by the maximum vertical field length and a user specified maximum depth. Camera tilting, panning or zooming are not considered. This does not correspond well to the human depth perception. Depending on the camera’s tilt 25 angle, depth at the bottom of the screen, i.e. the depth offset, should either increase or decrease. Zooming in and out decreases and increases the depth offset respectively. Furthermore zooming panning and tilting will also change the depth distribution in the field. Accordingly, there is a need for a method and apparatus to more accurately calculate a depth map.

30 2

INVENTION

According to a first aspect of the present invention a system for generating a depth map from an input two-dimensional image is provided. The system comprises 5 - an pattern recognition unit for recognizing reference entities imaged in the input two-dimensional image and for determining an imaged size of recognized species of said imaged reference entities, said reference entities further having a real size of which at least an average value is at least approximately known, a depth calculation unit arranged for determining the imaged size of 10 recognized species of said imaged reference entities in the input image and for estimating a depth coordinate of a position of said reference entities from said determined size and from said knowledge about their real size, a depth map generating unit for generating the depth map from said estimated depth and from a specification of a surface that carries the reference 15 entities.

According to a second aspect of the present invention a method for generating a depth map from an input two-dimensional image is provided. The method comprises the steps of recognizing reference entities imaged in the input two-dimensional image, 20 said reference entities having an imaged size and a real size of which at least an average value is at least approximately known, determining the imaged size of recognized species of said imaged reference entities in the input image and estimating a depth coordinate of a position of said reference entities from said determined size and from said knowledge about their 25 real size, generating the depth map from said estimated depth and from a specification of a surface that carries the reference entities.

According to the present invention, the depth in the line of sight on which the depth distribution is based is calculated on the basis of the observed size of the 30 reference entities and their estimated size, e.g. a length or an area.

A typical embodiment of the system further comprises a parallax calculation unit for calculating a parallax from said depth map, and a stereoscopic image 3 generator for generating a second two dimensional image from said captured two-dimensional image and the calculated parallax.

Instead of generating a single two-dimensional image alternatively a pair of two-dimensional images may be calculated that are symmetrically arranged with 5 respect to the input two-dimensional image.

The system may further comprising a 3D display facility for displaying a 3D image using the captured two-dimensional image and the generated second two-dimensional image. The stereoscopic display may be implemented in various ways, for example in the form of a head mounted display, in the form of a display that 10 cooperates with glasses, e.g. with glasses with a different polarization direction for the left and the right eye, or with glasses having optical shutter that alternately allow the left and the right eye to be viewed. In a preferred embodiment the display is an autostereoscopic display.

Some autostereoscopic displays may be capable of generating a 3D-image on 15 the basis of a larger plurality of two-dimensional images. This capability can be used to give an observer the sensation that he/she can walk around the displayed scene.

In an embodiment of a system comprising such a display the parallax generator is arranged to generate a plurality of parallax maps, and the stereoscopic image generator is arranged to generate an additional two-dimensional image using each of 20 the parallax maps.

Several options are possible for the implementation of the system. In an embodiment the entire system may be implemented in a TV-reception unit. Such a TV-reception unit comprising the system only receives the two dimensional image signal and optionally additional data relating to the settings and specifications from 25 a camera that provides said two dimensional image signal and renders a 3D image from said two-dimensional image signal and said additional data. This has the advantage that the amount of data to be broadcasted is modest.

In another embodiment the pattern recognition unit, the depth calculation unit and the depth map generating unit are part of a TV-transmission unit. In this 30 case the TV-transmission unit transmits two-dimensional image data and in addition depth information to the TV-reception unit. The latter calculates the parallax data and generates the second image to form a stereoscopic pair with the 4 original image. This has the advantage that the additional costs for providing the means for generating the depth map can be distributed over a plurality of users.

In again another embodiment the pattern recognition unit, the depth calculation unit, the depth map generating unit as well as the parallax calculation 5 unit and the stereoscopic image generator are part of the TV-transmission unit. In this case the TV-transmission unit transmits the image data of the stereoscopic pair to the TV-reception unit. The latter merely has to display the stereoscopic pair. This has the advantage that the additional costs for providing the means for generating the depth map as well as the means for performing the parallax calculation and the 10 stereoscopic image generation can be distributed over a plurality of users. Any other data format may be used to communicate data between a TV-transmission unit and a TV-reception unit in a system according to the first aspect of the invention. Examples of known suitable data formats are the Side-by-Side, Top-and-Bottom, line-column interleaved, Checker Board, time-multiplexed and color coded formats. 15

BRIEF DESCRIPTION OF THE DRAWINGS

These and other aspects are described in more detail with reference to the drawing. 20 Therein: FIG. 1 schematically shows a system comprising a camera, a transmitter and a receiver, FIG. 1A schematically shows a mapping of a scene to a camera sensor, FIG. 2 shows a first embodiment of a system according to the first aspect of 25 the invention, FIG. 2A shows a second embodiment of a system according to the first aspect of the invention, FIG. 2B shows a third embodiment of a system according to the first aspect of the invention, 30 FIG. 3 shows a fourth embodiment of a system according to the first aspect of the invention, FIG. 4 shows a part of an embodiment of a system according to the first aspect of the invention in more detail, 5 FIG. 5 schematically illustrates how a depth measure is derived, FIG. 6 shows a further part of an embodiment of a system according to the first aspect of the invention in more detail, FIG. 7 schematically illustrates a calculation of a depth map using an 5 estimated reference depth using a first point of view, FIG. 8 schematically illustrates a calculation of a depth map using an estimated reference depth using a second point of view, FIG. 9 illustrates a relationship between various variables, FIG. 10 shows a still further part of an embodiment of a system according to 10 the first aspect of the invention in more detail, FIG. 11 illustrates a method according to the second aspect of the present invention, FIG. 12 illustrates results obtained with a system and method according to the present invention, 15 FIG. 12A and 12B shows aspects of the results in more detail, FIG. 13 illustrates a further result obtained with a system and method according to the present invention, FIG. 14 shows a comparison between results obtained with the system and method according to the present invention and with a known system.

20

DETAILED DESCRIPTION OF EMBODIMENTS

In the following detailed description numerous specific details are set forth in 25 order to provide a thorough understanding of the present invention. However, it will be understood by one skilled in the art that the present invention may be practiced without these specific details. In other instances, well known methods, procedures, and components have not been described in detail so as not to obscure aspects of the present invention.

30 The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood 6 that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

5 Further, unless expressly stated to the contrary, "or" refers to an inclusive or and not to an exclusive or. For example, a condition A or B is satisfied by any one of the following: A is true (or present) and B is false (or not present), A is false (or not present) and B is true (or present), and both A and B are true (or present).

The invention is described more fully hereinafter with reference to the 10 accompanying drawings, in which embodiments of the invention are shown. This invention may, however, be embodied in many different forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. In the 15 drawings, the size and relative sizes of layers and regions may be exaggerated for clarity.

It will be understood that when an element is referred to as being "on", "connected to" or "coupled to" another element, it can be directly on, connected or coupled to the other element or intervening elements may be present. In contrast, 20 when an element is referred to as being "directly on," "directly connected to" or "directly coupled to" another element, there are no intervening elements present.

Like numbers refer to like elements throughout. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

25 It will be understood that, although the terms first, second, third etc. may be used herein to describe various elements, modules, units and/or components. These elements, modules, units and/or components should not be limited by these terms. These terms are only used to distinguish one element, module, unit and/or component from another element, module, unit and/or component. Thus, a first 30 element, module, unit or component discussed below could be termed a second element, module, units or component without departing from the teachings of the present invention.

7

Spatially relative terms, such as "beneath", "below", "lower", "above", "upper" and the like, may be used herein for ease of description to describe one element or feature's relationship to another element(s) or feature(s) as illustrated in the figures. It will be understood that the spatially relative terms are intended to encompass 5 different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements described as "below" or "beneath" other elements or features would then be oriented "above" the other elements or features. Thus, the exemplary term "below" can encompass both an orientation of above and below. The device may be otherwise 10 oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, 15 such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein. All publications, patent applications, patents, and other references mentioned herein are incorporated by reference in their entirety. In case 20 of conflict, the present specification, including definitions, will control. In addition, the materials, methods, and examples are illustrative only and not intended to be limiting.

FIG. 1 schematically shows an arrangement comprising a camera C that registers an image and generates two-dimensional image data Im(x,y) and optionally 25 data representative for properties of the camera such as a focal length f, a physical distance between two subsequent pixels of the camera sensor in a first direction mx, a physical distance between two subsequent pixels of the camera sensor in a second direction my, a tilt angle pt of the camera, and a pan angle βΡ of the camera. The camera data is provided to a video system comprising a transmission unit T and a 30 reception unit R.

It is advantageous that the above-mentioned representative data is provided by the camera to the receiver so that the exact value of these settings and properties 8 are available. Alternatively this data may be estimated from the image for example by one or more methods described in:

Jain, S.; Neumann, U.;, "Real-time Camera Pose and Focal Length Estimation," Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, vol.l, no., 5 pp.551-555

Ashraf, N.; Foroosh, H.;, "Robust auto-calibration of a PTZ camera with nonoverlapping FOV," Pattern Recognition, 2008. ICPR 2008. 19th International Conference on, vol., no., pp.1-4, 8-11 Dec. 2008

Junejo, I.N.; Foroosh, H.;, "Refining PTZ camera calibration," Pattern 10 Recognition, 2008. ICPR 2008. 19th International Conference on, vol., no., pp.1-4, 8-11 Dec. 2008

Farin, D.; de With, P.H.N.;, "Estimating Physical Camera Parameters based on Multi-Sprite Motion Estimation," in SPIE Image and Video Communications and Processing, vol. 5685 p. 489-500, January 2005, San Jose (CA), USA 15 - Junejo, I.N.;, "Calibration and 3D Geometry Estimation of a Pan-Tilt-Zoom

Camera Network," 16th European Signal Processing Conference (EUSIPCO 2008), August 25-29, 2008, Lausanne, Switzerland FIG. 1A schematically shows how a long shot image taken from a scene, is mapped at the image sensor C2 of the camera C according to a pin-hole model. The 20 scene comprises reference entities Ei, E2, E3 of which at least an average size is at least approximately known and that are carried by a surface F. In this case the scene is a soccer game. The reference entities Ei, E2, E3 are the players which are arranged at a surface F formed by the playing field.

A pin-hole model gives a sufficiently accurate approximation if the distance of 25 the reference entities in the scene is substantially greater than the dimension of the camera’s aperture. This is the case with long shot images. It is presumed that the camera C is arranged in an orthogonal coordinate system x,y,z and that the position and orientation of the camera C are determined by a line of sight Ci. The line of sight Ci crosses the y-axis at height H, and at a distance Do from that point crosses 30 the x-z plane at point of view Po. The line of sight Ci has a tilt angle Pt with respect to the plane x-z and is rotated with a pan angle βΡ around the y-axis. In this example 9 the direction of the x-axis is defined parallel to the field boundary of the playing field opposite the camera C.

The plane of the image sensor C2 is arranged orthogonal to the line of sight Ci and at a distance f, the focal distance, from the crossing point between the line of 5 sight and the y-axis.

It is presumed that the image sensor C2 has a first axis xs, denoted as horizontal axis, that is arranged parallel to the plane of the field F and that the screen has a second axis ys orthogonal to the first axis. The sensor C2 further has a vertical center line C3.

10 An embodiment of a video system including the transmission unit T and the reception unit R is shown in more detail in FIG. 2.

The embodiment of the system shown in FIG. 2 comprises a pattern recognition unit 10 for recognizing reference entities in an input two-dimensional image Im(x,y). The reference entities to be recognized in the input image have a real 15 size of which at least an average value is at least approximately known. The accuracy with which the average size of the reference entities should be known depends on the accuracy with which the depth map is to be calculated. In a typical case the average size of the reference entities to be recognized is known with an inaccuracy of at most 20%. In other cases the average size may be known with an 20 inaccuracy of 10% or less. In again another case the size of each of the reference entities may be exactly known. The approximately known average size may be a onedimensional feature of the reference entity, such as a length, but may alternatively be an area of the reference entity. The approximately known average size L is stored in a storage location 25. The recognition unit 10 recognizes the reference entities Ei 25 from the image data Im(x,y) and generates an imaged size if, i.e. a size of said references observed in said image for said reference entities Ei.

A depth calculation unit 20 is provided that is arranged for determining a size of recognized species i of said reference entities in the input image and for estimating a depth at which said reference entities are arranged from said observed 30 size if and from said approximately know average size L. In the embodiment shown the depth calculation unit 20 generates a depth value Do indicative for the depth at a reference position within the image, for example a centre position xc, yc of the image.

10 A depth map generating unit 30 of the system generates a depth map D(x,y) for the image from said estimated depth Do and from a specification of the surface that carries the reference entities. In a typical embodiment the surface that carries the reference entities is a playing field, e.g. a soccer field or a tennis court that can 5 be modeled as a plane having a height H(x,z) = constant. In another embodiment the playing field may have a known height distribution, for example a golf field. In again another embodiment, e.g. in the context of water sports such as swimming or sailing, the surface may be a water-surface having a statistically varying height that is substantially independent of the position. For practical purposes it may be assumed 10 that the surface is substantially planar. The x and the z-axis are defined within the plane describing or approximating said surface.

The system shown in FIG. 2 further comprises a parallax calculation unit 40 for calculating a parallax map P(x,y) from said depth map D(x,y). A stereoscopic image generator 50 generates a second two dimensional image Im2(x,y) from said 15 input two-dimensional image Im(x,y) and the calculated parallax map P(x,y).

The system further comprises a 3D display facility 60 for displaying a 3D image using the input two-dimensional image Im(x,y) and the generated second two-dimensional image Im2(x,y).

In the embodiment shown in FIG. 2 all functionality of the system is 20 implemented in a single device. The single device is for example the reception device R, whereas the transmission device T merely transmits the image data Im(x,y) and optional camera data (e.g. f, mx, my, Pt, βρ) to the reception device, e.g. by wire, or wireless. The data may be transmitted by an internet protocol.

FIG. 2A shows an alternative arrangement, wherein part of the functionality 25 of the system is provided by the transmitter T and part is provided by the receiver R.

In the alternative arrangement of FIG. 2A the pattern recognition unit 10, the depth calculation unit 20 and the depth map generating unit 30 are part of the TV-transmission unit T. In this case the TV transmission unit T transmits two-dimensional image data Im(x,y) and depth information D(x,y) to the TV-reception 30 unit R. The latter has the parallax calculation unit 40 for calculating the parallax data P(x,y), the stereoscopic image generator 50 for generating the second image 11

Im2(x,y) to form a stereoscopic pair with the original image Im(x,y), as well as the stereoscopic display 60.

FIG. 2B shows again another embodiment, wherein the TV transmission unit T further includes a parallax calculation unit 40 for calculating the parallax data 5 P(x,y) and the stereoscopic image generator 50. In this case the TV transmission unit T transmits the image data Im(x,y) and Im2(x,y) of the stereoscopic pair to the TV-reception unit R. The latter merely has to display the stereoscopic pair Im(x,y) and Im2(x,y). Figures 2A and 2B merely show two typical examples of a partitioning of components of the system according to the invention over a TV transmission unit 10 T and a TV reception unit R. Any other partitioning of the components may be considered, e.g. an embodiment wherein the TV-transmission unit T comprises components 10, 20, 25 and the TV reception unit R comprises components 40, 50, 60. In again another embodiment the functionality of the system may be partitioned over more than 2 units. For example the system may have a transmission unit T, an 15 intermediary processing unit and a reception unit R.

FIG. 3 shows a still further embodiment of the system. The embodiment of FIG. 3 differs from the embodiment of FIG. 2 in that the parallax generator 42 is arranged to generate a plurality of parallax maps Pk(x,y), and the stereoscopic image generator 52 is arranged to generate an additional two-dimensional image 20 Imk(x,y) for each of the parallax maps. The display 62 is a multiview display that displays said 2-dimensional images Im(x,y) and Imk(x,y) at mutually different angles, therewith giving an observer a sensation that he/she can walk around the displayed scene. Likewise the system of FIG. 3 may be partitioned in two or more physical units that are coupled to each other via communication links.

25 A typical example of a recognition unit 10 is described with reference to FIG.

4. In the example shown the recognition unit 10 has a first module 110 that determines which pixels in the image Im(x,y) are part of the carrier surface, e.g. the playing field F, or other surface, e.g. a water surface. The first module 110 typically provides a binary output signal Field(x,y), wherein Field(x,y) is 1 if a pixel is 30 recognized as background and 0 otherwise.

In the example described here, the recognition unit 10 has a second module 120 that determines a field boundary FB(x) on the basis of the signal Field(x,y) obtained from the first module. This is advantageous as it bounds the region in the 12 image wherein reference entities can be found having a known average size and which are used for the depth estimation by the depth calculation unit 20. In other cases, e.g. a scene having no useful field boundaries, e.g. a sailing match in open sea, the second module 120 of the recognition unit may be inactive, or even be absent if 5 the system is particularly arranged for operation within that context.

Usually, the field boundary can be modelled by a piece wise linear function in x. This requires only 2 parameters per line segment, i.e. a slope and offset. These parameters can be smoothed over time to get a sharp and temporally stable boundary, which is robust to misdetections in the field detector.

10 A reliably defined field boundary facilitates the detection of the reference entities on which the depth estimation is based. Dependent on the panning angle of the camera one or two main line segments are required to model the field boundary. If a corner of the field is visible two main line segments are necessary. Otherwise one main line segment suffices.

15 The recognition has a third module 130 for segmenting the image into portions containing reference entities of known average size. This module 130 labels all pixels that are not marked as field and that are below the field boundary as reference entity in the Reference entity map and generates a corresponding output signal RE(x,y), also denoted as Player(x,y) when the reference entities are players.

20 Turning now to FIG. 5 the depth calculation unit 20 is described in more detail. FIG. 5 schematically shows how an entity with an approximately known size, here a player with length L is imaged onto the plane of the camera sensor. In another embodiment the entity is for example an object, such as a sail of a sailing boat. It is presumed here that the player is present at a position corresponding to 25 the vertical center line of the plane of the image sensor. In that case the position where the player is arranged at the field corresponds to a vertical position If in the plane of the image sensor defined below. The head of the player has a vertical position if in the plane of the image sensor as defined below.

Accordingly the length Lp of the player in the plane of the camera sensor is the 30 difference between if and if . This results in the following calculations.

13 τΡ — = ian(av) / Ιζ Ds'm(av )+Lcos(/%) ƒ Dcos(öfv )—L sin)

I P _ t.P jP

JL·/ — 1~j 2 *-*»1 (1)

Therein αν is the angle between the line of sight Ci and the line from the focal point of the camera to the coordinates in the field where the reference entity, for example the player is present. Rewriting these equations the depth D at the position of the 5 player in the field can be estimated as follows using the average length L of the player, which typically is 1.80.

sin βί \Lf tfiy 4* ƒ tan ay] + ƒ cos /¾ cos ay [Lf * my 4- ƒ tan ay] — ƒ sin ay (2)

Therein Z,f is the length of a player Ei as the number of pixels in the image sensor 10 my is the magnification factor indicating the distance in the plane of the image sensor corresponding to the distance between two subsequent pixels in the vertical direction of the image sensor.

Consider a player, moving circularly around the camera. Due to his circular motion the depth remains constant. Furthermore the angle between the two lines going 15 from the camera center to the player’s head and feet remains unchanged. Since the image sensor is very small compared to the player’s depth, it can be assumed that the player’s size on the image sensor remains approximately constant.

In accordance therewith it is assumed that the player length on the image sensor is independent of the player’s image location, and only depends on the distance of a 20 player to the camera.

The situation wherein a player is arranged at depth Dpi from the camera at an arbitrary position in the field, is equivalent to the situation where the player is arranged at the same depth Dpi on a line in the field F corresponding to the vertical center line C3 of the camera sensor. This implies that the depth Dpi derived for this 25 player relates to the depth of the field F at the line of sight Ci according to 14

Dpi _ sin 1/¾ I

Do 8111(1/¾ I — Ckv) (3) provided that the height H at which the camera is arranged is significantly smaller than the depth Do in the field. Under this assumption αν can be approximated by ./ sin 1/¾ I \ ay = pt I — aresm ——)

\JJpl/Uo J

(4) 5 In practice this is the case. For sake of clarity of the Figures however, a situation is shown wherein H and Do are of the same order of magnitude.

By substituting αν and Zf, which is the imaged length of the player Ei at the original player location in the image, into Eq. 2, the absolute depth Dpi can be calculated. Finally on the basis of this estimation Dpi an estimation Do,i is obtained for the depth 10 offset Do at the image center using Eq. 3 as follows Π _ n sin(|/3t| — c*v)

J-soA — -Upl ’ . I I

stnAptl U (5) A more accurate estimation of Do is obtained by calculating Do.i for every player in the current frame, and storing this value into a temporal buffer, which contains all Do.i’s from the P previous frames. Then a joint estimated value, Do, for the current 15 frame, is calculated using a combination facility. Various methods are applicable for this purpose, such a median filter operation or a mean operation. Also variants thereof are possible, such as a weighted mean or an (alpha-trimmed) mean filter operation. In a typical embodiment a median filter is used for this purpose. Accordingly the joint estimated value is calculated as the median of all Do/s in the 20 temporal buffer. The median operation to the results in the temporal buffer removes outliers, caused by players’ pose and natural length variations, in the estimation of Do. Accordingly, FIG. 6 shows a depth calculation unit 20 in an embodiment of the system according to the first aspect comprising a first facility 210 that calculates the values Do,i as described above, and a second facility including a buffer 220 and the 15 combination facility, such as a median filter 230. The buffer 220 temporarily stores the values Do,i generated by the first facility. The values may be stored for a single image frame only, if a rapid response of the depth calculation unit 20 is required to changing camera settings, or may be store the values Do,i for a plurality of the latest 5 received image frames. Depending on a required accuracy of the depth calculation unit 20 the plurality may be selected for example from a range of 2 to 100.

The combination facility, here the median filter 230, generates the joint estimated value Do from the estimations Do.i according to

Do = Median (Do,i) (6) 10 Referring to FIG. 7, the depth map calculation unit 30 is now described in more detail. FIG. 7 shows lines Lo, Li. Therein parallel to the x-axis, Lo is the line through Po and Li is the line through P2. As indicated therein, the depth Doh for an arbitrary point Pci at the vertical center line C3 of the image corresponding to an observed point Pi of the field can be calculated with sin |/3t|

DoH — Do~ f a 1 C

sm( 0t \ — oiv) 15 11 7 (7)

Therein ccv is the angle between the line through Pi and the line of sight Ci through Po. The angle av changes sign when Doh is smaller than Do. Now the depth D for an arbitrary point P2 on the line LI, represented by Pc2 on the sensor of the camera C can be calculated from Doh by: cos \ S\

D = Doh-7-7-V

20 cos( 0 — c*h) (8) depending on whether the x-axis is selected as the reference line as shown in FIG. 5, or the z-axis is selected as the reference line. In practice the x-axis and the z-axis are constructed to be parallel with respective field boundary lines and the axis corresponding to the field boundary line that is in front of the line of sight is selected 25 as the reference line.

In Eq. 8 <xh is the angle between Do and Doh. From the camera depth model presented above it can be observed that the angle (xh changes sign when the camera pans to the opposite side of the field.

16

The angle δ can be expressed in terms of the camera height H, the pan angle βρ and depth Doh by . j, y/D2,H-H2sin \βρ\ sinó = Dan-’ sin δ = J1 — ( j£)h } sin \βρ\ V (9) 5

Accordingly s = arcsin f 1 — sin | βρ | j ^ ' (10a)

The same derivation can be made if the z-axis is chosen as reference line. In that case δ is given by δ = arcsin ( y 1 — ( Do π ) cos \βρ\ ƒ 10 ^ ' (10b)

In practice H is often significantly smaller than Doh, In that case Eq. 10a and Eq. 10b reduce to δ = βΡ and δ = arcsin(cos I βρ I) respectively. By combining Eq.’s 7 and 8, the depth D can now be expressed in terms of och and αν, and the camera’s pan and tilt angles βρ and βι respectively, as . „ sin\i3t\ cos|ó'| 15 8111(1/¾I - av) cos(|d| - aH)

This choice between Eq. 10a and Eq. 10b depends on which field boundary line is most clearly visible in the image. The angles αν and aH can be expressed in LPV, LPH

and Lpd as illustrated in FIG. 8.

17 FIG. 8 shows a view according to the image received by image sensor C2. Therein (xo,yo) are the coordinates of the center of the sensor C2. A depth value has to be determined for the point with coordinates (xi,yi(xi)) at the sensor C2. This point corresponds to a point P2 in the field F. The point with coordinates (xo, yi (xo)) 5 corresponds to a point Pi in the field on the same reference fine Li and mapped on the vertical centre line C3. The point with coordinates (xo, yo (xo)) corresponds to point PO in the field on and mapped on the vertical centre line C3. FI is the boundary line of the field and F2 is a set of lines parallel with this boundary lines.

Lpv is the distance between points (xo,yo) and (xo,yi(xo)).

10 lfH is the distance between points (xo,yi(xo)) and (xi,yi(xi)), and LPD is the distance between points (xi,yi(xi)), and (xo,yo).

FIG. 9 shows in more detail a portion of FIG. 7 relevant to the camera C. FIG. 9 illustrates a relation between various dimensions related to the image of the reference entity in sensor of the camera.

15 As illustrated in FIG. 9 the angle αν can be expressed by fLv\ olv — arctan I —j- j ^ ' (12) FIG. 9 schematically shows the relation between theLp, LPH and LPD, the focal distance f of the camera and the angles αν and aH. The sides of the triangles can be 20 expressed as: AB = sjp + L$2, BC = Lp_ AC = Jp + Lp2.

v (13a,b,c)

Therein, - A is the position of the camera; - B is the point (xo,yi(xo)) on the sensor C2; - C is the point (xi,yi(xi)) on the sensor C2.

18

Accordingly cm can be calculated with the law of cosines as ( 2/2 + Ly2 + £g2 - ig2 \

olh = arccos I -γ=............................... - I

\2V (/2 + L$2)(P + Lg2) / (M) 5 The distances Ιζ,, lfH and LPD directly relate to pixel coordinates according to the following relations.

Ly — \y0 — yi{xo)\* my,

Lff = λ/— y 1(^0)) · triy)2 + ((j’l - Xo) * rnx)2,

Ld = \/((yi(xi) - Vo) * )2 + (Ocz ~ -r0) · mx)2, V/ e F2 (15a,b,c)

Therein, x0,y0 are the coordinates of the center location in the image. The values m* 10 and my are the horizontal and vertical scale factor respectively that relate pixel coordinates to actual distances on the image sensor of the camera. The values xi and yi(xi) denote the horizontal and vertical image coordinates respectively of points on line 1. F2 denotes the set of all lines 1 parallel to the chosen image field boundary. Depth in the field region can now be assigned to all pixels belonging to the lines in 15 F2. This procedure is shown in FIG. 8.

The lines in the set F2 are assumed to correspond to real world lines that are parallel to the real world field boundary. In many situations this model is sufficiently accurate. A more accurate depth estimation can be obtained if the depth is assigned on image lines which intersect at the vanishing point corresponding to a 20 perspective distortion.

In an embodiment the depth map obtained for the field is extended with a depth map Daud(x,y) for the audience as follows.

(16) 19

Daudix? y) — &aud,o(.:iO ^ π IN * yfieldfa)} H" J-^audyoi.-1'·) t

Therein, N is the image height in pixels, yfieid(x) denotes the boundary between the field and the audience region, Daud,o(x) denotes the depth at position (x, yfieid(x)), and AR is the audience roll-off factor which defines the depth at yfieid(x)-0. IN, by 5 Daud.o.AR. A practical value for AR was found to be 1.05.

In an embodiment the depth map is extended with an estimation of the depth of the reference entities of known size such as the players. In an embodiment it is assumed that the depth is constant for individual players. Since the distance D to the players is significantly larger than the player length L in long shot images, this 10 is a reasonable approximation. Therefore, the player depth is calculated from the vertical location where the players’ feet touch the field. Accordingly: ^player (**:? 1/) == & field (:jF* · Vi ) j

Vi e pt, e pls (17)

Therein yi is the minimum y-coordinate of the set of pixels Pi belonging to player i, and PLS is the set of all players in the current frame. Here Dfieid is expressed directly 15 in the image coordinates (x,y). This expression of Dfieid can be obtained by substituting equations 10a,b, 13a-c, 14 and 15a-c into equation 11.

The complete depth model therewith can be described by:

' Dfieid{x,y) if (x, y) e FIELD

(x, y) <£PU Vi € PLS

D(x.y) = < Daud{x-y) if i'-z.y) € AUDIENCE

(./;.?/) £ Pi, Vie PLS

t Dplayeri-X; X/) if (**·'; V) € Pi·, Vi G PLS (Jg) here, Dfieid, Daud and Dpiayer denote the depth models as described above.

20 The depth map D(x,y) can now be quantized into a n-bits depth map by DEPTHS,,,) = 2"—1 · ( \ l-smax / 20 where, Dmin and Dmax are the maximum and minimum depth respectively. These can either be constant and specified by a user, or, they can be adapted at each frame to the maximum and minimum depth of D(x,y).

The disparity map generator 40 then generates a disparity map from the 5 depth map D(x,y) or from the quantized depth map DEPTH (x,y). For a comfortable depth experience, the disparity should not exceed 3% of the horizontal image resolution .This rule of thumb is used in 3D shooting for five events. Therefore, the disparity map is calculated from the depth map by: DISPARITY(x,y) = DEPTH{x,y)amer (20) 10 Therein n is the number of bits in the depth map and M is the horizontal resolution of the image and auser is a factor of which the value can be controlled by an operator at the transmitter or receiver.

In an alternative embodiment of a system according to the first aspect of the invention, depth is assigned in lines parallel to the field line, and either the x-axis or 15 the z-axis is defined parallel to this field line. For that purpose the field boundary line with the longest line segment is chosen.

A practical algorithm to select the most suitable field boundary line is presented below. Therein the field boundary line to be selected is detected on the basis of the slopes of the line segments. It is assumed that the slope of the field 20 boundary line which is parallel to the z-axis is larger than the slope of the field boundary line that is parallel to the x-axis. Assuming the longest line segment has slope ai and the other lines segment has slope as, then the classification of the longest line segment with slope ai is as follows: 25 if( MAX(ai; as) - MIN(ai; as) >0.05) { // Line parallel to x-axis and line parallel to z-axis are both present.

Classification of line with slope ai is done as follows if( I ai I < I as I ) Line with slope ai is parallel to x-axis. else Line with slope ai is parallel to z-axis.

30 } else 21 { // Only one of the Perpendicular or Parallel line is present, but we don’t know which one it is.

if( I ai I > 0.07 ) Line with slope ai is parallel to z-axis. else Line with slope ai is parallel to x-axis.

5 }

Parallax generation can also be done with depth image based rendering schemes. In this case the 2D camera, which has registered the original image, is considered to be the Left (or Right) camera of a stereoscopic 3D camera. The Right 10 (or Left) camera, is missing. Knowing the focal length of the Left (or Right) camera, and choosing a base line distance (BLD), (i.e. the distance between the Left and Right camera in the 3D camera), the image of the Right (or Left) camera can be generated as described in Liang Zhang; Tam, W. J.;, "Stereoscopic image generation based on depth images for 3D TV," Broadcasting, IEEE Transactions on , vol.51, 15 no.2, pp. 191- 199, June 2005. In this case the disparity can be calculated by: nr r\ DISPARITY = f · -=-, (21)

Depth

Where f is the camera’s focal length, BLD the baseline distance (i.e. the distance 20 between the two camera’s of the 3D camera) and Depth the depth. The focal length f can be directly available from metadata or estimated from the scene. The variable Depth for a pixel used in this embodiment is not the depth defined by D in FIG. 7, but the depth Doh of the line L0, LI that comprises the pixel. This is the depth Doh of the line where it is in front of the vertical center fine of the sensor. Accordingly 25 the D Depth is equal to Doh for all pixels on a line in the image parallel to the selected field boundary line. Therewith the depth computation is simplified in that it is not necessary to use equation 11. Within the field F the depth Depth can simply be calculated with equation 7. Beyond the field boundary, comprising the audience region, the depth can be calculated with equation 16.

30 In this embodiment of the invention the depth and thus disparity is constant on horizontal fines of the image inside the field region. Now the depth should be 22 assigned to all pixels on a horizontal line, which belong to the field. Pixels on a horizontal line which do not belong to the field region either belong to an entity, referred to as player in a typical embodiment, or to the audience region, in which the Audience depth model described in this patent application is again valid.

5 Components of the pattern recognition unit 10 in an embodiment of the system are now described in more detail. In an embodiment the detector 110 for detecting the carrier surface is for example a field detector as described in K Seo et al. “An intelligent display scheme of soccer video on mobile devices,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 10, pp.

10 1395-1401, October 2007. This detector is accurate, robust to field shadows and relatively fast. Other suitable detectors for this purpose are described for example by A. Ekin et al. in “Automatic soccer video analysis and summarization,” IEEE Transactions on Image Processing, vol. 12, pp. 796-307, July 2003 and by K. Wan et al. “Real-time goal-mouth detection in mpeg soccer video,” in Proceedings of the 15 Eleventh ACM International Conference on Multimedia, New York, NY, USA, November 2003, pp. 311-314. The field detector described by Seo has a training phase, during which it accumulates histograms for each component in the HSV color space over 30 frames. If 70% of all pixels fall into the yellow/green range of the Hue histogram bins, learning the field’s color is assumed to be successful. From the 20 Saturation histogram and the Value histogram the saturation mean (SatMean) and peak value index (PeakVallnd) are calculated respectively. Finally the input image is classified as follows: iG(x,y) > 0.95 * B(x, y) B(.v. y) > 0.95 · B(x. y) V(x,y) < 1.25 · PeakVallnd S(x, y) > 0.8 · SatMean 0, otherwise (22)

Therein R;G;B denote the color components of the RGB color space and V and S 25 denote the Value and Saturation respectively.

V and S are defined respectively as V= max(R;G;B), and S = (max(R;G;B)- min(R;G;B))/V if (max(R;G;B)- min(R;G;B) *0 S= 0, if (max(R;G;B)- min(R;G;B) =0 23

In order to prevent misdetections due to noise in the components V and S when the difference max(R;G;B) - min(R;G;B) is small, V and S are set to 0 (undefined) if this difference is less than a threshold value HSVthd. This situation may occur in nearly white and nearly black regions, i.e when R, G and B are close to 0 or 255, e.g. in 5 parts of the image representing the players or the audience. Therefore we set H and S, to zero, i.e. undefined, when max(R;G;B) - min(R;G;B) < HSVthd. The histograms from which the values PeakVallnd and SatMean are derived are obtained in a training phase of the system at start-up. The training phase may be repeated after a time interval depending on a rate in which the lighting conditions vary. For example 10 the training phase may be repeated every 2 minutes to adapt to temporal variations in lighting conditions. In the same manner the system may be trained to recognize another background, e.g. a surface in a tennis court or a water surface.

The Hue value H for a pixel may be calculated as follows from the components R, G, B: 15 a = ^(2R-G-B) (23a) /? = -^73 (G-B) (23b) H=atan2(P,a) (23c)

Therein atan2(p,a) is the angle in radians between the positive x-axis of a plane and the point given by the coordinates (β,α) therein. The variable H is merely used to 20 determine when the training phase is completed.

The field boundary may further be demarcated by advertisement boards.

The second module 120 uses the field map obtained from the first module 110 as an input to determine the field boundary. In practice it may happen that some pixels outside the field boundary are incorrectly classified as field pixels due to the 25 presence of elements in the audience region having a colour similar to that of the field. In order to prevent such misdetections a median filter is applied to the field map Field(x,y). A suitable window size of the median filter is for example in the range of 9x19 to 35x35 pixels, typically 21x21 pixels. Therewith a corrected field map Fieldm(x,y) having the value 1 if the majority of the pixels of the original Field 30 map Field(x,y) within the window of the median filter is classified as field pixels and 24 which has the value 0 otherwise. This results in a field map Fieldm(x,y) in which typically all noisy isolated misdetections in the audience region are removed and the field boundary is preserved.

Starting from this corrected fieldmap Fieldm(x,y) for every x-coordinate, the 5 first y-coordinate from the top of the image is searched where a field pixel encountered. I.e. a pixel for which Fieldm(x,y) = 1.

This results in the Boundary vector FB(x) having the value ymax. Therein ymax is the highest y-coordinate for which Fieldm(x,y)=l.

In practice deviations in the boundary vector FB(x) may occur due to various causes, 10 such as the presence of advertising boards and players moving along the boundary of the field. In an embodiment a further improvement is obtained by a piece wise linear curve fitting method.

A suitable method for this purpose is a probabilistic Hough transform as described by J. Matas et al. in “Robust detection of lines using the progressive 15 probabilistic hough transform,” Computer Vision and Image Understanding, vol. 78, no. 1, pp. 119-137, April 2000. For the example presented here, wherein the scene is a soccer game, but also for other games played on a rectangular playing area it was found most suitable to model the field boundary by 2 line segments.

The probabilistic Hough transform of Matas (also denoted as pHT) is applied 20 on the set of pixels formed by the field boundary vector (x,FB(x)). If the number of votes in the Hough space for a line is larger than a threshold, the line is selected. In this example the threshold was set to 50. Image points lying on the selected lines are merged into one line, if their distance is smaller than a maximum merging distance. Line segments smaller than the minimum line length, which is taken to be 60 pixels, 25 are rejected. This results in line segments with similar slope on FB(x,y), in which the dips due to misdetections and players are ignored, if the minimum fine length is chosen large enough. Then the slopes of all lines segments are clustered into K clusters with the k-means clustering algorithm. For each of the K clusters the linear least squares fit is calculated using the image points on the line segments, found 30 with pHT, as samples to get a more accurate boundary approximation. This results in K lines of which line j consists of a candidate slope acandj and a candidate offset bcandj. Then the intersection points of the K lines are calculated and of all possible piece wise linear approximations, the one with the lowest square error on FB(x) is 25 selected. Then, the piece wise linear line estimated for the previous picture is updated according to a-j,n+i — (1 — «) * <Zj,n + or acandj, bj,n+i = (1- oc) · bj>n + a · bcandj Vj € {1, .,., K} (24)

Therein a denotes the update speed of the line parameters and n is the picture 5 number. Eq. 22 acts as a low pass filter on the line parameters, which filters fast temporal variations in the parameters to get a temporally stable boundary. The updated boundary at frame n + 1 is rejected if its squared error on FB(x) is larger than the previous boundary. The advantage of this method is that dips in FB(x), caused by players overlapping with the audience boundary, are rejected in the 10 Hough transform by the minimum line length constraint. In practice number of line segments is set to 2, as this gives a reasonable approximation of the field boundary and this requires a relatively modest computational load.

As an alternative for the pHT, a Piece wise Linear Curve Fit may be applied as described by A. Cantoni, “Optimal curve fitting with piecewise linear 15 functions, ’’IEEE Transactions on Computers, vol. C-20, no. 1, pp. 59-67, January 1971. Using weighted least squares this method optimally fits a piece wise linear function to a curve, given the number of linear segments and their intervals. The intervals are found by randomly initializing the curve fitting method with different intervals and selecting the intervals which give the lowest squared error over a 20 sufficient number of initializations.

Accordingly, the piecewise linear curve fit method aims to fit N connected line segments from all data pairs (x, FB(x)). The resulting line fit (LINE(x)), which is merged from all line segments and having the minimum squared error ]ST(FB(x) - LINE(x))2 is selected as the output result.

25 Since for practical purposes the field boundary is typically modelled by 2 line segments only, their optimal intersection point can be found by applying an exhaustive search on a downsampled FB(x) vector. Advantages of this algorithm are that it doesn’t require the Hough transform nor k-means clustering. Furthermore, a low weight can be assigned to data points that are unreliable. The number of line 30 segments may be one or more. However, also in this embodiment the number of line 26 segments is typically set to 2, as this generally gives a reasonable approximation of the field boundary and this requires a relatively modest computational load.

The reference entity identification module 130 is described in more detail below with reference to FIG. 10. In a typical embodiment the reference entity is a 5 human player. The reference entity identification module 130 may identify the player as follows by combining the field map and field boundary as follows. In a first selection module 132 all pixels which are not marked as field and which are below the field boundary are labeled as possible reference entity the map REl(x,y). Players who partly overlap with the audience region, cause dips in the field boundary. By the 10 piece wise linear approximation of the field boundary parts of the players which are in the field region can still be detected. A connected component module 134, identifies connected components in the map REl(x,y). The connected component module may for example apply a connected component algorithm as described by S. Suzuki and K. Abe, in “Topological structural analysis of digitized binary images by 15 border following,” Computer Vision, Graphics, and Image Processing, vol. 30, no. 1, pp. 32-46, April 1985. The connected component algorithm is applied on the map REl(x,y). Subsequently bounding box generating module 136 constructs a bounding box (BB, see FIG. 12) around every connected component (CC, see FIG. 12) in the map CC(x,y), and generates a signal BB(x,y) as shown in FIG. 10.

20 Decision module 138 further improves the detection of the reference entity by verifying various constraints and removing components from the connected component map CC(x,y) that do not meet these constraints. For example decision module may verify the following features the aspect ratio Ra of the bounding box, 25 - the number of pixels Nbb of the connected component (absolute or relative to the total number of pixels in the image Im(x,y)) and the ratio between the number Ncc of connected pixels in the bounding box and Nbb.

The aspect ratio Ra of the bounding box is defined as BBheight 30 Ra =--—, wherein BBheight and BBwidth respectively are the height and the BBwidth width of the bounding box.

27

Practical values for constraints to be set depend on the application for which the depth map generation system is used. In a typical example, wherein the scene for which the depth map is constructed is a field game, such as soccer, the following constraints proved to be useful.

5 Cond: (0.4<Ra<3) & (Nbb > γ M.N) & (Ncc/Nbb > 0.25).

The factor γ is typically in a range of about 0.0002 to 0.0010, e.g. 0.0005. For example in an image having a resolution of 960x540 pixels the minimal value of Nbb is set to 160 pixels. The skilled person may apply other boundary settings to said constraints to other scenes, e.g. scenes relating to swimming or sailing.

10 A connected component CC is only identified as a player if it complies with this condition.

Instead of identifying the reference entities as described with reference to FIG. 10, persons skilled in the art of image processing may consider to apply other segmentation methods, for example players in a field or other moving entities may 15 be discriminated from the background by using motion segmentation, color segmentation, edge-detection, focus measures or a combination of these segmentation methods may be considered.

FIG. 11 schematically shows a method of estimating a depth map according to a second aspect of the invention. In a first step SI a two-dimensional image Im(x,y) 20 is obtained with a camera. In a next step S2 a field detection is applied, wherein it is determined for each pixel whether it represents a carrier surface for the reference entities, (e.g. a playing field, water surface etc.) or not. The result of this determination is a binary image Field(x,y). In a subsequent step S3 a boundary FB(x,y) of the carrier surface is determined. Subsequently in step S4 a depth offset 25 calculation takes place. In this calculation reference entities having an at least approximately known average size L are recognized in the image and their depth is estimated using their apparent size and their approximately known size L. Further, the depth offset Do is calculated on the basis of the position of the recognized reference entities and their estimated depth. The depth calculation also uses camera 30 data f, Pt, βρ, mx, my, which may be obtained from the camera or may be estimated from the image Im(x,y). Steps S2 and S3 are not essential to enable step S4, however these steps S2 and S3 do facilitate the recognition of the reference entities and the 28 calculation of their depth as will be shown in the sequel. In step S5 a depth map D(x,y) is calculated for the image using said camera data and depth offset Do calculated in step S4.

The system and method according to the present invention were applied to 5 soccer game images. FIG. 12 shows in a left and right column intermediate results obtained for two exemplary long shot input images. The top image in each column is a grayscale representation of the original input image Im(x,y).

The middle and lower image in each of the columns shows an intermediate image that is segmented in field and non-field pixels and has superposed thereon the 10 field boundary line FB(x). Furthermore the connected components CC identified by module 134 and their bounding boxes BB constructed by module 136 are shown therein. In the middle images the boundary line FB(x) in these images is generated with the probabilistic Hough transform method described above. In the lower images the boundary line FB(x) in these images is generated with optimal curve fit method 15 described above.

In the encircled portion of the lower image in the left column of FIG. 12 it can be seen, that the optimal piece wise linear curve fit method is more sensitive to player audience overlap at the borders of the image than the pHT based method.

Due to the player overlap with the audience the optimal piece wise linear curve fit 20 method may incidentally erroneously identify the field boundary line. Such errors caused by player-audience overlap, are discarded by the pHT based method. The optimal curve fit method however relies on the border pixels in the least squares fit, resulting in an incorrect field boundary segment. The second column of FIG. 11 shows the case when the field boundary line consist of 2 distinct field fines. If one 25 fine segment is significantly shorter than the other, the pHT based method only detects a single fine, if that fine is shorter than the minimum fine length. The optimal curve fit method relies on the border pixels and therefore it does not suffer from this problem.

FIG. 12 also demonstrates the operation of the connected component detection 30 module 134 and the bounding box detection module 136. Examples of connected components detected are indicated by CC and their bounding boxes by BB. FIG. 12A and 12B show examples of the operation of the decision module. Components in the image that are accepted as a representing a reference entity are indicated as “OK”.

29

Connected components CC that are discarded are indicated by “NOK”. From this figure we observe that all BB’s, satisfying the criteria in Eq. 22, are detected as players. Even most players which overlap with the audience region are detected. It can also be noticed however, that field line interference causes false negative player 5 detections. However, outliers in the estimated depths D0,i caused by these false detections will be discarded by the median filter module.

Camera zooming will either shrink or grow objects in the image. According to viewers’ perception during zooming, the camera moves away from or approaches objects. However, during zooming the distance of an object to the camera and thus 10 the depth according to the camera model used remains unchanged as the effective focal length also varies. This is contra intuitive. By simply keeping the focal length f fixed in the calculation of the depth offset Do, the desired perceptive effect can be obtained. Therefore the actual value of f, which remains constant, is used in field depth calculation. Zooming, directly affects Do since objects in the image either grow, 15 or shrink. In FIG. 13, Do is plotted per picture in a time interval in which the camera was zooming out and slightly panning too. Since, Do was calculated from the only 2 players in the pictures, who were changing pose, the depth offset exhibits some variation from frame to frame. It can be noticed, however, that according to the depth model the depth offset increases. This corresponds well to the human depth 20 perception during camera zooming. When the depth-image-based-rendering method as described in Liang Zhang; Tam, W.J.;, "Stereoscopic image generation based on depth images for 3D TV," Broadcasting, IEEE Transactions on , vol.51, no.2, pp. 191-199, June 2005, is used, instead of keeping the focal length fixed its actual value should be used, since the perceptive effect, that is desired during zooming, is already 25 included in the disparity calculation of equation 21 through the actual value of f.

The depth offset estimation is likely to improve when the number of players in the picture increases. Furthermore, temporally averaging the Do.i’s over several pictures will result in a smoother estimate of Do.

FIG. 14 shows a comparison between depth maps obtained with the system 30 according to the present invention as described with reference to FIG. 4 to 10 and with the system known from US2009/0196492 Al. In the first column FIG. 14 shows 4 typical long shot images of a soccer game. In the second column FIG. 14 shows respective depth maps obtained from these long shot images by the known system.

30

In the third column FIG. 14 shows respective depth maps obtained from the long shot images by the system according to the present invention. Since the camera pan and tilt angle, and the focal length were not available, they were empirically estimated for each image. Panning angles were estimated from the direction of the 5 boundary lines visible in the image. Tilting angles were estimated from the vertical position of the field boundary lines in the image. An effective focal length was estimated from sizes of the players in the image. These estimations may also be implemented by automatic methods. In the middle and right column of FIG. 14 the estimated depth D(x,y) is represented by a grey value that increases linearly with 10 the depth. The grey value black represents the shortest distance and the grey value white represents the longest distance on the scale.

It can be seen in the middle column of FIG. 14 illustrating the depth maps obtained with the known method that the depth distribution is independent on the panning angle of the camera.

15 Turning now to the right column of FIG. 14, it can be seen that for a pan angle βρ = 0 of the camera the depth is distributed mirror symmetrically with respect to the image vertical center line. When the camera pans to the left, the depth increase at the right image half is smaller than the depth increase at the left image half, and vice versa for a camera pan to the right, as shown in FIG. 14(b)-14(d).

20 Therewith the depth map generated with the system according to the present invention corresponds qualitatively to the actual depth distribution in the original scene.

In the claims the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. A single component 25 or other unit may fulfill the functions of several items recited in the claims. Components with a data processing task may be implemented in dedicated hardware, by a suitably programmed general purpose processor, but also by partly programmable dedicated hardware. The mere fact that certain measures are recited in mutually different claims does not indicate that a combination of these measures 30 cannot be used to advantage. Any reference signs in the claims should not be construed as limiting the scope.

Claims

1. Een systeem voor het genereren van een dieptekaart uit een twee dimensionaal invoerbeeld (Im(x,y)), het systeem omvattende 5. een patroonherkenningseenheid (10) voor het herkennen van referentie- entiteiten afgebeeld in het tweedimensionale invoerbeeld (Im(x,y)), en voor het vaststellen van een afgebeelde grootte van herkende exemplaren van genoemde afgebeelde referentie-entiteiten, welke genoemde referentie-entiteiten voorts een ware grootte hebben waarvan tenminste een gemiddelde waarde tenminste bij 10 benadering gekend is, een diepteberekeningseenheid (20) ingericht voor het schatten van een diepte coördinaat van een positie van genoemde referentie-entiteiten uit genoemde vastgestelde grootte en uit genoemde kennis omtrent hun ware grootte, een diepte kaart generatie eenheid (30) voor het genereren van een 15 dieptekaart (D(x,y)) uit genoemde geschatte diepte en uit een specificatie van een oppervlak (F) dat de referentie-entiteiten draagt.A system for generating a depth map from a two-dimensional input image (Im (x, y)), the system comprising 5. a pattern recognition unit (10) for recognizing reference entities displayed in the two-dimensional input image (Im (x) , y)), and to determine a displayed size of recognized copies of said displayed reference entities, said reference entities further having a true size of which at least an approximate average value is known, a depth calculation unit (20) arranged to estimate a depth coordinate of a position of said reference entities from said determined size and from said knowledge about their true size, a depth map generation unit (30) for generating a depth map (D (x, y ) from said estimated depth and from a specification of a surface (F) bearing the reference entities.

2. Systeem volgens conclusie 1, voorts omvattende een parallax rekeneenheid (40) voor het berekenen van een parallax (P(x,y)) uit genoemde dieptekaart (D(x,y)), 20 en een stereoscopische beeldgenerator (50) voor het genereren van tenminste een tweede tweedimensionaal beeld (Im2(x,y)) uit genoemd tweedimensionaal invoerbeeld (Im(x,y)) en de berekende parallax (P(x,y)).The system of claim 1, further comprising a parallax calculating unit (40) for calculating a parallax (P (x, y)) from said depth map (D (x, y)), 20 and a stereoscopic image generator (50) for generating at least a second two-dimensional image (Im2 (x, y)) from said two-dimensional input image (Im (x, y)) and the calculated parallax (P (x, y)).

3. Systeem volgens conclusie 2, voorts omvattende een 3D weergave faciliteit 25 (60) voor het weergeven van een 3D beeld door gebruik van het twee dimensionale invoerbeeld (Im(x,y)) en het gegenereerde tweede tweedimensionale beeld (Im2(x,y)).The system of claim 2, further comprising a 3D display facility (60) for displaying a 3D image using the two-dimensional input image (Im (x, y)) and the generated second two-dimensional image (Im2 (x, y)).

4. Systeem volgens conclusie 1, waarin de diepteberekeningseenheid (20) een eerste faciliteit heeft voor het berekenen van respectieve schattingen (Do,i) uit een 30 dieptecoördinaat van een punt van het oppervlak (F) dat is afgebeeld op een referentiepositie van de geschatte dieptecoördinaten van de referentie-entiteiten (i).The system of claim 1, wherein the depth calculation unit (20) has a first facility for calculating respective estimates (Do, i) from a depth coordinate of a point of the surface (F) that is imaged at a reference position of the estimated depth coordinates of the reference entities (i).

5. Systeem volgens conclusie 4, waarbij de diepteberekeningseenheid (20) een combinatiefaciliteit (230) heeft voor het genereren van een gezamenlijke geschatte waarde (Do) voor de dieptecoördinaat op genoemde referentiepositie.The system of claim 4, wherein the depth calculation unit (20) has a combination facility (230) for generating a combined estimated value (Do) for the depth coordinate at said reference position.

6. Systeem volgens conclusie 5, waarin genoemde combinatiefaciliteit (230) een mediaan filter, een gemiddelde filter, een gewogen gemiddelde filter of een alpha-getrimd gemiddelde filter is.The system of claim 5, wherein said combination facility (230) is a median filter, an average filter, a weighted average filter, or an alpha-trimmed average filter.

7. Systeem volgens conclusie 6, waarin genoemde reeks respectieve schattingen 10 (Do,i) omvat van een veelvoud aan beeld frames.The system of claim 6, wherein said set of respective estimates 10 (Do, i) comprises a plurality of image frames.

8. Systeem volgens conclusie 1, omvattende een segmentatiefaciliteit die het beeld in functionele gebieden segmenteert.The system of claim 1, comprising a segmentation facility that segments the image into functional areas.

9. Systeem volgens conclusie 8, waarbij de segmentatiefaciliteit een detectiefaciliteit omvat voor het detecteren van verbonden componenten in het invoerbeeld.The system of claim 8, wherein the segmentation facility comprises a detection facility for detecting connected components in the input image.

10. Systeem volgens conclusie 9, waarin de segmentatiefaciliteit omvat 20 - een omhullend kader constructiefaciliteit (136) voor het construeren van een omhullend kader voor individuele verbonden componenten, een beslissingsmodule (138) voor het uitvoeren van tenminste een van de volgende testen, een test of de aspect ratio van het omhullende kader gelegen is binnen een 25 tevoren bepaald bereik, een test of het aantal pixels van de verbonden component gelegen is binnen een tevoren bepaald bereik, een test of de verhouding tussen het aantal pixels van de verbonden component en het aantal pixels in het omhullende kader gelegen is binnen een 30 tevoren bepaald bereik, een beslissingselement voor het beslissen of de verbonden component een referentie-entiteit representeert afhankelijk van een resultaat van genoemde tenminste ene test.10. System as claimed in claim 9, wherein the segmentation facility comprises - an envelope frame construction facility (136) for constructing an envelope frame for individual connected components, a decision module (138) for performing at least one of the following tests, a test whether the aspect ratio of the envelope frame is within a predetermined range, a test whether the number of pixels of the connected component is within a predetermined range, a test or the ratio between the number of pixels of the connected component and the number of pixels in the envelope frame is within a predetermined range, a decision element for deciding whether the connected component represents a reference entity depending on a result of said at least one test.

11. Systeem volgens conclusie 8, waarin de segmentatiefociliteit een veldgrensdetectiemodule (120) omvat voor het detecteren van een veldgrens en waarin de segmentatiefaciliteit de beeldgegevens beperkt waarin de referentie- 5 entiteiten worden gedetecteerd tot een gebied begrensd door genoemde veldgrens.11. System according to claim 8, wherein the segmentation facility comprises a field boundary detection module (120) for detecting a field boundary and wherein the segmentation facility limits the image data in which the reference entities are detected to an area bounded by said field boundary.

12. Systeem volgens conclusie 11, waarin de genoemde veldgrensdetectiemodule (120) ingericht is voor het uitvoeren van een Houghtransformatie.The system of claim 11, wherein said field boundary detection module (120) is adapted to perform a Hight transformation.

13. Systeem volgens conclusie 11, waarin de genoemde veldgrensdetectiemodule (120) ingericht is voor het uitvoeren van een stuksgewijze lineaire curve benadering.The system of claim 11, wherein said field boundary detection module (120) is adapted to perform a single-line linear curve approximation.

14. Systeem volgens conclusie 1, omvattende een opzoek tabel omvattende verzameling informatie-eenheden die een hoek van een veldgrens aangeven voor 15 respectieve panning hoeken van een camera die het invoerbeeld levert.14. System as claimed in claim 1, comprising a look-up table comprising a set of information units which indicate an angle of a field boundary for respective panning angles of a camera which supplies the input image.

15. Een methode voor het genereren van een dieptekaart uit een tweedimensionaal invoerbeeld omvattende de volgende stappen, herkennen van referentie-entiteiten die afgebeeld zijn in het twee-20 dimensionaal invoerbeeld, welke genoemde referentie-entiteiten een ware grootte hebben waarvan tenminste een gemiddelde waarde tenminste bij benadering gekend is, vaststellen van een afgebeelde grootte van herkende exemplaren van genoemde afgebeelde referentie-entiteiten in het invoerbeeld en schatten van een 25 dieptecoördinaat van een positie van genoemde referentie-entiteiten uit genoemde vastgestelde grootte en uit genoemde kennis omtrent hun ware grootte, generen van een dieptekaart uit genoemde geschatte diepte en uit een specificatie van een oppervlak dat de referentie-entiteiten draagt.A method for generating a depth map from a two-dimensional input image comprising the following steps, recognizing reference entities depicted in the two-dimensional input image, said reference entities having a true size of which at least one average value is at least is approximately known, determining a displayed size of recognized copies of said displayed reference entities in the input image and estimating a depth coordinate of a position of said reference entities from said determined size and from said knowledge about their true size, generates of a depth map from said estimated depth and from a specification of a surface bearing the reference entities.