GB2571307A

GB2571307A - 3D skeleton reconstruction from images using volumic probability data

Info

Publication number: GB2571307A
Application number: GB1802950.4A
Authority: GB
Inventors: Le Floch Hervé
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2018-02-23
Filing date: 2018-02-23
Publication date: 2019-08-28
Anticipated expiration: 2038-02-23
Also published as: GB201802950D0; GB2571307B

Abstract

A method and system for generating a three-dimensional (3D) pose estimation of a real world object, observed by cameras in a scene volume, is described. The method comprises a first step 13 of obtaining two simultaneous source images of the scene volume. Then generating, from each source image, one or more part maps 452, 454 for one or more respective parts of the real world object, each part map for a given object element comprising part probabilities for respective samples of the source image, representing probabilities that the respective samples correspond to the given part. One or more sets of part volume data for respectively the one or more parts are generated. The generating step for a set includes projecting elementary voxels of the scene volume onto projection samples 405, 454 455, of the part maps and computing a joint part probability 1112 for each elementary voxel. One or more parts of the 3D skeleton are generated 462 using the sets of part volume data. Bounding boxes 701 may be used to limit the number of projected voxels. Elements of candidate graph elements may be linked by weights 459. Part maps may be generated using affinity fields (801, Fig. 8).

Description

3D SKELETON RECONSTRUCTION FROM IMAGES

USING VOLUMIC PROBABILITY DATA

FIELD OF THE INVENTION

The present invention relates generally to reconstruction of 3D skeletons from views of a 3D real world object.

BACKGROUND OF THE INVENTION

Reconstruction of 3D skeletons, also known as 3D object pose estimation, is widely used in image-based rendering. Various applications for 3D object pose estimation and virtual rendering can be contemplated, including providing alternative views of the same animated 3D object from virtual cameras, for instance a new and more immersive view of a sport event.

Various attempts to provide methods and devices for 3D skeleton reconstruction have been made, including US 8,830,236 and 3D Human Pose Estimation via Deep Learning from 2D annotations (2016 fourth International Conference on 3D Vision (3DV), Ernesto Brau, Hao Jiang). However, the efficiency of the techniques described in these documents remains insufficient in terms of performances, including memory use, processing time (for instance nearly real time such as less than a few seconds before rendering), ability to detect a maximum number of 3D real world objects in the scene.

SUMMARY OF INVENTION

New methods and devices to reconstruct 3D skeletons from source images of the same scene are proposed.

A method for generating a 3D skeleton of a 3D real world object observed by source cameras in a scene volume according to the invention is defined in Claim 1. It comprises the following steps performed by a computer system:

obtaining, from memory of the computer system, two (or more) simultaneous source images of the scene volume recorded by the source cameras;

generating, from each source image, one or more part maps for one or more respective parts of the 3D real world object, each part map for a given part comprising part probabilities for respective samples (e.g. pixels) of the source image representing probabilities that the respective samples correspond to the given part;

generating one or more sets of part volume data for respectively the one or more parts, wherein generating a set of part volume data for a respective part includes:

projecting elementary voxels of the scene volume onto projection samples of the part maps;

computing a joint part probability for each elementary voxel based on the part probabilities of its projection samples in the part maps corresponding to the respective part;

generating one or more parts of the 3D skeleton using the one or more sets of part volume data generated.

Retrieving probabilities for object parts in the 2D space from the source images and then combining them into joint part probabilities in the 3D space makes it possible to reduce overall complexity of the 3D skeleton reconstruction. In particular, no 2D skeletons need to be generated from the source images and then processed. Also, there is no need to solve any conflict between two generated 2D skeletons that would not perfectly match in the 3D space

Robustness of the 3D skeleton reconstruction is obtained through the actual determination of the 3D skeleton from the joint part probabilities, that is directly from probabilities in the 3D space.

Various applications of the invention may be contemplated, including a method for displaying a 3D skeleton of a 3D real world object observed by source cameras in a scene volume as defined in Claim 18. It comprises the following steps performed by a computer system:

generating a 3D skeleton of the 3D real world object using the generating method above, selecting a virtual camera viewing the scene volume, and displaying, on a display screen, the generated 3D skeleton from virtual camera viewpoint.

In this context, the invention improves the field of rendering a scene from a new viewpoint.

Correspondingly, a system, which may be a single device, for generating a 3D skeleton of a 3D real world object observed by source cameras in a scene volume according to the invention is defined in Claim 20. It comprises at least one microprocessor configured for carrying out the steps of:

generating, from each source image, one or more part maps for one or more respective parts of the 3D real world object, each part map for a given part comprising part probabilities for respective samples of the source image representing probabilities that the respective samples correspond to the given part;

Also, a system for displaying a 3D skeleton of a 3D real world object observed by source cameras in a scene volume may be as defined in Claim 21. It comprises the above system to generate a 3D skeleton of the 3D real world object connected to a display screen, wherein the microprocessor is further configured for carrying out the steps of:

selecting a virtual camera viewing the scene volume, and displaying, on a display screen, the generated 3D skeleton from virtual camera viewpoint.

Optional features of the invention are defined in the appended claims. Some of these features are explained here below with reference to a method, while they can be transposed into system features dedicated to any system according to the invention.

In embodiments, the method may further comprise using a first set of part volume data to restrict an amount of elementary voxels to be projected on part maps to generate a second set of part volume data. This approach aims at reducing computational costs of the overall method.

In specific embodiments, using the first set of part volume data includes:

determining part candidates of the 3D real world object from the first set of part volume data, defining bounding 3D boxes around the determined part candidates in the scene volume, wherein the amount of elementary voxels to be projected on the part maps to generate a second set of part volume data is restricted to the defined bounding boxes.

The bounding boxes thus define various sub-volumes where 3D objects are detected. Using such bounding boxes advantageously allows independent processing to be performed on each of them, thereby reducing complexity.

In other embodiments, generating a part map from a source image for a respective part includes:

obtaining one or more scaled versions of the source image, generating, from each of the source image and its scaled versions, an intermediary part map for the respective part, the intermediary part map comprising part probabilities for respective samples of the source image or its scaled version representing probabilities that the respective samples correspond to said part, and forming the part map with, for each sample considered, the highest part probability from the part probabilities of the generated intermediary part maps for the same sample considered.

The approach seeks to increase robustness of 3D object detection, and thus of 3D skeleton reconstruction.

In yet other embodiments, the method further comprises generating, from each source image, a part affinity field for the two adjacent parts that includes affinity vectors for respective samples of the source image, the magnitude and direction of each affinity vector representing estimated orientation probability and orientation of an element connecting, according to the 3D model, two said adjacent parts at the respective sample in the source image, wherein the weights set for the weighted links are based on the generated part affinity fields.

This approach also increases robustness of 3D skeleton reconstruction. This is because the part affinity fields give additional information on how candidate parts belong to two adjacent parts should be connected.

Another aspect of the invention relates to a non-transitory computer-readable medium storing a program which, when executed by a microprocessor or computer system in a device, causes the device to perform the method as defined above.

The non-transitory computer-readable medium may have features and advantages that are analogous to those set out above and below in relation to the methods and node devices.

At least parts of the methods according to the invention may be computer implemented. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, microcode, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a circuit, module or system. Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.

Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible carrier medium may comprise a storage medium such as a hard disk drive, a magnetic tape device or a solid state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will now be described, by way of example only, and with reference to the following drawings in which:

Figure 1 is a general overview of a system 10 implementing embodiments of the invention;

Figure 2 illustrates an exemplary 3D model of a 3D real world object, based on which a 3D skeleton of the 3D object can be built;

Figure 3 is a schematic block diagram of a computing device for implementation of one or more embodiments of the invention.

Figure 4 illustrates, using a flowchart, first embodiments of a method for generating a 3D skeleton of a 3D real world object observed by source cameras in a scene volume according to the present invention;

Figure 5 schematically illustrates the splitting of a cuboid into elementary cubes V(X,Y,Z) and an exemplary projection of the latter on a part map according to the invention;

Figure 6 illustrates, using a flowchart, a process for displaying a 3D skeleton of a 3D real world object observed by source cameras in a scene volume according to embodiments of the invention;

Figure 7 illustrates, using a flowchart, second embodiments of a 3D skeleton generating method according to the present invention;

Figure 8 illustrates, using a flowchart, third embodiments of a 3D skeleton generating method according to the present invention;

Figure 9 schematically illustrates a portion of a part affinity field PAF between right foot and right knee in a source image; and

Figure 10 schematically illustrates scalar products to compute weights for graph links according to embodiments of the invention.

DETAILLED DESCRIPTION OF EMBODIMENTS

Figure 1 is a general overview of a system 10 implementing embodiments of the invention. The system 10 comprises a three-dimensional (3D) real world object 11 in a scene volume V surrounded by two or more source camera/sensor units 12.

The 3D real world object 11 may be of various types, including beings, animals, mammals, human beings, articulated objects (e.g. robots), still objects, and so on. The scene captured may also include a plurality of 3D objects that may move overtime.

Although two main camera units 12a, 12b are shown in the Figure, there may be more of them, for instance about 7-10 camera units, up to about 30-50 camera units in a stadium.

The source camera units 12 generate synchronized videos made of 2D source images 13 (i.e. views from their viewpoints) of the scene at substantially the same time instant,

i.e. simultaneous source images. Each source camera/sensor unit 12 (12a, 12b) comprises a passive sensor (e.g. an RGB camera).

The 3D positions and orientations of the source cameras 12 within a reference 3D coordinates system SYS are known. They are named the extrinsic parameters of the source cameras.

Also the geometrical model of the source cameras 12, including the focal length of each source camera and the orthogonal projecting position of the center of projection in the image 13 are known in the camera coordinates system. They are named the intrinsic parameters of the source cameras. This camera model is described with intrinsic parameters as a pinhole model in this description but any different model could be used without changing the means of the invention. Preferably, the source cameras 12 are calibrated so that they output their source images of the scene at the same cadence and simultaneously. The intrinsic and extrinsic parameters of the cameras are supposed to be known or calculated by using wellknown calibration procedures.

In particular, these calibration procedures allow the 3D object to be reconstructed into a 3D skeleton at the real scale.

The source images 13 feed a processing or computer system 14 according to the invention.

The computer system 14 may be embedded in one of the source camera 12 or be a separate processing unit. Any communication technique (including Wifi, Ethernet, 3G, 4G, 5G mobile phone networks, and so on.) can be used to transmit the source images 13 from the source cameras 12 to the computer system 14.

An output of the computer system 14 is a 3D skeleton for at least one 3D object of the scene. Preferably, a virtual image 13v built with the 3D skeleton generated and showing the same scene with the 3D object or objects from a viewpoint of a virtual camera 12v is rendered on a connected display screen 15. Alternatively, data encoding the 3D skeleton generated may be sent to a distant system (not shown) for storage and display, using for instance any communication technique. Stored 3D skeletons may also be used in human motion analysis for video monitoring purposes for instance.

Figure 2 illustrates an exemplary 3D model 20 of a 3D real world object, based on which a 3D skeleton of the 3D object may be built according to the teachings of the present invention. In the example of the Figure, the 3D object is an articulated 3D real world object of human being type. Variants may regard still objects.

The 3D model comprises N distinct parts 21 and N-1 connecting elements or links 22. The parts 21 represent modeled portions of the 3D real world object, for instance joints (shoulders, knees, elbows, pelvis, ...) or end portion (head, hands, feet) of a human being. Each part 21 is defined as a point or “voxel” in the 3D coordinates system SYS. The connecting elements 22 are portions connecting the parts 21, for instance forearm, arm, thigh, trunk and so on. Each connecting element 22 can be represented as a straight line between the two connected parts, also named “adjacent parts”, through 3D space.

To generate the 3D skeleton or skeletons of the scene volume, i.e. to know where each part of the 3D real world object or objects is 3D located within the scene volume V, an idea of the present invention consists in retrieving probabilities from the source images to detect parts of the 3D objects before merging them in 3D space. The merged probabilities can then be used to robustly detect the parts in the 3D space, i.e. in the scene volume V.

This approach advantageously reduces complexity of the 3D skeleton reconstruction, in particular of processes at 2D level (i.e. on the source images) including avoiding conflict resolutions to be performed between conflicting parts detected from different source images. It turns that real time reconstructions (and thus displays or human motion analysis for instance) are better achieved. Real time reconstructions for “live” TV or broadcast purposes may include few seconds delay, e.g. less than 10 seconds, preferably at most 4 or 5 seconds.

The inventors have also noticed that it efficiently works on complex scenes (like sport events with multiple players in a stadium), with an ability to detect a wide number of interoperating 3D objects (multiple human players).

To that end, it is first obtained two or more simultaneous source images 13 of the scene volume V recorded by the source cameras 12. They may be obtained from memory of the computer system.

The position and orientation of the scene volume V captured are known in the 3D coordinates system SYS (for instance the 3D shape is known, typically a cuboid or cube, and the 3D locations of four of its vertices are known).

Next, from each source image, one or more part maps are generated for one or more respective parts of the 3D real world object. If various parts are present in the 3D model 20, various part maps can be generated from the same source image.

Each part map for a given part comprises part probabilities (e.g. an array of probabilities) for respective pixels of the source image representing probabilities, preferably a unary probability, that the respective pixels correspond to the given part.

Pixels of the source image are examples of “samples” forming an image. For ease of illustration, it is made reference below to pixels, while the invention may apply to any sample. A sample may be for instance a pixel in the source image, a color component of a pixel in the source image, a group of pixels in the source image, a group of pixel color components in the source image, etc.

The generated part map may differ in size from the source image, usually at a lower resolution, in which case the part map can be up-sampled at the same resolution as the source image. In case of up-sampling, each part map can thus be a 2D array matching the source image (also a 2D array): a pixel in the part map for a given part (e.g. the head of the 3D human being) takes the probability that the co-located pixel in the source image belongs to such given part (i.e. head in the example). In case of lower resolution part map, a pixel in the part map for a given part may take the probability that a relatively (given the scale) co-located pixel in the source image belongs to such given part or that a group of relatively (given the scale) colocated pixels in the source image belong to such given part.

For ease of illustration, it is considered below that the part maps are of the same size as the source image, although the up-sampling process is optional.

In some embodiments, the part map may be filtered by a low-pass filter to extend the influence area of some detected parts when part maps generate strongly localized probabilities. For example Gaussian filtering may be used. This approach improves the process, in particular the actual detection of parts as described below.

From these part maps, one or more sets of part volume data are also generated for respectively the one or more parts. In the invention, generating a set of part volume data for a respective part includes:

projecting elementary voxels of the scene volume onto projection pixels of the part maps. It means that the scene volume V is split into elementary voxels, preferably each elementary voxel representing a cube whose edge length depends on the 3D object (e.g. 1 cm for human beings). Also, the projection matches each elementary voxel with the pixel (referred to as “projection pixel”) of the source image or part map which represents it (i.e. the pixel which views the elementary voxel from the source camera point of view). This matching is a pure geometrical issue based on known intrinsic and extrinsic parameters; and computing a joint part probability, preferably a unary probability, for each elementary voxel based on the part probabilities of its projection pixels in the part maps corresponding to the respective part. This probability is said to be “joint” because it merges, and thus joins, several probabilities coming from several part maps for the same part. Examples of probability merging are proposed below. The set of joint part probability forms part “volume data” as it can be stored in memory as a 3D matrix matching the scene volume matrix (split into elementary voxels): a voxel in the part volume data for a given part (e.g. the head of the 3D human being) takes the joint probability that the co-located voxel in the scene volume V belongs to such given part (i.e. head in the example). If various parts are present in the 3D model 20, various sets of part volume data can thus be generated.

The part volume data may also be filtered by a filter to keep the highest joint part probabilities in order to improve part detection. Such joint part probabilities spread over the scene volume for a given part can then be used to determine the actual occurrence or occurrences of said part in the scene volume (in terms of identification and location). It means that one or more parts of the 3D skeleton can be generated using the one or more set of part volume data generated, for example where the joint probabilities are locally the highest (local maxima).

As mentioned above, an exemplary application for the present invention may relate to the display of a virtual image 13v showing the same scene from a new viewpoint, namely a virtual camera 12v. To that end, the invention also provides a method for displaying a 3D skeleton of a 3D real world object observed by source cameras in a scene volume. This method includes generating a 3D skeleton of the 3D real world object using the generating method described above.

Next, this application consists in selecting a virtual camera viewing the scene volume and displaying the generated 3D skeleton from the virtual camera on a display screen. In practice, several generated 3D skeletons are displayed simultaneously on the display, for instance when displaying a sport event. A simple 3D object as shown in Figure 2 can be used to display the generated 3D skeleton. This is useful to display animations that require low rendering costs. More promising applications can also provide an envelope to the 3D skeleton with a texture, either predefined or determined from pixel values acquired by the source cameras (for better rendering). This is for example to accurately render shot or filmed sportsmen as they actually look like in the scene volume.

Selecting a virtual camera may merely consist in defining the extrinsic and intrinsic parameters of a camera, thereby defining the view point (i.e. distance and direction from the scene volume) and the zoom (i.e. focal) provided by the virtual image.

Generating the 3D skeletons and displaying/rendering them on the display screen 15 may be performed for successive source images 12 acquired by the source cameras 13. Of course the displaying is made following the timing of acquiring the source images. It turns that 3D-skeleton-based animations of the captured scene can be efficiently produced and displayed.

Other applications based on the generated 3D skeleton or skeletons may be contemplated. For instance, video monitoring for surveillance purposes of areas, such as the street or a storehouse, may perform detection of 3D skeletons in captured surveillance images and then analyses the moving of these 3D skeletons to trigger an alarm or not.

Figure 3 schematically illustrates a device 300 used for the present invention, for instance the above-mentioned computer system 14. It is preferably a device such as a microcomputer, a workstation or a light portable device. The device 300 comprises a communication bus 313 to which there are preferably connected:

- a central processing unit 311, such as a microprocessor, denoted CPU;

- a read only memory 307, denoted ROM, for storing computer programs for implementing the invention;

- a random access memory 312, denoted RAM, for storing the executable code of methods according to the invention as well as the registers adapted to record variables and parameters necessary for implementing methods according to the invention; and

- at least one communication interface 302 connected to a communication network 301 over which data may be transmitted.

Optionally, the device 300 may also include the following components:

- a data storage means 304 such as a hard disk, for storing computer programs for implementing methods according to one or more embodiments of the invention;

- a disk drive 305 for a disk 306, the disk drive being adapted to read data from the disk 306 or to write data onto said disk;

- a screen 309 for displaying data and/or serving as a graphical interface with the user, by means of a keyboard 310 or any other pointing means.

The device 300 may be connected to various peripherals, such as for example source cameras 12, each being connected to an input/output card (not shown) so as to supply data to the device 300.

Preferably the communication bus provides communication and interoperability between the various elements included in the device 300 or connected to it. The representation of the bus is not limiting and in particular the central processing unit is operable to communicate instructions to any element of the device 300 directly or by means of another element of the device 300.

The disk 306 may optionally be replaced by any information medium such as for example a compact disk (CD-ROM), rewritable or not, a ZIP disk, a USB key or a memory card and, in general terms, by an information storage means that can be read by a microcomputer or by a microprocessor, integrated or not into the apparatus, possibly removable and adapted to store one or more programs whose execution enables a method according to the invention to be implemented.

The executable code may optionally be stored either in read only memory 307, on the hard disk 304 or on a removable digital medium such as for example a disk 306 as described previously. According to an optional variant, the executable code of the programs can be received by means of the communication network 301, via the interface 302, in order to be stored in one of the storage means of the device 300, such as the hard disk 304, before being executed.

The central processing unit 311 is preferably adapted to control and direct the execution of the instructions or portions of software code of the program or programs according to the invention, which instructions are stored in one of the aforementioned storage means. On powering up, the program or programs that are stored in a non-volatile memory, for example on the hard disk 304 or in the read only memory 307, are transferred into the random access memory 312, which then contains the executable code of the program or programs, as well as registers for storing the variables and parameters necessary for implementing the invention.

In a preferred embodiment, the device is a programmable apparatus which uses software to implement the invention. However, alternatively, the present invention may be implemented in hardware (for example, in the form of an Application Specific Integrated Circuit or ASIC).

Various embodiments of the present invention are now described with reference to Figures 4 to 10.

Figure 4 illustrates, using a flowchart, first embodiments of a method according to the present invention. The method takes place in the computer system 14 which has previously received M source images 13 acquired simultaneously by M calibrated source cameras 12, for instance through a wireless or a wired network. These source images 13 are for instance stored in a reception buffer (memory) of the communication interface 302.

The method 400 may be repeated for each set of simultaneous source images 13 received from the source cameras 12 for each successive time instants. For instance, 25 Hz to 100 Hz source cameras may be used, thereby requiring processing a set of source images 13 each 1/100 to 1/25 second.

The scene volume V viewed by the source cameras 12 is predefined as shown by the volume parameters 401. These parameters position the scene volume in the coordinates system SYS.

The source cameras 12 have been calibrated, meaning their extrinsic and intrinsic parameters 402 are known.

The nature, and thus the 3D model 20, or each 3D real world object 11 in the scene volume V is known. For ease of explanation, the description below concentrates on a single type of 3D object, for instance a human being as modeled in Figure 2. Where the scene volume V contains various types of 3D objects, various corresponding 3D models 20 can be used using the teachings below.

The method starts with the splitting 450 of the scene volume V into elementary voxels V(X,Y,Z) 403, preferably of equal sizes, typically elementary cubes. A size of the elementary voxels may be chosen depending on the 3D object to be captured. For instance, the edge length of each elementary voxel may be set to 1 cm for a human being. Figure 5 schematically illustrates the splitting of a cuboid into elementary cubes V(X,Y,Z), only one of which being shown for the sake of clarity.

The splitting 450 may be made once and for all, meaning it is made once and the same split is used for successive sets of source images captured at successive time instants.

The method also starts with the obtaining 451 of two (or more) simultaneous source images of the scene volume recorded by the source cameras. The source images 12 are for instance retrieved from the reception buffer of the communication interface 302.

Although the sources images may have different sizes from one source camera to the other, it is assumed they have the same size for illustration purposes. In any case, a resizing of some source images may be processed to be in such situation. This resizing is not mandatory but helps in simplifying the description.

From each of these source images 13i, one or more part maps PMj^part 404 are generated at step 452 for one or more respective parts 21 of the 3D real world object 11. Typically N part maps are generated (N being the number of parts in the considered 3D model 20). For illustrative purposes, the part map generated for the head (as a part of the 3D object 11) from source image ‘3’ is referenced PM₃ ^head.

Each part map PM^part comprises part probabilities PP^part(x,y) for respective pixels of the source image 1'. PP^part(x,y) represents a probability that the respective pixel in the source image 13, corresponds to the respective part ‘part’ of the 3D real world object. If the part map and the image source have the same sizes, the respective pixel is pixel at location (x,y) in the source image. Otherwise, it is the relatively (given the scale or sampling factor) co-located pixel.

For instance, it may be pixel at location (2x, 2y) when the height and width of the part map are half those of the image source.

The part map can be stored as an image having the same size as the source image, wherein each pixel takes the value of the part probability for the collocated pixel in the source image. Therefore there is a direct matching between a source image and the part maps generated from it: the collocated pixels in the part maps correspond to respective probabilities of the collocated pixel in the source image 12 to belong to a respective part of the 3D object as shown by the camera 12.

The part maps may have a different size/resolution to the source images (e.g. they are sub-sampled compared to the size of the source image). In such a case, the intrinsic parameters of the cameras can be modified taking into account the sub-sampling factor. Another solution consists in interpolating the part maps in order to match the genuine size of the source images. In such a case, a bilinear interpolation is preferred over a nearest-neighbor or bi-cubic interpolation.

In an improved solution, the parts maps may be low-pass filtered in order to increase the areas of influence of 2D pixels. For example Gaussian filtering may be used.

In the example of Figure 2, thirteen parts are composing the 3D model, thereby thirteen part maps are generated from each source image processed.

Known techniques can be used to produce these part maps from the source images 13.

One technique is described in publication “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields by Zhe Cao et al. (2016). This technique calculates confidence maps for part detection which bear probabilities at pixel level as defined above.

Another technique is described in publication “DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model’ by Eldar Insafutdinov et al. (2016).

More generally, a convolutional neural network (CNN) can be used which is configured based on a learning library of pictures in which a matching with each part of the models has been made. The running of configured CNN on the source images then identifies occurrences of the parts within input source images. An advantage of the CNNs is that the same running of the CNN can identify, within an input image, parts from different models, provided the CNN has been learnt with learning pictures embedding the various models to be searched.

Typically, the part probabilities generated are unary, i.e. set between 0 and 1.

These known techniques are dependent on the set of learning pictures to learn the

CNN. To that aim, the learning pictures usually provide exemplary objects that have bounded sizes. These techniques are badly adapted to detect objects the size of which is not of the same order of magnitude than in the learning pictures. Indeed, 3D objects can be sometimes big, sometimes tiny. This is for instance the case during sport events where players move from very close to the camera to very far.

In embodiments seeking to increase robustness, it is thus proposed to improve these known techniques to produce the part maps. An idea is to use scaled versions of the same source image to increase chances to have high part probabilities.

To that end, one or more scaled versions of a given source image 13 are obtained.

For instance, a half-sized image (scale 0.5) is generated (through down-sampling) as well as a double-sized image (scale 2 - through up-sampling). Known scaling techniques can be used.

Of course, other scaling values can be used. In this example, at least one upscaled version and one downscaled version of the source image are obtained and used. In variants, only up-scaled versions or only downscaled versions are used.

Next, an intermediary part map is generated for the part currently considered, from each of the source image and its scaled versions. This operation is made using any of the above-mentioned known techniques. Thus, the intermediary part map comprises part probabilities for respective pixels of the source image or its scaled version (possibly each pixel if the intermediary part map has the same dimensions as the images), which part probabilities represent probabilities that the respective pixels in the source image or scaled version correspond to said part currently considered.

As the generated part maps are not at the same scale, they are then preferably rescaled at a unique and same scale. For instance, an intermediary part map obtained from an up-scaled source image is downscaled (using the inversed scaling factor), meaning for instance that a part probability out of 2 is discarded (for a half scaling). Also an intermediary part map obtained from a downscaled source image is up-scaled (using the inversed scaling factor), meaning for instance that a part probability for a new pixel is determined for instance from the part probabilities of neighboring pixels (e.g. through interpolation).

The obtained (rescaled at the same scale) intermediary part maps are then used to generate the part map for said source image and the part currently considered. In particular, the part map is preferably formed with, for each pixel considered, the highest part probability from the part probabilities of the generated intermediary part maps at the same pixel considered.

For instance, for a pixel (x,y) in the source image having corresponding part probabilities calculated from the source image and its scaled versions, the highest probability between

- the part probability for pixel (x,y) in the part map obtained from the source image,

- the part probability for pixel (x,y) in the part map obtained from a first downscaled version of the source image,

- the part probability for pixel (x,y) in the part map obtained from a first upscaled version of the source image,

- so on, is selected to be the part probability associated with pixel (x,y) in the final part map output at step 452.

Knowing the part maps PMj^part and the scene volume V split into elementary voxels V(X,Y,Z), the computer system 14 can generate at step 453 one or more sets of part volume data for respectively the one or more parts. In fact one set is generated for each part 21.

Step 453 aims at obtaining, for each part, a 3D space corresponding to the scene volume wherein each elementary voxels bears (for instance by its value) the probability that the collocated voxels in V belongs to said part. This probability is built from the part probabilities obtained from the various source images 13.

For the example of Figure 2, thirteen 3D spaces are built (i.e. thirteen sets of part volume data).

To do so, step 453 comprises two substeps.

First, the elementary voxels V(X,Y,Z) of the scene volume are projected at step 454 onto projection pixels p,(x,y) of the part maps (which may all matches corresponding source images). Figure 5 schematically illustrates such projection. This is a geometrical issue which depends only on the extrinsic and intrinsic parameters of each source camera 12i, given the elementary voxels considered.

As the scale/resolution of the part maps may differ from the one of the source image, the projection may consist in projecting the voxels according to the intrinsic and extrinsic parameters of the source images/cameras and in scaling the obtained 2D coordinates according to the scaling factor.

The projection may however be direct (i.e. without scaling) if the part maps have already been interpolated/up-sampled at the same scale as the source image beforehand.

Each pixel p,(x,y) captured by the source camera i corresponds to one elementary voxel along the line Δ. All the elementary voxels of the scene volume V along this line are projected onto the same pixel. On the other way, an elementary voxel may project onto one or more pixels of the source images or part maps.

One may note that a given source camera may not view the whole scene volume V, but only a part of it, depending on how V is defined. Thus, some elementary voxels may not be projected on a projection pixel of some source images (and thus part maps). The source images on which an elementary voxel can be projected are named below “projecting images for the voxel”.

Step 454 thus matches the pixels p,(x,y) of the source images 12, (and thus of each part maps generated from them) with the elementary voxels V(X,Y,Z). The matching is shown as reference 405 in the Figure. A majority of pixels p,(x,y) is matched with respective sets of elementary voxels V(X,Y,Z), the number of which may vary depending on whether they are viewed by the cameras. Some pixels may be matched with few elementary voxels, even zero (if not viewed by any camera considered).

Next, using this matching, each part probability PPj^part(x,y) at a pixel in a part map PM^part is assigned at step 455 to the elementary voxel or voxels (if any) that are projected on this pixel p,(x,y). In other words, each elementary voxel is associated with the part probabilities taken by its projection pixels in the part maps. This may be made part by part, thereby generating N 3D spaces corresponding to the N parts, wherein each elementary voxel is associated with usually M map probabilities (M being the number of cameras considered).

Next, a joint part probability JPP_part(X,Y,Z) can be computed at step 456 for each elementary voxel V(X,Y,Z) based on these assigned part probabilities. Thus, N volumes or part volume data PVD_part 406 can be generated for the N parts, each volume representing the distribution of probabilities that the elementary voxels belong to the respective part considered.

In one embodiment, computing the joint part probability JPP_part(X,Y,Z) for an elementary voxel (Χ,Υ,Ζ) may include dividing the sum of the part probabilities of its projection pixels in the part maps corresponding to the respective part, by the number of such part maps. It means the sum of the assigned part probabilities PP^part(x,y) is computed, which sum is next divided by the number of projecting images for the voxel. This ensures the joint part probability to remain between 0 and 1.

The following of the method consists in generating one or more parts of the 3D skeleton using the one or more sets of part volume data PVD_part so generated. These generated one or more parts thus build the 3D skeleton.

The generation comprises various steps as described now.

First, a set of part candidate or candidates is determined at step 457 from each part volume data PVD_part. Each part candidate corresponds to an elementary voxel. The determination is made based on the joint part probabilities associated with the elementary voxels. For instance, part candidate or candidates from part volume data are determined by determining local maximum or maxima of the joint part probabilities (within data PVD_part) and outputting (i.e. selecting) elementary voxel or voxels (the part candidate or candidates) corresponding to the determined local maximum or maxima.

All 3D local maximum or maxima in each part volume data PVD_part may be selected. They identify candidates in the scene volume for the part considered.

In one embodiment, only the highest local maximum is selected or the a highest local maxima are selected (a integer > 1) for instance if the maximal number a of 3D objects in the scene volume is known in advance. This makes the process less complex as only few part candidates are handled for the next steps.

In another and refining embodiment, a probability threshold can be used to keep only the 3D local maximum or maxima that are associated with joint part probabilities above said threshold. This cleans up the set of part candidates from any uncertain part candidates that would result from isolated part detection at step 452 (i.e. from few or very few source images).

Consequently, the process is simplified. A probability threshold can be defined independently for each part or for a subset of parts. This is because the method used at step 452 may be more efficient to detect some parts than other parts.

In yet another embodiment, 3D local maximum or maxima that are too close (given a guard threshold) to the envelope (faces) of the scene volume V are discarded. This is to avoid processing 3D objects 11 that may not have been entirely captured (and thus possibly truncated).

At least two sets (usually N sets) of part candidate or candidates are thus obtained from respectively the part volume data corresponding to two (usually N) parts, each part candidate corresponding to an elementary voxel with an associated joint part unary probability,

Next, a one-to-one association between a first part candidate (e.g. a candidate for a head) of a first candidate set and a second part candidate (e.g. a candidate for a neck) of the second candidate set is made. This is done using a graph wherein nodes correspond to the part candidates of the two sets with their associated joint part probabilities and weighted links between nodes are set.

For ease of illustration, it is considered here that the graph is built based on only two parts that are adjacent according to the 3D model 20. This is a simplification of more complex approaches which are based on graphs involving a higher number of parts. In variant, more complete graphs may thus also be used to find one-to-one associations, as explained below.

The one-to-one association requires a first step 458 of linking the part candidates one to the other. This step may take into account the adjacency between parts according to the 3D model 20, i.e. the existence of connecting element 22 in the model. For instance a head candidate can be connected or linked to a neck candidate in the 3D model 20.

Thus, each pair of adjacent parts in the 3D model 20 may be successively considered.

In one embodiment, all part candidates for the first adjacent part are connected to each and every part candidates for the second adjacent part. This can be made through the building of a graph as introduced above. One graph is built per each pair of adjacent parts wherein the nodes correspond to the part candidates (i.e. voxels) for the two adjacent parts and a link between the nodes is created where part connection is made. Each node is thus assigned the joint part probability corresponding to the corresponding part candidate (voxel).

To reduce complexity, a link between two nodes is preferably set in the graph depending on morphological constraints defined by the 3D model between the two adjacent parts. For instance, decision to connect two part candidates (and thus corresponding nodes in the graph) may be based on a distance between the part candidates, given predefined (morphological) constraints.

The constraints may vary from one part to the other. For instance, a common head-neck distance is higher than 10 cm but less than 40 cm, a common pelvis-knee distance is higher than 20 cm but less than 80 cm, and so on.

Consequently, part candidates for two adjacent parts are thus preferably connected if their relative distance (in the 3D coordinates system SYS) meets the morphological constraints, e.g. is higher than a predefined floor threshold and/or less than a predefined ceiling threshold. The floor threshold helps distinguishing between intermingled 3D objects while the ceiling threshold helps processing separately distant 3D objects.

In a slight variant where all part candidates for the first adjacent part are first connected to each and every part candidates for the second adjacent part, the morphological constraints may be used to remove links linking two part candidates not satisfying the constraints.

Once the graphs for all pairs of adjacent parts have been obtained (steps 458 to 460 may however be performed one pair after the other), each link between two connected nodes is weighted at step 459. It means a weight is assigned to the link in the graph.

In one embodiment, a weight for such a link between two nodes corresponding to part candidates of the two sets depends on a distance between the two part candidates. In a rough approach, the inverse of the distance (as measured between the two part candidates in the 3D coordinates system SYS) is used as a weight.

Next, each graph is solved at step 460 to find the one-to-one associations between part candidates that maximize a cost or energy.

The one-to-one associations mean that, at the end, each node (for a first adjacent part) in the graph can only be linked to at most one other node (for the second part). After being solved, the graph may comprise nodes without links. This is the case for instance when the set of part candidates for the first adjacent part includes more candidates than the set for the other adjacent part in the pair.

For instance, a bipartite solving of the graph reduces to a maximum weight bipartite graph matching problem as explained for instance in “Introduction to graph theory, volume 2” by D. B. West et al. (2001). The optimal associations between the parts give portions of 3D 2D skeletons.

The energy E to maximize may be the sum of elementary energies assigned to the pairs of connected nodes respectively. Each elementary energy ‘e’ may be based on the joint part probabilities associated with the two nodes and on the weight of the link between the nodes:

E= Σβ where for instance e=p.[JPPpa_rt-i(first node)+JPP_part-2(second node)] + y.weightiink and β and γ are predefined parameters.

In an alternative and more efficient way, the energy can be defined as:

e=p.max{JPPpart-i(first node),JPP_part-2(second node)} + y.weightiink

For instance, let consider two connected nodes in the graph corresponding to a head candidate and to a neck candidate respectively. The head candidate has a joint part probability JPPhead(Xi,Yi,Zi) while the neck candidate has JPP_neck(X2,Y2,Z₂). The two candidates (Χι,Υι,Ζί) and (X2.Y2.Z2) are 0.15 meter apart in system SYS, in which case the weight for the link between the two nodes is set to 1/0.15. Their associated elementary energy is the following in the first example of energy above:

θ=β·[ JPP_head(Xl,Yl,Zi) + JPP_neck(X2,Y2,Z₂)] + _Y/0.1 5

The result of step 460 is a set of one-to-one-associated part candidates (there may be a single association in the set) for each graph (i.e. for each pair of adjacent parts according to the 3D model 20). Indeed, the above steps of determining 457 part candidates and of obtaining 458-460 one-to-one associations steps are repeated for the plurality of pairs of adjacent parts.

The above description of step 460 is based on a one-to-one graph simplification between adjacent part candidates. Some alternatives to this graph simplification exist.

For example, it is possible to construct a complete graph/tree between each head candidates and each right hand candidates passing through the neck candidates, right shoulder candidates and right elbow candidates. This tree can be segmented in a second step into independent sub-trees, each sub-tree defining a unique path between adjacent skeletons parts. The construction of the sub-tree can be viewed as a graph segmentation.

A global solution of this segmentation is the one that maximizes the total energy of the independent sub-trees.

This process of segmentation/subtree generation can be repeated for three other complete trees between respectively head candidates and right foot candidates (passing through intermediary parts), head candidates and left hand candidates, and head candidates and left foot candidates. More generally, if the 3D model has P ending parts, P-1 complete trees may be built and then solved.

The final steps consist in selecting one-to-one-associated part candidates so obtained as parts of the final 3D skeleton.

Preferably a first step 461 consists in connecting one-to-one-associated part candidates of two or more pairs of adjacent parts to obtain candidate 3D skeleton or skeletons. A connected component algorithm can be used.

This idea is merely to use each graph output to parse (thus build) the candidates 3D skeleton.

The output of a first graph is selected from which the one-to-one associations (of adjacent part candidates) are successively considered. Given an associated pair of adjacent part candidates, the outputs of the other graphs (preferably those involving one of the parts previously considered) are used to determine whether or not these adjacent part candidates are also one-to-one associated with other part candidates. In the affirmative, the various part candidates are put together in the same data structure in memory, which progressively forms a candidate 3D skeleton. And so on.

To illustrate this process still using the model of Figure 2, let consider a first association between a head candidate (voxel or “point” P1 in the scene volume) and a neck candidate (voxel or “point” P2 in the scene volume). This association results from the solving of the head-neck graph. The solved left-shoulder-neck graph is used to determine whether an association between the same neck candidate (P2) and a left-shoulder candidate exist. In the affirmative (voxel or “point” P3 in the scene volume for the left-shoulder candidate), points P1, P2, P3 are put together in a candidate structure. And so on with the left-elbow-left-shoulder graph, left-hand-left-elbow graph, right-shoulder-neck graph, pelvis-neck graph, and so on... At the end, at most thirteen points P1-P13 in the 3D space may have been found which form an entire 3D skeleton candidate.

A second association between a head candidate and a neck candidate may produce a second 3D skeleton candidate, be it entire (if all the graphs provide a new point) or not.

It turns that one or more (entire or partial) 3D skeleton candidates are formed. A 3D skeleton candidate may be made of a single isolated one-to-one association between two part candidates or of few associations.

In the graph segmentation approach described above where P-1 complete trees are built and then solved, the final stage may consists in merging together the four (more generally P-1) sub-trees (if any) sharing the same candidate for starting end part (here for instance the head). This also provides a 3D skeleton candidate for each occurrence of an end part (e.g. head, hand, foot) of the 3D model.

From these 3D skeleton candidates obtained, one 3D skeleton candidate may be selected as a 3D skeleton of the 3D object 11. This is step 462. Of course, if a plurality of 3D objects having the same 3D model 20 is present in the scene volume, a plurality of 3D skeleton candidates is selected as 3D skeletons for these objects. The number of 3D skeleton candidates to be selected can be known in advance. However, some applications may not know such number.

Apart from such known number, various criteria may be used alternatively or in combination to select the 3D skeleton candidates. The criteria increase the relevancy of the selection (i.e. the selected candidates correspond to existing objects in the scene volume).

A first criterion is a number of parts forming the 3D skeleton candidate according to the 3D model. One easily understands that a more complete skeleton candidate is a better candidate than a more partial skeleton candidate. Thus, preferably, the number should be above a predefined threshold to select (or keep) the 3D skeleton candidate. This is to avoid having too partial 3D skeletons, and it automatically discards the isolated one-to-one associations (or the 3D skeleton candidates made of few associations). This first criterion is similar to a number of connecting elements in the 3D skeleton candidate.

A second criterion is the joint part probabilities associated with the nodes of the 3D skeleton candidate in the graph or graphs. Again, one easily understands that the highest the joint part probabilities, the more accurate the 3D skeleton candidate. Thus, a sum of these probabilities should preferably be above a predefined threshold to select (or keep) the 3D skeleton candidate. This discards the 3D skeleton candidates that are based on uncertain part detections from the source images.

A third exemplary criterion is the weights set for the links between the nodes of the 3D skeleton candidate in the graph or graphs. For instance a sum of these weights should be above a predefined threshold to select (or keep) the 3D skeleton candidate. This criterion may be additional to the first one, since such weight sum is strongly impacted by the number of parts (the fewer the number of parts and thus of links, the few the number of weights to be summed).

A fourth criterion is the visibility of the 3D skeleton candidate by the source cameras 12. Such visibility can be expressed as the number of projecting images for the voxels composing the candidate, i.e. the number of source images onto which the 3D skeleton candidate can be projected. For instance, such number (or visibility) should be above a predefined number, e.g. half the number of source cameras, to select (or keep) the 3D skeleton candidate.

Some applications may require that the 3D skeleton or skeletons selected at step 462 (thus generated using the process of the Figure) be displayed, for instance using the display screen 15.

Figure 6 illustrates, using a flowchart, such a process 600 for displaying a 3D skeleton of a 3D real world object observed by source cameras in a scene volume. This is an exemplary application using the generated 3D skeleton.

Step 601 corresponds to generating a 3D skeleton of the 3D real world object using the teachings of the invention, e.g. using the process of Figure 4.

Step 602 consists in selecting a virtual camera 12v viewing the scene volume. Such camera does not actually exist. It is defined by a set of extrinsic and intrinsic parameters chosen by the user. These parameters define from which viewpoint, at which distance and with which focal (i.e. zoom) the user wishes to view the scene.

Using these parameters of the virtual camera, the virtual image 13v can be computed at step 603. This step merely consists in projecting the 3D skeleton or skeletons located in the 3D space onto a virtual empty image defined by the parameters of the virtual camera. This projection is similar to step 454 where the elementary voxels (here the voxels forming the 3D skeleton) are projected onto the source images.

Next, the built virtual image 13v is displayed on the display screen 15 at step 604.

Steps 603 and 604 ensure the display on a display screen of the generated 3D skeleton from the viewpoint of the virtual camera.

Figure 7 illustrates, using a flowchart, second embodiments, which improve the first embodiments described above, of a method according to the present invention. Similar references are used between Figures 4 and 7 for the same steps or data.

In step 454 of Figure 4, the elementary voxels V(X,Y,Z) of the scene volume are projected onto projection pixels p,(x,y) of the part maps (which may match in size their respective source image or not). The number of projections may be very high since it depends on the number of elementary voxels forming the scene volume (which may be huge to cover e.g. a sport field in stadium) and on the number of part maps, i.e. on the number of source images 13 (tens of cameras may be used), on the number of 3D models 20 to be found and on the number of parts 21 forming each 3D models 20.

The process 700 of Figure 7 aims at substantially reducing this number of projections, thereby reducing computational costs of the method.

The improvement relies on the following idea: using a first set of part volume data, i.e. PVD_head 406 for instance, to restrict an amount of elementary voxels to be projected (during step 454) on part maps (generated for a second part, e.g. neck) to generate 453 a second set of part volume data, PVD_neCk for instance. In fact, the processing of the first set of part volume data PVD_head makes it possible to identify parts candidates and thus to roughly define sub-volumes around these part candidates as locations where the 3D objects are located. It is inferred from the output of the processing that the remainder of the scene volume (thus excluding the subvolumes) is deprived of 3D objects.

In practice, the process of Figure 4 can be done for a first part (let say the head) up to step 457 (i.e. from step 450 to 457) where part candidates of the 3D real world object are determined from the first set of part volume data PVD_head. In a slight variant which further reduces complexity, this first set of part volume data PVD_head may be generated using large elementary voxels (for step 454), for instance by grouping several elementary voxels, typically a cube of x³ elementary voxels (x integer). In that case, the same set of part volume data PVD_headcan be recomputed later on based on the restricted amount of elementary voxels as described below.

Next, bounding 3D boxes are defined at step 701, around the determined part candidates in the scene volume. For instance the bounding box may be defined based on a predefined maximum size of the 3D objet. The bounding box may be centered on a determined part candidate. The bounding box may be a cuboid or a cube whose edges are at least twice the predefined maximum size. This ensures any 3D object to which the determined part candidate (i.e. voxel) belongs to be encompassed by the bounding box.

In one specific embodiment, bounding boxes that overlap each other are merged into a new bounding box. In that case, the smallest cuboid comprising the overlapping bounding boxes may be chosen. The merging process is iterative, meaning that a new bounding box resulting from a merger can be subject to another merger with another bounding box. A number of iterations may be predefined to avoid too long processing. Alternatively, it may not be limited, in which case iterative mergers may ends to a bounding box having the size of the scene volume, in case enough 3D objects are spread over the whole volume.

Once the bounding boxes are known, the part volume data PVD_part for the other parts (but also for the same first part in case large elementary voxels were used at step 454) are generated using only the elementary voxels of the bounding boxes for projecting step 454.

In other words, the amount of elementary voxels to be projected on the part maps to generate a second set of part volume data is restricted to the defined bounding boxes.

As a projection is only made from the elementary voxels of the bounding boxes, a joint part probability is computed at step 456 (for each part considered) only for this subset of elementary voxels and the obtained sets of part volume data PVD_part only have information in the bounding boxes.

The remainder of the process (steps 458 to 462) remains unchanged.

However, an advantageous embodiment is proposed when no bounding box overlap or intersect another one, which may be obtained after having merged bounding boxes. As the bounding boxes are spatially distinct one from the other, their processing can be made independently.

This means that, once the bounding boxes are known, steps 454 to 462 can be made on a single bounding box at the same time. One or more 3D skeletons are obtained from each bounding box. This approach saves memory consumption as the amount of data to process and store at a given time is substantially reduced (because each bounding box is processed separately).

Figure 8 illustrates, using a flowchart, third embodiments, which improve the first or second embodiments described above, of a method according to the present invention. Similar references are used between Figures 4 (or 7) and 8 for the same steps or data.

As described above (Figure 4), the weights set for the links connecting two nodes (i.e. part candidates for two adjacent parts) may be the inverse of the distance between the two part candidates in the 3D space or coordinates system SYS. The sole use of the distance to weigh the links proves to be efficient to identify 3D skeletons for distant 3D objects but quite insufficient to identify 3D skeletons for intermingled 3D objects. To improve detection of the 3D objects, the third embodiments of the present invention propose to use part affinity fields PAFs to adjust the weights of the links in the graphs before the latter are solved at step 460.

Part affinity fields are known for instance from above-cited publication “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. One part affinity field is generated for each pair of adjacent parts (according to the 3D model 20) and from each source image 13. It is generated by the same CNN as the one used at step 452.

Similar to the part maps, a part affinity field may have the same dimensions as the source image or reduced dimensions in which case it may be upscaled to recover the same dimensions.

In detail, a part affinity field for the two adjacent parts (e.g. right foot and right knee) includes affinity vectors for respective pixels of the source image, the magnitude and direction of each affinity vector representing estimated orientation probability and orientation of an element connecting, according to the 3D model, two occurrences of said adjacent parts at the respective pixel in the source image. According to the training base used to generate the CNN, the maximal magnitude may be limited to 1.

The resolution of the part affinity fields is usually at a lower resolution than the source images. It is possible to up-sample the part affinity field to the same resolution as the source image. In such a case, an up-sampled part affinity field for the two adjacent parts (e.g. right foot and right knee) includes an affinity vector per each pixel of the source image, the magnitude and direction of each affinity vector representing estimated orientation probability and orientation of an element connecting, according to the 3D model, two occurrences of said adjacent parts at said pixel in the source image.

This up-sampling is however optional.

Figure 9 schematically illustrates a portion of a part affinity field PAF between right foot and right knee in a source image (the leg of the source image is schematically traced in dot line to show the correspondence with the affinity vectors AV). The affinity vectors of the lower part of the leg are not shown for clarity reason.

The directions of the affinity vectors show the estimated orientation of a connecting element between the two parts considered (i.e. here the estimated orientation of the leg).

The lengths of the affinity vectors show the confidence in the orientation estimation at each pixel. The longer the AVs (with a length limited to one pixel), the more certain their orientations are.

The knowledge of orientations with high probabilities (AVs with long modulus close to 1) gives relevant information on how to connect two part candidates. This can be used to favor some 1 -to-1 matching when solving the graph. This is the idea of the process of Figure 8.

As shown in Figure 8, step 452 is replaced by step 801 where both part maps 404 (for each part of the 3D model 20) and part affinity fields 802 (for each pair of adjacent parts in the 3D model 20) are generated from each source image 13.

With the example of Figure 2, thirteen part maps and twelve part affinity fields are generated from each source image.

Steps 453 to 458 are similar to Figure 4 or 7.

Next, at step 803, the weights set for the weighted links are based on the generated part affinity fields 802. For instance, the weight of a link connecting a first-part (e.g. right foot) candidate and a second-part (e.g. right knee) candidate in the graph is set based on the PAFs related to both adjacent first and second parts and obtained from the source images at step 801.

As it is sought to favor the pairs of first-part candidate and second-part candidate that are arranged along the same orientation as the most probable affinity vectors, the weight to be used for the link between the two candidates can be based on a scalar product between the vector formed by the two candidates and the affinity vectors. As the affinity vectors are within a 2D image, the vector formed by the two candidates can be projected onto each PAF to perform the scalar product.

In this context, it is proposed to:

project the first and second part candidates onto a generated part affinity field, and compute the weight (for the link between the two candidates) based on affinity vectors located between the two projected part candidates in the generated part affinity field.

If the scale/resolution of the part affinity field differs from the one of the source image (e.g. if no up-sampling has been conducted), the projection consists in projecting the 3D candidates according to the intrinsic and extrinsic parameters of the source images/cameras and in scaling the obtained 2D coordinates according to the scaling factor.

The projection can be direct if the part affinity fields have been interpolated/upsampled at the same scale as the source image.

The affinity vectors to be considered may be along the segment formed by the two candidates, in particular the closest ones. For instance the known Bresenham's line algorithm can be used to determine which pixels (and thus associated affinity vector) to parse along this segment.

The projection and computation are preferably repeated for all the generated part affinity fields concerning the same two adjacent parts. Of course, the same process is repeated for the other pairs of adjacent parts.

In practice, the scalar products of the vector formed by the two projected part candidates and each of the affinity vectors located between the two projected part candidates (thus identified by the Bresenham's line algorithm) can be computed (to obtain elementary link weights), and then summed. The sum can then be normalized by dividing it with the modulus of the vector formed by the two projected part candidates (i.e. the projected distance between the two candidates).

It turns that a scalar product result for the two candidates is obtained from each part affinity field (i.e. at most twelve results are obtained for 3D model 20). The results may then be summed to obtain a final weight which is assigned to the link between these two candidates in the graph concerned.

Figure 10 schematically illustrates these scalar products in the process of solving a graph related, in the example shown, to the head and the neck as two adjacent parts. For ease of illustration, a single affinity vector is shown in dotted arrow for the head-neck connecting element in the affinity field maps instead of a plurality of shown in Figure 9. Single affinity vectors (still in dotted arrows) for other connecting elements are also shown to illustrate an entire human being.

On top of the Figure, a simplified graph is shown with two head candidates HC1 and HC2 (white dots) and two neck candidates NC1 and NC2 (black dots). Each part candidate is associated with a joint part probability (JPP) and links between the part candidates (given morphological constraints) are shown. The weight of these links is computed based on the part affinity fields shown in the below part of the Figure. While only two PAFs, PAFt and PAF₂, are shown (for ease of illustration), a higher number can be used.

As far as the first link (NC1, HC1) is concerned, the two candidates are first projected onto PAFt resulting in a projected vector VpriNCl.HCI) shown in plain arrow in (a).

The normalized sum of the scalar products of Vp^NCl.HCI) with each affinity vector of PAFt located between the two projected candidates (here a single affinity vector AXA(head-neck) is shown) gives a value for instance of 0.9. The elementary weight for link HC1-NC1 and PAFt is thus 0.9.

For the same link (NC1, HC1), the two candidates are projected onto PAF₂resulting in a projected vector Vp₂(NC1,HC1) shown in plain arrow in (b). The normalized sum of the scalar products of Vp₂(NC1,HC1) with each affinity vector of PAF₂ located between the two projected candidates (here a single affinity vector AV₂(head-neck) is shown) gives a value for instance of 0.7. The elementary weight for link HC1-NC1 and PAF₂ is thus 0.7.

If more PAFs are available, the same calculation is done for each of them.

Next, the elementary link weights for link HC1-NC1 are all summed. Here, only two elementary weights are summed, giving a weight for link HC1-NC1 equal to 0.9+0.7=1.6.

The same can be done for a second link (NC2, HC2). The candidates are projected onto PAFt as shown in (c). The normalized sum of the scalar products between Vp₁(NC2,HC2) and each affinity vector (here AXA (head-neck)) gives an elementary weight for link HC2-NC2 and PAFt equal to 0.1.

The candidates are also projected onto PAF₂ as shown in (d). The normalized sum of the scalar products between X/p₂(NC2,HC2) and each affinity vector (here AX/₂(head-neck)) gives an elementary weight for link HC2-NC2 and PAF₂ equal to 0.15.

Their sum gives a weight for link HC2-NC2 equal to 0.1+0.15=0.25.

The same is performed (not shown) for link HC1-NC2 and link HC2-NC1. Let assume a weight calculated for link HC1-NC2 is 0.3 and a weight calculated for link HC2-NC1 is 0.5.

All the calculated weights are shown in the graph on top of the Figure.

Back to Figure 8 after step 459, the remainder of the process (solving of the graphs and building the 3D skeletons) remains unchanged.

The graph solver for step 460 uses the weights calculated above. The energy to maximize is:

E= Ze, where e=3.[JPPp_art-i(first node)+JPPpa_rt-2(second node)] + y.weightiink

For the exemplary graph of Figure 10, β=0.4 and γ=0.5 are chosen, which give the following elementary energies for the pairs of part candidates:

θΗΟ1-ΝΟ1⁼θ·θ θΗΟ1-ΝΟ2⁼2.8 θΗΟ2-ΝΟ1⁼3·7 θΗΟ2-ΝΟ2⁼θ·4

Maximal energy is obtained by keeping links HC1-NC1 and HC2-NC2. Only 1-to-1 associations remain (here two).

However, the energy of HC2-NC2 may be considered too low to represent an actual portion of a 3D object. Thus, if a threshold is applied, HC2-NC2 can also be discarded, and only HC1-NC1 is kept as an output 1 -to-1 association between part candidates.

Although the present invention has been described hereinabove with reference to specific embodiments, the present invention is not limited to the specific embodiments, and modifications will be apparent to a skilled person in the art which lie within the scope of the present invention.

Many further modifications and variations will suggest themselves to those versed in the art upon making reference to the foregoing illustrative embodiments, which are given by 10 way of example only and which are not intended to limit the scope of the invention, that being determined solely by the appended claims. In particular the different features from different embodiments may be interchanged, where appropriate.

In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. The mere fact that different features 15 are recited in mutually different dependent claims does not indicate that a combination of these features cannot be advantageously used.

Claims

1. A method for generating a 3D skeleton of a 3D real world object observed by source cameras in a scene volume, comprising the following steps performed by a computer system:

obtaining, from memory of the computer system, two simultaneous source images of the scene volume recorded by the source cameras;

2. The method of Claim 1, further comprising using a first set of part volume data to restrict an amount of elementary voxels to be projected on part maps to generate a second set of part volume data.

3. The method of Claim 2, wherein using the first set of part volume data includes:

4. The method of Claim 3, wherein using the first set of part volume data further includes merging bounding boxes into a new bounding box where the bounding boxes overlap each other.

5. The method of Claim 1, wherein generating a part map from a source image for a respective part includes:

obtaining one or more scaled versions of the source image, generating, from each of the source image and its scaled versions, an intermediary part map for the respective part, the intermediary part map comprising part probabilities for respective samples of the source image or its scaled version representing probabilities that the respective samples correspond to said part of the 3D real world object, and forming the part map with, for each sample considered, the highest part probability from the part probabilities of the generated intermediary part maps for the same sample considered.

6. The method of Claim 1, wherein computing the joint part probability for an elementary voxel includes dividing the sum of the part probabilities of its projection samples in the part maps corresponding to the respective part, by the number of such part maps.

7. The method of Claim 1, wherein generating one or more parts of the 3D skeleton includes:

determining two or more sets of part candidate or candidates from respectively the part volume data, each part candidate corresponding to an elementary voxel with an associated joint part probability, solving a graph to associate together part candidates coming from different sets of part candidates, wherein nodes in the graph correspond to the part candidates of the two or more sets with their associated joint part probabilities and weighted links between nodes are set, and selecting associated part candidates as parts of the 3D skeleton.

8. The method of Claim 7, wherein exactly two sets of part candidates are used in a graph that correspond to two parts that are adjacent according to a 3D model of the 3D real world object, and solving the graph consists in obtaining a one-to-one association between a first part candidate of the first candidate set and a second part candidate of the second candidate set.

9. The method of Claim 7, further comprising generating, from each source image, a part affinity field for the two adjacent parts that includes affinity vectors for respective samples of the source image, the magnitude and direction of each affinity vector representing estimated orientation probability and orientation of an element connecting, according to the 3D model, two said adjacent parts at the respective sample in the source image, wherein the weights set for the weighted links are based on the generated part affinity fields.

10. The method of Claim 9, wherein setting a weight for a link between a first part candidate of a first candidate set and a second part candidate of a second candidate includes:

projecting the first and second part candidates onto a generated part affinity field, and computing the weight based on affinity vectors located between the two projected part candidates in the generated part affinity field.

11. The method of Claim 10, wherein computing the weight includes computing the scalar products of a vector formed by the two projected part candidates and the affinity vectors located between the two projected part candidates.

12. The method of Claim 10, wherein setting the weight for the link between the first and second part candidates includes:

repeating the projecting and computing steps for all the generated part affinity fields to obtain an elementary link weight from each part affinity field, and summing the computed elementary link weights to obtain a weight for the link.

13. The method of Claim 7, wherein determining part candidate or candidates from part volume data includes determining local maximum or maxima of the joint part probabilities and outputting elementary voxel or voxels corresponding to the determined local maximum or maxima.

14. The method of Claim 7, wherein setting a link between two nodes in the graph corresponding to part candidates of two sets depends on morphological constraints defined by the 3D model between the two corresponding parts.

15. The method of Claim 7, wherein a weight for a weighted link between two nodes corresponding to part candidates of two sets depends on a distance between the two part candidates.

16. The method of Claim 8, wherein generating one or more parts of the 3D skeleton includes:

repeating the determining and solving steps for a plurality of pairs of adjacent parts, connecting one-to-one-associated part candidates of two or more pairs of adjacent parts to obtain 3D skeleton candidate or candidates, and selecting at least one 3D skeleton candidate from the obtained 3D skeleton candidate or candidates, as a 3D skeleton of the 3D real world object.

17. The method of Claim 16, wherein selecting at least one candidate 3D skeleton is based on at least one from:

a number of parts forming the 3D skeleton candidate according to the 3D model, the joint part probabilities associated with the nodes of the 3D skeleton candidate in the graphs, the weights set for the links between the nodes of the 3D skeleton candidate in the graphs, and a number of source images onto which the 3D skeleton candidate can be projected.

18. A method for displaying a 3D skeleton of a 3D real world object observed by source cameras in a scene volume, comprising the following steps performed by a computer system:

generating a 3D skeleton of the 3D real world object using the generating method of Claim 1, selecting a virtual camera viewing the scene volume, and displaying, on a display screen, the generated 3D skeleton from virtual camera viewpoint.

19. A non-transitory computer-readable medium storing a program which, when executed by a microprocessor or computer system in a device, causes the device to perform the method of Claim 1.

20. A system for generating a 3D skeleton of a 3D real world object observed by source cameras in a scene volume, comprising at least one microprocessor configured for carrying out the steps of:

generating, from each source image, one or more part maps for one or more respective parts of the 3D real world object, each part map for a given part comprising part probabilities for respective samples of the source image representing probabilities that the respective samples corresponds to the given part;

21. A system for displaying a 3D skeleton of a 3D real world object observed by source cameras in a scene volume, comprising the generating system of Claim 20, wherein the microprocessor is further configured for carrying out the steps of:

selecting a virtual camera viewing the scene volume, and

5 displaying, on a display screen, the generated 3D skeleton from virtual camera viewpoint.

01 19

viewpoint.

selecting a virtual camera viewing the scene volume, and displaying, on a display screen, the generated 3D skeleton from virtual camera

Amendments to the claims have been made as follows:

11 01 19

1. A method for generating a 3D skeleton of a 3D real world object observed by source cameras in a scene volume, comprising the following steps performed by a computer

5 system:

generating, from each source image, one or more part maps for one or more respective parts of the 3D real world object, each part map for a given part being a 2D array of 10 part map samples and comprising, at the part map samples, part probabilities for respective samples of the source image representing probabilities that the respective samples correspond to the given part;

15 projecting elementary voxels of the scene volume onto part map samples, referred to as projection samples, of the part maps;

computing a joint part probability for each elementary voxel based on the part probabilities of its projection samples in the generated part maps corresponding to the respective part;

20 generating one or more parts of the 3D skeleton using the one or more sets of part volume data generated.

25 3. The method of Claim 2, wherein using the first set of part volume data includes:

determining part candidates of the 3D real world object from the first set of part volume data, defining bounding 3D boxes around the determined part candidates in the scene 30 volume, wherein the amount of elementary voxels to be projected on the part maps to generate a second set of part volume data is restricted to the defined bounding boxes.

35 5. The method of Claim 1, wherein generating a part map from a source image for a respective part includes:

11 01 19 obtaining one or more scaled versions of the source image, generating, from each of the source image and its scaled versions, an intermediary part map for the respective part, the intermediary part map being a 2D array of part map samples and comprising, at the part map samples, part probabilities for respective samples of 5 the source image or its scaled version representing probabilities that the respective samples correspond to said part of the 3D real world object, and forming the part map with, for each part map sample considered, the highest part probability from the part probabilities of the generated intermediary part maps for the same part map sample considered.

10 6. The method of Claim 1, wherein computing the joint part probability for an elementary voxel includes dividing the sum of the part probabilities of its projection samples in the part maps corresponding to the respective part, by the number of such part maps.

15 determining two or more sets of part candidate or candidates from respectively two or more generated sets of part volume data corresponding to two or more parts, each part candidate corresponding to an elementary voxel with an associated joint part probability, solving a graph to associate together part candidates coming from different sets of part candidates, wherein nodes in the graph correspond to the part candidates of the two or 20 more sets with their associated joint part probabilities and weighted links between nodes are set, and selecting associated part candidates as parts of the 3D skeleton.

8. The method of Claim 7, wherein exactly two sets of part candidates are used in a graph that correspond to two parts that are adjacent according to a 3D model of the 3D

25 real world object, and solving the graph consists in obtaining a one-to-one association between a first part candidate of the first candidate set and a second part candidate of the second candidate set.

9. The method of Claim 7, further comprising generating, from each source

30 image, a part affinity field for the two adjacent parts that includes affinity vectors for respective samples of the source image, the magnitude and direction of each affinity vector representing estimated orientation probability and orientation of an element connecting, according to the 3D model, two said adjacent parts at the respective sample in the source image, wherein the weights set for the weighted links are based on the generated part

35 affinity fields.

11 01 19

projecting the first and second part candidates onto a generated part affinity field, and

5 computing the weight based on affinity vectors located between the two projected part candidates in the generated part affinity field.

10 12. The method of Claim 10, wherein setting the weight for the link between the first and second part candidates includes:

15 13. The method of Claim 7, wherein determining a set of part candidate or candidates from a set of part volume data includes determining local maximum or maxima of the joint part probabilities and outputting elementary voxel or voxels corresponding to the determined local maximum or maxima.

14. The method of Claim 7, wherein setting a link between two nodes in the graph 20 corresponding to part candidates of two sets depends on morphological constraints defined by the 3D model between the two corresponding parts.

25 16. The method of Claim 8, wherein generating one or more parts of the 3D skeleton includes:

repeating the determining and solving steps for a plurality of pairs of adjacent parts, connecting one-to-one-associated part candidates of two or more pairs of adjacent 30 parts to obtain 3D skeleton candidate or candidates, and selecting at least one 3D skeleton candidate from the obtained 3D skeleton candidate or candidates, as a 3D skeleton of the 3D real world object.

35 a number of parts forming the 3D skeleton candidate according to the 3D model,

11 01 19 the joint part probabilities associated with the nodes of the 3D skeleton candidate in the graphs, the weights set for the links between the nodes of the 3D skeleton candidate in the graphs, and

5 a number of source images onto which the 3D skeleton candidate can be projected.

10 generating a 3D skeleton of the 3D real world object using the generating method of Claim 1, selecting a virtual camera viewing the scene volume, and displaying, on a display screen, the generated 3D skeleton from virtual camera viewpoint.

15 19. A non-transitory computer-readable medium storing a program which, when executed by a microprocessor or computer system in a device, causes the device to perform the method of Claim 1.

20. A system for generating a 3D skeleton of a 3D real world object observed by source cameras in a scene volume, comprising at least one microprocessor configured for 20 carrying out the steps of:

obtaining, from memory of the system, two simultaneous source images of the scene volume recorded by the source cameras;

generating, from each source image, one or more part maps for one or more respective parts of the 3D real world object, each part map for a given part being a 2D array of 25 part map samples and comprising, at the part map samples, part probabilities for respective samples of the source image representing probabilities that the respective samples corresponds to the given part;

30 projecting elementary voxels of the scene volume onto part map samples, referred to as projection samples, of the part maps;

35 generating one or more parts of the 3D skeleton using the one or more sets of part volume data generated.