GB2573172A

GB2573172A - 3D skeleton reconstruction with 2D processing reducing 3D processing

Info

Publication number: GB2573172A
Application number: GB1806951.8A
Authority: GB
Inventors: Le Floch Hervé
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2018-04-27
Filing date: 2018-04-27
Publication date: 2019-10-30
Anticipated expiration: 2038-04-27
Also published as: GB2573172B; GB201806951D0

Abstract

To generate 3D skeletons of a scene, i.e. to know where each part of real world objects is 3D located within 3D space, the invention adequately splits the overall process into operations performed at 2D level on images, and operations performed at 3D level on volume data. Complexity of the 3D skeleton reconstruction is reduced as the amount of data to be processed at 3D level is reduced due to 2D processing. Source images capturing the objects are obtained, from which sets of 2D part candidates are determined for parts of the objects. Each 2D part candidate corresponds to a pixel in the source image. 3D part candidates can be generated in 3D space from the 2D part candidates, and generated 3D part candidates representing the same part can be converted into a single 3D part candidate. Finally, a 3D skeleton is generated from the 3D part candidates.

Description

3D SKELETON RECONSTRUCTION WITH 2D PROCESSING REDUCING 3D PROCESSING

FIELD OF THE INVENTION

The present invention relates generally to reconstruction of 3D skeletons from views of one or more 3D real world objects. Improved 2D or 3D images of the 3D real world objects can be generated from the reconstructed 3D skeletons.

BACKGROUND OF THE INVENTION

Reconstruction of 3D skeletons, also known as 3D object pose estimation, is widely used in image-based rendering. Various applications for 3D object pose estimation and virtual rendering can be contemplated, including providing alternative views of the same animated 3D object or objects from virtual cameras, for instance a new and more immersive view of a sport event with players.

Various attempts to provide methods and devices for 3D skeleton reconstruction have been made, including US 8,830,236 and publication “3D Human Pose Estimation via Deep Learning from 2D annotations” (2016, fourth International Conference on 3D Vision (3DV), Ernesto Brau, Hao Jiang). However, the efficiency of the techniques described in these documents remains insufficient in terms of performances, including memory use, processing time (for instance nearly real time such as less than a few seconds before rendering), ability to detect a maximum number of 3D real world objects in the scene.

SUMMARY OF INVENTION

New methods and devices to reconstruct 3D skeletons from source images of the same scene are proposed. A method for generating a 3D skeleton of one or more 3D real world objects observed by cameras according to the invention is defined in Claim 1. It comprises the following steps performed by a computer system: obtaining (possibly from memory of the computer system) a plurality of simultaneous images of the 3D real world objects recorded by the cameras, determining, from each image, one or more sets of 2D part candidate or candidates for one or more respective parts of the 3D real world object (e.g. head, left hand, ... of a human-like object), each 2D part candidate corresponding to a sample (e.g. pixels) of the corresponding image, generating, in 3D space, 3D part candidates from the 2D part candidates, converting generated 3D part candidates representing the same part into a single 3D part candidate, and then, generating at least one 3D skeleton from the 3D part candidates.

The present invention offers a balanced split between operations performed at 2D level, here the determination of 2D part candidates, and operations performed at 3D level, here the conversion of 3D part candidates with a view of forming the 3D skeleton. It involves converting the 2D part candidates into 3D part candidates.

This split advantageously offers a good balance between processing complexity and speed performance as the amount of 3D data to be handled is substantially reduced by the 2D part candidate determining.

Various applications of the invention may be contemplated, including a method for displaying a 3D skeleton of one or more 3D real world objects observed by cameras as defined in Claim 20. It comprises the following steps performed by a computer system: generating a 3D skeleton of a 3D real world object using the generating method above, selecting a viewpoint in 3D space, and displaying, on a display screen, the generated 3D skeleton or a 3D object/character obtained from said generated 3D skeleton from the viewpoint.

More generally, the 3D skeleton generation may be applied to 2D or 3D image generation, therefore providing improved contribution to the technical field of image processing producing an improved image.

In this context, the invention may also improve the field of rendering a scene from a new viewpoint which may be seen as a new “virtual camera”.

Correspondingly, a system, which may be a single device, for generating a 3D skeleton of one or more 3D real world objects observed by cameras according to the invention is defined in Claim 22. It comprises at least one microprocessor configured for carrying out the steps of: obtaining, from memory of the computer system, a plurality of simultaneous images of the 3D real world objects recorded by the cameras, determining, from each image, one or more sets of 2D part candidate or candidates for one or more respective parts of the 3D real world object, each 2D part candidate corresponding to a sample of the corresponding image, generating, in 3D space, 3D part candidates from the 2D part candidates, converting generated 3D part candidates representing the same part into a single 3D part candidate, and then, generating at least one 3D skeleton from the 3D part candidates.

Also, a system for displaying a 3D skeleton of one or more 3D real world objects observed by cameras may be as defined in Claim 23. It comprises the above system to generate a 3D skeleton of the 3D real world object connected to a display screen, wherein the microprocessor is further configured for carrying out the steps of: selecting a viewpoint in 3D space, and displaying, on the display screen, the generated 3D skeleton from the viewpoint.

Optional features of the invention are defined in the appended claims. Some of these features are explained here below with reference to a method, while they can be transposed into system features dedicated to any system according to the invention.

In embodiments, the determining step includes: generating, from each image, one or more part maps for one or more respective parts of the 3D real world object, each part map for a given part comprising part probabilities for respective samples (e.g. pixels) of the image representing probabilities that the respective samples correspond to the given part, and determining sets of 2D part candidate or candidates from respectively the part maps.

This approach substantially reduces processing complexity as the part candidates are determined at 2D level, while a reduced number of such candidates can be used at 3D level for further processing.

In specific embodiments seeking to increase robustness, generating a part map from a source image for a respective part includes: obtaining one or more scaled versions of the source image, generating, from each of the source image and its scaled versions, an intermediate part map for the respective part, the intermediate part map comprising part probabilities for respective samples of the source image or its scaled version representing probabilities that the respective samples correspond to said part of the 3D real world object, and forming the part map with, for each sample considered, the highest part probability from the part probabilities of the same sample considered in the generated intermediate part maps.

In other specific embodiments, determining a set of 2D part candidate or candidates from a part map includes determining local maximum or maxima of the part probabilities in the part map and outputting the sample or samples corresponding to the determined local maximum or maxima as 2D part candidate or candidates.

In other embodiments, the step of generating 3D part candidates from the 2D part candidates includes: repeatedly matching two 2D part candidates from two respective sets of 2D part candidates (i.e. from two different source images) determined for the same part (e.g. head), and generating, in 3D space, 3D part candidates from respective pairs of matched 2D part candidates. It means that a pair of matching or matched 2D part candidates is used to generate one 3D part candidate in the volume.

This approach proves to be of low complexity to produce 3D part candidates for a 3D skeleton representing the observed 3D real world object.

In yet other embodiments, the method further comprises a step of filtering the generated 3D part candidates into a subset of 3D part candidates.

Through this filtering the number of 3D part candidates to be handled for 3D skeleton generation may be substantially reduced, thereby reducing processing complexity of the operations performed at 3D level.

In specific embodiments, the filtering step may include selecting 3D part candidates generated from pairs of matched 2D part candidates that share the same 2D part candidates.

The selecting step may include selecting at least one triplet of 3D part candidates generated from three respective pairs built from exactly the same three 2D part candidates.

The filtering step may also include selecting or discarding 3D part candidates generated from pairs of matched 2D part candidates based on a part distance determined between the 2D part candidates of the respective pairs.

In yet other embodiments, the 3D skeleton generating step includes using (and solving) a graph to obtain one or more one-to-one associations between 3D part candidates representing two different parts, wherein nodes of the graph correspond to the 3D part candidates representing the two different parts considered and weighted links between nodes corresponding to two 3D part candidates for the two different parts are set.

As association is sought between candidates of different parts, no link is preferably made between nodes representing candidates of the same part. This reduces the complexity of the graph to be solved.

The graph-based approach makes it possible to efficiently find the best associations between 3D parts to build a final 3D skeleton, at reasonable processing costs.

The weight for a link may be calculated based on pairwise probability or probabilities between pairs of 2D part candidates, pairs from which two 3D part candidates forming the link are generated. Each pairwise probability can be obtained for two 2D part candidates belonging to the same source image.

The method may further comprise generating, from the source image, a part affinity field between the two different parts considered that includes affinity vectors for respective samples of the source image, the magnitude and direction of each affinity vector representing estimated orientation probability and orientation of an element connecting, according to the 3D model, the two different parts considered at the respective sample in the source image, wherein the pairwise probability is calculated based on the part affinity field generated from the source image.

The pairwise probability for the two 2D part candidates may be calculated based on affinity vectors located between the two 2D part candidates in the generated part affinity field. Calculating the pairwise probability may include computing the scalar products of a vector formed by the two 2D part candidates and the affinity vectors located between the two 2D part candidates.

The weight for a link may also or in a variant be based on a distance between the two 3D part candidates forming the link.

Also, the two different parts considered may be adjacent according to a 3D model of the 3D real world object.

The 3D skeleton generating step may further include: repeating using a graph for successively each of a plurality of pairs of adjacent parts according to a 3D model of the 3D real world object, in order to obtain one or more one-to-one associations between 3D part candidates for each pair of adjacent parts, and connecting pairs of associated 3D part candidates that share the same 3D part candidate to obtain one or more 3D skeleton candidates.

The method may thus further comprise selecting one of the obtained 3D skeleton candidates as a 3D skeleton of the 3D real world object. Selecting one 3D skeleton candidate may be based on a number of parts forming the 3D skeleton candidate according to the 3D model.

Another aspect of the invention relates to a non-transitory computer-readable medium storing a program which, when executed by a microprocessor or computer system in a device, causes the device to perform the method as defined above.

The non-transitory computer-readable medium may have features and advantages that are analogous to those set out above and below in relation to the methods and node devices.

At least parts of the methods according to the invention may be computer implemented. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, microcode, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit", "module" or "system". Furthermore, the present invention may take the form of a computer program product embodied in any tangible medium of expression having computer usable program code embodied in the medium.

Since the present invention can be implemented in software, the present invention can be embodied as computer readable code for provision to a programmable apparatus on any suitable carrier medium. A tangible carrier medium may comprise a storage medium such as a hard disk drive, a magnetic tape device or a solid state memory device and the like. A transient carrier medium may include a signal such as an electrical signal, an electronic signal, an optical signal, an acoustic signal, a magnetic signal or an electromagnetic signal, e.g. a microwave or RF signal.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will now be described, by way of example only, and with reference to the following drawings in which:

Figure 1 is a general overview of a system 10 implementing embodiments of the invention;

Figure 2 illustrates an exemplary 3D model of a 3D real world object, based on which a 3D skeleton of the 3D object can be built;

Figure 3 is a schematic block diagram of a computing device for implementation of one or more embodiments of the invention.

Figure 4 illustrates, using a flowchart, embodiments of a method for generating a 3D skeleton of a 3D real world object observed by source cameras according to embodiments of the present invention;

Figures 4a to 4d schematically illustrate, using flowcharts, exemplary operations forming sub-steps of the process of Figure 4, according to embodiment.

Figure 5 schematically illustrates an exemplary splitting of a cuboid into elementary cubes V(X,Y,Z);

Figure 6 schematically illustrates a way to compute a part distance between two 2D part candidates according to embodiment of the present invention;

Figure 7 illustrates, using a flowchart, steps for computing a part distance between two 2D part candidates;

Figure 8 schematically illustrates a triangulation process to build a 3D part candidate from a matching pair of two matched 2D part candidates according to embodiment of the present invention;

Figures 9a and 9b schematically illustrate a filtering process of 3D part candidates according to embodiments of the present invention;

Figure 10 schematically illustrates a conversion process of 3D part candidates according to embodiments of the present invention;

Figure 11 schematically illustrates a portion of a part affinity field PAF between right foot and right knee in a source image;

Figures 12a to 12c schematically illustrate the retrieval of pairwise probabilities for two 3D part candidates according to embodiments of the present invention;

Figures 13a and 13b schematically illustrate the retrieval of pairwise probabilities in case of 3D part candidate conversion, according to embodiments of the present invention;

Figure 14 schematically illustrates steps for generating a 3D skeleton candidate using a graph according to embodiments of the present invention; and

Figure 15 illustrates, using a flowchart, a process for displaying a 3D skeleton of a 3D real world object observed by source cameras according to embodiments of the invention.

DETAILLED DESCRIPTION OF EMBODIMENTS

Figure 1 is a general overview of a system 10 implementing embodiments of the invention. The system 10 comprises a three-dimensional (3D) real world object 11 of a scene captured by two or more source camera/sensor units 12.

The 3D real world object 11 may be of various types, including beings, animals, mammals, human beings, articulated objects (e.g. robots), still objects, and so on. The scene captured may also include a plurality of 3D objects that may move overtime.

Although two main camera units 12a, 12b are shown in the Figure, there may be more of them, for instance about 7-10 camera units, up to about 30-50 camera units in a stadium.

The source camera units 12 generate synchronized videos made of 2D source images 13 (i.e. views from their viewpoints) of the scene at substantially the same time instant, i.e. simultaneous source images. Each source camera/sensor unit 12 (12a, 12b) comprises a passive sensor (e.g. an RGB camera).

The 3D positions and orientations of the source cameras 12 within a reference 3D coordinates system SYS are known. They are named the extrinsic parameters of the source cameras.

Also the geometrical model of the source cameras 12, including the focal length of each source camera and the orthogonal projecting position of the center of projection in the image 13 are known in the camera coordinates system. They are named the intrinsic parameters of the source cameras. This camera model is described with intrinsic parameters as a pinhole model in this description but any different model could be used without changing the means of the invention. Preferably, the source cameras 12 are calibrated so that they output their source images of the scene at the same cadence and simultaneously. The intrinsic and extrinsic parameters of the cameras are supposed to be known or calculated by using well-known calibration procedures.

In particular, these calibration procedures allow the 3D object to be reconstructed into a 3D skeleton at the real scale.

The source images 13 feed a processing or computer system 14 according to the invention.

The computer system 14 may be embedded in one of the source camera 12 or be a separate processing unit. Any communication technique (including Wifi, Ethernet, 3G, 4G, 5G mobile phone networks, and so on) can be used to transmit the source images 13 from the source cameras 12 to the computer system 14.

An output of the computer system 14 is a 3D skeleton for at least one 3D object of the scene in order to generate a 2D or 3D image of preferably the scene. A virtual image 13v built with the 3D skeleton generated and showing the same scene with the 3D object or objects from a viewpoint of a virtual camera 12v may be rendered on a connected display screen 15. Alternatively, data encoding the 3D skeleton generated may be sent to a distant system (not shown) for storage and display, using for instance any communication technique.

Stored 3D skeletons may also be used in human motion analysis for video monitoring purposes for instance.

Figure 2 illustrates an exemplary 3D model 20 of a 3D real world object, based on which a 3D skeleton of the 3D object may be built according to the teachings of the present invention. In the example of the Figure, the 3D object is an articulated 3D real world object of human being type. Variants may regard still objects.

The 3D model comprises N distinct parts 211, 212, 213, ... and N-1 connecting elements or links 22. The parts (globally referenced 21) represent modeled portions of the 3D real world object, for instance joints (shoulders, knees, elbows, pelvis, ...) or end portions (head, hands, feet) of a human being. Each part 21 is defined as a 3D point (or position) in the 3D coordinates system SYS. The 3D point or position may be approximated to a voxel in case SYS is discretized The connecting elements 22 are portions connecting the parts 21, for instance limbs such as forearm, arm, thigh, trunk and so on. Each connecting element 22 can be represented as a straight line between the two connected parts, also named “adjacent parts”, through 3D space.

To generate the 3D skeleton or skeletons of the scene, i.e. to know where each part of the 3D real world object or objects is 3D located within 3D space, an idea of the present invention consists in adequately splitting the overall process into processing operations performed at 2D level, i.e. on images, and processing operations performed at 3D level, i.e. on volume data.

This aims at reducing complexity of the 3D skeleton reconstruction as the amount of data to be processed at 3D level can be substantially reduced due to 2D processing. Indeed a reduced number of 2D points detected at image/2D level (hereafter 2D part candidates) thanks to the operations at same level is converted into similar reduction of 3D points (hereafter 3D part candidates) to be processed in the 3D space. Costly 3D processing operations are thus substantially reduced.

To that end, a plurality of simultaneous source images 13 of the scene captured by the source cameras 12 may be obtained, from memory of the computer system for instance.

In case a volume V of the captured scene is delimited, its position and orientation are known in the 3D coordinates system SYS (for instance the 3D shape is known, typically a cuboid or cube, and the 3D locations of four of its vertices are known).

Next, from each source image, one or more sets of 2D part candidate or candidates for one or more respective parts of the 3D real world object (e.g. head, left hand) can be determined. Each 2D part candidate corresponds to a sample (e.g. pixels) of the corresponding source image. Known techniques to detect 2D parts corresponding to a known model can be used as described below, which for instance provide part probabilities for each sample of the source image to correspond to the parts forming the object. The techniques may also provide a pairwise probability for each pair of detected 2D part candidates representing two adjacent parts (i.e. a connecting element 22) in the same image, this pairwise probability representing the probability that the two detected 2D part candidates be actually connected by the identified connecting element 22 in the real object.

The determined 2D part candidates may then be converted into 3D positions in 3D space, meaning 3D part candidates are generated from the 2D part candidates (on a per-part basis). Advantageously, a matching between 2D part candidates corresponding to the same part is first made before projecting each matching pair into a 3D part candidate in 3D space. This may merely involve geometrical considerations given for instance the positions and orientations of the cameras (more generally their extrinsic and intrinsic parameters) having captured the source images from which the 2D part candidates are obtained.

To increase robustness of the process, the 3D part candidates may optionally be filtered in order to preferably keep those generated from 2D part candidates shared by two or more 3D part candidates. Indeed, the inventor has noticed such sharing helps identifying the most robust/relevant 3D parts of the real objects.

Next, the invention provides obtaining a 3D part candidate for a given part from several 3D part candidates generated for that part. Such conversion may involve a RANSAC (Random sample consensus algorithm) approach based on distance considerations. Advantageously, this conversion also provides reduction of 3D processing complexity, as the number of 3D part candidates is further reduced.

Next, one or more 3D skeletons can be generated from the kept 3D part candidates. This may be made iteratively by considering each pair of adjacent parts forming the model. A graph encoding all the 3D part candidates (as nodes) for a given pair is preferably used and solved using inter-candidates statistics for the links between nodes. Such graph may help obtaining most relevant (and thus robust) associations between 3D part candidates. A connected component algorithm may then help progressively building the 3D skeleton, by successively considering the obtained associations for the various pairs of adjacent parts forming the object model and connecting those sharing the same 3D part candidate.

Thanks to the proposed approach, isolation of 3D objects within a scene volume comprising plenty of them can be improved. It turns that real time reconstructions of 3D skeletons (and thus displays or human motion analysis for instance) are better achieved. Real time reconstructions for “live” TV or broadcast purposes may include few seconds delay, e.g. less than 10 seconds, preferably at most 4 or 5 seconds.

The inventors have noticed that the proposed approach efficiently works on complex scenes (like sport events with multiple players in a stadium), with an ability to detect a wide number of interoperating 3D objects (multiple human players).

The generated 3D skeleton may be used to generate a 2D or 3D image. The present invention thus provides improved contribution to the technical field of image processing producing an improved image.

As mentioned above, an exemplary application for the present invention may relate to the display of a virtual image 13v showing the same scene from a new viewpoint, namely a virtual camera 12v. To that end, the invention also provides a method for displaying a 3D skeleton of one or more 3D real world objects observed by source cameras. This method includes generating at least one 3D skeleton of a 3D real world object using the generating method described above.

Next, this application consists in selecting a virtual camera and displaying the generated 3D skeleton from the virtual camera on a display screen. In practice, several generated 3D skeletons are displayed simultaneously on the display, for instance when displaying a sport event. A simple 3D object as shown in Figure 2 can be used to display the generated 3D skeleton. This is useful to display animations that require low rendering costs. More promising applications can also provide an envelope to the 3D skeleton with a texture, either predefined or determined from pixel values acquired by the source cameras (for better rendering). This is for example to accurately render shot or filmed sportsmen as they actually look like in the scene volume.

Selecting a virtual camera may merely consist in defining the extrinsic and intrinsic parameters of a camera, thereby defining the view point (i.e. distance and direction from the scene volume) and the zoom (i.e. focal) provided by the virtual image.

Generating the 3D skeletons and displaying/rendering them on the display screen 15 may be performed for successive source images 13 acquired by the source cameras 12. Of course the displaying operation is made following the timing of acquiring the source images. It turns that 3D-skeleton-based animations of the captured scene can be efficiently produced and displayed.

Other applications based on the generated 3D skeleton or skeletons may be contemplated. For instance, video monitoring for surveillance purposes of areas, such as the street or a storehouse, may perform detection of 3D skeletons in captured surveillance images and then analyses the moving of these 3D skeletons to trigger an alarm or not.

Figure 3 schematically illustrates a device 300 used for the present invention, for instance the above-mentioned computer system 14. It is preferably a device such as a microcomputer, a workstation or a light portable device. The device 300 comprises a communication bus 313 to which there are preferably connected: - a central processing unit 311, such as a microprocessor, denoted CPU; - a read only memory 307, denoted ROM, for storing computer programs for implementing the invention; - a random access memory 312, denoted RAM, for storing the executable code of methods according to the invention as well as the registers adapted to record variables and parameters necessary for implementing methods according to the invention; and - at least one communication interface 302 connected to a communication network 301 over which data may be transmitted.

Optionally, the device 300 may also include the following components: - a data storage means 304 such as a hard disk, for storing computer programs for implementing methods according to one or more embodiments of the invention; - a disk drive 305 for a disk 306, the disk drive being adapted to read data from the disk 306 or to write data onto said disk; - a screen 309 for displaying data and/or serving as a graphical interface with the user, by means of a keyboard 310 or any other pointing means.

The device 300 may be connected to various peripherals, such as for example source cameras 12, each being connected to an input/output card (not shown) so as to supply data to the device 300.

Preferably the communication bus provides communication and interoperability between the various elements included in the device 300 or connected to it. The representation of the bus is not limiting and in particular the central processing unit is operable to communicate instructions to any element of the device 300 directly or by means of another element of the device 300.

The disk 306 may optionally be replaced by any information medium such as for example a compact disk (CD-ROM), rewritable or not, a ZIP disk, a USB key or a memory card and, in general terms, by an information storage means that can be read by a microcomputer or by a microprocessor, integrated or not into the apparatus, possibly removable and adapted to store one or more programs whose execution enables a method according to the invention to be implemented.

The executable code may optionally be stored either in read only memory 307, on the hard disk 304 or on a removable digital medium such as for example a disk 306 as described previously. According to an optional variant, the executable code of the programs can be received by means of the communication network 301, via the interface 302, in order to be stored in one of the storage means of the device 300, such as the hard disk 304, before being executed.

The central processing unit 311 is preferably adapted to control and direct the execution of the instructions or portions of software code of the program or programs according to the invention, which instructions are stored in one of the aforementioned storage means. On powering up, the program or programs that are stored in a non-volatile memory, for example on the hard disk 304 or in the read only memory 307, are transferred into the random access memory 312, which then contains the executable code of the program or programs, as well as registers for storing the variables and parameters necessary for implementing the invention.

In a preferred embodiment, the device is a programmable apparatus which uses software to implement the invention. However, alternatively, the present invention may be implemented in hardware (for example, in the form of an Application Specific Integrated Circuit or ASIC).

Various embodiments of the present invention are now described with reference to

Figures 4 to 15.

Figure 4 illustrates, using a flowchart, embodiments of a method according to the present invention. The method takes place in the computer system 14 which has previously received M source images 13 acquired simultaneously by M calibrated source cameras 12, for instance through a wireless or a wired network. These source images 13 are for instance stored in a reception buffer (memory) of the communication interface 302. The M source images may be a subset of source images available.

The method may be repeated for each set of simultaneous source images 13 received from the source cameras 12 for each successive time instants. For instance, 25 Hz to 100 Hz source cameras may be used, thereby requiring processing a set of source images 13 each 1/100 to 1/25 second.

The scene volume V viewed by the source cameras 12 may be predefined as shown by the volume parameters 401. These parameters locate the scene volume in the coordinates system SYS. The scene volume V may be split into elementary voxels V(X,Y,Z), preferably of equal sizes, typically elementary cubes. A size of the elementary voxels may be chosen depending on the 3D object to be captured. This is the resolution of the 3D space: each voxel corresponds to a point in the 3D space.

For instance, the edge length of each elementary voxel may be set to 1 cm for a human being. Figure 5 schematically illustrates the splitting of a cuboid into elementary cubes V(X,Y,Z), only one of which being shown for sake of clarity.

The invention also applies to a 3D coordinates system SYS without specific scene volume and corresponding splitting into voxels.

The source cameras 12 have been calibrated, meaning their extrinsic and intrinsic parameters 402 are known.

The nature, and thus the 3D model 20, or each 3D real world object 11 in SYS is known. For ease of explanation, the description below concentrates on a single type of 3D object, for instance a human being as modelled in Figure 2. Where the captured scene contains various types of 3D objects, various corresponding 3D models 20 can be used using the teachings below.

The method starts with the obtaining 400 of a plurality of simultaneous source images of the 3D objects or of the scene volume recorded by the source cameras. The source images 12 are for instance retrieved from the reception buffer of the communication interface 302.

Although the sources images may have different sizes from one source camera to the other, it is assumed they have the same size for illustration purposes. In any case, a resizing of some source images may be processed to be in such situation. This resizing is not mandatory but helps in simplifying the description.

From each of these source images 13i, one or more sets of 2D part candidate or candidates 2D-PC'j(k) 403 are determined at step 410 for one or more respective parts 21 j of the 3D real world object (e.g. head, left hand, ... of a human-like object). Each 2D part candidate 2D-PC'j(k) corresponds to a sample (e.g. pixels) of the corresponding source image. Such determination is based on the 3D model or models 20 to detect each part of them (or at least the maximum number of such parts) within each source image. Several occurrences of the same model can be detected within the same source image, meaning several 3D real world objects are present in the scene captured by the cameras.

In the example of Figure 2, the detected 2D skeletons are made of up to thirteen parts with up to twelve connecting elements.

Known techniques can be used to produce these 2D skeletons from the source images 13.

One technique is described in publication “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields" by Zhe Cao et al. (2016).

Another technique is described in publication “DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model" by Eldar Insafutdinov et al. (2016) or publication “Deep-Cut: Joint Subset Partition and Labelling for Multi Person Pose Estimation" by Leonid Pishchulin et al. (2016).

More generally, a convolutional neural network (CNN) can be used which is configured based on a learning library of pictures in which a matching with each part of the models has been made. The CNN detects parts with associated part probabilities and may also provide pairwise (or part affinity) probabilities between detected parts which represent the probabilities that the detected parts are associated with the same 3D object. Pairwise probabilities may be obtained from different means. For example, in the publication “DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model’ by Eldar Insafutdinov et al. (2016), a logistic regression algorithm is used.

An advantage of the CNNs is that the same running of the CNN can identify, within an input image, parts from different models, provided that the CNN has learnt using learning pictures embedding the various models to be searched.

Typically, the part probabilities generated are unary, i.e. set between 0 and 1.

The technique described in publication “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields" calculates confidence maps for part detection and part affinity fields for part association. A confidence map or “part map” for a given part bears probabilities for respective pixels of the source image that these pixels correspond to said part of the 3D model 20. Each part affinity fields defined for a connecting element (or limb) between two adjacent parts provides affinity vectors for respective pixels of the source image, the magnitude and direction of each affinity vector representing estimated orientation probability and orientation of the limb connecting, according to the 3D model, two occurrences of said adjacent parts at the respective pixel in the source image.

The part maps and part affinity fields may have a different size/resolution from the source images (e.g. they are sub-sampled compared to the size of the source image). In such a case, the intrinsic parameters of the cameras can be modified taking into account the subsampling factor. In a variant, the part maps or part affinity fields may be interpolated in order to match the genuine size of the source images. In such a case, a bilinear interpolation is preferred over a nearest-neighbor or bi-cubic interpolation.

The part maps can be processed to obtain part candidates for each part type. In this process, each part candidate can be provided with a part probability. In other words, this technique generates, from each source image, one or more part maps for one or more respective parts of the 3D real world object, each part map for a given part comprising part probabilities for respective samples (e.g. pixels) of the source image representing probabilities that the respective samples correspond to the given part, and then the technique determines sets of 2D part candidate or candidates from respectively the part maps.

The other technique described in publication “DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model" or publication “Deep-Cut: Joint Subset Partition and Labelling for Multi Person Pose Estimation" is different. However, part candidates with associated part probabilities and (local and global) pairwise probabilities between all parts are still determined.

It turns out that step 410 generates a plurality of sets of 2D part candidates 2D-PC'j(k) (where “i" identifies the source images, “j” the 3D object part concerned and “k” indexes the various 2D part candidates identified for given i and j). Each 2D part candidate is thus defined by a 2D position (the pixel concerned), a part type (defining which part of the 3D object it corresponds to) and a part probability.

Figure 4a illustrates, using a flowchart, exemplary operations forming step 410 when applied to a given source image 13i. The flowchart can thus be repeated for each of the source images 13i.

The known techniques described above are dependent on the set of learning pictures used by the CNN to learn. To that aim, the learning pictures usually provide exemplary objects that have bounded sizes. These techniques are badly adapted to detect objects the size of which is not of the same order of magnitude than in the learning pictures. Indeed, 3D objects can be sometimes big, sometimes tiny. This is for instance the case during sport events where players move from very close to the camera to very far.

In embodiments seeking to increase robustness, it is proposed to use scaling of the source image to increase chances to have high part probabilities.

To that end, one or more scaled versions of a given source image 13 are obtained at step 411.

For instance, a half-sized image (scale 0.5) is generated (through down-sampling) as well as a double-sized image (scale 2 - through up-sampling). Known scaling techniques can be used.

Of course, other scaling values can be used. In this example, at least one up-scaled version and one downscaled version of the source image are obtained and used. In variants, only up-scaled versions or only downscaled versions are used.

Next, an intermediate part map is generated at step 412 for each part considered, from each of the source image and its scaled versions. This operation is made using any of the above-mentioned known techniques. Thus, the intermediate part map comprises part probabilities for respective pixels of the source image or its scaled version (possibly each pixel if the intermediate part map has the same dimensions as the images), which part probabilities represent probabilities that the respective pixels in the source image or scaled version correspond to the part considered.

Pixels of the source image or of its scaled versions are examples of “samples” forming an image. For ease of illustration, it is made reference below to pixels, while the invention may apply to any sample. A sample may be for instance a pixel in the source image, a color component of a pixel in the source image, a group of pixels in the source image, a group of pixel color components in the source image, etc.

As the generated intermediate part maps are not at the same scale, they are then preferably rescaled at a unique and same scale. For instance, an intermediate part map obtained from an up-scaled source image is downscaled (using the inversed scaling factor), meaning for instance that a part probability out of 2 is discarded (for a half scaling). Also an intermediate part map obtained from a downscaled source image is up-scaled (using the inversed scaling factor), meaning for instance that a part probability for a new pixel is determined for instance from the part probabilities of neighboring pixels (e.g. through interpolation).

The obtained (rescaled at the same scale) intermediate part maps are then used to generate at step 413 the part maps for said source image and the part currently considered. In particular, the part map for a given part 21 j is preferably formed with, for each pixel considered, the highest part probability from the part probabilities of the same pixel considered in the generated intermediate part maps (generated for the part considered from the source image and its scaled versions) corresponding to part 21 j.

For instance, for a pixel p,(x,y) in the source image having corresponding head probabilities (i.e. probabilities that respective pixels correspond to a head of the model) calculated from the source image and its scaled versions, the highest probability between - the head probability for p, in the head map obtained from the source image, - the head probability for p, in the head map obtained from a first downscaled version of the source image, - the head probability for p, in the head map obtained from a first up-scaled version of the source image, - so on, is selected to be the head probability associated with pixel (x,y) in the final and optimized head map output at step 410.

These operations are preferably repeated for each part forming the model 20 in order to obtain a corresponding number of optimized part maps.

Next, from these optimized part maps, the 2D part candidates can be determined at step 414. One set of 2D part candidate or candidates 2D-PC'j(k) is obtained from each part map (i.e. for each object part).

Each 2D part candidate corresponds to a pixel in the source image. The determination is made based on the part probabilities associated with each pixel in the part maps. For instance, 2D part candidate or candidates from an optimized part map are determined by determining local maximum or maxima of the part probabilities in the part map and outputting (i.e. selecting) the pixel or pixels corresponding to the determined local maximum or maxima as 2D part candidate or candidates.

All 2D local maximum or maxima in each part map may be selected. They identify 2D candidates in the source image for each part considered.

In one embodiment, only the a highest local maxima are selected (a integer > 1). This makes the process less complex as only few 2D part candidates are handled for the next steps.

In another and refining embodiment, a probability threshold can be used to keep only the local maximum or maxima that are associated with part probabilities above said threshold. This cleans up the set of 2D part candidates from any uncertain part candidates that would result from isolated part detection at step 410 (i.e. from few or very few source images). Consequently, the process is simplified. A probability threshold can be defined independently for each part or for a subset of parts. This is because the method used at step 410 may be more efficient to detect some 2D parts than other parts. A variant to the flowchart of Figure 4a may simply involve generating the part maps from the source images and determining the local maxima as 2D part candidates from the part maps, without using scaled versions.

Back to Figure 4, once the sets of 2D part candidates 2D-PC'j(k) 403 are known (one set per each source image 13 and per each object part 21 j), step 420 consists in generating, in 3D space, 3D part candidates from the 2D part candidates, This is made using the parameters 401,402 of the cameras 12 and of the scene volume V, if any.

Figure 4b illustrates, using a flowchart, exemplary operations forming step 420 according to embodiments. The flowchart uses as input two sets of 2D part candidates 403 corresponding to the same object part 21j, i.e. the 2D part candidates obtained from two different source images for the same part 21 j. It outputs 3D part candidates.

It aims at matching, as many times as possible (i.e. by repeating the matching operation), two 2D part candidates from respectively the two sets of 2D part candidates: a 2D part candidate of one set is matched with a 2D part candidate of the other set. 3D part candidates are then generated from respective pairs of matched 2D part candidates: each matching pair produces a 3D part candidate.

The operations of the flowchart are thus repeated for a plurality of parts 21 j with the same pair of source images, preferably for each part 21j. Next, the same operations are preferably repeated for a plurality of pairs of source images, preferable for all possible pairs or for all pairs of a circular pairing of the source images (each image being paired with only a previous one and a next one).

As shown in Figure 4b, the matching, referenced 421, may include using and solving a graph to obtain one or more one-to-one associations between a 2D part candidate of a first one of the sets (i.e. from a first source image) and a 2D part candidate of the second set (i.e. from a second source image), wherein nodes of the graph correspond to the 2D part candidates of the two sets and weighted links between nodes are set based on a part distance determined between the corresponding 2D part candidates.

The part distances between a first 2D part candidate of a first one of the two sets and respectively 2D part candidates of the second set are calculated at step 422. Each 2D part candidate of the first set is processed to obtain part distances between each pair of 2D part candidates from the two sets.

Figure 6 schematically illustrates a way to compute a part distance ρδ between two 2D part candidates 2D-PC1j(k) and 2D-PC2j(k’) determined in two source images 13i and 132 for the same part 21 j. 2D part candidates determined from the two images (for part 21 j only) are shown with black stars. Figure 7 illustrates, using a flowchart, the corresponding operations.

The extrinsic and intrinsic parameters 402 of corresponding cameras 12i and 122 are known and used to calculate the two fundamental matrices 404: M1-2 from camera 12i to camera 122 and M2-1 from camera 122 to camera 12i. In epipolar geometry, it is known that the fundamental matrix projects a point of a first view into a line (an epipolar line) in the other view. To be concrete, the epipolar line is line Δ seen from another camera. Two directions may thus be processed, meaning for instance that the part distance ρδ may be built from a first directional part distance ρδι-2 and a second directional part distance ρδ2-ι.

The top half of Figure 6 illustrates the computation of the first directional part distance ρδι-2 while the bottom half illustrates the computation of the second directional part distance ρδ2-ι.

As shown, a first one 2D-PC1j(k) of the 2D part candidates is projected 701 as a first epipolar line Δ1-2 on the source image 132 corresponding to the second 2D part candidate. Next, a first directional part distance ρδι-2 is computed 702 between the second 2D part candidate 2D-PC2j(k’) and the first epipolar line Δ1-2. The distance may merely be the orthogonal distance between the part and the line (e.g. in number of pixels).

Symmetrically, the second 2D part candidate 2D-PC2j(k’) can be projected 703 as a second epipolar line Δ2-1 on the source image 13i corresponding to the first 2D part candidate, and a second directional part distance ρδ2-ι can be computed 704 between the first 2D part candidate 2D-PC1j(k) and the second epipolar ΙίηβΔ2-ι.

The part distance ρδ between the two 2D part candidates may thus be selected 705 as the maximum distance between the first and second directional part distances ρδι-2 and ρδ2-ι: p6=max{p6i-2 ; ρδ2-ι}. In a variant, the mean value between the two directional part distances can be selected.

Of course, to simplify the process, only one directional part distance can be computed and kept as part distance ρδ.

Optional step 706 discards the part distances that are evaluated as being too high to mirror 2D part candidates that may correspond to the same part of the same 3D object. In this context, step 706 may provide comparing the part distance ρδ with a predefined threshold (e.g. 20 pixels), and if the part distance is above the predefined threshold, the part distance ρδ is set to an infinite value for the pair of considered 2D part candidates. This is to avoid any matching between the two 2D part candidates to be ultimately found using the approaches described below (e.g. step 423). .

By using this algorithm, a part distance ρδ is computed for each pair of 2D part candidates determined from two different sets of 2D part candidates corresponding to the same part 21 j. For instance all part distances p6(2D-PC1j(k),2D-PC2j(kj) between a 2D part candidate 2D-PC1j(k) determined from the first set and a 2D part candidate 2D-PC2j(kj determined from the second set for model part 21 j are known at the end of step 422 (some distances may be infinite).

Next step is step 423 consisting in determining matchings between pairs of 2D part candidates. This determining is based on these part distances.

In embodiments, this is made using a graph. This is to obtain one or more one-to-one associations between a 2D part candidate of the first set and a 2D part candidate of the second set.

The graph is built with nodes corresponding to the 2D part candidates of the two sets and with weighted links between nodes that are set based on the determined part distances between the corresponding 2D part candidates. In this graph, a node (i.e. a 2D part candidate of a first set) is linked to a plurality of other nodes, namely the nodes corresponding to the 2D part candidates of the other set. No link is set between nodes corresponding to 2D part candidates of the same set.

The weights for the links are set with the corresponding calculated part distances. A bipartite solving of this graph is made which reduces to a maximum weight bipartite graph matching problem as explained for instance in “Introduction to graph theory, volume 2” by D. B. West et al. (2001). The solving step outputs optimal one-to-one associations between 2D part candidates, meaning that a 2D part candidate 2D-SK1j(k) of the first set is at the end linked to (i.e. matched with) at most one 2D part candidate 2D-SK2j(kj of the other set (still for the currently-considered model part 21j).

The bipartite solving may be based on the link weights only, meaning the one-to-one matchings correspond to the minimums of the sum of the link weights in the graph. Optionally, the nodes may be weighted using their respective part probabilities as indicated above (in which case an appropriate formula between the node weights and the link weights is used).

The pairs of matched 2D part candidates {2D-PC'j, 2D-PCkj} 405 are obtained for the currently-considered model part 21 j and the current pair of source images 13,, 13k. Other matched 2D part candidates are obtained, using the same algorithm, for the other model parts and for the other pairs of source images to be considered.

Alternatively to the use of a graph at step 423, the closest 2D part candidate or candidates 2D-PC2j of the second set to the first part candidate 2D-PC1j(k) can be selected based on the determined part distances. This outputs one or more matching 2D part candidates for the first 2D part candidate. For instance, this may be the N (integer equal or above 1) closest ones or those whose part distances are less than a predefined (Euclidean) distance.

Next, the 3D part candidates 3D-PCi kj are generated from the matching pairs. This is step 424 which uses inter-view 2D triangulation to convert two matched 2D part candidates into a 3D part candidate in 3D space. 2D triangulation is made each matching pair after each matching pair.

An exemplary implementation of this step for a given matching pair {2D-PC1j(k),2D-PC2j(kj} is illustrated in Figure 8. It is made of three main sub-steps, namely: projecting a first one 2D-PC1j(k) of the matched 2D part candidates as a first line Δι in 3D space (e.g. volume V representing the scene volume when it is defined). The projection corresponds to the line shown in Figure 5 for instance. This projection is a geometrical issue based on the intrinsic and extrinsic parameters of the corresponding camera (here camera 12i); projecting the second matched 2D part candidate 2D-PC2j(kj as a second line Δ2 in the 3D space; and determining a 3D position (e.g. a voxel V(X,Y,Z)) locating the 3D part candidate 3D-PC1i2j, based on the first and second lines.

The two lines Δ1 and Δ2 rarely intersect one the other in the same 3D position or the same voxel. If they intersect, the intersecting point or voxel is elected as representing the part considered. Otherwise, the closest 3D point or voxel to the two lines is preferably selected. The closeness can be evaluated based on a least square distance approach.

Back to Figure 4, the result of step 420 performed for each pair (i,k) of source images 13i,13k and a given model part 21 j is a 3D set of 3D part candidates 406 built. Several 3D sets as obtained for the several parts 21 composing the object model 20.

Each 3D part candidate is thus defined by a 3D position (e.g. voxel position in SYS) and a part type (the part 21 to which it corresponds). The 3D part candidate may further be associated with the two part probabilities of the two matched 2D part candidates from which it is generated and/or with the part distance calculated for the two matched 2D part candidates.

To reduce the number of 3D part candidates to be processed further, a filtering 430 may optionally be performed which consists in filtering 3D part candidates of a given 3D set (i.e. for a given model part 21j) into a subset of 3D part candidates. An idea of this step 430 is to keep the most promising 3D part candidates.

Figure 4c illustrates, using a flowchart, exemplary operations forming step 430.

Various embodiments may be contemplated.

Some embodiments include selecting 3D part candidates generated from pairs of matched 2D part candidates that share the same 2D part candidates. This is done one 3D set after the other, i.e. model part 21 after model part. In these embodiments, confidence is given to the 2D part candidates that are involved in several matching pairs, thereby providing confidence to the 3D part candidates generated from them.

Optimized situation is when three matching pairs are built from exactly the same three 2D part candidates, in which case the triplet of 3D part candidates generated from these three pairs can be considered as three confident 3D part candidates. Figure 9a illustrates such situation: (2D-PC1j(k), 2D-PC2j(k’)), (2D-PC1j(k), 2D-PC3j(k”)) and (2D-PC3j(k”), 2D-PC2j(k’)) are determined as matching pairs in step 423 (the matching is shown in the Figure with a thin dotted line between the 2D part candidates) resulting in classifying the three 3D part candidates 3D-PC12j, 3D-PC13j and 3D-PC23j as confident 3D part candidates because they share the same three 2D part candidates.

At step 431, such confident 3D part candidates are thus kept through the filtering.

Optionally, the alleged confident 3D part candidates whose associated part distances are too high can be regarded as not confident and thus being discarded. The others (with low associated part distance) are kept. This means selecting or discarding 3D part candidates generated from pairs of matched 2D part candidates can be based on the part distance ρδ determined between the 2D part candidates of the respective pairs. This selects better confident 3D part candidates.

Alternatively, step 431 may merely consist in using the associated part distance to select or discard the 3D part candidates, regardless of whether they share 2D part candidates.

Once the confident 3D part candidates have been filtered, lower confident 3D part candidates can also be selected at step 432.

All the 3D part candidates not selected at step 431 can be considered successively for this step. Alternatively, only those not yet selected (through the triplet approach) and that share a 2D part candidate with another 3D part candidate are considered. Figure 9b illustrates such 3D candidates: (2D-PC1j(k), 2D-PC2j(k’)) and (2D-PC2j(k’), 2D-PC3j(k”)) are determined as matching pairs in step 423 but 2D-PC1j(k) and 2D-PC3j(k”) are not matched (as shown in the Figure with the thick dashed line). Thus 3D-PC12j and 3D-PC23j share a 2D part candidate 2D-PC2j(k’) but no more.

Step 432 may then consist in selecting those 3D part candidates successively considered that are (in the meaning of a 3D Euclidean distance for instance) closer to a yet-selected 3D part candidate (during step 431 for the same model part 21 j) than a predefined distance, e.g. less than 2 meters for human objects 11.

Again, optionally, the 3D part candidates successively considered whose associated part distances are too high can be discarded. The others (with low associated part distance) are thus kept as filtered 3D part candidates.

The output of step 430 is a 3D subset 407 of filtered 3D part candidates for each model part 21 j considered.

Next to the filtering step 430 or to the 3D part candidate generation 420 if no filtering is made, generated 3D part candidates 3D-PCikj representing the same part 21j are converted at step 440 into a single 3D part candidate. This makes it possible to consolidate clusters of for instance very close 3D part candidates into a robust 3D part candidate to build a final 3D skeleton.

The same conversion process can be repeated for each part.

The conversion can be based on spatial closeness, i.e. on 3D distances between the 3D part candidates generated for the part considered (generated from the various source images 13). For instance a RANSAC (RANdom SAmple Consensus) algorithm with a local/global fitting model can be applied. This is illustrated in Figure 10.

Let consider the 3D set of 3D points (i.e. 3D part candidates) generated for the part currently considered. A RANSAC average 3D position is calculated from these 3D points.

The RANSAC approach calculates a robust average 3D position as the average of selected inliers, i.e. selected 3D points. These selected 3D points are accepted as inliers for the computation if their distances to other robust 3D points are below a threshold. The number of inliers NJnliers (i.e. the number of 3D points that are close to the average 3D position calculated by the RANSAC algorithm) is known. This is a functionality of the RANSAC algorithm.

This approach thus: a) selects two or more of the 3D part candidates generated for the same part that are close enough between them, and b) generates a centroid 3D part candidate from the selected ones, as single 3D part candidate, optionally replacing the selected ones.

For instance, clusters of the generated 3D part candidates are first made. As an example, the 3D part candidates are connected to each other close 3D part candidate with respect to a predefined distance threshold. Next, each set of connected 3D part candidates is processed separately using the RANSAC algorithm to provide a robust average 3D position for the set and identify the inliers therefrom. The robust average 3D position should maximize the number of inliers from the set.

This robust average 3D position is kept if the number of inliers is sufficient (for instance 2 or more).

The RANSAC may be iteratively applied by substituting the inliers with the calculated average 3D position and determining a new robust average 3D position.

Alternatively, the determined inliers may be discarded for subsequent iteration without substitution.

In other words, the conversion includes repeating a) and b), for instance until a) cannot select two or more 3D part candidates or until a number of remaining 3D part candidates (inliers being discarded) is below a predefined value.

The converting iterations may convert separate sets of connected 3D part candidates. This is for instance the case when several 3D objects are captured by the cameras. Indeed, in that case, usually the same number of 3D point clouds is obtained, wherein each 3D point cloud is made of 3D part candidates generated from the source images.

It results that one or more sets of inliers (i.e. original or filtered 3D part candidates) have been converted into one or more respective robust average 3D part candidates. Some 3D part candidates may not have been converted. This is schematically illustrated through Figure 10.

The 3D part candidates 3D-PCj (stars in the Figure) are generated from matching pairs of 2D part maps 2D-PCj for the model part 21 j considered. The RANSAC algorithm determines for instance a cluster of three 3D part candidates (bottom circle) which are all inliers for a single average 3D part candidate (the circle): for this cluster, N_lnliers=3. It also determines for instance a cluster of seven 3D part candidates (top circle), six of which (black stars) are inliers for a single average 3D part candidate (the circle): N_lnliers=6. The last 3D part candidate (white star) of this cluster remains as an outlier, in which case N_lnliers=1.

Another outlier (3D part candidate outside the two circles) also remains.

In this described RANSOM-based embodiment, the 3D part candidates 3D-PCj have the same weight. In a variant, each of the 3D part candidates can be weighted, using for instance the part probabilities obtained for the two matched 2D part candidates 2D-PCj from which the 3D part candidate is generated. The RANSAC algorithm can thus take into account these weights, for instance to compute the average 3D position as a weighted barycenter of the inliers’ 3D positions). In embodiments, the average unary probability of the two part probabilities can be used.

At the end of step 440, the set of average 3D part candidates and the set of remaining outliers form the set of final 3D part candidates 408 for the 3D skeleton generation 450.

The generating step 450 may use and solve a graph to obtain one or more one-to-one associations between final 3D part candidates representing two different parts. The graph may be built with nodes corresponding to the 3D part candidates representing the two different parts considered and with weighted links between nodes corresponding to two 3D part candidates for the two different parts that are set based on a distance between the two 3D part candidates. In this graph, a node (i.e. a 3D part candidate) corresponding to a first part (e.g. head) is linked to one or more nodes corresponding to the other part (e.g. neck). No link is set between nodes corresponding to 3D part candidates corresponding to the same part.

The two different parts considered are preferably adjacent according to the 3D model of the 3D real world object.

The graph-based solving may be used for each pair of adjacent parts, in order to progressively obtain one-to-one associations for all the pairs.

With reference to Figure 4d, to build 451 the graph for a first part and an adjacent second part, the nodes are first set for each final 3D part candidate 408 obtained for the first and second parts considered.

Each node may then be weighted based on the number NJnliers of 3D part candidates used to generate the final 3D part candidate corresponding to the node. For instance, with regard to Figure 10, the node corresponding to the top average 3D part candidate can be weighted with the value 6; the node corresponding to the bottom average 3D part candidate can be weighted with the value 3; while the nodes corresponding to the outliers can be weighted with the value 1.

Other weights, for instance based on the part probabilities of the matched 2D part candidates from which these 3D part candidates are generated, can be taken into account. An average part probability can be used for instance.

The building of the graph also requires setting links between the nodes. No link is preferably set between nodes representing 3D part candidates of the same part. A link can be always defined between nodes representing 3D part candidates of two different parts. In embodiments, such a link between two nodes corresponding to a 3D first-part candidate and a 3D second-part candidate can be set depending on a (e.g. Euclidean) distance between the two 3D part candidates and morphological constraints defined by the 3D model between the two different parts considered. This aims at reducing complexity in the graph through morphological considerations. Indeed, a head for a human cannot be 2 meter far from the neck.

The constraints may indeed vary from one part to the other. For instance, a common head-neck distance is less than 40 cm, a common pelvis-knee distance is less than 80 cm, and so on.

Once the links are set, their weights are calculated.

In embodiments, the weight for a link is calculated based on pairwise probability or probabilities between pairs of 2D part candidates (each representing the probability of association between the corresponding paired 2D part candidates), pairs from which the two 3D part candidates forming the link are generated.

The pairwise probabilities have been shortly introduced above. A pairwise probability is obtained for two 2D part candidates belonging to the same source image. It can be obtained based on the techniques described in “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields", “DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model" and “Deep-Cut: Joint Subset Partition and Labelling for Multi Person Pose Estimation".

For instance, “Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields" describes how part affinity fields are obtained. One part affinity field is generated for each pair of adjacent parts (according to the 3D model 20) and from each source image 13,. It is generated by the same CNN as the one used at step 410.

The part affinity fields can be processed to obtain (local) pairwise (or part affinity) probabilities between part candidates identified from the same source image for two adjacent parts. Such pairwise probability may be the modulus of affinity vectors between the two part candidates.

Pairwise probabilities may also be obtained between parts that are not adjacent. Such pairwise probability is said to be “global” and mirrors the probability that the two parts belong to the same object.

Similar to the part maps, a part affinity field may have the same dimensions as the source image or reduced dimensions in which case it may be upscaled to recover the same dimensions.

In detail, a part affinity field between the two adjacent parts (e.g. right foot and right knee) includes affinity vectors for respective pixels of the source image, the magnitude and direction of each affinity vector representing estimated orientation probability and orientation of an element connecting, according to the 3D model, two occurrences of said adjacent parts at the respective pixel in the source image. According to the training base used to generate the CNN, the maximal magnitude may be limited to 1.

The resolution of the part affinity fields is usually at a lower resolution than the source images. It is possible to up-sample the part affinity field to the same resolution as the source image. In such a case, an up-sampled part affinity field for the two adjacent parts (e.g. right foot and right knee) includes an affinity vector per each pixel of the source image, the magnitude and direction of each affinity vector representing estimated orientation probability and orientation of an element connecting, according to the 3D model, two occurrences of said adjacent parts at said pixel in the source image.

This up-sampling is however optional.

Figure 11 schematically illustrates a portion of a part affinity field PAF between right foot and right knee in a source image (the leg in the source image is schematically traced in dotted line to show the correspondence with the affinity vectors AV). The affinity vectors of the lower part of the leg are not shown for clarity reason.

The directions of the affinity vectors show the estimated orientation of a connecting element between the two parts considered (i.e. here the estimated orientation of the leg).

The lengths of the affinity vectors show the confidence in the orientation estimation at each pixel. The longer the AVs (with a length limited to one pixel), the more certain their orientations are.

The knowledge of orientations with high probabilities (AVs with long modulus close to 1) gives relevant information on how to connect two 2D part candidates. To do so, the pairwise probability between the two 2D part candidates is calculated based on the generated part affinity field, for instance based on affinity vectors located between the two 2D part candidates in the generated part affinity field.

The affinity vectors to be considered may be along the segment formed by the two 2D part candidates, in particular the closest ones. For instance the known Bresenham's line algorithm can be used to determine which pixels (and thus associated affinity vector) to parse along this segment.

In practice, the scalar products of the vector formed by the two 2D part candidates and each of the affinity vectors located between the two 2D part candidates (thus identified by the Bresenham's line algorithm) can be computed, and then summed to obtain the pairwise probability between the two 2D part candidates. The sum can then be normalized by dividing it with the modulus of the vector formed by the two 2D part candidates.

If the pairwise probability is too low, it may be set to 0 or considered as not existing.

Techniques described in “DeeperCut: A Deeper, Stronger, and Faster Multi-Person Pose Estimation Model’ and “Deep-Cut: Joint Subset Partition and Labelling for Multi Person Pose Estimation" directly provide pairwise probabilities between two 2D part candidates.

Back to the setting of a link weight in the graph, it is recalled that NJnliers 3D first-part candidates are used to generate a first final 3D first-part candidate of the link and NJnliers 3D second-part candidates are used to generate the second final 3D second-part candidate of the link. As each 3D part candidate comes from a pair of 2D part candidates, a plurality of 2D first-part candidates and a plurality of 2D second-part candidates may be used to produce the two final 3D part candidates.

The pairwise probabilities between corresponding pairs of 2D part candidates corresponding to these final 3D part candidates can thus be retrieved (this is done if the two final 3D part candidates are linked together given the morphological constraints).

Figures 12 illustrate such retrieval of pairwise probabilities for a pair of original 3D part candidates. This must be reiterated for each pair of one of the NJnliers 3D first-part candidates and one of the NJnliers 3D second-part candidates.

In the first case shown in Figure 12a, the two original 3D part candidates have been built (step 420) from two pairs of matched 2D part candidates coming from the same two source images 13i and 132. A pairwise probability may have been obtained between the two 2D part candidates belonging to the same source image. In the affirmative, this pairwise probability is retrieved. As a result, the two original 3D part candidates may be associated with 0, 1 or 2 pairwise probabilities inherited from their 2D part candidates.

In the second case shown in Figure 12b, the two original 3D part candidates have been built (step 420) from two pairs of matched 2D part candidates coming from three source images 13i, 132 and 133: only two 2D part candidates belong to the same source image 132. The pairwise probability between these two 2D part candidates is retrieved, if any. As a result, the two original 3D part candidates may be associated with 0 or 1 pairwise probability inherited from their 2D part candidates.

In the last case shown in Figure 12c, the two original 3D part candidates have been built (step 420) from two pairs of matched 2D part candidates coming from four source images 13i to 134: the 2D part candidates come from different source images. As a result, the two original 3D part candidates cannot be associated with any pairwise probability inherited from their 2D part candidates.

The link between the first final 3D first-part candidate and the second final 3D second-part candidate is thus associated with the pairwise probabilities inherited from each pair of one of their NJnliers 3D first-part candidates and one of their NJnliers 3D second-part candidates. There may be a high number of pairwise probabilities.

Figure 13a schematically illustrates such retrieval. For ease of illustration, a single final 3D neck (second-part) candidate is shown which is obtained as an outlier (i.e. from a single original 3D neck candidate).

Regarding the top final (average) 3D head (first-part) candidate, there are 6 pairs of original 3D head candidates to be considered (the six solid lines on the left side). From each pair, 0 to 2 pairwise probabilities can be inherited (as shown in Figures 12), thereby resulting in 0 to 12 pairwise probabilities for the link (solid line on the right side) between the top final (average) 3D head candidate and the 3D neck candidate.

As far as the bottom final (average) 3D head candidate is concerned, up to 6 pairwise probabilities can be inherited. Regarding the outliner final 3D head candidate, only up to 2 pairwise probabilities can be inherited.

The weight for the link between the first final 3D first-part candidate and the second final 3D second-part candidate can thus be set based on the inherited pairwise probabilities, for instance as the maximum value or the mean value or any other formula.

Figure 13b schematically illustrates such retrieval in a different way where both final 3D head and neck candidates are built from two or more original 3D head and neck candidates. This example also shows that a pair of original 3D-PChead-3D-PCneck is not connected due for instance to morphological constraints. In other words, pairwise probabilities are discarded (or not inherited) based on morphological constraints regarding the corresponding original 3D part candidates.

In the case of Figure 13b, the link between the two final average 3D part candidates (solid line in the right side) is associated with up to 6 pairwise probabilities inherited from the three remaining pairs (solid lines in the left side).

At the end, pairwise probabilities may be obtained for the links provided in the graph.

Where no pairwise probability is inherited, it may be decided to discard such link. In a variant a predefined weight may be provided.

Alternatively to or in combination with the use of the pairwise probabilities, the weight for a link may be based on a (e.g. Euclidean) distance between the two final 3D part candidates forming the link.

As mentioned above, such graph is built for each pair of adjacent parts within the model 20.

While the graphs are, in this example, built after the filtering and converting steps 430 and 440, the building may be made before step 430 or between steps 430 and 440. In that case, the nodes in the graph correspond to the original or filtered 3D part candidates and the link may be associated with corresponding inherited pairwise probabilities. Next, when the filtering and converting steps are performed, the graph may be updated (deletion of the filtered out 3D part candidates and substitution of the inliers with a new node corresponding to the average 3D part candidate). A bipartite solving 452 of each graph as introduced above outputs optimal one-to-one associations between pairs of final 3D part candidates. It means that a final 3D part candidate 3D-PCj for part 21 j is at the end linked to (i.e. matched with) at most one final 3D part candidate 3D-PCkforthe other part 21k.

The bipartite solving may be based on the link weights and/or on node weights, meaning the one-to-one matchings correspond to the minima (or maxima) of a formula (e.g. sum) involving the link weights and/or the node weights.

Figure 14 schematically illustrates the generation of a 3D skeleton candidate using the graph solving. The Figure shows (on top part) two graphs sharing the same part. For instance, graph 1 includes the final 3D parti and part2 candidates while graph 2 includes the final 3D part2 and part3 candidates (assuming parti and part2 as well as part2 and part3 are adjacent according to model 20).

Each graph is solved independently, meaning the 3D skeleton generating step 450 includes repeating using and solving a graph for successively each of a plurality (e.g. each and every) of pairs of adjacent parts according to a 3D model of the 3D real world object, in order to obtain one or more one-to-one associations between 3D part candidates for each pair of adjacent parts. As explained below, pairs of associated 3D part candidates that share the same 3D part candidate can then be connected, using for instance a connected component algorithm, to obtain one or more (full or partial) 3D skeleton candidates.

Preferably, a connected component algorithm can first be run on the graph currently considered to extract un-connected sub-graphs and solve them independently. This is to reduce processing complexity. For instance, graph 2 of the Figure may be split into independent sub-graphs SG1 and SG2.

For each of graph 1, SG1 and SG2, the graph bipartite solving 452 is run to find the solution (in terms of links) that maximizes an energy in a one-to-one relationship.

The energy E to maximize may be the sum of elementary energies e assigned to the pairs of connected nodes respectively: E= Σβ. Each elementary energy e may be based on the node weights (e.g. NJnliers value), the link-associated pairwise probabilities and/or the Euclidean distance between the two corresponding final 3D part candidates. For instance,: e=a.f(N_lnliers) + 3.g(pairwise probabilities) + y.(1+Euclidean distance)-1.

The NJ nliers values weighting the two nodes are used as input of function f. For instance f(x,y)=max(x,y).

The pairwise probabilities values associated with the link are used as input of function g. For instance g(xi,X2,...,Xn)=sum(Xi > threshold). The threshold may be predefined. α, β, γ can be heuristic weights, for instance set to 0.5.

The result of the solving 452 is a set of one-to-one-associated part candidates (there may be a single association in the set) for each graph (i.e. for each pair of adjacent parts according to the 3D model 20), as shown for instance in the bottom part of Figure 14.

The final steps consist in selecting one-to-one-associated part candidates so obtained as parts of the final 3D skeleton.

Preferably one-to-one-associated final 3D part candidates of two or more pairs of adjacent parts are connected to obtain 3D skeleton candidate or candidates. This is step 453 of building 3D skeleton candidates. A connected component algorithm can be used. As readily apparent from the Figure, the final 3D part candidates common to two (or more) obtained associations make it possible to progressively build a 3D skeleton candidate.

This idea is merely to use each graph output to parse (thus build) the candidates 3D skeleton.

The output of a first graph is selected from which the one-to-one associations (of 3D adjacent part candidates) are successively considered. Given an associated pair of 3D adjacent part candidates (for instance parti and part2), the outputs of the other graphs (preferably those involving one of the parts previously considered, for instance part2 and part3) are used to determine whether or not these 3D adjacent part candidates are also one-to-one associated with other 3D part candidates. In the affirmative, the various 3D part candidates are put together in the same data structure in memory, which progressively forms a 3D skeleton candidate. And so on.

To illustrate this process still using the model of Figure 2, let consider a first association between a head candidate (voxel or “point” P1 in the scene volume) and a neck candidate (voxel or “point” P2 in the scene volume). This association results from the solving of the head-neck graph. The solved left-shoulder-neck graph is used to determine whether an association between the same neck candidate (P2) and a left-shoulder candidate exist. In the affirmative (voxel or “point” P3 in the scene volume for the left-shoulder candidate), points P1, P2, P3 are put together in a candidate structure. And so on with the left-elbow-left-shoulder graph, left-hand-left-elbow graph, right-shoulder-neck graph, pelvis-neck graph, and so on... At the end, at most thirteen points P1-P13 in the 3D space may have been found which form an entire 3D skeleton candidate. A second association between a head candidate and a neck candidate may produce a second 3D skeleton candidate, be it entire (if all the graphs provide a new point) or not.

It turns that one or more (entire or partial) 3D skeleton candidates are formed. A 3D skeleton candidate may be made of a single isolated one-to-one association between two part candidates or of few associations.

From these 3D skeleton candidates obtained, one 3D skeleton candidate may be selected as a 3D skeleton 3D-SK of the 3D object 11. This is step 454. Of course, if a plurality of 3D objects having the same 3D model 20 is present in the scene volume, a plurality of 3D skeleton candidates is selected as 3D skeletons 409 for these objects. The number of 3D skeleton candidates to be selected can be known in advance. However, some applications may not know such number.

Apart from such known number, various criteria may be used alternatively or in combination to select the 3D skeleton candidates. The criteria increase the relevancy of the selection (i.e. the selected candidates correspond to existing objects in the scene volume).

An exemplary criterion is a number of parts forming the 3D skeleton candidate according to the 3D model. One easily understands that a more complete skeleton candidate is a better candidate than a more partial skeleton candidate. Thus, preferably, the number should be above a predefined threshold (e.g. 9 out of 13 in the case of Figure 2) to select (or keep) the 3D skeleton candidate. In a variant, the 3D skeleton candidate or candidates with the higher number of parts are selected.

This is to avoid having too partial 3D skeletons, and it automatically discards the isolated one-to-one associations (or the 3D skeleton candidates made of few associations). This criterion is similar to a number of connecting elements in the 3D skeleton candidate.

Another criterion is the visibility of the 3D skeleton candidate by the source cameras 12. Such visibility can be expressed as the number of projecting images for the voxels composing the candidate, i.e. the number of source images onto which the 3D skeleton candidate can be projected. For instance, such number (or visibility) should be above a predefined number, e.g. half the number of source cameras, to select (or keep) the 3D skeleton candidate.

Some applications may require that the 3D skeleton or skeletons obtained at step 450 (thus generated using the process of the Figure) be displayed, for instance using the display screen 15. A 2D or 3D image of the 3D object or objects can thus be generated using the obtained 3D skeleton or skeletons.

Figure 15 illustrates, using a flowchart, such a process 1500 for displaying a 3D skeleton of one or more 3D real world objects observed by source cameras. This is an exemplary application using the generated 3D skeleton.

Step 1501 corresponds to generating a 3D skeleton of the 3D real world object using the teachings of the invention, e.g. using the process of Figure 4.

Step 1502 consists in selecting a virtual camera 12v. Such camera does not actually exist. It is defined by a set of extrinsic and intrinsic parameters chosen by the user. These parameters define from which viewpoint, at which distance and with which focal (i.e. zoom) the user wishes to view the scene.

Using these parameters of the virtual camera, the virtual image 13v can be computed at step 1503. This step merely consists in projecting the 3D skeleton or skeletons located in the 3D space onto a virtual empty image defined by the parameters of the virtual camera.

Next, the built virtual image 13v is displayed on the display screen 15 at step 1504.

Steps 1503 and 1504 ensure the display on a display screen of the generated 3D skeleton from the viewpoint of the virtual camera.

Although the present invention has been described hereinabove with reference to specific embodiments, the present invention is not limited to the specific embodiments, and modifications will be apparent to a skilled person in the art, which lie within the scope of the present invention.

Many further modifications and variations will suggest themselves to those versed in the art upon making reference to the foregoing illustrative embodiments, which are given by way of example only and which are not intended to limit the scope of the invention, that being determined solely by the appended claims. In particular the different features from different embodiments may be interchanged, where appropriate.

In the claims, the word “comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. The mere fact that different features are recited in mutually different dependent claims does not indicate that a combination of these features cannot be advantageously used.

Claims

1. A method for generating a 3D skeleton of one or more 3D real world objects observed by cameras, comprising the following steps performed by a computer system: obtaining, from memory of the computer system, a plurality of simultaneous images of the 3D real world objects recorded by the cameras, determining, from each image, one or more sets of 2D part candidate or candidates for one or more respective parts of the 3D real world object, each 2D part candidate corresponding to a sample of the corresponding image, generating, in 3D space, 3D part candidates from the 2D part candidates, converting generated 3D part candidates representing the same part into a single 3D part candidate, and then, generating at least one 3D skeleton from the 3D part candidates.

2. The method of Claim 1, wherein the determining step includes: generating, from each image, one or more part maps for one or more respective parts of the 3D real world object, each part map for a given part comprising part probabilities for respective samples of the image representing probabilities that the respective samples correspond to the given part, and determining sets of 2D part candidate or candidates from respectively the part maps.

3. The method of Claim 1, wherein the step of generating 3D part candidates from the 2D part candidates includes: repeatedly matching two 2D part candidates from two respective sets of 2D part candidates determined for the same part, and generating, in 3D space, 3D part candidates from respective pairs of matched 2D part candidates.

4. The method of Claim 3, wherein matching two 2D part candidates from two respective sets of 2D part candidates determined for the same part includes using a graph to obtain one or more one-to-one associations between a 2D part candidate of a first one of the sets and a 2D part candidate of the second set, wherein nodes of the graph correspond to the 2D part candidates of the two sets and weighted links between nodes are set based on a part distance determined between the corresponding 2D part candidates.

5. The method of Claim 3, wherein matching two 2D part candidates from two 2D candidate sets determined for the same part includes: determining part distances between a first 2D part candidate of a first one of the two sets and respectively 2D part candidates of the second set, and selecting, as matching 2D part candidates for the first 2D part candidate, the closest 2D part candidate or candidates of the second set to the first part candidate, based on the determined part distances.

6. The method of Claim 4 or 5, wherein determining a part distance between two 2D part candidates includes: projecting a first one of the two 2D part candidates as a first epipolar line on a image corresponding to the second 2D part candidate, and calculating a first distance between the second 2D part candidate and the first epipolar line.

7. The method of Claim 3, wherein generating a 3D part candidate from a pair of matched 2D part candidates includes: projecting a first one of the matched 2D part candidates as a first line in 3D space, projecting the second matched 2D part candidate as a second line in the 3D space, and determining a 3D position locating the 3D part candidate, based on the first and second lines.

8. The method of Claim 1, further comprising a step of filtering the generated 3D part candidates into a subset of 3D part candidates.

9. The method of Claim 8, wherein the filtering step includes selecting 3D part candidates generated from pairs of matched 2D part candidates that share the same 2D part candidates, wherein the selecting step includes selecting at least one triplet of 3D part candidates generated from three respective pairs built from exactly the same three 2D part candidates.

10. The method of Claim 9, wherein the filtering step further includes adding, to the subset, generated 3D part candidates that are closer to a yet-selected 3D part candidate than a predefined distance.

11. The method of Claim 1, wherein converting 3D part candidates includes applying a Random sample consensus algorithm on the generated 3D part candidates.

12. The method of Claim 1, wherein converting 3D part candidates includes: a) selecting two or more of the generated 3D part candidates that are close enough between them, and b) generating a centroid 3D part candidate from the selected ones, as single 3D part candidate.

13. The method of Claim 12, wherein converting includes repeating a) and b).

14. The method of Claim 1, wherein the 3D skeleton generating step includes using a graph to obtain one or more one-to-one associations between 3D part candidates representing two different parts, wherein nodes of the graph correspond to the 3D part candidates representing the two different parts considered and weighted links between nodes corresponding to two 3D part candidates for the two different parts are set.

15. The method of Claim 14, wherein setting a link between two nodes in the graph corresponding to two 3D part candidates depends on a distance between the two 3D part candidates and morphological constraints defined by the 3D model between the two different parts considered.

16. The method of Claim 14, wherein the weight for a link is calculated based on pairwise probability or probabilities between pairs of 2D part candidates, pairs from which two 3D part candidates forming the link are generated, and each pairwise probability is obtained for two 2D part candidates belonging to the same image.

17. The method of Claim 14, wherein the weight for a link is based on a distance between the two 3D part candidates forming the link.

18. The method of Claim 14, wherein a node in the graph is weighted based on a number of 3D part candidates used to generate the 3D part candidate corresponding to the node.

19. The method of Claim 14, wherein the 3D skeleton generating step further includes: repeating using a graph for successively each of a plurality of pairs of adjacent parts according to a 3D model of the 3D real world object, in order to obtain one or more one-to-one associations between 3D part candidates for each pair of adjacent parts, and connecting pairs of associated 3D part candidates that share the same 3D part candidate to obtain one or more 3D skeleton candidates.

20. A method for displaying a 3D skeleton of one or more 3D real world objects observed by cameras, comprising the following steps performed by a computer system: generating a 3D skeleton of a 3D real world object using the generating method of Claim 1, selecting a viewpoint in 3D space, and displaying, on a display screen, the generated 3D skeleton from the viewpoint.

21. A non-transitory computer-readable medium storing a program which, when executed by a microprocessor or computer system in a device, causes the device to perform the method of Claim 1 or 20.

22. A system for generating a 3D skeleton of one or more 3D real world objects observed by cameras, comprising at least one microprocessor configured for carrying out the steps of: obtaining, from memory of the computer system, a plurality of simultaneous images of the 3D real world objects recorded by the cameras, determining, from each image, one or more sets of 2D part candidate or candidates for one or more respective parts of the 3D real world object, each 2D part candidate corresponding to a sample of the corresponding image, generating, in 3D space, 3D part candidates from the 2D part candidates, converting generated 3D part candidates representing the same part into a single 3D part candidate, and then, generating at least one 3D skeleton from the 3D part candidates.

23. A system for displaying a 3D skeleton of one or more 3D real world objects observed by cameras, comprising the generating system of Claim 35 connected to a display screen, wherein the microprocessor is further configured for carrying out the steps of: selecting a viewpoint in 3D space, and displaying, on the display screen, the generated 3D skeleton from the viewpoint.