CN115917597A

CN115917597A - Promoting 2D representations to 3D using attention models

Info

Publication number: CN115917597A
Application number: CN202080102235.5A
Authority: CN
Inventors: 阿德里安·洛帕特; 萨拉·卡劳特
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-08-31
Filing date: 2020-08-31
Publication date: 2023-04-04
Also published as: WO2022042865A1

Abstract

A processing device (200) is described for forming a system for estimating a three-dimensional state of one or more joint objects represented by a plurality of data sets (101), each data set being indicative of a projection of a joint of an object on a two-dimensional space, the processing device comprising one or more processors (201) configured to: receiving (301) a data set; processing (302) a data set using an encoder architecture (104) having a plurality of encoder layers, each encoder layer comprising a respective self-attention mechanism (107) and a respective feed-forward network (109), each self-attention mechanism implementing a plurality of attention heads; and training (303) the encoder layers from the data set to improve the accuracy of the encoder architecture in estimating the three-dimensional state (106) of the one or more objects represented by the data set. This may allow for elevation of 2D body keypoints to 3D relative to the root (e.g. pelvic joint) by extrapolating the input data to estimate depth at real-time runtime using a small, efficient model.

Description

Promoting 2D representations to 3D using attention models

Technical Field

The invention relates to estimating a three-dimensional position of an object from a two-dimensional dataset at a processing device.

Background

In recent years, 3D position estimation of human joints has been the subject of extensive research. Of particular interest is the definition of a method of extrapolating (extrapolar) two-dimensional data (in the form of x, y keypoints) to 3D to predict the root relative coordinates of joints associated with human bones. There are 17 key points describing the human skeleton, including the head, shoulders, elbows, wrists, pelvis, knees and ankles.

Initially, work was primarily based on pre-designed models with a large number of constraints to address the high dependence between different human joints.

With the development of Convolutional Neural Networks (CNN), pose estimators have been developed that reconstruct 3D poses end-to-end directly from RGB images or intermediate 2D predictions. This approach has rapidly outpaced the accuracy of previous manual estimators.

The current state-of-the-art methods are generally divided into two main methods. Some methods predict 3D keypoints end-to-end directly from monocular images. This usually gives good results but requires very large models. Other methods perform lifting, where a 2D predictor is used to predict body pose from an image, and then 2D keypoints (relative to the image) are enlarged to 3D. This two-step approach typically uses a Temporal Convolution Layer (Temporal convergence Layer) to aggregate information from the pose derived from the video.

There is a need to develop a method to promote 2D projection of joints of an object to 3D that overcomes the limitations of existing methods by extrapolating input data to estimate depth using a small, efficient model while running in real time.

Disclosure of Invention

According to one aspect, there is provided a processing device for forming a system for estimating a three-dimensional state of one or more joint objects represented by a plurality of data sets, each data set being indicative of a projection of a joint of an object onto a two-dimensional space, the processing device comprising one or more processors configured to: receiving a data set; processing the data set using an encoder architecture having a plurality of encoder layers, each encoder layer including a respective self-attention mechanism and a respective feed-forward network, each self-attention mechanism implementing a plurality of attention heads; and training the encoder layers according to the data set to improve accuracy of the encoder architecture in estimating three-dimensional states of one or more objects represented by the data set.

This may allow the processing device to estimate depth by extrapolating input data at real-time runtime using a small, efficient model, forming a system for lifting a 2D data set comprising, for example, 2D human key points (x, y coordinates) to 3D (x, y, z coordinates) relative to the root (e.g., pelvic joint).

The or each object may be a person and each data set may be indicative of a body posture. Each data set may include a plurality of key points (each having 2D coordinates) of the human body. In general, there are 17 key points describing human bones, including the head, shoulders, elbows, wrists, pelvis, knees and ankles. Each data set indicative of a human body gesture may define 2D coordinates for each of a plurality of keypoints. This may allow the 3D position of the human joint to be estimated, which may be useful in video games or virtual reality applications.

The encoder architecture may implement a transform architecture. In the task of promoting 2D keypoints to 3D, using the Transformer encoder model can yield good accuracy.

The operation of each self-attention mechanism may be defined by a set of parameters, and the apparatus may be configured to share one or more such parameters between multiple self-attention mechanisms. Weight sharing can make the model more efficient while maintaining (or in some cases improving) the accuracy of predicting the 3D state of the object.

The apparatus may be configured to adjust one or more of said parameters in an attention-deficit mechanism in response to the continuous data set and, after adjusting a parameter, propagate the parameter to one or more other attention-deficit mechanisms. Thus, parameters of one or some attention layers may be shared with other attention layers. This may further improve the efficiency of the model.

The operation of each encoder layer may be defined by a set of parameters, the apparatus may be configured to adjust one or more of these parameters in an encoder layer in response to a continuous data set, and the apparatus may be configured not to propagate such parameters to any other encoder layer. Thus, in some embodiments, the parameters of the encoder layer may not be shared with other encoder layers. This may further improve the model.

The one or more processors may be configured to perform a one-dimensional convolution on the data set to form convolved data and process the convolved data using the encoder architecture. This may allow the dimensions of the data set to be adjusted to match the dimensions of the model inputs and outputs.

The one or more processors may be configured to perform a one-dimensional convolution on the series of consecutive data sets. This may allow estimating the 3D state of the object for the 2D input sequence. For example, a 3D motion sequence of a human body may be estimated.

The series of data sets may be a series of an odd number of data sets. This may allow for data sets on either side of the central data set that need to be acquired in a three-dimensional state to be considered during training.

The apparatus may be configured to estimate a three-dimensional state of an intermediate one of the series of data sets from the series of data sets. The middle one of the series of data sets may correspond to the center of the original field (receptive field) (used to predict the total number of poses for one 3D pose). During training, half of the receptive field corresponds to past gestures and the other half corresponds to future gestures. The intermediate pose within the receptive field may be the pose currently being promoted from 2D to 3D.

The one or more processors may be configured to train the encoder architecture to promote the data set to three dimensions. Training the encoder may allow the model to more accurately predict the 3D state of the object.

Each data set may represent the position of a human joint relative to a predetermined joint or structure of the human body. The structure may be a pelvis. The position of the pelvis may be taken as the root. The distance of the camera to the pelvis can then be determined using a separate model. This may allow the position of each joint relative to the camera to be determined.

The processing device may be configured to: receiving a plurality of images, each image representing a joint object; and for each image, detecting the position of the joints of the object in that image, thereby forming one of the data sets. An accurate 2D pose of a human body in multiple images may be obtained using a 2D pose estimator. This may allow 2D gestures to be predicted from multiple images, and these 2D gestures may then be used as a data set input to the device to promote the 2D data set to 3D.

Once training is complete, the system can be used to estimate the three-dimensional state of one or more joint objects represented by a plurality of such data sets in an inference phase.

According to another aspect, a system for estimating the three-dimensional state of one or more joint objects is provided, the system being formed by the processing device described above.

According to yet another aspect, there is provided a method for estimating a three-dimensional state of one or more joint objects represented by a plurality of data sets, each data set being indicative of a projection of a joint of an object onto a two-dimensional space, the method comprising: receiving a data set; processing the data set using an encoder architecture having a plurality of encoder layers, each encoder layer including a respective self-attention mechanism and a respective feed-forward network, each self-attention mechanism implementing a plurality of attention heads; and training the encoder layers according to the data set to improve the accuracy of the encoder architecture in estimating the three-dimensional state of the one or more objects represented by the data set.

The method may further comprise estimating a three-dimensional state of one or more joint objects represented by a plurality of such data sets.

The use of this method may allow for the depth estimation by extrapolating the input data at real-time runtime using a small, efficient model, lifting a 2D data set comprising, for example, 2D human key points (x, y coordinates) to 3D (x, y, z) relative to the root (e.g., pelvic joint).

Drawings

The invention will now be described by way of example with reference to the accompanying drawings.

In the drawings:

FIG. 1 shows an overview of a model architecture that takes as input a sequence of data sets comprising 2D keypoints of human body gestures, and generates a 3D pose estimate using self-attention to long-term information.

Fig. 2 shows a processing device for forming a system for estimating the three-dimensional state of one or more joint objects represented by a plurality of data sets.

Fig. 3 shows an exemplary flow chart detailing the method steps performed by the processing device.

Figure 4 shows the qualitative results of several actions of human posture using the humans3.6m dataset: (a) an original RGB image for 2D keypoint prediction using CPN; (b) 3D reconstruction using the methods described herein (n =243, where n is the receptive field of the pose in the input sequence); (c) ground truth 3D keypoints.

Detailed Description

The methods described herein are exemplified by processing data sets, each data set being indicative of a human body posture. However, it should be understood that the method may also be applied to other data sets and objects where it is desirable to convert data from 2D to 3D.

Typically, there are 17 key points that delineate the human skeleton, including the head, shoulders, elbows, wrists, pelvis, knees and ankles. In the examples described herein, each data set indicative of human body gestures defines 2D coordinates for each of these key points. Preferably, each data set indicative of a posture of the human body represents a position of a joint of the human body relative to a predetermined joint or structure of the human body. In a preferred embodiment, the structure is a human pelvis.

In examples described herein, the various data sets (e.g., various body gestures) input to the model may be derived from a larger motion capture data set that includes images depicting different body gestures. Such larger datasets may include, for example, the Human3.6M motion capture dataset (see C.Ionescu, D.Papapapava, V.Olaru, and C.Smilacis cu, "Human3.6M: 3d human perception in the natural environment and Large Scale dataset and prediction methods (Large scale datasets and predictive methods for 3d human perception in natural environments)," IEEE model Analysis and Machine Intelligence (PAMI) collection, 7 (2013), pages 1325-1339) or HumanEva (see L.Sigal, A.O.Banan, and M.J.ack, "Humanaa: synchronized video and motion capture dataset and Journal algorithm for assessing joint-like human motion and motion capture (synthesized video and track library) and Journal algorithm (Journal collection, 2, vision and motion capture set, 2. For computing Machine for motion capture dataset (see, journal of motion Analysis and prediction set, 2, and Computer for computing of motion parameters, and computing Machine (see, journal of motion Analysis and prediction). Human3.6m includes 360 ten thousand frames of 11 different subjects, but only 7 of them are annotated. The subject performed up to 15 different types of movements, which were recorded from four different angles. In contrast, humanEva is a small set of motion capture data, with only three subjects and recorded from three perspectives.

The methods described herein may include a two-step pose estimation method. The data set indicative of the body pose input to the model may first be obtained by using a 2D pose estimator to obtain an accurate body 2D pose from the image. This may be done in a top-down manner. These joints can then be lifted by predicting their depth relative to the root (e.g., pelvis).

Thus, the processing device may be configured to receive a plurality of images, each image representing one joint object. For each image, the processing device may then detect the position of the joints of the human body (or other object) in the image, forming one of the data sets indicative of a single gesture.

For example, a 2D pose may be obtained by using a ground truth human bounding box and then using a 2D pose estimator. Some common 2D estimation models that can be used to acquire a 2D pose sequence are: stacked Hourglass (SH), as described by a.newell, k.yang and j.deng in "Stacked Hourglass network for human pose estimation" (IEEE European Computer Vision Conference on Computer Vision, ECCV) corpus (2016), pages 483-499); mask RCNN (Mask-RCNN), as described in the Mask RCNN (Mask-RCNN) corpus of k.he, g.gkioxari, p.dontar and r.girshick (IEEE International Conference on Computer Vision, ICCV) (2017), pages 2961-2969); or Cascaded Pyramid Networks (CPNs), as described in "Cascaded Pyramid networks for multi-person pose estimation" (IEEE conference for Computer Vision and Pattern Recognition, CVPR) corpus for multi-person pose estimation (2018), pages 7103-7112), y.chen, z.wang, y.peng, z.zhang, g.yu, and j.sun.

Alternatively, a 2D detector that does not rely on ground truth may be used. For example, SH and CPN may be used as detectors for a Human3.6M motion capture dataset, and Mask-RCNN as detectors for a HumanEva motion capture dataset. However, ground truth 2D poses can also be used for training. In one particular example, SH may be pre-trained on MPII motion capture dataset (l.pishulin, e.intacutdinov, s.tang, b.andres, m.andra, p.gehler and b.schele, "Subset partitioning and Labeling of Multi-Person Pose Estimation (Joint laboratory for Person), and fine-tuning of motion capture dataset (IEEE) and fine-tuning of motion capture dataset (CVPR) 2016) and calculation of motion recognition dataset (c.picture conference, c.fetchshot, d.grangian, d.grannier and m.audi)" following the method described by d.pavllo, c.Feichthon, d.g. in video with temporal and semi-supervised training "(IEEE computer vision and pattern recognition conference Corpus (CVPR) in" 3D human Pose Estimation with temporal linkage and semi-supervised training "(2019), pages 7753-7762). Both Mask-RCNN and CPN can be pre-trained on COCO (T.Lin, M.Maire, S.Belongie, J.Hays, P.Perona, D.ramann, P.Dollar and C.L.Zitnick, "Common objects in Microsoft COCO: context," IEEE European Conference on Computer Vision, ECCV) corpus 2014-755, and then fine-tuned on Human 3.M's 2D pose 6 because the key points in each motion capture dataset are defined differently. More specifically, mask-RCNN uses ResNet-101 with FPN. Since the CPN requires a bounding box, the Mask-RCNN can be used first to detect the human body. It can then use the ResNet-50 with an input resolution of 384 x 384 to determine keypoints from the image.

In the examples described herein, an open source transform model is used to promote the set of keypoint data. In this example, the 2D keypoint sequence generates one 3D pose prediction (corresponding to the pose at the input sequence/field center) over several transform encoder blocks. Further 3D estimates may be calculated in a sliding window manner.

As shown at 101, the device takes as input a sequence of datasets comprising 2D keypoints (the input sequence is collectively referred to as a receptive field). The method can be used for different numbers of data sets. The series of data sets is preferably an odd number of series of data sets. For example, a receptive field comprising 27, 81 and 243 keypoint data sets may be used. Preferably, the middle dataset of the sequence is selected for lifting, as it corresponds to the center of the original receptive field.

As described above, the data set corresponding to the input sequence may be from a 2D predictor that estimates 2D pose from image frames, or directly from 2D ground truth. Thus, these input data sets are located in image space.

As shown in FIG. 1, in some embodiments, certain modifications may be performed to match the dimensions of the inputs and outputs of the model. In this example, the sequence of input data sets is dimensionally changed by convolutional layer 102. In this example, the transform's input is re-projected from the input dimension [ B, N,34] to [ B, N,512], where B is the batch size, N is the receptive field (i.e., the number of body poses input to the model in each processing step), and 34 corresponds to 17 joints multiplied by 2 (the number of coordinates, i.e., x and y).

Then, temporal coding is added to embed the order of the input sequence of data sets (gestures). As shown at 103, temporal coding is aggregated by embedding an add vector to the input. This allows the model to take advantage of the order of the gesture sequence. This may also be referred to as position coding, for injecting information about the relative or absolute position of markers (tokens) in the sequence. These time embeddings can be created using sine and cosine functions and then added as the re-projected input. In this embodiment, the temporal embedding of the injection has the same dimensions as the input.

The time-embedded data sets are then input into a transform encoder model that processes them. The Transformer is shown at 104 in figure 1.

Self-attention models are used for Natural Language Processing (NLP) to embed gestures over time, rather than using a full convolution method. The encoder architecture has a plurality of encoder layers, each encoder layer including a respective self-attention mechanism 107 and a respective feed-forward network 109. Each self-attention mechanism implements multiple heads of attention. The outputs from

blocks

107 and 109 may be added and normalized as shown at 108 and 110.

A.Vaswani、N.Shazeer、N.Parmar、J.Uszkoreit、L.Jones、A.N.Gomez、

Kaiser and i.polosukhin the basic Transformer encoder described in "Attention is all you needed" (IEEE Neural Information Processing System, neurosps) evolution paper set (2017), pages 5998-6008) can be taken as baseline. In this example, there are 512-dimensional hidden layers, 8 multi-attention headers, and 6 encoder blocks.

Since in this embodiment no decoder part of the transform is used and since the residual concatenation within self attention, the output dimensions are the same as the input dimensions. In this example, i.e., [ B, N,512].

The 1D convolutional layer 105 changes dimensions so that the output of the model (as shown at 106) is the x, y and z coordinates of all joints relative to the pelvis for the pose currently being lifted.

The output embeddings are re-projected to the desired dimension using 1D convolutional layer 105, from [ B,1, 512] to [ B,1, 51], where 51 corresponds to 17 joints by 3 (for x, y and z coordinates). The loss is then calculated, for example using the mean-per-joint position error (MPJPE) for the dataset 3D ground truth. The error is then propagated back and the next training iteration begins.

Preferably, the model output markers (current pose promoted from 2D to 3D within the gesture input sequence) correspond to the intermediate dataset of receptive fields (intermediate pose). This is because during training half of the receptive field corresponds to past gestures and the other half corresponds to future gestures. Thus, during training, the model utilizes temporal data from past and future frames to enable the creation of temporally consistent predictions.

During inference, the model architecture remains the same as during training, but only the past gesture frames are used in the receptive field, rather than the frames used during training on either side of the current gesture being lifted. The model may work in a sliding window fashion so that a 3D reconstruction may eventually be obtained for all 2D representations in the input sequence.

One benefit of this architecture is that the number of receptive fields and multi-attention heads can be modified without affecting the size of the model. In addition, the hyperparameter of the Transformer can be modified.

In some embodiments, weight sharing may be used to maintain or improve final accuracy while significantly reducing the total number of parameters, thereby building a more efficient model.

Alternatively, the parameters may be shared in each encoder block. In particular, attention layer parameters may be shared.

In one embodiment of sharing attention layer parameters, the operation of each self-attention mechanism is defined by a set of parameters, and one or more of these parameters are shared among multiple self-attention mechanisms. One or more of the parameters in an attention-deficit mechanism may be adjusted in response to successive data sets, and after adjusting a parameter, the parameter may be propagated to one or more other attention-deficit mechanisms.

Alternatively or additionally, the operation of each encoder layer may be defined by a set of parameters. The apparatus may be configured to adjust one or more of these parameters in an encoder layer in response to successive data sets. In one embodiment, the device may be configured not to propagate such parameters to any other encoder layer.

In particular, sharing only attention layer parameters (and not encoder block parameters) may improve the final accuracy while significantly reducing the total number of parameters.

In some embodiments, additional data augmentation may be applied to the data set during training and testing. For example, each gesture may be flipped horizontally.

During training of the attention layer of the Transformer model, optimizers such as Adam optimizers (as described in "convergence and more about Adam" (International Conference On Learning reporting, ICLR) proceedings (2018)) can be used (s.j. Redi, s.kale, and s.kumar). For example, a training run of a human3.6m motion capture dataset may last 80 training generations (epoch), and a HumanEva motion capture dataset may last 1000 training generations.

The learning rate may be increased linearly for the first training step (e.g., 1000 iterations with a learning rate factor of 12) and then decreased in proportion to the inverse square root of the number of steps. This is commonly referred to as noampopt.

The batch size may be proportional to the receptive field value. For example, when n =27, 81, and 243, b =5120, 3072, and 1536, respectively.

In terms of hardware, the system can be trained and evaluated using 8 NVIDIA V100 GPUs and optimized in parallel. Typically, the training time for each receptive field may be about 8, 14 and 40 hours, respectively, considering batch size as well.

Fig. 2 is a schematic representation of a system 200 configured to perform the methods described herein. The system 200 may be implemented on a device such as a laptop, tablet, smartphone, or Television (TV).

The system 200 includes a processor 201, the processor 201 configured to process a data set in the manner described herein. For example, the processor 201 may be implemented as a computer program running on a programmable device such as a GPU or Central Processing Unit (CPU). The system 200 includes a memory 202 arranged in communication with a processor 201. The memory 202 may be a non-volatile memory. The processor 201 may also include a cache (not shown in fig. 2) that may be used to temporarily store data from the memory 202. The system may include more than one processor and more than one memory. The memory may store data executable by the processor. The processor may be configured to operate in accordance with a computer program stored in a non-transitory form on a machine-readable storage medium. The computer program may store instructions for causing a processor to perform its methods in the manner described herein.

Fig. 3 shows a flow chart summarizing an example of a method for estimating a three-dimensional state of one or more joint objects represented by a plurality of data sets (e.g. a plurality of body gestures), each data set indicating a projection of a joint of an object (e.g. a body) on a two-dimensional space.

At step 301, the method includes receiving a data set. At step 302, the method includes processing a data set using an encoder architecture having a plurality of encoder layers, each encoder layer including a respective self-attention mechanism and a respective feed-forward network, each self-attention mechanism implementing a plurality of attention heads. At step 303, the method includes training the encoder layers according to the data set to improve the accuracy of the encoder architecture in estimating the three-dimensional state of the one or more objects represented by the data set.

Described herein is the use of a self-attention Transformer model to estimate the depth of 2D keypoints. The self-attention architecture of the encoder allows the model to produce temporally consistent gestures by exploiting remote temporal information across frames/gestures.

In some embodiments, it has been found that the methods described herein can provide better results and allow for smaller model sizes than previous methods.

For input 2D predictions (Mask-RCNN and CPN), the method described herein was found to be superior to previous boosting methods and to perform comparable to methods using keypoints and features extracted from the original RGB image. For ground truth input, the performance of this Model was found to be superior to previous models, achieving comparable results to Skinned Multi-Person Linear models (SMPL) or Multi-view methods that predict body shape and pose simultaneously. The number of parameters in the model is easily adjustable and smaller (e.g., 950 ten thousand) than current methods (which may have about 1100-1700 ten thousand parameters) while still achieving better performance. Thus, the method can achieve better results with smaller model sizes than the prior art.

Figure 4 shows qualitative results of several movements of human posture using the humans3.6m dataset. (a) Columns show the original RGB image for two-dimensional keypoint prediction using CPN. (b) The column shows the 3D reconstruction (with a receptive field of n = 243) performed using the methods described herein. Column (c) shows ground truth 3D keypoints. It can be seen that the obtained 3D reconstruction matches well with the ground truth three-dimensional keypoints.

Thus, the methods described herein allow for elevation of 2D human key points (x, y coordinates) to 3D (x, y, z) relative to the root (e.g., pelvic joint) by extrapolating input data to estimate depth at real-time runtime using a small, efficient model.

The applicant hereby discloses in isolation each individual feature described herein and any combination of two or more such features, to the extent that such features or combinations are capable of being carried out based on the present specification as a whole in the light of the common general knowledge of a person skilled in the art, irrespective of whether such features or combinations of features solve any problems disclosed herein, and without limitation to the scope of the claims. The applicant indicates that aspects of the present invention may consist of any such individual feature or combination of features. In view of the foregoing description it will be evident to a person skilled in the art that various modifications may be made within the scope of the invention.

Claims

1. A processing device (200) for forming a system for estimating a three-dimensional state of one or more joint objects represented by a plurality of data sets (101), each data set being indicative of a projection of a joint of the object onto a two-dimensional space, the processing device comprising one or more processors (201) configured to:

receiving (301) the data set;

processing (302) the data set using an encoder architecture (104) having a plurality of encoder layers, each encoder layer comprising a respective self-attention mechanism (107) and a respective feed-forward network (109), each self-attention mechanism implementing a plurality of attention heads; and

the encoder layer is trained (303) from the data set to improve the accuracy of the encoder architecture in estimating three-dimensional states (106) of one or more objects represented by the data set.

2. The processing device (200) according to claim 1, wherein each object is a person and each data set is indicative of a body posture.

3. The processing device (200) according to claim 1 or 2, wherein the encoder architecture (104) implements a transform architecture.

4. The processing apparatus (200) according to any one of the preceding claims, wherein the operation of each self-attention mechanism (107) is defined by a set of parameters, and the apparatus is configured to share one or more of the parameters between a plurality of self-attention mechanisms.

5. The processing device (200) according to claim 4, wherein the device is configured to adjust one or more of the parameters in the self-attention mechanism (107) in response to the continuous data set and to propagate the parameters to one or more other self-attention mechanisms after adjusting the parameters.

6. A processing device (200) according to claim 5, wherein the operation of each encoder layer is defined by a set of parameters, the device being configured to adjust one or more of said parameters in an encoder layer in response to a continuous data set, and the device being configured not to propagate said parameters to any other encoder layer.

7. The processing device (200) according to any one of the preceding claims, wherein the one or more processors (201) are configured to:

a one-dimensional convolution (102) is performed on the data set to form convolved data and the convolved data is processed using the encoder architecture (104).

8. The processing device (200) according to claim 7, wherein the one or more processors (201) are configured to perform the one-dimensional convolution (102) on a series of consecutive data sets (101).

9. The processing device (200) according to claim 8, wherein the series of data sets (101) is a series of an odd number of data sets.

10. The processing device (200) according to claim 9, wherein the device is configured to estimate a three-dimensional state of an intermediate one of the series of data sets from the series of data sets (101).

11. The processing device (200) according to any one of the preceding claims, wherein the one or more processors (201) are configured to train the encoder architecture (104) to promote the data set to three dimensions.

12. The processing device (200) according to claim 11, wherein each data set represents a position of a joint of the human body relative to a predetermined joint or structure of the human body.

13. The treatment device (200) according to claim 12, wherein the structure is a pelvis.

14. The processing device (200) according to any one of the preceding claims, wherein the processing device is configured to:

receiving a plurality of images, each image representing a joint object; and

for each image, joint positions of the object in the image are detected, thereby forming one of the data sets (101).

15. A system for estimating the three-dimensional state of one or more joint objects, the system being formed by the processing device (200) of any one of the preceding claims.

16. A method (300) for estimating a three-dimensional state of one or more joint objects represented by a plurality of data sets (101), each data set being indicative of a projection of a joint of the object on a two-dimensional space, the method comprising:

receiving (301) the data set;