CN114241514A

CN114241514A - Model training method and device for extracting human skeleton features

Info

Publication number: CN114241514A
Application number: CN202111351423.4A
Authority: CN
Inventors: 何嘉斌; 刘廷曦; 翁仁亮
Original assignee: Beijing Aibee Technology Co Ltd
Current assignee: Beijing Aibee Technology Co Ltd
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-03-25
Anticipated expiration: 2041-11-15
Also published as: CN114241514B

Abstract

The application discloses a model training method for extracting human skeleton characteristics, which comprises the following steps: obtaining M skeleton data sets, wherein each skeleton data set in the M skeleton data sets corresponds to a data dimension, each skeleton data set comprises N skeleton data, and the N skeleton data in the skeleton data sets correspond to N initial skeleton data in a one-to-one mode. And the N pieces of skeleton data in the skeleton data set correspond to the N pieces of initial skeleton data one by one. Then, according to the M skeletal data sets, training a model for extracting human skeletal features. When the model is trained, a calculation mode of loss of the first initial bone data is improved, and the loss calculation mode after the improvement considers the loss of the first initial bone data corresponding to the M data dimensions respectively, so that the calculated loss of the first initial bone data is easier to converge, and the training efficiency of the model is improved.

Description

Model training method and device for extracting human skeleton features

Technical Field

The present application relates to the field of data processing, and in particular, to a model training method and apparatus for extracting human skeletal features.

Background

Currently, a model for extracting human skeletal features can be trained by means of self-supervised training. The self-supervision training refers to that no artificially labeled label of a training sample is used in the process of training the model.

In some scenarios, a model for extracting human skeletal features may be trained using an auto-supervised training approach. However, training a model for extracting human skeletal features by adopting a self-supervision training mode has low training efficiency. Therefore, a solution to the above problems is urgently needed.

Disclosure of Invention

The technical problem that this application will solve is: the model for extracting the human skeleton characteristics is trained in a contrast learning mode, the training efficiency is low, and the model training method and the model training device for extracting the human skeleton characteristics are provided.

In a first aspect, an embodiment of the present application provides a model training method for extracting human bone features, where the method includes:

acquiring M skeleton data sets, wherein each skeleton data set in the M skeleton data sets corresponds to a data dimension, each skeleton data set comprises N skeleton data, the N skeleton data in the skeleton data sets correspond to N initial skeleton data in a one-to-one manner, M is an integer larger than 1, and N is an integer larger than or equal to 1;

training a model for extracting human skeleton features according to the M skeleton data sets; wherein:

and determining the loss of the model based on the loss of the N initial skeleton data, wherein the N initial skeleton data comprise first initial skeleton data, and the loss of the first initial skeleton data is determined according to the loss of the first initial skeleton data corresponding to the M data dimensions respectively.

Optionally, the training a model for extracting human bone features according to the M bone data sets includes:

processing N skeletal data in the M skeletal data sets by adopting a first data enhancement mode to obtain M first training data sets, wherein one skeletal data set corresponds to one first training data set, and one first training data set comprises N groups of training data; processing N skeletal data in the M skeletal data sets by adopting a second data enhancement mode to obtain M second training data sets, wherein one skeletal data set corresponds to one second training data set, and one second training data set comprises N groups of training data;

training a model for extracting human skeletal features based on the M first training data sets and the M second training data sets; wherein:

the M data dimensions comprise a first dimension, a first bone feature set corresponding to the first dimension is obtained through the model by a first training data set corresponding to the first dimension, a second bone feature set corresponding to the first dimension is obtained through the model by a second training data set corresponding to the first dimension, the first bone feature set and the second bone feature set respectively comprise N features, and the N features are in one-to-one correspondence with the N initial bone data.

Optionally, the loss of the first initial bone data in the first dimension is determined according to the first loss and/or the second loss, wherein:

the first loss is determined according to a first similarity of a first bone feature and a second bone feature, a first weight of the first similarity, a second weight of the similarity of the first bone feature and each of 2 x N bone features, and a multi-dimensional fusion similarity corresponding to each second weight, the first bone feature being a bone feature in a first set of bone features corresponding to the first dimension corresponding to the first initial bone data, the second bone feature being a bone feature in a second set of bone features corresponding to the first dimension corresponding to the first initial bone data, the 2 x N bone features including: a first bone feature set corresponding to the first dimension and a second bone feature set corresponding to the first dimension;

the second loss is determined according to the first similarity, the first weight, a third weight of the similarity of each of the second and 2 x N bone features, and a multi-dimensional fusion similarity corresponding to each third weight.

Optionally, the 2 × N bone features include a third bone feature, and the multi-dimensional fusion similarity corresponding to the second weight of the similarity between the first bone feature and the third bone feature is determined by the similarity between the first bone feature and the third bone feature and the similarity between a fourth bone feature and a fifth bone feature corresponding to each of (M-1) dimensions, wherein the fourth bone feature and the first bone feature correspond to the same initial bone data, and the fifth bone feature and the third bone feature correspond to the same initial bone data.

Optionally, the 2 × N bone features include a third bone feature, and the multidimensional fusion similarity corresponding to a third weight of the similarity between the second bone feature and the third bone feature is determined by the similarity between the second bone feature and the third bone feature and the similarity between a sixth bone feature and a seventh bone feature corresponding to each of (M-1) dimensions, where the sixth bone feature and the second bone feature correspond to the same training data, and the seventh bone feature and the third bone feature correspond to the same training data.

Optionally, the weight corresponding to the first similarity is greater than the weight corresponding to the similarity between the first bone feature and an eighth bone feature, and the weight corresponding to the first similarity is greater than the weight corresponding to the similarity between the second bone feature and a ninth bone feature, wherein the eighth bone feature is any one of the other (2 x N-1) bone features except for the second bone feature in the 2 x N bone features, and the ninth bone feature is any one of the other (2 x N-1) bone features except for the first bone feature in the 2 x N bone features.

Optionally, the method further includes:

calculating the similarity between any two bone features in the 2 x N bone features corresponding to the first dimension based on the 2 x N bone features corresponding to the first dimension to obtain a 2N x 2N first similarity matrix, wherein the element corresponding to the ith row and the jth column of the first similarity matrix is used for indicating the similarity between the ith bone feature and the jth bone feature in the 2 x N bone features;

modifying the value of the diagonal element of the first similarity matrix into a preset value to obtain a second similarity matrix;

and processing the second similarity matrix according to an optimal transmission distribution algorithm to obtain a weight matrix, wherein the element corresponding to the ith row and the jth column of the weight matrix is used for indicating the weight of the similarity of the ith bone feature and the jth bone feature in the 2 x N bone features.

Optionally, the first similarity between the first bone feature and the second bone feature is determined by:

processing the first bone feature according to a feature mapping module to obtain a first feature;

processing the second bone features according to the feature mapping module to obtain second features;

and determining the cosine similarity of the first feature and the second feature as the first similarity.

Optionally, in the training process of the model, the parameters of the feature mapping module are adjusted according to the loss of the model.

Optionally, the M data dimensions include at least two of:

an articulation point dimension, an articulation point motion dimension, a bone dimension, and a bone motion dimension.

In a second aspect, an embodiment of the present application provides a model training apparatus for extracting human bone features, the apparatus including:

an obtaining unit, configured to obtain M skeleton data sets, where each skeleton data set in the M skeleton data sets corresponds to a data dimension, each skeleton data set includes N skeleton data, the N skeleton data in the skeleton data sets correspond to N initial skeleton data in a one-to-one manner, M is an integer greater than 1, and N is an integer greater than or equal to 1;

the training unit is used for training a model for extracting human skeleton characteristics according to the M skeleton data sets; wherein:

Optionally, the training unit is configured to:

Optionally, the apparatus further comprises:

a calculating unit, configured to calculate, based on the 2 × N bone features corresponding to the first dimension, a similarity between any two bone features in the 2 × N bone features corresponding to the first dimension, to obtain a 2N × 2N first similarity matrix, where an element corresponding to an ith row and a jth column of the first similarity matrix is used to indicate a similarity between an ith bone feature and a jth bone feature in the 2 × N bone features;

the modification unit is used for modifying the value of the diagonal element of the first similarity matrix into a preset value to obtain a second similarity matrix;

and the processing unit is used for processing the second similarity matrix according to an optimal transmission distribution algorithm to obtain a weight matrix, and elements corresponding to the ith row and the jth column of the weight matrix are used for indicating the weight of the similarity between the ith bone feature and the jth bone feature in the 2 x N bone features.

Optionally, the M data dimensions include at least two of:

In a third aspect, an embodiment of the present application provides a human bone data processing system, including:

a model, a data enhancement model and a feature mapping model which are obtained by training by adopting the method of any one of the first aspect and are used for extracting human skeleton features;

the data enhancement model is used for enhancing the initial bone data to obtain training bone data;

the characteristic mapping model is used for processing the bone characteristics output by the model for extracting the human bone characteristics and outputting the characteristics for calculating the loss of the model for extracting the human bone characteristics.

In a fourth aspect, an embodiment of the present application provides an apparatus, where the apparatus includes: a processor, a memory, a system bus; the equipment and the memory are connected through the system bus; the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of the above first aspects.

In a fifth aspect, embodiments of the present application provide a computer-readable storage medium, which stores instructions that, when executed on a terminal device, cause the terminal device to perform the method according to any one of the above first aspects.

In a sixth aspect, embodiments of the present application provide a computer program product, which when run on a terminal device, causes the terminal device to perform the method of any one of the above first aspects.

Compared with the prior art, the embodiment of the application has the following advantages:

the embodiment of the application provides a model training method for extracting human skeleton characteristics, which comprises the following steps: obtaining M skeleton data sets, wherein each skeleton data set in the M skeleton data sets corresponds to one data dimension, each skeleton data set comprises N skeleton data, and the N skeleton data in the skeleton data sets correspond to N initial skeleton data in a one-to-one mode. The N pieces of skeleton data in the skeleton data set correspond to the N pieces of initial skeleton data one to one, and it can be understood that the N pieces of skeleton data in the skeleton data set are obtained according to the N pieces of initial skeleton data, and one piece of initial skeleton data corresponds to one piece of skeleton data in the skeleton data set. Then, according to the M skeletal data sets, training a model for extracting human skeletal features. And, the loss of the model is determined based on the loss determination of the N initial skeleton data, and if any one initial skeleton data of the N initial skeleton data is referred to as "first initial skeleton data", the loss of the first initial skeleton data is determined based on the loss of the first initial skeleton data corresponding to each of the M data dimensions. As described above, even if there is second initial bone data semantically close to the first initial bone data, the semantics of the bone data corresponding to the M dimensions of the second initial bone data are not necessarily all close to the semantics of the bone data corresponding to the M dimensions of the first initial bone data. Therefore, the loss of the first initial bone data is determined according to the losses of the first initial bone data in the M data dimensions, and the influence of the second initial bone data on the loss of the first initial bone data can be reduced, so that the loss of the first initial bone data is easier to converge, and correspondingly, the loss of the model is easier to converge, and the training efficiency of the model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flowchart of a model training method for extracting human skeletal features according to an embodiment of the present disclosure;

FIG. 2 is a schematic structural diagram of a model training apparatus for extracting human bone features according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a human bone data processing system according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Contrast learning is a way of self-supervision training, and the contrast learning refers to: and processing one sample by adopting two different data enhancement modes to obtain two enhanced samples. And training the model using the two enhanced samples. It can be understood that, since the two enhanced samples are obtained by performing data enhancement processing on the same sample, although the contents of the two enhanced samples are different, the semantics of the two enhanced samples are consistent. Therefore, if the similarity of the features output by the model for the two enhanced samples is high enough, and the similarity with other samples (or samples enhanced by different enhancing modes of other samples) is low enough, the model has the capability of extracting the features of the samples.

At present, a model for extracting human skeletal features may be trained by using a comparative learning method, and when the model for extracting human skeletal features is trained by using the comparative learning method, N initial skeletal data may be data-enhanced by using two data-enhancement methods to obtain 2 × N training data, and the model is trained by using the 2 × N training data, it can be understood that 2 × N skeletal features may be obtained after the 2 × N training data passes through the model.

The initial bone data referred to herein may be joint data. The joint point data may be data of a plurality of (e.g., 25) joint points of the human body.

For a certain initial skeleton data, for example, the first initial skeleton data, the loss corresponding to the initial skeleton data can be calculated by the following formula (1).

In equation (1):

L_ija corresponding loss for the first initial bone data;

i and j represent serial numbers of bone features corresponding to two enhanced training data derived from the first initial bone data among all 2 × N bone features;

S_ijrepresenting the similarity of the bone characteristics corresponding to the enhanced two training data corresponding to the first initial bone data;

S_ikand representing the similarity of the ith bone feature and the kth bone feature in the obtained 2 x N features.

It can be understood that for formula (1), if there is semantic proximity in the initial bone data to the first initial bone dataOther initial bone data, such as the second initial bone data, may result in a higher similarity between two of the 2 × N bone features corresponding to the second initial bone data and two of the bone features corresponding to the first initial bone data, which results in a larger denominator of equation (1), and further results in L_ijThe value is too large to approach 0, so that the model convergence is slow, i.e. the training efficiency of the model is low.

Now, the following examples are given: let N be 3, i.e., there are 3 initial bone data, and assume that these 3 initial bone data are data a, data B, and data C. Then the data a1, B1 and C1 are obtained after the 3 initial bone data are data enhanced by the first data enhancement mode, and the data a2, B2 and C2 are obtained after the 3 initial bone data are data enhanced by the second data enhancement mode. The data A1, B1, C1, A2, B2 and C2 are input into the model to obtain 6 bone features, namely Z1, Z2, Z3, Z4, Z5 and Z6. It will be appreciated that bone features Z1 and Z4 are features corresponding to data a, bone features Z2 and Z5 are features corresponding to data B, and bone features Z3 and Z6 are features corresponding to data C. Then according to equation (1) above: the corresponding loss of data a can be represented by the following equation (2):

in equation (2):

L₁₄the corresponding loss of the data A;

S₁₄is the similarity of Z1 and Z4;

S₁₂is the similarity of Z1 and Z2;

S₁₃is the similarity of Z1 and Z3;

S₁₅is the similarity of Z1 and Z5;

S₁₆is the similarity of Z1 and Z6.

It can be understood that if the semantic information of the data A and the data B are similar, the similarity between the features Z1, Z4, Z2 and Z5 is higher, and thus, the formula (2)S in the denominator of (1)₁₄、S₁₂And S₁₅Are all relatively large, thus making even S₁₄Is relatively high, but the denominator is also relatively large, thereby causing

Is relatively small, resulting in L₁₄And is also large, making the model difficult to converge.

In order to solve the above problem, embodiments of the present application provide a model training method and apparatus for extracting human bone features.

Various non-limiting embodiments of the present application are described in detail below with reference to the accompanying drawings.

Referring to fig. 1, the figure is a schematic flowchart of a model training method for extracting human skeletal features according to an embodiment of the present application. The method provided by the embodiment of the present application may be executed by a terminal device or a server, and the embodiment of the present application is not particularly limited.

In this embodiment, the method shown in fig. 1 may include the following steps, for example: S101-S102.

S101: obtaining M skeleton data sets, wherein each skeleton data set in the M skeleton data sets corresponds to a data dimension, each skeleton data set comprises N skeleton data, the N skeleton data in the skeleton data sets correspond to N initial skeleton data in a one-to-one mode, M is an integer larger than 1, and N is an integer larger than or equal to 1.

Regarding the N initial skeletal data, it should be noted that, in one example, the N initial skeletal data may be data of a joint dimension, for example.

Regarding the M data dimensions, it should be noted that, in the embodiment of the present application, the M data dimensions may include at least two of an articulation point dimension, an articulation point motion dimension, a bone dimension, and a bone motion dimension.

Wherein:

the data of the joint dimension may be, for example, coordinate sequences of N joints, and the coordinate sequence of one joint is exemplified as follows: a bone sequence with a length of T frames, each frame containing N persons, each person having M joints, each joint containing three-dimensional coordinate position information, the coordinate sequence of the joint can be expressed as: t × N × M × 3.

The data of the joint movement dimension may be, for example, a movement coordinate sequence of N joints. The motion coordinate sequence of a certain joint point subtracts the corresponding joint point coordinate of the previous frame from all the joint point coordinates of the next frame in the corresponding T frame skeleton sequence, and because the first frame skeleton sequence has no reducible frame, the first frame skeleton sequence is discarded, and then the first frame skeleton sequence is supplemented in an interpolation mode.

The data of the bone dimension may be, for example, a coordinate sequence of N bones. The coordinate sequence of a certain bone is the difference of the coordinates of adjacent joint points in the same frame of bone sequence.

The data of the bone motion dimension may be, for example, a motion coordinate sequence of N bones. The motion coordinate sequence of a certain bone subtracts the bone coordinate corresponding to the previous frame from all the bone coordinates of the next frame in the corresponding T frame bone sequence, so that the first frame bone sequence has no reducible frame, and the first frame bone sequence is discarded and then is supplemented back to the frame bone sequence in an interpolation mode.

In this embodiment of the application, when M skeleton data sets are obtained, for example, N initial skeleton data sets may be obtained first, and then, according to the N initial skeleton data sets, skeleton data of M dimensions with the N initial skeleton data sets are obtained, so as to obtain the M skeleton data sets. It can be understood that each of the skeletal data sets includes N pieces of skeletal data, and the N pieces of skeletal data in the skeletal data sets correspond to N pieces of initial skeletal data one to one. For convenience of description, in the following embodiments, any one piece of initial skeleton data of the N pieces of initial skeleton data is referred to as "first initial skeleton data".

In one example, assume that the M data dimensions are: an articulation point dimension, an articulation point motion dimension, and a bone dimension. For the first initial bone data, first initial bone data may be obtained first, and then the first initial bone data is processed to obtain data of an articulation point dimension corresponding to the first initial bone data, data of an articulation point motion dimension corresponding to the first initial bone data, and data of a bone dimension corresponding to the first initial bone data. When the first initial bone data is joint point data, the acquired set comprising the N initial bone data can be directly determined as a bone data set corresponding to a joint point dimension.

In the present embodiment, "joint point data" is "data of joint point dimension", "joint point movement data" is "data of joint point movement dimension", "bone data" is "data of bone dimension", and "bone movement data" is "data of bone movement dimension".

S102: training a model for extracting human skeleton features according to the M skeleton data sets; wherein: and determining the loss of the model based on the loss of the N initial skeleton data, wherein the N initial skeleton data comprise first initial skeleton data, and the loss of the first initial skeleton data is determined according to the loss of the first initial skeleton data corresponding to the M data dimensions respectively.

After the M skeletal data sets are obtained, a model for extracting human skeletal features can be trained in a general self-supervision training mode. In one example, the model may be a Convolutional nerve (CNN) that includes multiple Convolutional layers.

In training the model, a loss of the model may be calculated based on losses of the N pieces of initial bone data, and, for a first piece of initial bone data, a loss of the first piece of initial bone data may be calculated from losses of the first piece of initial bone data in the M data dimensions, respectively. For example, the loss of the first initial bone data is a sum of losses of the first initial bone data in the M data dimensions.

It is understood that, for the first initial bone data, even if the semantics of the second initial bone data are close to those of the first initial bone data, the semantics of the bone data corresponding to the second initial bone data in the aforementioned M dimensions are not necessarily all close to those of the bone data corresponding to the first initial bone data in the aforementioned M dimensions. Therefore, the loss of the first initial bone data is determined according to the losses of the first initial bone data in the M data dimensions, and the influence of the second initial bone data on the loss of the first initial bone data can be reduced, so that the loss of the first initial bone data is easier to converge, and correspondingly, the loss of the model is easier to converge, and the training efficiency of the model is improved.

In an example of the embodiment of the present application, in a specific implementation, for example, the S102 may train a model for extracting human bone features according to the M bone data sets by using a contrast learning manner. For this case, S102 may be implemented by the following steps S1-S2.

S1: processing N skeletal data in the M skeletal data sets by adopting a first data enhancement mode to obtain M first training data sets, wherein one skeletal data set corresponds to one first training data set, and one first training data set comprises N groups of training data; and processing N skeletal data in the M skeletal data sets by adopting a second data enhancement mode to obtain M second training data sets, wherein one skeletal data set corresponds to one second training data set, and one second training data set comprises N groups of training data.

The embodiment of the present application does not specifically limit the first data enhancement mode and the second data enhancement mode. In one example, the first data enhancement mode may be one of the following, the second data enhancement mode is one of the following, and the first data enhancement mode and the second data enhancement mode are different.

1) All the joint points of each frame randomly and slightly rotate towards the same direction;

2) all the joint points of each frame are slightly inclined randomly towards the same direction;

3) adding a Gaussian noise to the coordinates of all joint points in all frames;

4) and hiding part of joint points in the partial frame.

With respect to S1, an example will now be explained. Suppose N is 4 and M is 3; the M bone data sets are respectively a bone data set corresponding to a joint point dimension, a bone data set corresponding to a joint point motion dimension and a bone data set corresponding to a bone dimension. Each set of bone data includes 4 sets of bone data. Then, a first data enhancement mode is adopted to obtain 3 first training data sets, which are respectively a first training data set corresponding to the joint dimension, a first training data set corresponding to the joint motion dimension, and a first training data set corresponding to the bone dimension. Wherein: the first training data set corresponding to the joint point dimension comprises 4 groups of training data, and the 4 groups of training data correspond to 4 groups of initial bone data one by one; the first training data set corresponding to the joint point movement dimension comprises 4 groups of training data, and the 4 groups of training data correspond to 4 groups of initial bone data one by one; the first training data set corresponding to the bone dimension comprises 4 sets of training data, and the 4 sets of training data are in one-to-one correspondence with the 4 sets of initial bone data. Similarly, by using the second data enhancement mode, 3 second training data sets can be obtained, which are respectively the second training data set corresponding to the joint dimension, the second training data set corresponding to the joint motion dimension, and the second training data set corresponding to the bone dimension. Wherein: the second training data set corresponding to the joint point dimension comprises 4 groups of training data, and the 4 groups of training data correspond to 4 groups of initial bone data one by one; the second training data set corresponding to the joint point movement dimension comprises 4 groups of training data, and the 4 groups of training data correspond to 4 groups of initial bone data one by one; the second training data set corresponding to the bone dimensions comprises 4 sets of training data, and the 4 sets of training data are in one-to-one correspondence with the 4 sets of initial bone data.

S2: training a model for extracting human skeletal features based on the M first training data sets and the M second training data sets.

In the embodiment of the present application, the processing manner of the training data of the M data dimensions by the model is similar, and a first dimension of the M data dimensions is taken as an example to be described next. It should be noted that the first dimension is any one of the M data dimensions.

In an embodiment of the present application, the M data dimensions include a first dimension. After the first training data set corresponding to the first dimension is input into the model, a first bone feature set corresponding to the first dimension can be obtained, and after the second training data set corresponding to the first dimension is input into the model, a second bone feature set corresponding to the first dimension can be obtained, wherein the first bone feature set and the second bone feature set respectively comprise N bone features. It is understood that 2 × N bone features can be obtained after the first training data set corresponding to the first dimension and the second training data set corresponding to the first dimension are input into the model.

In this embodiment, the first bone feature set includes bone features corresponding to N sets of training data in the first training data set, and the N sets of training data in the first training data set correspond to the N initial pieces of bone data one to one, so that the N bone features in the first bone feature set correspond to the N initial pieces of bone data one to one. Similarly, the second bone feature set includes bone features corresponding to N sets of training data in the second training data set, and the N sets of training data in the second training data set correspond to the N initial bone data sets one to one, so that the N bone features in the second bone feature set correspond to the N initial bone data sets one to one.

It can be understood that, for a first initial bone data of the N initial bone data, there is a feature corresponding to the first dimension in the first set of bone features corresponding to the first dimension, and there is also a feature corresponding to the first dimension in the second set of bone features corresponding to the first dimension. For convenience of description, a feature of the first set of bone features corresponding to the first initial bone data is referred to as a "first bone feature", and a bone feature of the second set of bone features corresponding to the first initial bone data is referred to as a "second bone feature".

As described above, the loss of the first initial bone data is determined based on the loss of the first initial bone data in each of the M data dimensions. Next, taking the first dimension as an example, the loss of the first initial bone data in the first dimension is described.

In one example, the loss of the first initial bone data in the first dimension may be a first loss.

Wherein the first loss is determined according to a first similarity of a first bone feature and a second bone feature, a first weight of the first similarity, a second weight of the similarity of the first bone feature and each of the 2 x N bone features, and a multi-dimensional fusion similarity corresponding to each second weight. The multidimensional fusion similarity can be determined according to bone features corresponding to the first initial bone data in the M dimensions respectively. In this way, the first loss not only considers the bone features of the first initial bone data in the first dimension, but also considers the corresponding bone features of the first initial bone data in other dimensions, so that the bone features of the dimensions can be compensated for each other, thereby improving the model training efficiency.

In one example, if any one of the 2 × N bone features is referred to as a third bone feature, then the multi-dimensional fusion similarity corresponding to the second weight of the similarity between the first bone feature and the third bone feature may be determined according to the similarity between the first bone feature and the third bone feature and the similarity between a fourth bone feature and a fifth bone feature corresponding to each of (M-1) dimensions, where the fourth bone feature and the first bone feature correspond to the same initial bone data, and the fifth bone feature and the third bone feature correspond to the same initial bone data.

As for the fourth bone feature, it should be noted that the fourth bone feature is a feature corresponding to the first initial bone data. For example, the first bone feature is a bone feature corresponding to an articulation dimension corresponding to first initial bone data, and the fourth bone feature is a bone feature corresponding to an articulation dimension corresponding to the first initial bone data.

As for the fifth bone feature, the fifth bone feature and the third bone feature correspond to the same initial bone data, for example, the third bone feature is a bone feature corresponding to an articulation dimension corresponding to the second initial bone data, and the fifth bone feature is a bone feature corresponding to an articulation dimension corresponding to the second initial bone data.

In one example, assuming that M is 3, the first loss may be calculated by the following equation (3):

in equation (3):

L_j1a first loss corresponding to the first initial bone data;

j1 and j2 represent the serial numbers of bone features corresponding to two enhanced training data derived from the first initial bone data among all 2 × N bone features corresponding to the first dimension, where j1 represents the serial number of the first bone feature and j2 represents the serial number of the second bone feature;

S_j1j2representing the similarity of the bone characteristics corresponding to the enhanced two training data corresponding to the first initial bone data, wherein the similarity of the first bone characteristic and the second bone characteristic is represented;

W_j1j2denotes S_j1j2The weight of (c);

jk denotes the serial number of the third bone feature;

W_j1jka second weight representing a similarity of the first and third bone features;

represents W_j1jkThe multi-dimensional fusion similarity of (2);

m1 and b1 represent sequence numbers of bone features corresponding to training data obtained by the first data enhancement mode in 2 × N bone features corresponding to the other two dimensions respectively;

mk and bk represent serial numbers of bone features corresponding to enhanced training data derived from certain initial bone data in 2 × N features respectively corresponding to the other two dimensions, wherein the initial bone data corresponding to mk and bk are initial bone functions corresponding to third bone features;

S_m1mkrepresenting the similarity of a fourth bone characteristic and a fifth bone characteristic corresponding to a certain dimension (such as a joint point movement dimension);

S_j1jkrepresenting a similarity of the first and third bone features;

S_b1bkrepresenting the similarity of the fourth and fifth bone features corresponding to a certain dimension (e.g. bone dimension).

In yet another example, the loss of the first initial bone data in the first dimension may be a second loss.

Wherein the second loss is determined according to a first similarity of the first and second bone features, a first weight of the first similarity, a third weight of the similarity of the second bone feature and each of the 2 x N bone features, and a multi-dimensional fusion similarity corresponding to each third weight. The multidimensional fusion similarity corresponding to the third weight may be determined according to bone features of the first initial bone data in the M dimensions, respectively. In this way, the second loss not only considers the bone features of the first initial bone data in the first dimension, but also considers the corresponding bone features of the first initial bone data in other dimensions, so that the features of the dimensions can be compensated for each other, thereby improving the model training efficiency.

In one example, if any one of the 2 × N bone features is referred to as a third bone feature, then the multi-dimensional fusion similarity corresponding to the second weight of the similarity between the second bone feature and the third bone feature may be determined according to the similarity between the second bone feature and the third bone feature and the similarity between a sixth bone feature and a seventh bone feature corresponding to each of (M-1) dimensions, where the sixth bone feature and the second bone feature correspond to the same initial bone data, and the seventh bone feature and the third bone feature correspond to the same initial bone data.

As for the sixth bone feature, it should be noted that the sixth bone feature is a bone feature corresponding to the first initial bone data. For example, the second bone feature is a bone feature corresponding to an articulation dimension corresponding to first initial bone data, and the sixth bone feature is a bone feature corresponding to an articulation dimension corresponding to the first initial bone data.

As for the seventh bone feature, the seventh bone feature and the third bone feature correspond to the same initial bone data, for example, the third bone feature is a bone feature corresponding to an articulation dimension corresponding to the second initial bone data, and the seventh bone feature is a bone feature corresponding to an articulation dimension corresponding to the second initial bone data.

In one example, assuming that M is 3, the second loss may be calculated by the following equation (4):

in equation (4):

L_j2a second loss corresponding to the first initial bone data;

S_j2j1representing the similarity of the bone characteristics corresponding to the enhanced two training data corresponding to the first initial bone data, wherein the similarity of the first bone characteristic and the second bone characteristic is represented;

W_j2j1denotes S_j2j1The weight of (c);

jk denotes the serial number of the third bone feature;

W_j2jka third weight representing a similarity of the second and third skeletal features;

represents W_j2jkThe multi-dimensional fusion similarity of (2);

m2 and b2 represent sequence numbers of the bone features corresponding to the training data obtained by the second data enhancement mode in 2 × N bone features corresponding to the other two dimensions respectively;

mk and bk represent serial numbers of bone features corresponding to enhanced training data derived from certain initial bone data in 2 × N bone features respectively corresponding to the other two dimensions, wherein the initial bone data corresponding to mk and bk are initial bone functions corresponding to third bone features;

S_m2mkrepresenting the similarity of a sixth bone characteristic and a seventh bone characteristic corresponding to a certain dimension (such as a joint point movement dimension);

S_j2jkrepresenting a similarity of the second and third bone features;

S_m2mkrepresenting a sixth skeleton corresponding to a dimension (e.g., a joint motion dimension)Similarity of features and seventh skeletal features;

S_b2bkrepresenting the similarity of the sixth and seventh skeletal features corresponding to a certain dimension (e.g., bone dimension).

In yet another example, the loss of the first initial bone data in the first dimension may be determined based on the first loss and the second loss, e.g., the loss of the first initial bone data in the first dimension may be weighted by the first loss and the second loss. In one example, the loss of the first initial bone data in the first dimension can be calculated by the following equation (5).

L_j＝a*L_j1+b*L_j2Formula (5)

In equation (5):

L_jloss of first initial bone data in a first dimension;

L_j1is a first loss;

a is the weight of the first loss, in one example, a ═ 0.5;

L_j2a second loss;

b is the weight of the second penalty, in one example, b is 0.5.

In an example of the embodiment of the present application, the first similarity corresponds to a first weight that is greater than a weight corresponding to a similarity between the first bone feature and an eighth bone feature, and the first similarity corresponds to a weight that is greater than a weight corresponding to a similarity between the second bone feature and a ninth bone feature. Wherein the eighth bone feature is any one of the other (2 x N-1) bone features except for the second bone feature, and the ninth bone feature is any one of the other (2 x N-1) bone features except for the first bone feature.

For example, the following steps are carried out:

n is 3, and for the first dimension, there are 3 pieces of bone data, and it is assumed that these 3 pieces of bone data are data a, data B, and data C. After the 3 pieces of bone data are subjected to data enhancement by using a first data enhancement mode, data a1, B1 and C1 are obtained, that is, the first training data set corresponding to the first dimension includes data a1, B1 and C1. After the 3 pieces of bone data are subjected to data enhancement by adopting a second data enhancement mode, data a2, B2 and C2 are obtained, that is, the second training data set corresponding to the first dimension includes data a2, B2 and C2. Data A1, B1, C1, A2, B2 and C2 are input into a model to obtain 6 features, namely Z1, Z2, Z3, Z4, Z5 and Z6, a first bone feature set comprises features Z1, Z2 and Z3, and a second bone feature set comprises features Z4, Z5 and Z6.

Then: taking the first initial bone data as the initial bone data corresponding to the data a as an example, the weight of the similarity between Z1 and Z4 is greater than the weight of the similarity between Z1 and Z1; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z1 and Z2; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z1 and Z3; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z1 and Z5; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z1 and Z6; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z2; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z3; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z4; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z5; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z6.

It will be appreciated that in this way, since the first weight of the first similarity is greater, and the first bone feature has a lesser weight for similarity with other bone features, so, for a first initial skeletal data, even if there is a second initial skeletal data semantically close to the first initial skeletal data, but, when calculating the loss of the first initial bone data, the first similarity is weighted higher than the other similarities, thus, the effect of the first similarity on the loss of the first initial bone data is strengthened, while the effect of other similarities on the loss of the first initial bone data is weakened, thereby making it easier for the calculated loss of the first initial bone data to converge and, further, the loss of the model is easier to converge, and therefore the training efficiency of the model is improved.

In one example, the skeletal features output by the model are not beneficial to calculating the similarity between skeletal features. Therefore, in the embodiment of the present application, when calculating the similarity between the bone features, the feature mapping module may first perform feature mapping on the bone features output by the model, map the bone features to a feature space convenient for calculating the similarity between the features, and then calculate the similarity between the bone features by using the mapped features.

Taking the calculation of the similarity between the first bone feature and the second bone feature as an example, the first bone feature may be processed according to a feature mapping module to obtain a first feature, and the second bone feature may be processed according to the feature mapping module to obtain a second feature. And then, determining the cosine similarity of the first feature and the second feature as the first similarity.

The embodiments of the present application also do not specifically limit the structure of the feature mapping module, which in one example may be a convolutional neural network including a plurality of convolutional layers.

In one example, the feature mapping module may be pre-trained.

In yet another example, the feature mapping module may be trained simultaneously with the model, in other words, during the training of the model, the parameters of the feature mapping module are adjusted according to the loss of the model. In this way, the feature mapping module and the model can be trained simultaneously without training the feature mapping module in advance.

In the embodiment of the present application, the weight of the similarity between any two bone features may be obtained through pre-calculation, and a manner of calculating the weight of the similarity corresponding to the first dimension is described next. In one example, the weight of any similarity may be calculated by the following steps S3-S5.

S3: and calculating the similarity between any two bone features in the 2 x N bone features corresponding to the first dimension based on the 2 x N bone features corresponding to the first dimension to obtain a 2N x 2N first similarity matrix, wherein the element corresponding to the ith row and the jth column of the first similarity matrix is used for indicating the similarity between the ith bone feature and the jth bone feature in the 2 x N bone features.

In the embodiment of the present application, the similarity between two bone features may be a cosine similarity of the two bone features. Regarding the way of calculating the cosine similarity, the embodiment of the present application is not particularly limited.

S4: and modifying the value of the diagonal element of the first similarity matrix into a preset value to obtain a second similarity matrix.

S5: and processing the second similarity matrix according to an optimal transmission distribution algorithm to obtain a weight matrix, wherein the element corresponding to the ith row and the jth column of the weight matrix is used for indicating the weight of the similarity of the ith bone feature and the jth bone feature in the 2 x N bone features.

Regarding S4 and S5, it should be noted that, since the diagonal elements of the first similarity matrix are used to indicate the similarity of a certain bone feature to itself, the value of the diagonal elements of the first similarity matrix is 1. In order to make the bone feature with the highest similarity to the first bone feature in the first bone feature set be the second bone feature in the second bone feature set, in the embodiment of the present application, the value of the diagonal element of the first similarity matrix may be modified to a smaller preset value, so as to obtain the second similarity matrix. The preset value is not particularly limited in the embodiments of the present application, and in one example, the preset value may be 1 e-3.

And then, processing the second similarity matrix according to an optimal transmission distribution algorithm to obtain a weight matrix. In the weight matrix, the sum of each row element is 1, and the sum of each column element is also 1. And, assuming that the weight of the first similarity corresponding to the first bone feature and the second bone feature is the element corresponding to the ith row and the jth column of the weight matrix, the element with the largest value among the 2 × N elements of the ith row of the weight matrix is the weight of the first similarity, and the element with the largest value among the 2 × N elements of the jth column of the weight matrix is the weight of the first similarity. It is understood that in the weight matrix, the element in the ith row is the weight of the similarity between the first bone feature and the 2 × N bone features, and the element in the jth column is the weight of the similarity between the second bone feature and the 2 × N bone features. Therefore, by the weight matrix, it is known that: the weight corresponding to the first similarity is greater than the weight corresponding to the similarity between the first bone characteristic and the eighth bone characteristic, and the weight corresponding to the first similarity is greater than the weight corresponding to the similarity between the second bone characteristic and the ninth bone characteristic.

The embodiment of the present application does not specifically limit the optimal transmission allocation algorithm, which may be, for example, a Sinkhorn-Knopp algorithm.

Based on the method provided by the above embodiment, the embodiment of the present application further provides an apparatus, which is described below with reference to the accompanying drawings.

Referring to fig. 2, a schematic structural diagram of a model training apparatus for extracting human bone features according to an embodiment of the present application is provided. The apparatus 200 may specifically include, for example: an acquisition unit 201 and a training unit 202.

An obtaining unit 201, configured to obtain M skeleton data sets, where each skeleton data set in the M skeleton data sets corresponds to a data dimension, each skeleton data set includes N skeleton data, the N skeleton data in the skeleton data sets correspond to N initial skeleton data in a one-to-one manner, M is an integer greater than 1, and N is an integer greater than or equal to 1;

a training unit 202, configured to train a model for extracting human skeletal features according to the M skeletal data sets; wherein:

Optionally, the training unit 202 is configured to:

Optionally, the apparatus further comprises:

Optionally, the M data dimensions include at least two of:

Since the apparatus 200 is an apparatus corresponding to the method provided in the above method embodiment, and the specific implementation of each unit of the apparatus 200 is the same as that of the above method embodiment, for the specific implementation of each unit of the apparatus 200, reference may be made to the description part of the above method embodiment, and details are not repeated here.

The embodiment of the application also provides a human skeleton data processing system. Fig. 3 is a schematic structural diagram of a human bone data processing system according to an embodiment of the present disclosure, as shown in fig. 3. The system 300 includes a model 310 for extracting human skeletal features, a data enhancement model 320, and a feature mapping model 330. Wherein:

the model 310 is obtained by training according to the method described in fig. 1;

the data enhancement model 320 is used for enhancing the initial bone data to obtain training bone data;

the feature mapping model 330 is configured to process the bone features output by the model for extracting human bone features, and output features used for calculating loss of the model for extracting human bone features.

With regard to the specific implementation manner of the data enhancement model 320 for performing the enhancement processing on the initial bone data, reference may be made to the description part of the above method embodiment, and the description will not be repeated here.

The feature mapping model 330 mentioned here may correspond to the feature mapping module in the above method embodiment. With respect to the specific implementation of the feature mapping model 330, reference may be made to the above description of the feature mapping module, and a repeated description is not made here.

An embodiment of the present application further provides an apparatus, including: a processor, a memory, a system bus; the equipment and the memory are connected through the system bus; the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of the above method embodiments.

The present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a terminal device, the instructions cause the terminal device to perform the method according to any one of the above method embodiments.

An embodiment of the present application further provides a computer program product, which, when running on a terminal device, causes the terminal device to execute the method described in any one of the above method embodiments.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the attached claims

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A model training method for extracting human skeletal features, the method comprising:

2. The method of claim 1, wherein training a model for extracting human skeletal features from the M sets of skeletal data comprises:

3. The method of claim 2, wherein a loss of the first initial bone data in a first dimension is determined from a first loss and/or a second loss, wherein:

the first loss is determined according to a first similarity of a first bone feature and a second bone feature, a first weight of the first similarity, a second weight of the similarity of the first bone feature and each of 2 x N bone features, and a multi-dimensional fusion similarity corresponding to each second weight, wherein the first bone feature is a bone feature corresponding to the first initial bone data in a first set of bone features corresponding to the first dimension, the second bone feature is a bone feature corresponding to the first initial bone data in a second set of bone features corresponding to the first dimension, and the 2 x N bone features comprise the first set of bone features corresponding to the first dimension and the second set of bone features corresponding to the first dimension;

4. The method of claim 3, wherein the 2 x N bone features include a third bone feature, the second weight of similarity of the first and third bone features corresponds to a multi-dimensional fusion similarity determined by the similarity of the first and third bone features and the similarity of a fourth and fifth bone features for each of (M-1) dimensions, wherein the fourth and first bone features correspond to the same initial bone data and the fifth and third bone features correspond to the same initial bone data.

5. The method of claim 3, wherein the 2 x N bone features include a third bone feature, and wherein a third weight of the similarity of the second and third bone features corresponds to a multi-dimensional fusion similarity determined by the similarity of the second and third bone features and the similarity of a sixth and seventh bone features for each of (M-1) dimensions, wherein the sixth and second bone features correspond to the same training data and the seventh and third bone features correspond to the same training data.

6. The method according to claim 3, wherein the first similarity corresponds to a weight that is greater than a weight corresponding to a similarity of the first bone feature to an eighth bone feature, the first similarity corresponds to a weight that is greater than a weight corresponding to a similarity of the second bone feature to a ninth bone feature, wherein the eighth bone feature is any one of the other (2N-1) of the 2N bone features except the second bone feature, and the ninth bone feature is any one of the other (2N-1) of the 2N bone features except the first bone feature.

7. A model training apparatus for extracting human skeletal features, the apparatus comprising:

8. A human skeletal data processing system, characterized in that the system comprises:

a model, a data enhancement model and a feature mapping model which are obtained by training by adopting the method of any one of claims 1-6 and used for extracting human skeleton features;

9. An apparatus, characterized in that the apparatus comprises: a processor, a memory, a system bus; the equipment and the memory are connected through the system bus; the memory is to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1 to 6.

10. A computer-readable storage medium having stored therein instructions which, when run on a terminal device, cause the terminal device to perform the method of any one of claims 1 to 6.