CN114241514B

CN114241514B - Model training method and device for extracting human skeleton characteristics

Info

Publication number: CN114241514B
Application number: CN202111351423.4A
Authority: CN
Inventors: 何嘉斌; 刘廷曦; 翁仁亮
Original assignee: Beijing Aibee Technology Co Ltd
Current assignee: Beijing Aibee Technology Co Ltd
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2024-05-28
Anticipated expiration: 2041-11-15
Also published as: CN114241514A

Abstract

The application discloses a model training method for extracting human skeleton characteristics, which comprises the following steps: obtaining M bone data sets, wherein each bone data set in the M bone data sets corresponds to one data dimension, each bone data set comprises N bone data, and N bone data in the bone data sets corresponds to N initial bone data one by one. N bone data in the bone data set are in one-to-one correspondence with N initial bone data. And training a model for extracting the human skeleton characteristics according to the M skeleton data sets. When the model is trained, the calculation mode of the loss of the first initial bone data is improved, and the loss calculation mode after improvement considers the loss of the first initial bone data corresponding to the M data dimensions, so that the calculated loss of the first initial bone data is easier to converge, and the training efficiency of the model is improved.

Description

Model training method and device for extracting human skeleton characteristics

Technical Field

The application relates to the field of data processing, in particular to a model training method and device for extracting human skeleton characteristics.

Background

Currently, models for extracting human skeletal features can be trained by means of self-supervised training. Wherein, self-supervision training refers to a label marked by people without using training samples in the process of training a model.

In some scenarios, a model for extracting skeletal features of a human body may be trained using self-supervised training. However, the self-supervision training mode is adopted to train the model for extracting the human skeleton characteristics, and the training efficiency is low. Therefore, a solution is urgently needed to solve the above-mentioned problems.

Disclosure of Invention

The technical problems to be solved by the application are as follows: a model for extracting human skeleton features is trained by adopting a contrast learning mode, the training efficiency is low, and a model training method and device for extracting human skeleton features are provided.

In a first aspect, an embodiment of the present application provides a model training method for extracting skeletal features of a human body, the method including:

Acquiring M bone data sets, wherein each bone data set in the M bone data sets corresponds to a data dimension, each bone data set comprises N bone data, N bone data in the bone data sets corresponds to N initial bone data one by one, M is an integer greater than 1, and N is an integer greater than or equal to 1;

training a model for extracting human skeleton characteristics according to the M skeleton data sets; wherein:

The loss of the model is determined based on the loss of the N initial bone data, the N initial bone data comprise first initial bone data, and the loss of the first initial bone data is determined according to the loss of the first initial bone data corresponding to the M data dimensions.

Optionally, training a model for extracting human skeleton features according to the M bone data sets includes:

Processing N bones data in the M bones data sets by adopting a first data enhancement mode to obtain M first training data sets, wherein one bones data set corresponds to one first training data set, and one first training data set comprises N groups of training data; processing N bones data in the M bones data sets by adopting a second data enhancement mode to obtain M second training data sets, wherein one bones data set corresponds to one second training data set, and one second training data set comprises N groups of training data;

Training a model for extracting human skeletal features based on the M first training data sets and the M second training data sets; wherein:

The M data dimensions comprise first dimensions, a first bone feature set corresponding to the first dimensions is obtained through the model by a first training data set corresponding to the first dimensions, a second bone feature set corresponding to the first dimensions is obtained through the model by a second training data set corresponding to the first dimensions, the first bone feature set and the second bone feature set respectively comprise N features, and the N features are in one-to-one correspondence with the N initial bone data.

Optionally, the loss of the first initial bone data in the first dimension is determined according to the first loss and/or the second loss, wherein:

the first loss is determined according to a first similarity of a first bone feature and a second bone feature, a first weight of the first similarity, a second weight of the similarity of each of the first bone feature and 2*N bone features, and a multi-dimensional fusion similarity corresponding to each second weight, the first bone feature is a bone feature corresponding to the first initial bone data in a first set of bone features corresponding to the first dimension, the second bone feature is a bone feature corresponding to the first initial bone data in a second set of bone features corresponding to the first dimension, the 2*N bone features include: a first bone feature set corresponding to the first dimension and a second bone feature set corresponding to the first dimension;

The second penalty is determined from the first similarity, the first weight, a third weight of similarity for each of the second and 2*N bone features, and a multi-dimensional fusion similarity for each third weight.

Optionally, the 2*N bone features include a third bone feature, the multi-dimensional fusion similarity corresponding to a second weight of the similarity of the first bone feature and the third bone feature is determined by the similarity of the first bone feature and the third bone feature, and the similarity of a fourth bone feature and a fifth bone feature corresponding to each of the (M-1) dimensions, wherein the fourth bone feature and the first bone feature correspond to the same initial bone data, and the fifth bone feature and the third bone feature correspond to the same initial bone data.

Optionally, the 2*N bone features include a third bone feature, the multi-dimensional fusion similarity corresponding to a third weight of the similarity of the second bone feature and the third bone feature is determined by the similarity of the second bone feature and the third bone feature, and the similarity of a sixth bone feature and a seventh bone feature corresponding to each of the (M-1) dimensions, wherein the sixth bone feature and the second bone feature correspond to the same training data, and the seventh bone feature and the third bone feature correspond to the same training data.

Optionally, the weight corresponding to the first similarity is greater than the weight corresponding to the similarity of the first bone feature and an eighth bone feature, the weight corresponding to the first similarity is greater than the weight corresponding to the similarity of the second bone feature and a ninth bone feature, wherein the eighth bone feature is any one of the other (2*N-1) bone features of the 2*N bone features other than the second bone feature, and the ninth bone feature is any one of the other (2*N-1) bone features of the 2*N bone features other than the first bone feature.

Optionally, the method further comprises:

Calculating the similarity between any two bone features in 2*N bone features corresponding to the first dimension based on 2*N bone features corresponding to the first dimension to obtain a first similarity matrix of 2N×2N, wherein elements corresponding to the ith row and the jth column of the first similarity matrix are used for indicating the similarity between the ith bone feature and the jth bone feature in 2*N bone features;

Modifying the value of the diagonal element of the first similarity matrix to a preset value to obtain a second similarity matrix;

And processing the second similarity matrix according to an optimal transmission allocation algorithm to obtain a weight matrix, wherein elements corresponding to the ith row and the jth column of the weight matrix are used for indicating the weight of the similarity of the ith bone feature and the jth bone feature in the 2*N bone features.

Optionally, the first similarity between the first bone feature and the second bone feature is determined by:

processing the first bone feature according to a feature mapping module to obtain a first feature;

Processing the second bone feature according to the feature mapping module to obtain a second feature;

and determining cosine similarity of the first feature and the second feature as the first similarity.

Optionally, in the training process of the model, parameters of the feature mapping module are adjusted according to loss of the model.

Optionally, the M data dimensions include at least two of:

A joint dimension, a joint motion dimension, a bone dimension, and a bone motion dimension.

In a second aspect, an embodiment of the present application provides a model training apparatus for extracting skeletal features of a human body, the apparatus comprising:

The bone data acquisition unit is used for acquiring M bone data sets, each bone data set in the M bone data sets corresponds to one data dimension, each bone data set comprises N bone data, N bone data in the bone data sets corresponds to N initial bone data one by one, M is an integer greater than 1, and N is an integer greater than or equal to 1;

the training unit is used for training a model for extracting human skeleton characteristics according to the M skeleton data sets; wherein:

Optionally, the training unit is configured to:

Optionally, the apparatus further includes:

A calculating unit, configured to calculate, based on 2*N bone features corresponding to the first dimension, a similarity between any two bone features in 2*N bone features corresponding to the first dimension, to obtain a first similarity matrix of 2n×2n, where an element corresponding to a j-th row of the first similarity matrix is used to indicate a similarity between an i-th bone feature and a j-th bone feature in the 2*N bone features;

the modification unit is used for modifying the values of the diagonal elements of the first similarity matrix into preset values to obtain a second similarity matrix;

The processing unit is used for processing the second similarity matrix according to an optimal transmission allocation algorithm to obtain a weight matrix, and the element corresponding to the ith row and the jth column of the weight matrix is used for indicating the weight of the similarity of the ith bone feature and the jth bone feature in the 2*N bone features.

Optionally, the M data dimensions include at least two of:

In a third aspect, embodiments of the present application provide a human skeletal data processing system, the system comprising:

a model for extracting human skeletal features, a data enhancement model, and a feature mapping model trained using the method of any one of the first aspects above;

The data enhancement model is used for enhancing the initial bone data to obtain training bone data;

The feature mapping model is used for processing the skeleton features output by the model for extracting the skeleton features of the human body and outputting the features for calculating the loss of the model for extracting the skeleton features of the human body.

In a fourth aspect, an embodiment of the present application provides an apparatus, including: a processor, memory, system bus; the device and the memory are connected through the system bus; the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of the first aspects above.

In a fifth aspect, an embodiment of the present application provides a computer readable storage medium having instructions stored therein, which when run on a terminal device, cause the terminal device to perform the method of any one of the first aspects above.

In a sixth aspect, embodiments of the present application provide a computer program product which, when run on a terminal device, causes the terminal device to perform the method of any of the first aspects above.

Compared with the prior art, the embodiment of the application has the following advantages:

The embodiment of the application provides a model training method for extracting human skeleton characteristics, which comprises the following steps: obtaining M bone data sets, wherein each bone data set in the M bone data sets corresponds to a data dimension, each bone data set comprises N bone data, and N bone data in the bone data sets corresponds to N initial bone data one by one. The N bone data in the bone data set corresponds to the N initial bone data one by one, and it can be understood that the N bone data in the bone data set is obtained according to the N initial bone data, and one initial bone data corresponds to one bone data in the bone data set. And training a model for extracting the human skeleton characteristics according to the M skeleton data sets. And determining the loss of the model based on the loss of the N initial bone data, wherein if any one of the N initial bone data is referred to as "first initial bone data", the loss of the first initial bone data is determined according to the loss of the first initial bone data corresponding to each of the M data dimensions. In this way, even if there is second initial bone data that is semantically close to the first initial bone data, the semantics of the bone data corresponding to the second initial bone data in the M dimensions are not necessarily all close to the semantics of the bone data corresponding to the first initial bone data in the M dimensions. Therefore, the loss of the first initial bone data is determined according to the loss of the first initial bone data corresponding to the M data dimensions, so that the influence of the second initial bone data on the loss of the first initial bone data can be reduced, the loss of the first initial bone data is easier to converge, correspondingly, the loss of the model is easier to converge, and the training efficiency of the model is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.

FIG. 1 is a schematic flow chart of a model training method for extracting human skeleton characteristics according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a model training device for extracting skeletal features of a human body according to an embodiment of the present application;

Fig. 3 is a schematic structural diagram of a human skeleton data processing system according to an embodiment of the present application.

Detailed Description

In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Contrast learning is a way of self-supervision training, and is referred to as contrast learning: and processing one sample by adopting two different data enhancement modes to obtain two enhanced samples. And training the model using the two enhanced samples. It will be appreciated that since the two enhanced samples are data enhanced from the same sample, the semantics of the two enhanced samples are consistent, although the content of the two enhanced samples is different. Therefore, if the similarity of the features output by the model for the two enhanced samples is high enough, and the similarity with other samples (or the samples enhanced by other samples through different enhancement modes) is low enough, the model has the capability of extracting the features of the samples.

At present, a model for extracting human skeleton features can be trained by adopting a contrast learning mode, when the model for extracting human skeleton features is trained by adopting the contrast learning mode, N initial skeleton data can be subjected to data enhancement by adopting two data enhancement modes to obtain 2*N training data, and the model is trained by utilizing 2*N training data, so that it can be understood that 2*N skeleton features can be obtained after 2*N training data passes through the model.

The initial bone data referred to herein may be node data. The node data may be data of a plurality of (for example, 25) nodes of the human body.

For a certain initial bone data, for example, the first initial bone data, the loss corresponding to the initial bone data can be calculated by the following formula (1).

In formula (1):

l _ij is the loss corresponding to the first initial bone data;

i and j represent the serial numbers of bone features corresponding to the two training data enhanced from the first initial bone data in all 2*N bone features;

S _ij represents the similarity of the bone characteristics corresponding to the two training data after the enhancement corresponding to the first initial bone data;

S _ik represents the similarity between the ith bone feature and the kth bone feature in the 2*N features obtained.

It will be appreciated that, for the formula (1), if there is other initial bone data, such as the second initial bone data, that is similar to the first initial bone data in the semantic meaning, the similarity between two bone features corresponding to the second initial bone data and two bone features corresponding to the first initial bone data in the 2*N bone features is higher, which results in that the denominator of the formula (1) is larger, further, the value of L _ij is larger, and cannot approach 0, so that model convergence is slow, that is, training efficiency of the model is low.

Now, an example is described: let n=3, i.e. there are 3 initial bone data in total, and let these 3 initial bone data be data a, data B and data C. The data A1, B1 and C1 are obtained after the data enhancement is performed on the 3 initial bone data by using the first data enhancement method, and the data A2, B2 and C2 are obtained after the data enhancement is performed on the 3 initial bone data by using the second data enhancement method. Data A1, B1, C1, A2, B2 and C2 were input into the model to obtain 6 bone features, Z1, Z2, Z3, Z4, Z5 and Z6, respectively. It will be appreciated that bone features Z1 and Z4 are the features corresponding to data A, bone features Z2 and Z5 are the features corresponding to data B, and bone features Z3 and Z6 are the features corresponding to data C. Then according to equation (1) above: the loss corresponding to data a may be represented by the following equation (2):

In formula (2):

L ₁₄ is the loss corresponding to data A;

S ₁₄ is the similarity of Z1 and Z4;

s ₁₂ is the similarity of Z1 and Z2;

s ₁₃ is the similarity of Z1 and Z3;

s ₁₅ is the similarity of Z1 and Z5;

s ₁₆ is the similarity of Z1 and Z6.

It will be appreciated that if the semantic information of data A and data B are similar, the similarity between the features Z1, Z4, Z2 and Z5 is relatively high, and therefore S ₁₄、S₁₂ and S ₁₅ in the denominator of equation (2) are relatively large, so that even though S ₁₄ is relatively high, the denominator is relatively large, and thus results inSmaller, resulting in a larger L ₁₄, making the model difficult to converge.

In order to solve the above problems, the embodiment of the application provides a model training method and device for extracting human skeleton characteristics.

Various non-limiting embodiments of the present application are described in detail below with reference to the attached drawing figures.

Referring to fig. 1, the flow chart of a model training method for extracting human skeleton features according to an embodiment of the present application is shown. The method provided by the embodiment of the application can be executed by the terminal equipment or the server, and the embodiment of the application is not particularly limited.

In this embodiment, the method shown in fig. 1 may include, for example, the following steps: S101-S102.

S101: obtaining M bone data sets, wherein each bone data set in the M bone data sets corresponds to a data dimension, each bone data set comprises N bone data, N bone data in the bone data sets corresponds to N initial bone data one by one, M is an integer greater than 1, and N is an integer greater than or equal to 1.

With respect to the N initial bone data, it should be noted that, in one example, the N initial bone data may be, for example, data of a joint dimension.

Regarding the M data dimensions, it should be noted that, in an embodiment of the present application, the M data dimensions may include at least two of a joint point dimension, a joint point motion dimension, a bone dimension, and a bone motion dimension.

Wherein:

The data of the node dimension may be, for example, a coordinate sequence of N nodes, and an example will be given by taking a coordinate sequence of one node as an example: a skeletal sequence of length T frames, each frame containing N persons, each person having M joints, each joint containing three-dimensional coordinate position information, the coordinate sequence of the joint can be expressed as: t×n×m×3.

The data of the joint movement dimension may be, for example, a movement coordinate sequence of N joints. The motion coordinate sequence of a certain joint point is obtained by subtracting the joint point coordinates corresponding to the previous frame from all joint point coordinates of the next frame in the corresponding T frame skeleton sequence, and the first frame skeleton sequence is discarded and then is complemented with the first frame skeleton sequence in an interpolation mode because the first frame skeleton sequence has no frame which can be subtracted.

The data of bone dimensions may be, for example, a sequence of coordinates of N bones. The coordinate sequence of a certain bone is the difference of the coordinates of adjacent joint points in the same frame of skeleton sequence.

The data of the bone motion dimension may be, for example, a sequence of motion coordinates of N bones. The motion coordinate sequence of a certain bone subtracts the bone coordinates corresponding to the previous frame from all the bone coordinates of the next frame in the corresponding T frame bone sequence, and as the first frame bone sequence has no frame which can be subtracted, we discard the first frame bone sequence and then supplement the first frame bone sequence by interpolation.

In the embodiment of the application, when the acquisition of the M bone data sets is specifically implemented, for example, N initial bone data may be acquired first, and then bone data in M dimensions with the N initial bone data may be obtained according to the N initial bone data, so as to obtain the M bone data sets. It is understood that each bone data set includes N bone data, where N bone data in the bone data set corresponds to N initial bone data one by one. For convenience of description, in the following embodiments, any one of the N initial bone data is referred to as "first initial bone data".

In one example, assume that the M data dimensions are: a joint point dimension, a joint point motion dimension, and a bone dimension. Then, for the first initial bone data, first initial bone data may be first acquired, and then, the first initial bone data may be processed to obtain data of a joint point dimension corresponding to the first initial bone data, data of a joint point motion dimension corresponding to the first initial bone data, and data of a bone dimension corresponding to the first initial bone data. When the first initial bone data is joint point data, the acquired set including N initial bone data may be directly determined as a bone data set corresponding to the joint point dimension.

In the embodiment of the application, the "articulation point data" is "articulation point dimension data", the "articulation point motion data" is "articulation point motion dimension data", the "bone data" is "bone dimension data", and the "bone motion data" is "bone motion dimension data".

S102: training a model for extracting human skeleton characteristics according to the M skeleton data sets; wherein: the loss of the model is determined based on the loss of the N initial bone data, the N initial bone data comprise first initial bone data, and the loss of the first initial bone data is determined according to the loss of the first initial bone data corresponding to the M data dimensions.

After the M bone data sets are acquired, a model for extracting human bone features can be trained in a general self-supervision training mode. In one example, the model may be a convolutional nerve (Convolutional Neural Networks, CNN) that includes multiple convolutional layers.

When training the model, the loss of the model may be calculated based on the loss of the N initial skeletal data, and, for a first initial skeletal data, the loss of the first initial skeletal data may be calculated from the losses of the first initial skeletal data corresponding to the M data dimensions, respectively. For example, the loss of the first initial bone data is the sum of the losses of the first initial bone data corresponding to the M data dimensions.

It will be appreciated that, for the first initial bone data, even though the semantics of the second initial bone data are close to those of the first initial bone data, the semantics of the bone data corresponding to the second initial bone data in the aforementioned M dimensions are not necessarily close to those of the bone data corresponding to the aforementioned M dimensions. Therefore, the loss of the first initial bone data is determined according to the loss of the first initial bone data corresponding to the M data dimensions, so that the influence of the second initial bone data on the loss of the first initial bone data can be reduced, the loss of the first initial bone data is easier to converge, correspondingly, the loss of the model is easier to converge, and the training efficiency of the model is improved.

In an example of the embodiment of the present application, S102 may be implemented in a specific manner, for example, by training a model for extracting skeletal features of a human body according to the M bone data sets by using a contrast learning manner. For this case, S102 may be implemented by the following steps S1-S2.

S1: processing N bones data in the M bones data sets by adopting a first data enhancement mode to obtain M first training data sets, wherein one bones data set corresponds to one first training data set, and one first training data set comprises N groups of training data; and processing N bones data in the M bones data sets by adopting a second data enhancement mode to obtain M second training data sets, wherein one bones data set corresponds to one second training data set, and one second training data set comprises N groups of training data.

The embodiment of the application is not particularly limited to the first data enhancement mode and the second data enhancement mode. In one example, the first data enhancement mode may be one of the following, the second data enhancement mode is one of the following, and the first data enhancement mode and the second data enhancement mode are different.

1) Randomly slightly rotating all the nodes of each frame towards the same direction;

2) Randomly slightly tilting all the nodes of each frame towards the same direction;

3) Adding a Gaussian noise to all the node coordinates in all the frames;

4) And concealing part of the nodes in part of the frames.

With respect to S1, an example will now be described. Assuming n=4, m=3; the M bone data sets are respectively a bone data set corresponding to the joint point dimension, a bone data set corresponding to the joint point motion dimension and a bone data set corresponding to the bone dimension. Each bone data set includes 4 sets of bone data. Then, 3 first training data sets can be obtained by adopting the first data enhancement mode, wherein the first training data sets are respectively corresponding to the joint point dimension, the first training data set corresponding to the joint point motion dimension and the first training data set corresponding to the bone dimension. Wherein: the first training data set corresponding to the joint point dimension comprises 4 groups of training data, and the 4 groups of training data are in one-to-one correspondence with the 4 groups of initial bone data; the first training data set corresponding to the joint node motion dimension comprises 4 groups of training data, and the 4 groups of training data are in one-to-one correspondence with the 4 groups of initial bone data; the first training data set corresponding to the bone dimension comprises 4 groups of training data, and the 4 groups of training data are in one-to-one correspondence with the 4 groups of initial bone data. Similarly, 3 second training data sets can be obtained by adopting a second data enhancement mode, wherein the second training data sets correspond to the joint point dimension, the second training data set corresponds to the joint point motion dimension and the second training data set corresponds to the bone dimension respectively. Wherein: the second training data set corresponding to the joint point dimension comprises 4 groups of training data, and the 4 groups of training data are in one-to-one correspondence with the 4 groups of initial bone data; the second training data set corresponding to the joint node motion dimension comprises 4 groups of training data, and the 4 groups of training data are in one-to-one correspondence with the 4 groups of initial bone data; the second training data set corresponding to the bone dimension comprises 4 groups of training data, and the 4 groups of training data are in one-to-one correspondence with the 4 groups of initial bone data.

S2: based on the M first training data sets and the M second training data sets, a model for extracting human skeletal features is trained.

In the embodiment of the present application, the model processes training data of the M data dimensions similarly, and a first dimension of the M data dimensions is described as an example. It should be noted that the first dimension is any one dimension of the M data dimensions.

In an embodiment of the present application, the M data dimensions include a first dimension. After the first training data set corresponding to the first dimension is input into the model, a first bone feature set corresponding to the first dimension can be obtained, and after the second training data set corresponding to the first dimension is input into the model, a second bone feature set corresponding to the first dimension can be obtained, wherein the first bone feature set and the second bone feature set respectively comprise N bone features. It will be appreciated that 2*N bone features may be obtained after inputting the first training data set corresponding to the first dimension and the second training data set corresponding to the first dimension into the model.

In the embodiment of the present application, the first bone feature set includes bone features corresponding to N sets of training data in the first training data set, and the N sets of training data in the first training data set are in one-to-one correspondence with the N initial bone data, so that the N bone features in the first bone feature set are in one-to-one correspondence with the N initial bone data. Similarly, the second bone feature set includes bone features corresponding to N sets of training data in the second training data set, and the N sets of training data in the second training data set are in one-to-one correspondence with the N initial bone data, so that the N bone features in the second bone feature set are in one-to-one correspondence with the N initial bone data.

It will be appreciated that, for a first initial bone data of the N initial bone data, there are features corresponding thereto in a first set of bone features corresponding to the first dimension, and there are features corresponding thereto in a second set of bone features corresponding to the first dimension. For convenience of description, the feature of the first bone feature set corresponding to the first initial bone data is referred to as a "first bone feature", and the bone feature of the second bone feature set corresponding to the first initial bone data is referred to as a "second bone feature".

As described above, the loss of the first initial bone data is determined according to the loss of the first initial bone data corresponding to the M data dimensions. Next, taking the first dimension as an example, the loss of the first initial bone data in the first dimension is described.

In one example, the loss of the first initial bone data in the first dimension may be a first loss.

The first loss is determined according to a first similarity of a first bone feature and a second bone feature, a first weight of the first similarity, a second weight of the similarity of each bone feature in the first bone feature and the 2*N bone features, and a multidimensional fusion similarity corresponding to each second weight. The multi-dimensional fusion similarity can be determined according to bone characteristics of the first initial bone data corresponding to the M dimensions respectively. In this way, the first loss considers not only the bone characteristics of the first initial bone data in the first dimension, but also the corresponding bone characteristics of the first initial bone data in other dimensions, so that the bone characteristics of each dimension can be mutually compensated, and the model training efficiency is improved.

In one example, when any one of the 2*N bone features is referred to as a third bone feature, the multi-dimensional fusion similarity corresponding to the second weight of the similarity between the first bone feature and the third bone feature may be determined according to the similarity between the first bone feature and the third bone feature, and the similarity between a fourth bone feature and a fifth bone feature corresponding to each of the (M-1) dimensions, where the fourth bone feature and the first bone feature correspond to the same initial bone data, and the fifth bone feature and the third bone feature correspond to the same initial bone data.

With respect to a fourth bone feature, it is noted that the fourth bone feature is a feature corresponding to the first initial bone data. For example, the first bone feature is a bone feature corresponding to a joint point dimension corresponding to first initial bone data, and the fourth bone feature is a bone feature corresponding to a joint point motion dimension corresponding to the first initial bone data.

With respect to the fifth bone feature, it should be noted that the fifth bone feature and the third bone feature correspond to the same initial bone data, for example, the third bone feature is a bone feature corresponding to a joint point dimension corresponding to the second initial bone data, and the fifth bone feature is a bone feature corresponding to a joint point motion dimension corresponding to the second initial bone data.

In one example, assuming that m=3, the first loss can be calculated by the following equation (3):

in formula (3):

l _j1 is a first loss corresponding to the first initial bone data;

j1 and j2 represent the sequence numbers of bone features corresponding to the enhanced two training data derived from the first initial bone data among all 2*N bone features corresponding to the first dimension, where j1 represents the sequence number of the first bone feature and j2 represents the sequence number of the second bone feature;

s _j1j2 represents the similarity of the bone features corresponding to the two training data after enhancement corresponding to the first initial bone data, where the similarity of the first bone feature and the second bone feature is represented;

W _j1j2 represents the weight of S _j1j2;

jk represents the sequence number of the third bone feature;

w _j1jk represents a second weight of similarity of the first skeletal feature and the third skeletal feature;

Representing the multi-dimensional fusion similarity of W _j1jk;

m1 and b1 represent serial numbers of corresponding bone features derived from training data obtained by using a first data enhancement mode in 2*N bone features corresponding to the other two dimensions respectively;

mk and bk represent serial numbers of bone features corresponding to one training data from the enhancement of one piece of initial bone data in 2*N features corresponding to the other two dimensions respectively, wherein the initial bone data corresponding to mk and bk are initial bone functions corresponding to a third bone feature;

s _m1mk represents similarity representing a fourth bone feature and a fifth bone feature corresponding to a dimension (e.g., an articulation dimension);

s _j1jk denotes the similarity of the first bone feature and the third bone feature;

S _m1mk represents the similarity of the fourth bone feature and the fifth bone feature corresponding to a certain dimension (e.g., the articulation dimension);

s _b1bk represents the similarity of the fourth bone feature and the fifth bone feature corresponding to a certain dimension (e.g., bone dimension).

In yet another example, the loss of the first initial bone data in the first dimension may be a second loss.

The second loss is determined according to a first similarity of the first bone feature and the second bone feature, a first weight of the first similarity, a third weight of the similarity of each of the second bone feature and the 2*N bone features, and a multi-dimensional fusion similarity corresponding to each third weight. The multi-dimensional fusion similarity corresponding to the third weight can be determined according to bone characteristics corresponding to the first initial bone data in the M dimensions respectively. In this way, the second loss considers not only the bone characteristics of the first initial bone data in the first dimension, but also the corresponding bone characteristics of the first initial bone data in other dimensions, so that the characteristics of each dimension can be mutually compensated, and the model training efficiency is improved.

In one example, if any one of the 2*N bone features is referred to as a third bone feature, the multi-dimensional fusion similarity corresponding to the second weight of the similarity between the second bone feature and the third bone feature may be determined according to the similarity between the second bone feature and the third bone feature, and the similarity between a sixth bone feature and a seventh bone feature corresponding to each of the (M-1) dimensions, where the sixth bone feature and the second bone feature correspond to the same initial bone data, and the seventh bone feature and the third bone feature correspond to the same initial bone data.

With respect to a sixth bone feature, it is noted that the sixth bone feature is a bone feature corresponding to the first initial bone data. For example, the second bone feature is a bone feature corresponding to a joint point dimension corresponding to first initial bone data, and the sixth bone feature is a bone feature corresponding to a joint point motion dimension corresponding to the first initial bone data.

In the seventh bone feature, the seventh bone feature and the third bone feature correspond to the same initial bone data, for example, the third bone feature is a bone feature corresponding to a joint point dimension corresponding to the second initial bone data, and the seventh bone feature is a bone feature corresponding to a joint point motion dimension corresponding to the second initial bone data.

In one example, assuming that m=3, the second loss can be calculated by the following equation (4):

In formula (4):

l _j2 is a second loss corresponding to the first initial bone data;

s _j2j1 represents the similarity of the bone features corresponding to the two training data after enhancement corresponding to the first initial bone data, where the similarity of the first bone feature and the second bone feature is represented;

W _j2j1 represents the weight of S _j2j1;

jk represents the sequence number of the third bone feature;

W _j2jk represents a third weight of similarity of the second bone feature and the third bone feature;

representing the multi-dimensional fusion similarity of W _j2jk;

m2 and b2 represent serial numbers of bone features corresponding to training data obtained by using a second data enhancement mode from 2*N bone features corresponding to the other two dimensions respectively;

mk and bk represent serial numbers of bone features corresponding to one training data enhanced from one of the other 2*N bone features corresponding to the other two dimensions, wherein the initial bone data corresponding to mk and bk are initial bone functions corresponding to a third bone feature;

S _m2mk represents similarity representing a sixth bone feature and a seventh bone feature corresponding to a dimension (e.g., an articulation dimension);

S _j2jk denotes the similarity of the second bone feature and the third bone feature;

S _m2mk represents the similarity of the sixth bone feature and the seventh bone feature corresponding to a certain dimension (e.g., the articulation dimension);

S _b2bk represents the similarity of the sixth and seventh bone features corresponding to a certain dimension (e.g., bone dimension).

In yet another example, the loss of the first initial bone data in the first dimension may be determined from the first loss and the second loss, e.g., the loss of the first initial bone data in the first dimension may be derived from a weighted sum of the first loss and the second loss. In one example, the loss of the first initial bone data in the first dimension may be calculated by the following equation (5).

L _j＝a*L_j1+b*L_j2 formula (5)

In formula (5):

L _j is the loss of the first initial bone data in the first dimension;

L _j1 is the first penalty;

a is the weight of the first penalty, in one example a=0.5;

l _j2 is the second penalty;

b is the weight of the second penalty, in one example b=0.5.

In an example of the embodiment of the present application, the first weight corresponding to the first similarity is greater than the weight corresponding to the similarity between the first bone feature and the eighth bone feature, and the weight corresponding to the first similarity is greater than the weight corresponding to the similarity between the second bone feature and the ninth bone feature. Wherein the eighth bone feature is any one of the 2*N bone features other than the second bone feature (2*N-1) and the ninth bone feature is any one of the 2*N bone features other than the first bone feature (2*N-1).

Illustrating:

N=3, for the first dimension, i.e. a total of 3 bone data, it is assumed that these 3 bone data are data a, data B and data C. After the data enhancement is performed on the 3 bone data by adopting a first data enhancement mode, data A1, B1 and C1 are obtained, namely a first training data set corresponding to a first dimension comprises data A1, B1 and C1. After the data enhancement is performed on the 3 bone data by adopting a second data enhancement mode, data A2, B2 and C2 are obtained, namely, a second training data set corresponding to the first dimension comprises data A2, B2 and C2. Data A1, B1, C1, A2, B2 and C2 are input into the model to obtain 6 features, namely Z1, Z2, Z3, Z4, Z5 and Z6 respectively, wherein a first bone feature set comprises features Z1, Z2 and Z3, and a second bone feature set comprises features Z4, Z5 and Z6.

Then: taking the first initial bone data as initial bone data corresponding to the data A as an example, the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z1 and Z1; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z1 and Z2; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z1 and Z3; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z1 and Z5; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z1 and Z6; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z2; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z3; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z4; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z5; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z6.

It will be appreciated that, in this way, since the first weight of the first similarity is greater and the weights of the similarities of the first bone features and other bone features are smaller, for the first initial bone data, even if there is second initial bone data that is semantically close to the first initial bone data, when the loss of the first initial bone data is calculated, the weight of the first similarity is higher than that of the other similarities, so that the influence of the first similarity on the loss of the first initial bone data is intensified, and the influence of the other similarities on the loss of the first initial bone data is weakened, so that the calculated loss of the first initial bone data is easier to converge, further, the loss of the model is easier to converge, and the training efficiency of the model is improved.

In one example, the model outputs bone features that are detrimental to computing similarity between bone features. Therefore, in the embodiment of the application, when calculating the similarity between the bone features, the feature mapping module may be used to perform feature mapping on the bone features output by the model first, map the bone features to a feature space that is convenient for calculating the similarity between the features, and then calculate the similarity between the bone features by using the mapped features.

Taking the example of calculating the similarity of the first bone feature and the second bone feature, the first bone feature may be processed according to the feature mapping module to obtain a first feature, and the second bone feature may be processed according to the feature mapping module to obtain a second feature. And then, determining cosine similarity of the first feature and the second feature as the first similarity.

Embodiments of the present application are also not particularly limited to the structure of the feature mapping module, which in one example may be a convolutional neural network comprising a plurality of convolutional layers.

In one example, the feature mapping module may be pre-trained.

In yet another example, the feature mapping module may be trained simultaneously with the model, in other words, parameters of the feature mapping module are adjusted according to the model's loss during training of the model. In this way, the feature mapping module and the model can be trained simultaneously without training the feature mapping module in advance.

In the embodiment of the present application, the weights of the similarity between any two bone features may be calculated in advance, and a manner of calculating the weight of the similarity corresponding to the first dimension is described below. In one example, the weight of any similarity may be calculated by the following steps S3-S5.

S3: and calculating the similarity between any two bone features in 2*N bone features corresponding to the first dimension based on 2*N bone features corresponding to the first dimension to obtain a first similarity matrix of 2N×2N, wherein elements corresponding to the ith row and the jth column of the first similarity matrix are used for indicating the similarity between the ith bone feature and the jth bone feature in 2*N bone features.

In the embodiment of the present application, the similarity between two bone features may be cosine similarity of two bone features. Regarding the calculation manner of the cosine similarity, the embodiment of the application is not particularly limited.

S4: and modifying the value of the diagonal element of the first similarity matrix to a preset value to obtain a second similarity matrix.

S5: and processing the second similarity matrix according to an optimal transmission allocation algorithm to obtain a weight matrix, wherein elements corresponding to the ith row and the jth column of the weight matrix are used for indicating the weight of the similarity of the ith bone feature and the jth bone feature in the 2*N bone features.

With respect to S4 and S5, it should be noted that, since the diagonal element of the first similarity matrix is used to indicate the similarity between a certain bone feature and itself, the value of the diagonal element of the first similarity matrix is 1. In order to make the bone feature with the highest similarity to the first bone feature in the first bone feature set be the second bone feature in the second bone feature set, in the embodiment of the present application, the value of the diagonal element of the first similarity matrix may be modified to a smaller preset value, so as to obtain the second similarity matrix. The preset value is not particularly limited in the embodiment of the present application, and in one example, the preset value may be 1e-3.

And then, processing the second similarity matrix according to an optimal transmission allocation algorithm to obtain a weight matrix. In the weight matrix, the sum of elements in each row is 1, and the sum of elements in each column is also 1. And assuming that the weight of the first similarity corresponding to the first bone feature and the second bone feature is an element corresponding to the ith row and the jth column of the weight matrix, the element with the largest value in 2*N elements of the ith row of the weight matrix is the weight of the first similarity, and the element with the largest value in 2*N elements of the jth column of the weight matrix is the weight of the first similarity. It is understood that, in the weight matrix, the element in the i-th row is the weight of the similarity between the first bone feature and the 2*N bone features, and the element in the j-th column is the weight of the similarity between the second bone feature and the 2*N bone features. Thus, from the weight matrix, it can be seen that: the first similarity is greater than the eighth bone feature, and the second bone feature is greater than the ninth bone feature.

The embodiment of the application is not particularly limited to the optimal transmission allocation algorithm, and the optimal transmission allocation algorithm can be Sinkhorn-Knopp algorithm, for example.

Based on the method provided by the embodiment, the embodiment of the application also provides a device, and the device is described below with reference to the accompanying drawings.

Referring to fig. 2, a schematic structural diagram of a model training device for extracting skeletal features of a human body is provided in an embodiment of the present application. The apparatus 200 may specifically include, for example: an acquisition unit 201 and a training unit 202.

An obtaining unit 201, configured to obtain M bone data sets, where each bone data set in the M bone data sets corresponds to one data dimension, each bone data set includes N bone data, N bone data in the bone data sets corresponds to N initial bone data one by one, M is an integer greater than 1, and N is an integer greater than or equal to 1;

A training unit 202, configured to train a model for extracting human skeleton features according to the M bone data sets; wherein:

Optionally, the training unit 202 is configured to:

Optionally, the apparatus further includes:

Optionally, the M data dimensions include at least two of:

Since the apparatus 200 is an apparatus corresponding to the method provided in the above method embodiment, the specific implementation of each unit of the apparatus 200 is the same as the above method embodiment, and therefore, with respect to the specific implementation of each unit of the apparatus 200, reference may be made to the description part of the above method embodiment, and details are not repeated herein.

The embodiment of the application also provides a human skeleton data processing system. Fig. 3 is a schematic structural diagram of a human skeleton data processing system according to an embodiment of the present application. The system 300 includes a model 310 for extracting human skeletal features, a data enhancement model 320, and a feature mapping model 330. Wherein:

The model 310 is trained using the method described in fig. 1;

The data enhancement model 320 is configured to perform enhancement processing on the initial bone data to obtain training bone data;

The feature mapping model 330 is configured to process the skeletal features output by the model for extracting skeletal features of the human body, and output features for calculating the loss of the model for extracting skeletal features of the human body.

For a specific implementation of the enhancement processing of the initial bone data by the data enhancement model 320, reference may be made to the description of the above method embodiments, which is not repeated here.

The feature mapping model 330 referred to herein may correspond to the feature mapping module in the method embodiment above. With respect to the specific implementation of the feature mapping model 330, reference may be made to the description of the feature mapping module above, which is not repeated here.

The embodiment of the application also provides equipment, which comprises: a processor, memory, system bus; the device and the memory are connected through the system bus; the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of the above method embodiments.

An embodiment of the present application further provides a computer readable storage medium, where instructions are stored in the computer readable storage medium, and when the instructions are executed on a terminal device, the instructions cause the terminal device to perform the method according to any one of the above method embodiments.

Embodiments of the present application also provide a computer program product which, when run on a terminal device, causes the terminal device to perform the method of any of the above method embodiments.

Other embodiments of the application will be apparent to those skilled in the art from consideration of the specification and practice of the application disclosed herein. This application is intended to cover any variations, uses, or adaptations of the application following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the application pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It is to be understood that the application is not limited to the precise arrangements and instrumentalities shown in the drawings, which have been described above, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the application is limited only by the appended claims

The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.

Claims

1. A model training method for extracting human skeletal features, the method comprising:

training a model for extracting human skeleton characteristics by using a contrast learning mode according to the M skeleton data sets; wherein:

2. The method of claim 1, wherein training a model for extracting human skeletal features from the M sets of skeletal data comprises:

3. The method of claim 2, wherein the loss of the first initial bone data in the first dimension is determined from the first loss and/or the second loss, wherein:

The first loss is determined according to a first similarity of a first bone feature and a second bone feature, a first weight of the first similarity, a second weight of the similarity of each of the first bone feature and 2*N bone features, and a multi-dimensional fusion similarity corresponding to each second weight, wherein the first bone feature is a bone feature corresponding to the first initial bone data in a first bone feature set corresponding to the first dimension, the second bone feature is a bone feature corresponding to the first initial bone data in a second bone feature set corresponding to the first dimension, and the 2*N bone features comprise a first bone feature set corresponding to the first dimension and a second bone feature set corresponding to the first dimension;

4. The method of claim 3, wherein the 2*N bone features include a third bone feature, the second weights of the similarities of the first and third bone features corresponding to a multi-dimensional fusion similarity determined from the similarities of the first and third bone features and the similarities of a fourth and fifth bone feature corresponding to each of the (M-1) dimensions, wherein the fourth and first bone features correspond to the same initial bone data, and the fifth and third bone features correspond to the same initial bone data.

5. A method according to claim 3, wherein the 2*N bone features include a third bone feature, the third weights of the similarity of the second and third bone features corresponding to a multi-dimensional fusion similarity determined from the similarity of the second and third bone features and the similarity of a sixth and seventh bone feature corresponding to each of the (M-1) dimensions, wherein the sixth and second bone features correspond to the same training data, and the seventh and third bone features correspond to the same training data.

6. A method according to claim 3, wherein the first similarity corresponds to a greater weight than the first and eighth bone features, the first similarity corresponds to a greater weight than the second and ninth bone features, wherein the eighth bone feature is any one of the 2*N bone features other than the second bone feature (2*N-1), and the ninth bone feature is any one of the 2*N bone features other than the first bone feature (2*N-1).

7. A model training apparatus for extracting skeletal features of a human body, the apparatus comprising:

the training unit is used for training a model for extracting human skeleton characteristics in a contrast learning mode according to the M skeleton data sets; wherein:

8. A human skeletal data processing system, the system comprising:

a model for extracting human skeletal features, a data enhancement model, and a feature mapping model trained using the method of any one of claims 1-6;

9. A model training apparatus for extracting skeletal features of a human body, the apparatus comprising: a processor, memory, system bus; the device and the memory are connected through the system bus; the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-6.

10. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein instructions, which when run on a terminal device, cause the terminal device to perform the method of any of claims 1 to 6.