CN113792821B

CN113792821B - Model training method and device for extracting human skeleton features

Info

Publication number: CN113792821B
Application number: CN202111348828.2A
Authority: CN
Inventors: 何嘉斌; 刘廷曦; 翁仁亮
Original assignee: Beijing Aibee Technology Co Ltd
Current assignee: Beijing Aibee Technology Co Ltd
Priority date: 2021-11-15
Filing date: 2021-11-15
Publication date: 2022-02-15
Anticipated expiration: 2041-11-15
Also published as: CN113792821A

Abstract

The application discloses a model training method and a model training device for extracting human skeleton features, wherein a first skeleton data set and a second skeleton data set can be obtained, the first skeleton data set and the second skeleton data set respectively comprise N pieces of skeleton data, and the first skeleton data set and the second skeleton data set are obtained by performing data enhancement on N pieces of initial skeleton data in different data enhancement modes. Training a model for extracting human skeletal features based on the first set of skeletal data and the second set of skeletal data. When the model is trained, the calculation mode of the loss of the first initial bone data is improved, the improved loss calculation mode strengthens the influence of the first similarity on the loss of the first initial bone data, so that the calculated loss of the first initial bone data is easier to converge, further, the loss of the model is easier to converge, and the training efficiency of the model is improved.

Description

Model training method and device for extracting human skeleton features

Technical Field

The present application relates to the field of data processing, and in particular, to a model training method and apparatus for extracting human skeletal features.

Background

Currently, models for extracting features, such as human skeletal features, can be trained by way of self-supervised training. The self-supervision training refers to that no artificially labeled label of a training sample is used in the process of training the model.

Contrast learning is a way of self-supervision training, and the contrast learning refers to: and processing one sample by adopting two different data enhancement modes to obtain two enhanced samples. And training the model using the two enhanced samples. It can be understood that, since the two enhanced samples are obtained by performing data enhancement processing on the same sample, although the contents of the two enhanced samples are different, the semantics of the two enhanced samples are consistent. Therefore, if the similarity of the features output by the model for the two enhanced samples is high enough, and the similarity with other samples (or samples enhanced by different enhancing modes of other samples) is low enough, the model has the capability of extracting the features of the samples.

At present, a model for extracting features can be trained by adopting a contrast learning mode, but the training efficiency is low by adopting the contrast learning mode to train the model for extracting features. Therefore, a solution to the above problems is urgently needed.

Disclosure of Invention

The technical problem that this application will solve is: the model for extracting the features is trained in a contrast learning mode, the training efficiency is low, and the model training method and the model training device for extracting the features are provided.

In a first aspect, an embodiment of the present application provides a model training method for extracting human bone features, where the method includes:

acquiring a first bone data set and a second bone data set, wherein the first bone data set and the second bone data set both comprise N pieces of bone data, the first bone data set is obtained by performing data enhancement on N pieces of initial bone data in a first data enhancement mode, and the second bone data set is obtained by performing data enhancement on the N pieces of initial bone data in a second data enhancement mode;

training a model for extracting human skeletal features based on the first set of skeletal data and the second set of skeletal data; wherein:

the first set of skeletal data is derived via the model from a first set of skeletal features, the second set of skeletal data is derived via the model from a second set of skeletal features, the first set of skeletal features and the second set of skeletal features each comprise N skeletal features; a loss of the model is determined based on a loss of the N initial skeletal data, the N initial skeletal data including first initial skeletal data, the loss of the first initial skeletal data being determined according to a first similarity between a first skeletal feature and a second skeletal feature, and a weight corresponding to the first similarity, the first skeletal feature being a feature of the first skeletal feature set corresponding to the first initial skeletal data, the second skeletal feature being a feature of the second skeletal feature set corresponding to the first initial skeletal data, and the weight corresponding to the first similarity being greater than the weight corresponding to a similarity between the first skeletal feature and a third skeletal feature, the weight corresponding to the first similarity being greater than the weight corresponding to a similarity between the second skeletal feature and a fourth skeletal feature, wherein, the third bone feature is any one of (2 x N-1) other bone features than the second bone feature, the fourth bone feature is any one of (2 x N-1) other bone features than the first bone feature, and the 2 x N bone features include N bone features of the first set of bone features and N bone features of the second set of bone features.

Alternatively to this, the first and second parts may,

and determining the loss of the first initial bone data according to the first similarity, the weight corresponding to the first similarity, and the similarity corresponding to each bone feature in the first bone feature and the (2N-1) bone features except the first bone feature, and the weight of the similarity corresponding to each bone feature.

Optionally, the first similarity between the first bone feature and the second bone feature is determined by:

processing the first bone feature according to a feature mapping module to obtain a first feature;

processing the second bone features according to the feature mapping module to obtain second features;

and determining the cosine similarity of the first feature and the second feature as the first similarity.

Optionally, in the training process of the model, the parameters of the feature mapping module are adjusted according to the loss of the model.

Optionally, the first initial bone data includes data of at least one of the following dimensions:

an articulation point dimension, an articulation point motion dimension, a bone dimension, and a bone motion dimension.

Optionally, the model includes a feature extraction module and a feature fusion module; when the first initial bone data comprises data of at least two dimensions, the feature extraction module is used for extracting features of the data of each dimension in the first bone data, and the feature fusion module is used for fusing data features corresponding to the extracted dimensions to obtain bone features of the first initial bone data, wherein the first bone data is obtained by processing the first initial bone data in a first data enhancement mode.

Optionally, the method further includes:

calculating the similarity between any two of the 2 x N bone features based on the 2 x N bone features to obtain a 2N x 2N first similarity matrix, wherein an ith row and a jth column of the first similarity matrix correspond to elements for indicating the similarity between an ith bone feature and a jth bone feature in the 2 x N bone features;

modifying the value of the diagonal element of the first similarity matrix into a preset value to obtain a second similarity matrix;

and processing the second similarity matrix according to an optimal transmission distribution algorithm to obtain a weight matrix, wherein the element corresponding to the ith row and the jth column of the weight matrix is used for indicating the weight of the similarity of the ith bone feature and the jth bone feature in the 2 x N bone features.

In a second aspect, an embodiment of the present application provides a model training method for extracting features, where the method includes:

acquiring a first data set and a second data set, wherein the first data set and the second data set both comprise N pieces of data, the first data set is obtained by performing data enhancement on N pieces of initial data in a first data enhancement mode, and the second data set is obtained by performing data enhancement on the N pieces of initial data in a second data enhancement mode;

training a model for extracting features based on the first data set and the second data set; wherein:

the first data set obtains a first feature set through the model, the second data set obtains a second feature set through the model, and the first feature set and the second feature set both comprise N features; the loss of the model is determined based on the loss of the N initial data, the N initial data comprises first initial data, the loss of the first initial data is determined according to a first similarity of a first feature and a second feature and a weight corresponding to the first similarity, the first feature is a feature corresponding to the first initial data in the first feature set, the second feature is a feature corresponding to the first initial data in the second feature set, the weight corresponding to the first similarity is greater than the weight corresponding to the similarity of the first feature and a third feature, the weight corresponding to the first similarity is greater than the weight corresponding to the similarity of the second feature and a fourth feature, wherein the third feature is any one of (2N-1) features except the second feature in the 2N features, the fourth feature is any one of (2 × N-1) other features than the first feature among the 2 × N features, and the 2 × N features include N features of the first feature set and N features of the second feature set.

Alternatively to this, the first and second parts may,

the loss of the first initial data is determined according to the first similarity, the weight corresponding to the first similarity, and the similarity corresponding to each feature in the first feature and the (2N-1) features except the first feature, and the weight of the similarity corresponding to each feature.

Optionally, the first similarity between the first feature and the second feature is determined by:

processing the first characteristic according to a characteristic mapping module to obtain a first mapping characteristic;

processing the second characteristic according to the characteristic mapping module to obtain a second mapping characteristic;

and determining the cosine similarity of the first mapping characteristic and the second mapping characteristic as the first similarity.

Optionally, the first initial data includes data of at least one dimension.

Optionally, the model includes a feature extraction sub-module and a feature fusion sub-module; when the first initial data comprises data of at least two dimensions, the feature extraction submodule is used for extracting features of the data of each dimension in the first data, and the feature fusion submodule is used for fusing data features corresponding to the extracted dimensions to obtain features of the first initial data, wherein the first data is obtained by processing the first initial data in a first data enhancement mode.

Optionally, the method further includes:

calculating the similarity between any two features of the 2 x N features based on the 2 x N features to obtain a 2N x 2N first similarity matrix, where an element corresponding to the ith row and the jth column of the first similarity matrix is used to indicate the similarity between the ith feature and the jth feature in the 2 x N features;

and processing the second similarity matrix according to an optimal transmission distribution algorithm to obtain a weight matrix, wherein the element corresponding to the ith row and the jth column of the weight matrix is used for indicating the weight of the similarity of the ith characteristic and the jth characteristic in the 2 x N characteristics.

In a third aspect, an embodiment of the present application provides a human bone data processing system, including:

a model, a data enhancement model and a feature mapping model which are obtained by training by adopting the method of any one of the first aspect and are used for extracting human skeleton features;

the data enhancement model is used for enhancing the initial bone data to obtain training bone data;

the characteristic mapping model is used for processing the bone characteristics output by the model for extracting the human bone characteristics and outputting the characteristics for calculating the loss of the model for extracting the human bone characteristics.

In a fourth aspect, the embodiment of the present application provides a model training apparatus for extracting human bone features, where the apparatus includes:

an obtaining unit, configured to obtain a first bone data set and a second bone data set, where the first bone data set and the second bone data set both include N pieces of bone data, the first bone data set is obtained by performing data enhancement on N pieces of initial bone data in a first data enhancement manner, and the second bone data set is obtained by performing data enhancement on the N pieces of initial bone data in a second data enhancement manner;

a training unit for training a model for extracting human skeletal features based on the first skeletal data set and the second skeletal data set; wherein:

Alternatively to this, the first and second parts may,

Optionally, the apparatus further comprises:

a calculating unit, configured to calculate, based on the 2 × N bone features, a similarity between any two of the 2 × N bone features, to obtain a 2N × 2N first similarity matrix, where an element corresponding to an ith row and a jth column of the first similarity matrix is used to indicate a similarity between an ith bone feature and a jth bone feature of the 2 × N bone features;

the modification unit is used for modifying the value of the diagonal element of the first similarity matrix into a preset value to obtain a second similarity matrix;

and the processing unit is used for processing the second similarity matrix according to an optimal transmission distribution algorithm to obtain a weight matrix, and elements corresponding to the ith row and the jth column of the weight matrix are used for indicating the weight of the similarity between the ith bone feature and the jth bone feature in the 2 x N bone features.

In a fifth aspect, an embodiment of the present application provides a model training apparatus for extracting features, the apparatus including:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first data set and a second data set, the first data set and the second data set both comprise N pieces of data, the first data set is obtained by performing data enhancement on N pieces of initial data by adopting a first data enhancement mode, and the second data set is obtained by performing data enhancement on the N pieces of initial data by adopting a second data enhancement mode;

a training unit for training a model for extracting features based on the first data set and the second data set; wherein:

Alternatively to this, the first and second parts may,

Optionally, the first initial data includes data of at least one dimension.

Optionally, the apparatus further comprises:

a calculating unit, configured to calculate, based on the 2 × N features, a similarity between any two features in the 2 × N features to obtain a 2N × 2N first similarity matrix, where an element corresponding to an ith row and a jth column of the first similarity matrix is used to indicate a similarity between an ith feature and a jth feature in the 2 × N features;

and the processing unit is used for processing the second similarity matrix according to an optimal transmission allocation algorithm to obtain a weight matrix, wherein the element corresponding to the ith row and the jth column of the weight matrix is used for indicating the weight of the similarity of the ith feature and the jth feature in the 2 x N features.

In a sixth aspect, an embodiment of the present application provides an apparatus, including: a processor, a memory, a system bus; the equipment and the memory are connected through the system bus; the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of the above first aspects or any of the above second aspects.

In a seventh aspect, an embodiment of the present application provides a computer-readable storage medium, where instructions are stored, and when the instructions are executed on a terminal device, the instructions cause the terminal device to perform the method according to any one of the above first aspect or any one of the above second aspect.

In an eighth aspect, embodiments of the present application provide a computer program product, which when run on a terminal device, causes the terminal device to perform the method of any one of the above first aspects or any one of the above second aspects.

Compared with the prior art, the embodiment of the application has the following advantages:

the embodiment of the application provides a model training method for extracting human skeletal features, which can obtain a first skeletal data set and a second skeletal data set, wherein the first skeletal data set and the second skeletal data set both comprise N pieces of skeletal data, the first skeletal data set is obtained by performing data enhancement on N pieces of initial skeletal data in a first data enhancement mode, and the second skeletal data set is obtained by performing data enhancement on the N pieces of initial skeletal data in a second data enhancement mode; training a model for extracting human skeletal features based on the first set of skeletal data and the second set of skeletal data. In training the model, the first set of skeletal data may yield a first set of skeletal features via the model, and the second set of skeletal data may yield a second set of skeletal features via the model. In order to improve the convergence speed of the model and improve the training efficiency of the model, the embodiment of the application improves the loss of the model. Specifically, the method comprises the following steps: a loss of the model is determined based on a loss of the N initial skeletal data. For convenience of description, any one initial bone data in the N initial bone data is referred to as "first initial bone data", and a loss of the first initial bone data is determined according to a first similarity between a first bone feature and a second bone feature and a weight corresponding to the first similarity, where: the first bone feature is a feature of the first set of bone features corresponding to the first initial bone data, the second bone feature is a feature of the second set of bone features corresponding to the first initial bone data, and the first similarity corresponds to a weight that is greater than a weight that corresponds to a similarity of the first bone feature to a third bone feature, the first similarity corresponds to a weight that is greater than a weight that corresponds to a similarity of the second bone feature to a fourth bone feature, wherein the third bone feature is any one of (2N-1) other bone features of the 2N bone features except the second bone feature, and the fourth bone feature is any one of (2N-1) other bone features of the 2N bone features except the first bone feature, the 2 x N bone features include N bone features in the first set of bone features and N bone features in the second set of bone features. It can be seen that, in the embodiment of the present application, for a first initial bone data, even if there is a second initial bone data semantically close to the first initial bone data, when a loss of the first initial bone data is calculated, the weight of the first similarity is higher than the weights of other similarities, and therefore, the influence of the first similarity on the loss of the first initial bone data is strengthened, so that the calculated loss of the first initial bone data is easier to converge, and further, the loss of the model is easier to converge, thereby improving the training efficiency of the model.

The embodiment of the application provides a model training method for extracting features, which can obtain a first data set and a second data set, wherein the first data set and the second data set both comprise N pieces of data, the first data set is obtained by performing data enhancement on N pieces of initial data in a first data enhancement mode, and the second data set is obtained by performing data enhancement on the N pieces of initial data in a second data enhancement mode; training a model for extracting human features based on the first data set and the second data set. In training the model, the first set of data may be used to derive a first set of features via the model, and the second set of data may be used to derive a second set of features via the model. In order to improve the convergence speed of the model and improve the training efficiency of the model, the embodiment of the application improves the loss of the model. Specifically, the method comprises the following steps: and the loss of the model is determined based on the loss of the N initial data. For convenience of description, any one initial data in the N initial data is referred to as "first initial data", and a loss of the first initial data is determined according to a first similarity between a first feature and a second feature and a weight corresponding to the first similarity, where: the first feature is a feature of the first feature set corresponding to the first initial data, the second feature is a feature of the second set of features corresponding to the first initial data, and the weight corresponding to the first similarity is larger than the weight corresponding to the similarity between the first feature and the third feature, the weight corresponding to the first similarity is larger than the weight corresponding to the similarity of the second feature and the fourth feature, wherein the third feature is any one of (2 x N-1) other features than the second feature among the 2 x N features, the fourth feature is any one of the other (2 x N-1) features than the first feature among the 2 x N features, the 2 x N features include N features in the first set of features and N features in the second set of features. As can be seen from this, in the embodiment of the present application, even if there is second initial data semantically close to the first initial data, when the loss of the first initial data is calculated, the weight of the first similarity is higher than the weights of other similarities, and therefore, the influence of the first similarity on the loss of the first initial data is strengthened, so that the calculated loss of the first initial data is more easily converged, and further, the loss of the model is more easily converged, thereby improving the training efficiency of the model.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flowchart of a model training method for extracting human skeletal features according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a model for extracting human bone features according to an embodiment of the present application;

FIG. 3 is a schematic structural diagram of a human skeletal data processing system according to an embodiment of the present disclosure;

FIG. 4 is a schematic flowchart of a model training method for feature extraction according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a model training apparatus for extracting human bone features according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of a model training apparatus for extracting features according to an embodiment of the present disclosure.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Through research, the inventor of the application finds that the training efficiency is low by training the model for extracting the features in a contrast learning mode. Next, a model for extracting human bone features is trained by way of comparative learning, as an example.

At present, when a model for extracting human skeletal features is trained by using a comparative learning method, two data enhancement methods may be used to perform data enhancement on N initial skeletal data to obtain 2 × N training data, and the model is trained by using the 2 × N training data, it can be understood that 2 × N skeletal features may be obtained after the 2 × N training data passes through the model.

The initial bone data referred to herein may be joint data. The joint point data may be data of a plurality of (e.g., 25) joint points of the human body.

For a certain initial skeleton data, for example, the first initial skeleton data, the loss corresponding to the initial skeleton data can be calculated by the following formula (1).

Formula (1)

In equation (1):

a corresponding loss for the first initial bone data;

i and j represent the serial numbers of the bone features corresponding to the two enhanced training data derived from the first initial bone data in all the 2 x N features;

representing the similarity of the bone characteristics corresponding to the enhanced two training data corresponding to the first initial bone data;

and representing the similarity of the ith bone feature and the kth bone feature in the obtained 2 x N features.

It can be understood that, for formula (1), if there is other initial bone data, such as second initial bone data, in the initial bone data, which is semantically close to the first initial bone data, then the similarity between the two bone features corresponding to the second initial bone data in the 2 × N bone features and the two bone features corresponding to the first initial bone data is higher, which results in the larger denominator of formula (1), and further results in the larger denominator of formula (1)

The value is too large to approach 0, so that the model convergence is slow, i.e. the training efficiency of the model is low.

Now, the following examples are given: assume N =3, that is, there are 3 initial bone data, and assume that these 3 initial bone data are data a, data B, and data C. Then the data a1, B1 and C1 are obtained after the 3 initial bone data are data enhanced by the first data enhancement mode, and the data a2, B2 and C2 are obtained after the 3 initial bone data are data enhanced by the second data enhancement mode. The data A1, B1, C1, A2, B2 and C2 are input into the model to obtain 6 bone features, namely Z1, Z2, Z3, Z4, Z5 and Z6. It will be appreciated that bone features Z1 and Z4 are features corresponding to data a, bone features Z2 and Z5 are features corresponding to data B, and bone features Z3 and Z6 are features corresponding to data C. Then according to equation (1) above: the corresponding loss of data a can be represented by the following equation (2):

formula (2)

In equation (2):

the corresponding loss of the data A;

is the similarity of Z1 and Z4;

is the similarity of Z1 and Z2;

is the similarity of Z1 and Z3;

is the similarity of Z1 and Z5;

is the similarity of Z1 and Z6.

It can be understood that if the semantic information of the data a and the data B are similar, the similarity between the features Z1, Z4, Z2 and Z5 is higher, and therefore, the denominator of the formula (2) has a higher similarity

And

are all relatively large, thus making even

Is relatively high, but the denominator is also relatively large, thereby causing

Is relatively small, thereby resulting in

And is also large, making the model difficult to converge.

In order to solve the above problem, embodiments of the present application provide a model training method for extracting human bone features and a model training method for extracting features.

Various non-limiting embodiments of the present application are described in detail below with reference to the accompanying drawings.

Referring to fig. 1, the figure is a schematic flowchart of a model training method for extracting human skeletal features according to an embodiment of the present application. The method provided by the embodiment of the present application may be executed by a terminal device or a server, and the embodiment of the present application is not particularly limited.

In this embodiment, the method shown in fig. 1 may include the following steps, for example: S101-S102.

S101: the method comprises the steps of obtaining a first bone data set and a second bone data set, wherein the first bone data set and the second bone data set both comprise N pieces of bone data, the first bone data set is obtained by performing data enhancement on N pieces of initial bone data in a first data enhancement mode, and the second bone data set is obtained by performing data enhancement on the N pieces of initial bone data in a second data enhancement mode.

The N pieces of initial skeleton data may be single-dimensional skeleton data or multi-dimensional skeleton data, and the embodiment of the present application is not particularly limited.

In an embodiment of the present application, the bone data may include: data of the dimensions of the articulation points, data of the dimensions of the movement of the articulation points, data of the dimensions of the bones, and data of the dimensions of the movement of the bones. In other words, each piece of initial bone data in the N pieces of initial bone data may be data of a joint dimension, data of a joint motion dimension, data of a bone dimension, or data of a bone motion dimension. Each piece of initial bone data in the N pieces of initial bone data may also be data of at least two dimensions of an articulation point dimension, an articulation point motion dimension, a bone dimension, and a bone motion dimension.

Wherein:

the data of the joint dimension may be, for example, coordinate sequences of N joints, and the coordinate sequence of one joint is exemplified as follows: a bone sequence with a length of T frames, each frame containing N persons, each person having M joints, each joint containing three-dimensional coordinate position information, the coordinate sequence of the joint can be expressed as:

。

the data of the joint movement dimension may be, for example, a movement coordinate sequence of N joints. The motion coordinate sequence of a certain joint point subtracts the corresponding joint point coordinate of the previous frame from all the joint point coordinates of the next frame in the corresponding T frame skeleton sequence, and because the first frame skeleton sequence has no reducible frame, the first frame skeleton sequence is discarded, and then the first frame skeleton sequence is supplemented in an interpolation mode.

The data of the bone dimension may be, for example, a coordinate sequence of N bones. The coordinate sequence of a certain bone is the difference of the coordinates of adjacent joint points in the same frame of bone sequence.

The data of the bone motion dimension may be, for example, a motion coordinate sequence of N bones. The motion coordinate sequence of a certain bone subtracts the bone coordinate corresponding to the previous frame from all the bone coordinates of the next frame in the corresponding T frame bone sequence, so that the first frame bone sequence has no reducible frame, and the first frame bone sequence is discarded and then is supplemented back to the frame bone sequence in an interpolation mode.

For convenience of description, in the following embodiments, any one piece of initial skeleton data of the N pieces of initial skeleton data is referred to as "first initial skeleton data".

In one example, when the first initial bone data is obtained, the first joint point data may be obtained first, and then the first joint point motion data, the first bone data, and the first bone motion data are obtained based on the first joint point data, and then at least two items of the first joint point data, the first joint point motion data, the first bone data, and the first bone motion data are used as the first initial bone data. In the present embodiment, "joint point data" is "data of joint point dimension", "joint point movement data" is "data of joint point movement dimension", "bone data" is "data of bone dimension", and "bone movement data" is "data of bone movement dimension".

It is understood that the first initial bone data comprises data of multiple dimensions, so that training samples can be diversified, and the effect of model training can be optimized.

The embodiment of the present application does not specifically limit the first data enhancement mode and the second data enhancement mode. In one example, the first data enhancement mode may be one of the following, the second data enhancement mode is one of the following, and the first data enhancement mode and the second data enhancement mode are different.

1) All the joint points of each frame randomly and slightly rotate towards the same direction;

2) all the joint points of each frame are slightly inclined randomly towards the same direction;

3) adding a Gaussian noise to the coordinates of all joint points in all frames;

4) and hiding part of joint points in the partial frame.

S102: training a model for extracting human skeletal features based on the first set of skeletal data and the second set of skeletal data.

When the model is trained based on the first set of bone data and the second set of bone data, the first set of bone data and the second set of bone data may be input into the model, wherein after the first set of bone data is input into the model, a first set of bone features may be obtained, N bone features may be included in the first set of bone features, and the N pieces of initial bone data are in one-to-one correspondence with the N bone features in the first set of bone features. A second set of skeletal features may be obtained after inputting the second set of skeletal data into the model. N bone features can be included in the second bone feature set, and the N pieces of initial bone data are in one-to-one correspondence with the N bone features in the second bone feature set. It will be appreciated that 2 x N bone features may be obtained after the first and second sets of bone data are input into the model.

It can be understood that, for a first initial bone data of the N initial bone data, there is a corresponding bone feature in the first set of bone features, and there is also a corresponding bone feature in the second set of bone features. For convenience of description, the bone feature corresponding to the first initial bone data in the first set of bone features is referred to as "first bone feature", and the bone feature corresponding to the first initial bone data in the second set of bone features is referred to as "second bone feature". Specifically, first initial bone data is obtained through a first data enhancement mode, and first bone data is obtained through the model to obtain a first bone characteristic; the first initial bone data is subjected to a second data enhancement mode to obtain second bone data, and the second bone data is subjected to a second bone characteristic through the model.

In the embodiment of the application, in order to improve the convergence speed of the model, the loss calculation mode of the model is improved. Specifically, the method comprises the following steps:

the loss of the model is determined based on the loss of the N initial skeletal data, for example, the loss of the model is an average of the losses of the N initial skeletal data.

And for first initial bone data, determining the loss of the first initial bone data according to a first similarity of the first bone characteristic and the second bone characteristic and a weight corresponding to the first similarity. The weight corresponding to the first similarity is greater than the weight corresponding to the similarity between the first bone feature and the third bone feature, and the weight corresponding to the first similarity is greater than the weight corresponding to the similarity between the second bone feature and the fourth bone feature. Wherein the third bone feature is any one of the other (2 x N-1) bone features except the second bone feature, and the fourth bone feature is any one of the other (2 x N-1) bone features except the first bone feature.

For example, the following steps are carried out:

n =3, i.e., there are 3 initial skeletal data in total, assuming that these 3 initial skeletal data are data a, data B, and data C. After data enhancement is performed on the 3 initial bone data in the first data enhancement mode, data a1, B1 and C1 are obtained, i.e., the first set of bone data includes data a1, B1 and C1. After data enhancement is performed on the 3 initial bone data in the second data enhancement mode, data a2, B2 and C2 are obtained, i.e., the second set of bone data includes data a2, B2 and C2. Data A1, B1, C1, A2, B2 and C2 are input into a model, 6 bone features are obtained, namely Z1, Z2, Z3, Z4, Z5 and Z6 respectively, the first bone feature set comprises bone features Z1, Z2 and Z3, and the second bone feature set comprises bone features Z4, Z5 and Z6.

Then: taking the first initial bone data as data a for example, the weight of the similarity between Z1 and Z4 is greater than the weight of the similarity between Z1 and Z1; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z1 and Z2; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z1 and Z3; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z1 and Z5; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z1 and Z6; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z2; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z3; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z4; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z5; the weight of the similarity of Z1 and Z4 is greater than the weight of the similarity of Z4 and Z6.

It can be understood that, in this way, since the weight of the first similarity is larger and the weight of the similarity of the first bone feature to other bone features is smaller, for the first initial bone data, even if there is second initial bone data semantically close to the first initial bone data, the weight of the first similarity is higher than the weight of the other similarities when calculating the loss of the first initial bone data, and therefore, the influence of the first similarity on the loss of the first initial bone data is strengthened, so that the calculated loss of the first initial bone data is easier to converge, and further, the loss of the model is easier to converge, thereby improving the training efficiency of the model.

As described above, the loss of the first initial bone data is determined according to the first similarity between the first bone feature and the second bone feature and the weight corresponding to the first similarity. In one example, the determination may be based on the first bone feature, the second bone feature, a weight corresponding to the first similarity, and the loss calculation logic. Wherein the loss calculation logic is different, and the first initial bone data loss is calculated in a different manner. Several ways of calculating the loss of the first initial bone data are described below.

In one implementation: the loss of the first initial bone data can be calculated by the following equation (3).

In equation (3):

a corresponding loss for the first initial bone data;

to represent

The weight of (c).

It is understood that, using equation (3), the influence of the similarity of the first and second bone features on the loss of the first initial bone data can be strengthened, so that the loss of the model converges as soon as possible.

In yet another implementation, the loss of the first initial bone data is determined according to the first similarity, the weight corresponding to the first similarity, and the similarity corresponding to each of the first bone feature and the (2N-1) bone features except the first bone feature, and the weight corresponding to each of the similarity corresponding to each of the bone features. In one example, the loss of the first initial bone data can be calculated by the following equation (4).

Formula (4)

In equation (4):

a corresponding loss for the first initial bone data;

i and j represent serial numbers of bone features corresponding to two enhanced training data derived from the first initial bone data among all 2 × N bone features;

to represent

The weight of (c);

representing the similarity of the ith bone characteristic and the kth bone characteristic in the obtained 2 x N bone characteristics;

to represent

The weight of (c).

As can be seen from equation (4), equation (4) weakens the influence of the second initial bone data on the loss of the first initial bone data, and even if there is second initial bone data semantically close to the first initial bone data, the weight of the first similarity is higher than the weight of other similarities when calculating the loss of the first initial bone data, and therefore, weakens the influence of the second initial bone data on the loss of the first initial bone data, which is expressed in equation (4) as a smaller contribution to the denominator, thereby making the loss of the model easier to converge.

With regard to the effect of equation (4), there is now exemplified:

assume N =3, that is, there are 3 initial bone data, and assume that these 3 initial bone data are data a, data B, and data C. Then the data a1, B1 and C1 are obtained after the 3 initial bone data are data enhanced by the first data enhancement mode, and the data a2, B2 and C2 are obtained after the 3 initial bone data are data enhanced by the second data enhancement mode. The data A1, B1, C1, A2, B2 and C2 are input into the model to obtain 6 bone features, namely Z1, Z2, Z3, Z4, Z5 and Z6. It will be appreciated that bone features Z1 and Z4 are features corresponding to data a, bone features Z2 and Z5 are features corresponding to data B, and bone features Z3 and Z6 are features corresponding to data C. Then according to equation (4) above: the corresponding loss of data a can be represented by the following equation (5):

formula (5)

In equation (5):

the corresponding loss of the data A;

is the similarity of Z1 and Z4,

is composed of

The weight of (c);

is the similarity of Z1 and Z2,

is composed of

The weight of (c);

is the similarity of Z1 and Z3,

is composed of

The weight of (c);

is the similarity of Z1 and Z5,

is composed of

The weight of (c);

is the similarity of Z1 and Z6,

is composed of

The weight of (c).

It can be understood that if the semantic information of the data a and the data B are similar, the similarity between the bone features Z1, Z4, Z2 and Z5 is higher, and thus, the denominator of the formula (5) has a higher similarity

And

are all relatively large, however, because

And

is relatively small, thus making

And

the contribution to the denominator of equation (5) becomes small, so that the contribution in equation (5) becomes small

Is relatively large, and then makes

Smaller, making the model easier to converge.

In yet another example, since equation (3) above can reinforce the effect of the first similarity on the loss of the first initial bone data, and equation (4) can weaken the effect of the second initial bone data on the loss of the first initial bone data, equation (3) and equation (4) may be combined to calculate the loss of the first initial bone data, for example, equation (6) may be referred to.

Formula (6)

As for the meaning of each parameter in the formula (6), reference may be made to the above description sections for the formula (3) and the formula (4), and the description will not be repeated here.

In one example, the skeletal features output by the model are not beneficial to calculating the similarity between skeletal features. Therefore, in the embodiment of the present application, when calculating the similarity between the bone features, the feature mapping module may first perform feature mapping on the bone features output by the model, map the bone features to a feature space convenient for calculating the similarity between the features, and then calculate the similarity between the bone features by using the mapped features.

Taking the calculation of the similarity between the first bone feature and the second bone feature as an example, the first bone feature may be processed according to a feature mapping module to obtain a first feature, and the second bone feature may be processed according to the feature mapping module to obtain a second feature. And then, determining the cosine similarity of the first feature and the second feature as the first similarity.

The structure of the feature mapping module is not particularly limited in the embodiments of the present application, and in one example, the feature mapping module may be a Convolutional Neural Network (CNN) including a plurality of Convolutional layers.

In one example, the feature mapping module may be pre-trained.

In yet another example, the feature mapping module may be trained simultaneously with the model, in other words, during the training of the model, the parameters of the feature mapping module are adjusted according to the loss of the model. In this way, the feature mapping module and the model can be trained simultaneously without training the feature mapping module in advance.

In the embodiment of the present application, the weight of the similarity between any two features may be obtained through pre-calculation, and a manner of calculating the weight of the similarity will be described next. In one example, the weight of any similarity may be calculated by the following steps S1-S3.

S1: calculating the similarity between any two of the 2 x N bone features based on the 2 x N bone features to obtain a 2N x 2N first similarity matrix, wherein the element corresponding to the ith row and the jth column of the first similarity matrix is used for indicating the similarity between the ith bone feature and the jth bone feature in the 2 x N features.

In the embodiment of the present application, the similarity between two bone features may be a cosine similarity of the two bone features. Regarding the way of calculating the cosine similarity, the embodiment of the present application is not particularly limited.

S2: and modifying the value of the diagonal element of the first similarity matrix into a preset value to obtain a second similarity matrix.

S3: and processing the second similarity matrix according to an optimal transmission distribution algorithm to obtain a weight matrix, wherein the element corresponding to the ith row and the jth column of the weight matrix is used for indicating the weight of the similarity of the ith bone feature and the jth bone feature in the 2 x N bone features.

Regarding S2 and S3, it should be noted that, since the diagonal elements of the first similarity matrix are used to indicate the similarity of a certain bone feature to itself, the value of the diagonal elements of the first similarity matrix is 1. In order to make the bone feature with the highest similarity to the first bone feature in the first bone feature set be the second bone feature in the second bone feature set, in the embodiment of the present application, the value of the diagonal element of the first similarity matrix may be modified to a smaller preset value, so as to obtain the second similarity matrix. The preset value is not particularly limited in the embodiments of the present application, and in one example, the preset value may be 1 e-3.

And then, processing the second similarity matrix according to an optimal transmission distribution algorithm to obtain a weight matrix. In the weight matrix, the sum of each row element is 1, and the sum of each column element is also 1. And, assuming that the weight of the first similarity corresponding to the first bone feature and the second bone feature is the element corresponding to the ith row and the jth column of the weight matrix, the element with the largest value among the 2 × N elements of the ith row of the weight matrix is the weight of the first similarity, and the element with the largest value among the 2 × N elements of the jth column of the weight matrix is the weight of the first similarity. It is understood that in the weight matrix, the element in the ith row is the weight of the similarity between the first bone feature and the 2 × N bone features, and the element in the jth column is the weight of the similarity between the second bone feature and the 2 × N bone features. Therefore, by the weight matrix, it is known that: the weight corresponding to the first similarity is larger than the weight corresponding to the similarity between the first bone characteristic and a third bone characteristic, and the weight corresponding to the first similarity is larger than the weight corresponding to the similarity between the second bone characteristic and a fourth bone characteristic.

The embodiment of the present application does not specifically limit the optimal transmission allocation algorithm, which may be, for example, a Sinkhorn-Knopp algorithm.

The model network structure is not particularly limited by the embodiments of the present application, and in one example, the model may be a convolutional neural network including a plurality of convolutional layers.

As mentioned above, each piece of the N pieces of initial skeletal data includes data in at least one of the following dimensions: an articulation point dimension, an articulation point motion dimension, a bone dimension, and a bone motion dimension. When each piece of initial bone data comprises data of at least two dimensions, the model can comprise a feature extraction module and a feature fusion module. Wherein: after the first bone data is input into the model, the feature extraction module is configured to perform feature extraction on data of each dimension in the first bone data, and the feature fusion module is configured to fuse data features corresponding to each extracted dimension to obtain a bone feature (i.e., the first bone feature) of the first initial bone data. Correspondingly, after the second bone data is input into the model, the feature extraction module is configured to perform feature extraction on data of each dimension in the second bone data, and the feature fusion module is configured to fuse data features corresponding to each extracted dimension to obtain bone features of the first initial bone data (i.e., the second bone features).

In one example, the feature extraction module may include a plurality of feature extraction modules and one feature fusion module, wherein the number of feature extraction modules is the same as the dimension of the first initial bone data, and one feature extraction module corresponds to one dimension. For example, the first initial bone data includes data of an articulation point dimension, an articulation point movement dimension, a bone dimension, and a bone movement dimension, the feature extraction module includes 4 feature extraction modules and a feature fusion module, and the 4 feature extraction modules are respectively: the device comprises a feature extraction module for extracting the features of the data of the joint point dimension, a feature extraction module for extracting the features of the data of the joint point motion dimension, a feature extraction module for extracting the features of the data of the bone dimension and a feature extraction module for extracting the features of the data of the bone motion dimension.

Regarding the structure of the model, the first initial bone data comprising the aforementioned 4-dimensional data will be described with reference to fig. 2. Fig. 2 is a schematic structural diagram of a model for extracting human bone features according to an embodiment of the present disclosure.

As shown in fig. 2, the model 200 includes an articulation point feature extraction module 210, an articulation point motion feature extraction module 220, a bone feature extraction module 230, a bone motion feature extraction module 240, and a feature fusion module 250.

Wherein:

after entering the first bone data into the model 200:

the joint point feature extraction module 210 is configured to extract features of joint point data in the first bone data;

the joint point movement feature extraction module 220 is configured to extract features of joint point movement data in the first bone data;

the bone feature extraction module 230 is configured to extract features of bone data in the first bone data;

the bone motion feature extraction module 240 is configured to extract features of bone motion data in the first bone data;

the feature fusion module 250 is configured to fuse the output of the joint point feature extraction module 210, the output of the joint point motion feature extraction module 220, the output of the bone feature extraction module 230, and the output of the bone motion feature extraction module 240 to obtain a first bone feature.

In a similar manner, the first and second substrates are,

after inputting the second bone data into the model 200:

the joint point feature extraction module 210 is configured to extract features of joint point data in the second bone data;

the joint point movement feature extraction module 220 is configured to extract features of joint point movement data in the second bone data;

the bone feature extraction module 230 is configured to extract features of bone data in the second bone data;

the bone motion feature extraction module 240 is configured to extract features of the bone motion data in the second bone data;

the feature fusion module 250 is configured to fuse the output of the joint point feature extraction module 210, the output of the joint point motion feature extraction module 220, the output of the bone feature extraction module 230, and the output of the bone motion feature extraction module 240 to obtain a second bone feature.

The embodiment of the application also provides a human skeleton data processing system. Fig. 3 is a schematic structural diagram of a human bone data processing system according to an embodiment of the present disclosure, as shown in fig. 3. The system 300 includes a model 310 for extracting human skeletal features, a data enhancement model 320, and a feature mapping model 330. Wherein:

the model 310 is obtained by training according to the method described in fig. 1;

the data enhancement model 320 is used for enhancing the initial bone data to obtain training bone data;

the feature mapping model 330 is configured to process the bone features output by the model for extracting human bone features, and output features used for calculating loss of the model for extracting human bone features.

With regard to the specific implementation manner of the data enhancement model 320 for performing the enhancement processing on the initial bone data, reference may be made to the description part of the above method embodiment, and the description will not be repeated here.

The feature mapping model 330 mentioned here may correspond to the feature mapping module in the above method embodiment. With respect to the specific implementation of the feature mapping model 330, reference may be made to the above description of the feature mapping module, and a repeated description is not made here.

The embodiment of the application also provides a model training method for extracting the features. Referring to fig. 4, the figure is a schematic flowchart of a model training method for extracting features according to an embodiment of the present application, and the method may be implemented, for example, through the following S401-S402.

S401: the method comprises the steps of obtaining a first data set and a second data set, wherein the first data set and the second data set both comprise N data, the first data set is obtained after data enhancement is carried out on N initial data in a first data enhancement mode, and the second data set is obtained after data enhancement is carried out on the N initial data in a second data enhancement mode.

S402: training a model for extracting features based on the first data set and the second data set.

Wherein:

Alternatively to this, the first and second parts may,

The first feature mentioned here is applied to a scene of extracting human skeleton features, namely the first skeleton feature. The second feature mentioned here is applied to the scene of extracting the human skeleton feature, namely, the second skeleton feature. The first mapping feature mentioned here is applied to a scene where human skeleton features are extracted, that is, the first feature. The second mapping feature mentioned here is applied to the scene of extracting the human skeleton feature, which is the second feature.

Optionally, the first initial data includes data of at least one dimension.

Optionally, the method further includes:

The method shown in fig. 4 can be applied to various scenarios. The model for extracting the human skeleton features is trained to be one of the scenes. Of course, other scenarios may also be suitable, such as training models for extracting image features, and so forth.

Regarding the specific implementation of the method shown in fig. 4, it is the same concept as the method shown in fig. 1. Therefore, with regard to the specific implementation of the method shown in fig. 4, reference may be made to the description above for the method shown in fig. 1, and the description is not repeated here.

Based on the methods provided by the above embodiments, the embodiments of the present application also provide corresponding apparatuses, which are described below with reference to the accompanying drawings.

Referring to fig. 5, the figure is a schematic structural diagram of a model training device for extracting human bone features according to an embodiment of the present application. The apparatus 500 may specifically include, for example: an acquisition unit 501 and a training unit 502.

An obtaining unit 501, configured to obtain a first bone data set and a second bone data set, where the first bone data set and the second bone data set both include N pieces of bone data, the first bone data set is obtained by performing data enhancement on N pieces of initial bone data in a first data enhancement manner, and the second bone data set is obtained by performing data enhancement on the N pieces of initial bone data in a second data enhancement manner;

a training unit 502 for training a model for extracting human skeletal features based on the first skeletal data set and the second skeletal data set; wherein:

Alternatively to this, the first and second parts may,

Optionally, the apparatus further comprises:

Since the apparatus 500 is an apparatus corresponding to the method provided in the above method embodiment and corresponding to the method in fig. 1, and the specific implementation of each unit of the apparatus 500 is the same as that of the above method embodiment, reference may be made to the relevant description part of the above method embodiment for the specific implementation of each unit of the apparatus 500, and details are not repeated here.

Referring to fig. 6, the drawing is a schematic structural diagram of a model training apparatus for extracting features according to an embodiment of the present application. The apparatus 600 may specifically include, for example: an acquisition unit 601 and a training unit 602.

An obtaining unit 601, configured to obtain a first data set and a second data set, where the first data set and the second data set both include N pieces of data, the first data set is obtained by performing data enhancement on N pieces of initial data in a first data enhancement manner, and the second data set is obtained by performing data enhancement on the N pieces of initial data in a second data enhancement manner;

a training unit 602, configured to train a model for extracting features based on the first data set and the second data set; wherein:

Alternatively to this, the first and second parts may,

Optionally, the first initial data includes data of at least one dimension.

Optionally, the apparatus further comprises:

Since the apparatus 600 is an apparatus corresponding to the method provided in the above method embodiment and corresponding to the method in fig. 4, and the specific implementation of each unit of the apparatus 600 is the same as that of the above method embodiment, reference may be made to the relevant description part of the above method embodiment for the specific implementation of each unit of the apparatus 600, and details are not repeated here.

An embodiment of the present application further provides an apparatus, including: a processor, a memory, a system bus; the equipment and the memory are connected through the system bus; the memory is for storing one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of the above method embodiments.

The present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a terminal device, the instructions cause the terminal device to perform the method according to any one of the above method embodiments.

An embodiment of the present application further provides a computer program product, which, when running on a terminal device, causes the terminal device to execute the method described in any one of the above method embodiments.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the attached claims

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A model training method for extracting human skeletal features, the method comprising:

2. The method of claim 1,

and determining the loss of the first initial bone data according to the first similarity, the weight corresponding to the first similarity, and the similarity corresponding to each bone feature in the first bone feature and the other (2 x N-1) bone features except the first bone feature, and the weight of the similarity corresponding to each bone feature.

3. The method of claim 1, wherein the first similarity between the first bone feature and the second bone feature is determined by:

4. The method of claim 3, wherein parameters of the feature mapping module are adjusted according to a loss of the model during the training of the model.

5. The method of any one of claims 1-4, wherein the first initial bone data comprises data in at least one of the following dimensions:

6. The method of claim 5, wherein the model comprises a feature extraction module and a feature fusion module; when the first initial bone data comprises data of at least two dimensions, the feature extraction module is used for extracting features of the data of each dimension in the first bone data, and the feature fusion module is used for fusing data features corresponding to the extracted dimensions to obtain bone features of the first initial bone data, wherein the first bone data is obtained by processing the first initial bone data in a first data enhancement mode.

7. The method according to claim 1 or 2, characterized in that the method further comprises:

8. A human skeletal data processing system, characterized in that the system comprises:

a model, a data enhancement model and a feature mapping model which are obtained by training by adopting the method of any one of claims 1 to 7 and used for extracting human skeleton features;

9. A model training apparatus for extracting human skeletal features, the apparatus comprising:

10. A model training apparatus, characterized in that the apparatus comprises: a processor, a memory, a system bus; the equipment and the memory are connected through the system bus; the memory is to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1 to 7.

11. A computer-readable storage medium having stored therein instructions which, when run on a terminal device, cause the terminal device to perform the method of any one of claims 1 to 7.