CN112818898B

CN112818898B - Model training method and device and electronic equipment

Info

Publication number: CN112818898B
Application number: CN202110195069.4A
Authority: CN
Inventors: 罗宇轩; 唐堂
Original assignee: Beijing Zitiao Network Technology Co Ltd
Current assignee: Beijing Zitiao Network Technology Co Ltd
Priority date: 2021-02-20
Filing date: 2021-02-20
Publication date: 2024-02-20
Anticipated expiration: 2041-02-20
Also published as: CN112818898A

Abstract

The embodiment of the invention discloses a model training method, a model training device and electronic equipment. One embodiment of the method comprises the following steps: acquiring a training sample set; selecting a training sample from the training sample set, and executing the following training steps based on the selected training sample: inputting the sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information; determining a transformation matrix between the three-dimensional human body posture information and the sample inertial motion capture data; converting the gesture key points and the inertial motion capture three-dimensional points into the same coordinate system by utilizing a transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training is completed. The implementation can save the calibration cost among different coordinate systems.

Description

Model training method and device and electronic equipment

Technical Field

The embodiment of the disclosure relates to the technical field of computers, in particular to a model training method, a model training device and electronic equipment.

Background

Human body posture estimation (Human Pose Estimation) is an important task in computer vision and is also an essential step in understanding human actions and behaviors. In recent years, a method for estimating the human body posture using deep learning has been proposed successively, and the performance far beyond the conventional method has been achieved. When actually solving, the estimation of the human body posture is often converted into a problem of predicting the key points of the human body, namely, the position coordinates of all the key points of the human body are predicted first, and then the spatial position relation among the key points is determined according to priori knowledge, so that a predicted human body skeleton is obtained.

Disclosure of Invention

This disclosure is provided in part to introduce concepts in a simplified form that are further described below in the detailed description. This disclosure is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

The embodiment of the disclosure provides a model training method, a device and electronic equipment, which can save the calibration cost among different coordinate systems and enable a three-dimensional human body posture prediction network to achieve better precision.

In a first aspect, an embodiment of the present disclosure provides a model training method, including: acquiring a training sample set, wherein the training sample comprises a sample human body image and sample inertial motion capturing data corresponding to the sample human body image, and the sample inertial motion capturing data is inertial motion capturing data of a human body in the sample human body image acquired when the sample human body image is shot; selecting a training sample from the training sample set, and executing the following training steps based on the selected training sample: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing a transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training is completed.

In a second aspect, embodiments of the present disclosure provide a model training apparatus, the apparatus comprising: the first acquisition unit is used for acquiring a training sample set, wherein the training sample comprises a sample human body image and sample inertial motion capturing data corresponding to the sample human body image, and the sample inertial motion capturing data is inertial motion capturing data of a human body in the sample human body image acquired when the sample human body image is shot; the training unit is used for selecting training samples from the training sample set, and based on the selected training samples, the following training steps are executed: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing a transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training is completed.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; and storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the model training method as in the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer readable medium having stored thereon a computer program which, when executed by a processor, implements the steps of the model training method as in the first aspect.

The embodiment of the disclosure provides a model training method, a device and electronic equipment, wherein a training sample set is obtained; then, selecting a training sample from the training sample set, and executing the following training steps based on the selected training sample: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing the transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training. By determining the transformation matrix between the three-dimensional human body posture key points output by the network and the corresponding inertial motion capture three-dimensional points in the network training process, the three-dimensional human body posture key points and the inertial motion capture three-dimensional points can be converted into the same coordinate system, compared with the method that when inertial motion capture data are used as a data set of a three-dimensional human body posture estimation algorithm, the transformation relation between the inertial capture coordinate system and a camera coordinate system is required to be calibrated, and the calibration cost between different coordinate systems can be saved, and the three-dimensional human body posture prediction network can achieve better precision.

Drawings

The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by reference to the following detailed description when taken in conjunction with the accompanying drawings. The same or similar reference numbers will be used throughout the drawings to refer to the same or like elements. It should be understood that the figures are schematic and that elements and components are not necessarily drawn to scale.

FIG. 1 is an exemplary system architecture diagram in which various embodiments of the present disclosure may be applied;

FIG. 2 is a flow chart of one embodiment of a model training method according to the present disclosure;

FIG. 3 is a flow chart of one embodiment of predicting three-dimensional human body pose information in a model training method according to the present disclosure;

FIG. 4 is a schematic structural view of one embodiment of a model training apparatus according to the present disclosure;

fig. 5 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure have been shown in the accompanying drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but are provided to provide a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for illustration purposes only and are not intended to limit the scope of the present disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like in this disclosure are merely used to distinguish between different devices, modules, or units and are not used to define an order or interdependence of functions performed by the devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those of ordinary skill in the art will appreciate that "one or more" is intended to be understood as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the various devices in the embodiments of the present disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.

FIG. 1 illustrates an exemplary system architecture 100 in which embodiments of the model training methods of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include an inertial motion capture device 101, networks 1021, 1022, 1023, a terminal device 103, and a server 104. The network 1021 is the medium used to provide a communication link between the inertial motion capture device 101 and the terminal device 103. The network 1022 is used as a medium to provide a communication link between the inertial motion capture device 101 and the server 104. The network 1023 is a medium used to provide communication links between the terminal devices 103 and the server 104. The networks 1021, 1022, 1023 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

The inertial motion capture device 101 may include, but is not limited to: inertial motion capture sensors and inertial motion capture garments mounted on various nodes (e.g., joints) of the human body. The inertial motion capture sensor is mounted at a joint of the user's body, or after the user wears the inertial motion capture garment, the pose and orientation of the body part may be acquired and inertial motion capture data transmitted to the terminal device 103 using the network 1021, or inertial motion capture data transmitted to the server 104 using the network 1022.

The user may interact with the server 104 through the network 1023 using the terminal device 103 to send or receive a message or the like, for example, the user may acquire a human body image using the terminal device 103, and the server 104 may acquire a human body image from the terminal device 103. Various communication client applications, such as an inertial motion capture application, an image acquisition application, instant messaging software, etc., may be installed on the terminal device 103.

The terminal device 103 may first obtain a set of training samples, wherein sample inertial motion capture data in the set of training samples may be obtained from the inertial motion capture device 101; then, a training sample can be selected from the training sample set, and based on the selected training sample, the following training steps are executed: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing the transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training.

The terminal device 103 may be hardware or software. When the terminal device 103 is hardware, it may be various electronic devices having a camera and supporting information interaction, including but not limited to a smart phone, a tablet computer, a laptop computer, etc. When the terminal device 103 is software, it can be installed in the above-listed electronic devices. Which may be implemented as multiple software or software modules (e.g., multiple software or software modules for providing distributed services) or as a single software or software module. The present invention is not particularly limited herein.

The server 104 may be a server providing various services. For example, a training sample set may be first acquired, wherein sample human body images in the training sample set may be acquired from the terminal device 103, and sample inertial motion capture data in the training sample set may be acquired from the inertial motion capture device 101; then, a training sample can be selected from the training sample set, and based on the selected training sample, the following training steps are executed: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing the transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training.

The server 104 may be hardware or software. When the server 104 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or as a single server. When server 104 is software, it may be implemented as multiple software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be further noted that, the model training method provided in the embodiments of the present disclosure may be executed by the server 104, and the model training apparatus is generally disposed in the server 104. The model training method provided by the embodiment of the present disclosure may also be performed by the terminal device 103, where the model training apparatus is generally disposed in the terminal device 103.

It should also be noted that, in the case where the model training method provided in the embodiments of the present disclosure is executed by the server 104, if the training sample set is stored locally in the server 104, the inertial motion capture device 101, the networks 1021, 1022, 1023, and the terminal device 103 may not be present in the exemplary system architecture 100.

It should be further noted that, in the case where the model training method provided in the embodiment of the present disclosure is executed by the terminal device 103, if information such as the inertial motion capturing data and the initial neural network is stored locally in the terminal device 103, the inertial motion capturing device 101, the networks 1021, 1022, 1023 and the server 104 may not exist in the exemplary system architecture 100.

It should be understood that the number of inertial motion capture devices, networks, terminal devices, and servers in FIG. 1 are merely illustrative. There may be any number of inertial motion capture devices, networks, terminal devices, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of a model training method according to the present disclosure is shown. The model training method comprises the following steps:

step 201, a training sample set is obtained.

In this embodiment, the execution subject of the model training method (e.g., the terminal device or the server shown in fig. 1) may acquire a training sample set. The training samples in the training sample set may include a sample human body image and sample inertial motion capture data corresponding to the sample human body image. The sample inertial motion capture data may be inertial motion capture data of a human body present in a sample human body image acquired when the sample human body image is captured. The inertial motion capturing is one new type of human motion capturing technology, and the inertial motion capturing technology uses wireless motion posture sensor to collect the posture and azimuth of the body part, recovers the human motion model based on the principle of human motion and adopts wireless transmission mode to display the data in computer software.

Here, the user may capture the inertial motion of the human body by the inertial motion capturing device and simultaneously take an image of the user by the imaging device, thereby obtaining a human body image corresponding to the inertial motion capturing data.

Step 202, selecting a training sample from the training sample set, and based on the selected training sample, executing the following training steps: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing a transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training is completed.

In this embodiment, the executing body may select a training sample from the training sample set obtained in step 201, and execute the following training steps based on the selected training sample.

In this embodiment, the training step 202 may include sub-steps 2021, 2022, 2023, 2024, 2025, and 2026. Wherein:

in step 2021, the sample human body image of the selected training sample is input into the initial neural network, so as to obtain three-dimensional human body posture information corresponding to the selected training sample.

In this embodiment, the execution body may input the sample human body image of the selected training sample into the initial neural network, so as to obtain three-dimensional human body posture information corresponding to the selected training sample. The initial neural network may be various neural networks capable of obtaining three-dimensional human body posture information from a human body image, for example, a convolutional neural network, a deep neural network, and the like. The three-dimensional human body posture information may include the posture and orientation of various body parts, for example, the direction and position of human body joints.

Step 2022, determining a transformation matrix between the three-dimensional human body posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data.

In this embodiment, the execution body may determine a transformation matrix between the three-dimensional human body posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data. Transformation matrices are a concept in mathematical linear algebra. In linear algebra, the linear transformation can be represented by a matrix. If T is a linear transformation mapping Rn to Rm and x is a column vector with n elements, we call m x n matrix a the transformation matrix of T.

Here, the executing body may determine a transformation matrix between the three-dimensional human body posture information corresponding to the selected training sample and the corresponding sample inertial motion capture data by using a least square method. The least square method is also called as a least squares method, and is a mathematical optimization technology. It finds the best functional match for the data by minimizing the sum of squares of the errors. The unknown data can be easily obtained by the least square method, and the sum of squares of errors between the obtained data and the actual data is minimized.

Step 2023, converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capturing three-dimensional points indicated by the corresponding sample inertial motion capturing data into the same coordinate system by using the transformation matrix, and determining the difference between the gesture key points and the inertial motion capturing three-dimensional points.

In this embodiment, the execution body may convert the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training sample and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by using the transformation matrix. Here, the executing body may establish a coordinate system, may transform the posture key points indicated by the three-dimensional human posture information corresponding to the selected training sample into the established coordinate system, and may transform the inertial motion capturing three-dimensional points indicated by the sample inertial motion capturing data corresponding to the selected training sample into the established coordinate system.

The execution body may then determine the difference between the gesture keypoints and the inertial motion capture three-dimensional points. Specifically, the executing body may determine the difference between the gesture key point and the inertial motion capture three-dimensional point by using a preset loss function, for example, may determine the difference between the gesture key point and the inertial motion capture three-dimensional point by using a mean square error as the loss function, and may determine the difference between the gesture key point and the inertial motion capture three-dimensional point by using an L2 norm as the loss function.

Step 2024 adjusts network parameters of the initial neural network based on the determined differences.

In this embodiment, the executing entity may adjust the network parameters of the initial neural network based on the difference determined in step 2023. Here, the above-described execution subject may employ various implementations to adjust network parameters of the initial neural network based on differences between the gesture keypoints and the inertial motion capture three-dimensional points. For example, a BP (Back Propagation) algorithm or an SGD (Stochastic Gradient Descent, random gradient descent) algorithm may be employed to adjust network parameters of the initial neural network.

Step 2025 determines whether a preset training end condition is satisfied.

In this embodiment, the execution body may determine whether a preset training end condition is satisfied. The training end conditions preset herein may include, but are not limited to, at least one of: the training time exceeds the preset duration; the training times exceed the preset times; the determined difference is less than a preset difference threshold.

If the training end condition is satisfied, the execution subject may execute step 2026.

If the training end condition is satisfied, step 2026 determines the adjusted initial neural network as the three-dimensional human body posture prediction network after the training is completed.

In this embodiment, if it is determined in step 2025 that the training end condition is satisfied, the execution subject may determine the adjusted initial neural network as a three-dimensional human body posture prediction network after training.

According to the method provided by the embodiment of the disclosure, through determining the transformation matrix between the three-dimensional human body posture key points output by the network and the corresponding inertial motion capturing three-dimensional points in the network training process, the three-dimensional human body posture key points and the inertial motion capturing three-dimensional points can be converted into the same coordinate system, compared with the method that when inertial motion capturing data are used as a data set of a three-dimensional human body posture estimation algorithm, the transformation relation between the inertial capturing coordinate system and a camera coordinate system is required to be calibrated, the calibration cost between different coordinate systems can be saved, and the three-dimensional human body posture prediction network can achieve better precision.

In some alternative implementations, if it is determined in step 2025 that the training end condition is not met, the execution subject may select unused training samples from the training sample set using the adjusted initial neural network as the initial neural network, and continue to execute the training step based on the re-selected training samples (sub-steps 2021-2026).

In some optional implementations, the training samples in the training sample set may include a sample human video and sample inertial motion capture data corresponding to the sample human video. The sample inertial motion capture data may be inertial motion capture data of a human body presented in a sample human body video acquired when the sample human body video was captured. Here, the user may capture the inertial motion of the human body by using the inertial motion capturing device and simultaneously take a picture of the user by using the image pickup device, thereby obtaining a human body video corresponding to the inertial motion capturing data. The execution subject can input the sample human body image of the selected training sample into the initial neural network in the following manner to obtain three-dimensional human body posture information corresponding to the selected training sample: the execution body can input the sample human body video of the selected training sample into the initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample. The initial neural network may be various neural networks capable of obtaining three-dimensional human body posture information from human body videos, for example, a convolutional neural network, a depth neural network, and the like. The three-dimensional human body posture information may include the posture and orientation of various body parts, for example, the direction and position of human body joints. In comparison with the method for obtaining the transformation matrix by utilizing the three-dimensional human body posture information corresponding to the image and the corresponding inertial motion capturing data, the method for obtaining the transformation matrix by utilizing the three-dimensional human body posture information corresponding to the whole video and the corresponding inertial motion capturing data uses the video level transformation matrix, and reduces errors by utilizing a large amount of data, so that random errors of samples offset a part of each other, and the smaller the average error of network output is, the smaller the corresponding error of the obtained affine transformation is.

In some optional implementations, the executing body may convert the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by using the transformation matrix in the following manner: the executing body may convert the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training sample into the coordinate system where the inertial motion capturing three-dimensional points indicated by the sample inertial motion capturing data corresponding to the selected training sample are located by using the transformation matrix. The coordinate system in which the inertial motion capturing three-dimensional point is located may be a human body coordinate system when sample inertial motion capturing data is acquired, for example, may be a coordinate system in which the lower left corner of the human body region is taken as the origin of coordinates and two sides parallel to the ground and perpendicular to the ground are taken as coordinate axes.

In some optional implementations, the executing body may convert the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by using the transformation matrix in the following manner: and converting the inertial motion capturing three-dimensional points indicated by the sample inertial motion capturing data corresponding to the selected training sample into a coordinate system where the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training sample are located by utilizing the transformation matrix. The coordinate system where the gesture key points are located may be a camera coordinate system corresponding to the sample human body image of the selected training sample.

With further reference to fig. 3, fig. 3 is a flow 300 of one embodiment of predicting three-dimensional human body posture information in a model training method according to the present embodiment. The process 300 of predicting three-dimensional human body posture information includes the following steps:

in step 301, a human body image to be predicted is acquired.

In this embodiment, the execution subject of the model training method (for example, the terminal device or the server shown in fig. 1) may directly or indirectly acquire the human body image to be predicted. For example, when the execution subject is a terminal device, the execution subject may directly acquire a human body image to be predicted input by a user; when the execution body is a server, the execution body may acquire the human body image to be predicted input by the user from the terminal device through a wired connection manner or a wireless connection manner. Here, various parts of the human body, such as a head, a waist, an arm, a leg, and the like, may be included in the human body image.

Step 302, inputting the human body image into the three-dimensional human body posture prediction network after training, and obtaining the three-dimensional human body posture information of the human body presented in the human body image.

In this embodiment, the execution subject may input the human body image into a three-dimensional human body posture prediction network after training is completed, so as to obtain three-dimensional human body posture information of a human body represented in the human body image. The three-dimensional human body posture prediction network is a three-dimensional human body posture prediction network trained by the method shown in fig. 2. The three-dimensional human body posture prediction network can be used for representing the corresponding relation between the image and the three-dimensional human body posture information of the human body displayed in the image. The three-dimensional human body posture information may include the posture and orientation of various body parts, for example, the direction and position of human body joints.

According to the method provided by the embodiment of the disclosure, the three-dimensional human body posture information of the human body represented in the human body image is predicted by inputting the human body image into the three-dimensional human body posture prediction network obtained by training the method shown in fig. 2, and the accuracy of the predicted three-dimensional human body posture information can be improved in this way.

With further reference to fig. 4, as an implementation of the method shown in the foregoing figures, the present disclosure provides an embodiment of a model training apparatus, which corresponds to the method embodiment shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 4, the model training apparatus 400 of the present embodiment includes: a first acquisition unit 401 and a training unit 402. The first obtaining unit 401 is configured to obtain a training sample set, where the training sample includes a sample human body image and sample inertial motion capturing data corresponding to the sample human body image, and the sample inertial motion capturing data is inertial motion capturing data of a human body represented in a sample human body image acquired when the sample human body image is captured; the training unit 402 is configured to select a training sample from the training sample set, and based on the selected training sample, perform the following training steps: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing a transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training is completed.

In this embodiment, the specific processing of the first acquisition unit 401 and the training unit 402 of the model training apparatus 400 may refer to step 201 and step 202 in the corresponding embodiment of fig. 2.

In some alternative implementations, the model training apparatus 400 may further include a feedback unit (not shown in the figure). The feedback unit may be configured to select an unused training sample from the training sample set by using the adjusted initial neural network as the initial neural network if the training end condition is not satisfied, and continue to perform the training step based on the selected training sample.

In some alternative implementations, the model training apparatus 400 may further include a second acquisition unit (not shown in the figure) and an input unit (not shown in the figure). The second obtaining unit may be configured to obtain a human body image to be predicted; the input unit may be configured to input the human body image into the trained three-dimensional human body posture prediction network, to obtain three-dimensional human body posture information of a human body represented in the human body image.

In some optional implementations, the training samples may include sample human body videos and sample inertial motion capture data corresponding to the sample human body videos, which may be inertial motion capture data of a human body presented in sample human body videos acquired when the sample human body videos were captured; and the training unit 402 may be further configured to input a sample human body image of the selected training sample into the initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample by: the training unit 402 may input the sample human body video of the selected training sample into the initial neural network, so as to obtain three-dimensional human body posture information corresponding to the selected training sample.

In some optional implementations, the training unit 402 may be further configured to convert the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by using the transformation matrix as follows: the training unit 402 may convert the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples into the coordinate system where the inertial motion capture three-dimensional points indicated by the inertial motion capture data of the samples corresponding to the selected training samples are located by using the transformation matrix.

In some optional implementations, the training unit 402 may be further configured to convert the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by using the transformation matrix as follows: the training unit 402 may convert the inertial motion capture three-dimensional points indicated by the sample inertial motion capture data corresponding to the selected training sample into the coordinate system where the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training sample are located by using the transformation matrix.

Referring now to fig. 5, a schematic diagram of an electronic device (e.g., server or terminal device of fig. 1) 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal devices in the embodiments of the present disclosure may include, but are not limited to, mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), car terminals (e.g., car navigation terminals), and the like, and stationary terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 5 is merely an example and should not impose any limitations on the functionality and scope of use of embodiments of the present disclosure.

As shown in fig. 5, the electronic device 500 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 501, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM503, various programs and data required for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

In general, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 507 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 508 including, for example, magnetic tape, hard disk, etc.; and communication means 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 shows an electronic device 500 having various means, it is to be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may be implemented or provided instead. Each block shown in fig. 5 may represent one device or a plurality of devices as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or from the storage means 508, or from the ROM 502. The above-described functions defined in the methods of the embodiments of the present disclosure are performed when the computer program is executed by the processing device 501. It should be noted that, the computer readable medium according to the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In an embodiment of the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. Whereas in embodiments of the present disclosure, the computer-readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, fiber optic cables, RF (radio frequency), and the like, or any suitable combination of the foregoing.

The computer readable medium may be contained in the electronic device; or may exist alone without being incorporated into the electronic device. The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a training sample set, wherein the training sample comprises a sample human body image and sample inertial motion capturing data corresponding to the sample human body image, and the sample inertial motion capturing data is inertial motion capturing data of a human body in the sample human body image acquired when the sample human body image is shot; selecting a training sample from the training sample set, and executing the following training steps based on the selected training sample: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing a transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training is completed.

Computer program code for carrying out operations of embodiments of the present disclosure may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

According to one or more embodiments of the present disclosure, there is provided a model training method, the method comprising: acquiring a training sample set, wherein the training sample comprises a sample human body image and sample inertial motion capturing data corresponding to the sample human body image, and the sample inertial motion capturing data is inertial motion capturing data of a human body in the sample human body image acquired when the sample human body image is shot; selecting a training sample from the training sample set, and executing the following training steps based on the selected training sample: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing a transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training is completed.

According to one or more embodiments of the present disclosure, the method further comprises: if the training ending condition is not met, the adjusted initial neural network is used as the initial neural network, unused training samples are selected from the training sample set, and the training step is continuously executed based on the selected training samples.

According to one or more embodiments of the present disclosure, the method further comprises: acquiring a human body image to be predicted; and inputting the human body image into a three-dimensional human body posture prediction network after training to obtain three-dimensional human body posture information of a human body displayed in the human body image.

According to one or more embodiments of the present disclosure, a training sample includes a sample human body video and sample inertial motion capture data corresponding to the sample human body video, the sample inertial motion capture data being inertial motion capture data of a human body presented in the sample human body video acquired when the sample human body video was captured; and inputting the sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample, wherein the three-dimensional human body posture information comprises: and inputting the sample human body video of the selected training sample into the initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample.

According to one or more embodiments of the present disclosure, converting, using a transformation matrix, a gesture key point indicated by three-dimensional human gesture information corresponding to a selected training sample and an inertial motion capture three-dimensional point indicated by corresponding sample inertial motion capture data into the same coordinate system, includes: and converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples into a coordinate system where the inertial motion capturing three-dimensional points indicated by the sample inertial motion capturing data corresponding to the selected training samples are located by utilizing the transformation matrix.

According to one or more embodiments of the present disclosure, converting, using a transformation matrix, a gesture key point indicated by three-dimensional human gesture information corresponding to a selected training sample and an inertial motion capture three-dimensional point indicated by corresponding sample inertial motion capture data into the same coordinate system, includes: and converting the inertial motion capturing three-dimensional points indicated by the sample inertial motion capturing data corresponding to the selected training sample into a coordinate system where the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training sample are located by utilizing the transformation matrix.

According to one or more embodiments of the present disclosure, there is provided a model training apparatus, the apparatus comprising: the first acquisition unit is used for acquiring a training sample set, wherein the training sample comprises a sample human body image and sample inertial motion capturing data corresponding to the sample human body image, and the sample inertial motion capturing data is inertial motion capturing data of a human body in the sample human body image acquired when the sample human body image is shot; the training unit is used for selecting training samples from the training sample set, and based on the selected training samples, the following training steps are executed: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing a transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training is completed.

According to one or more embodiments of the present disclosure, the apparatus further comprises: and the feedback unit is used for taking the adjusted initial neural network as the initial neural network if the training ending condition is not met, selecting unused training samples from the training sample set, and continuously executing the training step based on the selected training samples.

According to one or more embodiments of the present disclosure, the apparatus further comprises: a second acquisition unit for acquiring a human body image to be predicted; the input unit is used for inputting the human body image into the three-dimensional human body posture prediction network after training to obtain the three-dimensional human body posture information of the human body displayed in the human body image.

According to one or more embodiments of the present disclosure, a training sample includes a sample human body video and sample inertial motion capture data corresponding to the sample human body video, the sample inertial motion capture data being inertial motion capture data of a human body presented in the sample human body video acquired when the sample human body video was captured; the training unit is further used for inputting the sample human body image of the selected training sample into the initial neural network in the following mode to obtain three-dimensional human body posture information corresponding to the selected training sample: and inputting the sample human body video of the selected training sample into the initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample.

According to one or more embodiments of the present disclosure, the training unit is further configured to convert the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by using a transformation matrix in the following manner: and converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples into a coordinate system where the inertial motion capturing three-dimensional points indicated by the sample inertial motion capturing data corresponding to the selected training samples are located by utilizing the transformation matrix.

According to one or more embodiments of the present disclosure, the training unit is further configured to convert the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by using a transformation matrix in the following manner: and converting the inertial motion capturing three-dimensional points indicated by the sample inertial motion capturing data corresponding to the selected training sample into a coordinate system where the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training sample are located by utilizing the transformation matrix.

According to one or more embodiments of the present disclosure, there is provided an electronic device including: one or more processors; and a storage device for storing one or more programs that, when executed by the one or more processors, cause the one or more processors to implement the model training method as described above.

According to one or more embodiments of the present disclosure, a computer readable medium is provided, having stored thereon a computer program which, when executed by a processor, implements the steps of a model training method as described above.

The units involved in the embodiments described in the present disclosure may be implemented by means of software, or may be implemented by means of hardware. The described units may also be provided in a processor, for example, described as: a processor includes a first acquisition unit and a training unit. The names of these units do not in some way constitute a limitation of the unit itself, for example, the first acquisition unit may also be described as "unit acquiring a training sample set".

The foregoing description is only of the preferred embodiments of the present disclosure and description of the principles of the technology being employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above technical features, but encompasses other technical features formed by any combination of the above technical features or their equivalents without departing from the spirit of the invention. Such as the above-described features, are mutually substituted with (but not limited to) the features having similar functions disclosed in the embodiments of the present disclosure.

Claims

1. A method of model training, comprising:

acquiring a training sample set, wherein the training sample comprises a sample human body image and sample inertial motion capturing data corresponding to the sample human body image, and the sample inertial motion capturing data is inertial motion capturing data of a human body in the sample human body image acquired when the sample human body image is shot;

selecting a training sample from the training sample set, and executing the following training steps based on the selected training sample: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing the transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training is completed.

2. The method according to claim 1, wherein the method further comprises:

and if the training ending condition is not met, taking the adjusted initial neural network as the initial neural network, selecting unused training samples from the training sample set, and continuously executing the training step based on the selected training samples.

3. The method according to claim 1, wherein the method further comprises:

acquiring a human body image to be predicted;

inputting the human body image into the three-dimensional human body posture prediction network after training, and obtaining three-dimensional human body posture information of the human body displayed in the human body image.

4. The method of claim 1, wherein the training sample comprises a sample human body video and sample inertial motion capture data corresponding to the sample human body video, the sample inertial motion capture data being inertial motion capture data of a human body presented in the sample human body video acquired when the sample human body video was captured; and

the step of inputting the sample human body image of the selected training sample into the initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample comprises the following steps:

And inputting the sample human body video of the selected training sample into the initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample.

5. The method according to one of claims 1 to 4, wherein converting, using the transformation matrix, the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system includes:

and converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples into a coordinate system where the inertial motion capturing three-dimensional points indicated by the sample inertial motion capturing data corresponding to the selected training samples are located by utilizing the transformation matrix.

6. The method according to one of claims 1 to 4, wherein converting, using the transformation matrix, the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system includes:

and converting the inertial motion capturing three-dimensional points indicated by the sample inertial motion capturing data corresponding to the selected training sample into a coordinate system where the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training sample are located by utilizing the transformation matrix.

7. A model training device, comprising:

the first acquisition unit is used for acquiring a training sample set, wherein the training sample comprises a sample human body image and sample inertial motion capturing data corresponding to the sample human body image, and the sample inertial motion capturing data is inertial motion capturing data of a human body in the sample human body image acquired when the sample human body image is shot;

the training unit is used for selecting training samples from the training sample set, and based on the selected training samples, the following training steps are executed: inputting a sample human body image of the selected training sample into an initial neural network to obtain three-dimensional human body posture information corresponding to the selected training sample; determining a transformation matrix between three-dimensional human body posture information corresponding to the selected training sample and corresponding sample inertial motion capture data; converting the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by utilizing the transformation matrix, and determining the difference between the gesture key points and the inertial motion capture three-dimensional points; based on the determined difference, adjusting network parameters of the initial neural network; determining whether a preset training ending condition is met; and if the training ending condition is met, determining the adjusted initial neural network as a three-dimensional human body posture prediction network after training is completed.

8. The apparatus of claim 7, wherein the apparatus further comprises:

and the feedback unit is used for taking the adjusted initial neural network as the initial neural network if the training ending condition is not met, selecting unused training samples from the training sample set, and continuously executing the training step based on the selected training samples.

9. The apparatus of claim 7, wherein the apparatus further comprises:

a second acquisition unit for acquiring a human body image to be predicted;

and the input unit is used for inputting the human body image into the three-dimensional human body posture prediction network after training to obtain three-dimensional human body posture information of the human body displayed in the human body image.

10. The apparatus of claim 7, wherein the training sample comprises a sample human body video and sample inertial motion capture data corresponding to the sample human body video, the sample inertial motion capture data being inertial motion capture data of a human body presented in the sample human body video acquired when the sample human body video was captured; and

the training unit is further used for inputting the sample human body image of the selected training sample into the initial neural network in the following manner to obtain three-dimensional human body posture information corresponding to the selected training sample:

11. The apparatus according to one of claims 7 to 10, wherein the training unit is further configured to convert the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by using the transformation matrix:

12. The apparatus according to one of claims 7 to 10, wherein the training unit is further configured to convert the gesture key points indicated by the three-dimensional human gesture information corresponding to the selected training samples and the inertial motion capture three-dimensional points indicated by the corresponding sample inertial motion capture data into the same coordinate system by using the transformation matrix:

13. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon,

when executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-6.

14. A computer readable medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-6.