CN117523664B

CN117523664B - Training method of human motion prediction model, human-computer interaction method, and corresponding device, equipment and storage medium

Info

Publication number: CN117523664B
Application number: CN202311508827.9A
Authority: CN
Inventors: 崔琼杰; 王浩帆
Original assignee: Shuhang Technology Beijing Co ltd
Current assignee: Shuhang Technology Beijing Co ltd
Priority date: 2023-11-13
Filing date: 2023-11-13
Publication date: 2024-06-25
Anticipated expiration: 2043-11-13
Also published as: CN117523664A

Abstract

The application discloses a training method of a human motion prediction model, a human-computer interaction method, and corresponding devices, equipment and storage media. The training method comprises the following steps: acquiring an initial model, a historical action sequence of a reference human body and a target future action sequence corresponding to the historical action sequence; extracting target action characteristics of a main body and target action characteristics of hands from a historical action sequence by using an initial model; normalizing the target motion characteristics of the main body and the target motion characteristics of the hand by the initial model through the distribution norms to obtain a normalized main body motion characteristic sequence and a normalized hand motion characteristic sequence; the initial model predicts a future motion sequence of the subject and a future motion sequence of the hand based on the normalized subject motion feature sequence and the normalized hand motion feature sequence; based on the future action sequence of the main body, the future action sequence of the hand and the target future action sequence, updating parameters of the initial model to obtain the human body action prediction model.

Description

Training method of human motion prediction model, human-computer interaction method, and corresponding device, equipment and storage medium

Technical Field

The application relates to the technical field of human motion prediction, in particular to a training method, a related method and related products of a human motion prediction model.

Background

Human motion prediction is a basic task aimed at predicting human motion in a future period of time. The current method generally predicts the future actions of the human body through a deep learning model, specifically, the deep learning model predicts the actions of the human body based on the interactions between the human bodies, but the accuracy of the prediction result of the deep learning model is low when predicting the future action sequence of a person based on the historical action sequence of the person.

Disclosure of Invention

The application provides a training method, a correlation method and a correlation product of a human motion prediction model, wherein the correlation method comprises the following steps: the man-machine interaction method comprises the following steps of: training device, human-computer interaction device, electronic equipment, computer-readable storage medium of human motion prediction model.

In a first aspect, a method for training a human motion prediction model is provided, the method comprising:

acquiring an initial model, a historical action sequence of a reference human body and a target future action sequence corresponding to the historical action sequence;

Extracting target motion features of a subject and target motion features of a hand from the historical motion sequence using the initial model, the subject including a portion of the reference human body other than the hand, the hand including the hand of the reference human body;

Normalizing the target motion characteristics of the main body and the target motion characteristics of the hand by the initial model through distribution norms (distribution norm, DN) to obtain a normalized main body motion characteristic sequence and a normalized hand motion characteristic sequence;

The initial model predicts a future motion sequence of the subject and a future motion sequence of the hand based on the normalized subject motion feature sequence and the normalized hand motion feature sequence;

And updating parameters of the initial model based on the future action sequence of the main body, the future action sequence of the hand and the target future action sequence to obtain a human body action prediction model.

In combination with any one of the embodiments of the present application, the updating the parameters of the initial model based on the future motion sequence of the subject, the future motion sequence of the hand, and the target future motion sequence to obtain a human motion prediction model includes:

determining a subject target motion sequence of the subject and a hand target motion sequence of the hand based on the target future motion sequence;

Determining a first loss based on a first difference between the future motion sequence of the subject and the subject target motion sequence and a second difference between the future motion sequence of the hand and the hand target motion sequence, the first loss being positively correlated with both the first difference and the second difference;

and updating parameters of the initial model based on the first loss to obtain the human motion prediction model.

In combination with any one of the embodiments of the present application, before updating the parameters of the initial model based on the first loss to obtain the human motion prediction model, the method further includes:

Determining a third difference between the normalized subject motion feature sequence and the normalized hand motion feature sequence;

Determining a second loss based on the third difference, the second loss being related to the third difference;

based on the first loss, updating parameters of the initial model to obtain the human motion prediction model, including:

and updating parameters of the initial model based on the first loss and the second loss to obtain the human motion prediction model.

In combination with any one of the embodiments of the present application, the normalized hand motion feature sequence includes a normalized left hand motion feature sequence and a normalized right hand motion feature sequence;

The determining a third difference between the normalized subject motion feature sequence and the normalized hand motion feature sequence comprises:

Respectively calculating the average value of the normalized main body action characteristic sequence, the average value of the normalized left hand action characteristic sequence and the average value of the normalized right hand action characteristic sequence along the space dimension to obtain a main body average action characteristic sequence, a left hand average action characteristic sequence and a right hand average action characteristic sequence;

And aiming at the main body average action characteristic sequence, the left hand average action characteristic sequence and the right hand average action characteristic sequence, obtaining the third difference by calculating the maximum mean value difference between every two.

In combination with any one of the embodiments of the present application, the extracting the target motion feature of the subject and the target motion feature of the hand from the historical motion sequence includes:

extracting the action characteristic sequence of the main body from the historical action sequence to obtain a historical main body action characteristic sequence;

extracting the action characteristic sequence of the hand from the historical action sequence to obtain a historical hand action characteristic sequence;

And obtaining target motion characteristics of the main body and target motion characteristics of the hands by performing discrete cosine transform on the historical main body motion characteristic sequence and the historical hand motion characteristic sequence respectively.

In combination with any of the embodiments of the present application, the initial model includes an image convolutional neural network;

The discrete cosine transform is performed on the historical subject motion feature sequence and the historical hand motion feature sequence to obtain a target motion feature of the subject and a target motion feature of the hand, including:

acquiring a full connection diagram of the reference human body;

The historical main body action feature sequence and the historical hand action feature sequence are used as input of the full-connection graph, and the full-connection graph is processed by using the graph convolution neural network to obtain initial action features of the main body and initial action features of the hands;

Obtaining target action characteristics of the main body by performing discrete cosine transform on the initial action characteristics of the main body;

And obtaining target motion characteristics of the hand by performing discrete cosine transform on the initial motion characteristics of the hand.

In a second aspect, a human-computer interaction method is provided, the human-computer interaction method is applied to a human-computer interaction device, the human-computer interaction device comprises a camera, and the method comprises the following steps:

collecting a historical action sequence of a target human body through the camera;

acquiring a human motion prediction model obtained according to any one of the embodiments of the present application;

And processing the historical motion sequence of the target human body by using the human body motion prediction model to predict and obtain a future motion sequence of the target human body, wherein the future motion sequence of the target human body comprises: a future motion sequence of a subject of the target human body and a future motion sequence of a hand of the target human body, the subject including a portion of the target human body other than the hand, the hand including a left hand of the target human body and a right hand of the target human body;

a target operation is performed in response to a future motion sequence of a subject of the target human body and a future motion sequence of a hand of the target human body.

In a third aspect, there is provided a training apparatus for a human motion prediction model, the apparatus comprising:

the acquisition unit is used for acquiring an initial model, a historical action sequence of a reference human body and a target future action sequence corresponding to the historical action sequence;

An extracting unit configured to extract, from the history motion sequence, a target motion feature of a subject and a target motion feature of a hand, the subject including a part of the reference human body other than the hand, the hand including a hand of the reference human body, using the initial model;

The normalization unit is used for normalizing the target action characteristics of the main body and the target action characteristics of the hand through DN based on the initial model to obtain a normalized main body action characteristic sequence and a normalized hand action characteristic sequence;

A prediction unit, configured to predict, by using the initial model, a future motion sequence of the subject and a future motion sequence of the hand based on the normalized subject motion feature sequence and the normalized hand motion feature sequence;

And the updating unit is used for updating the parameters of the initial model based on the future action sequence of the main body, the future action sequence of the hand and the target future action sequence to obtain a human body action prediction model.

In combination with any one of the embodiments of the present application, the updating unit is configured to:

In combination with any of the embodiments of the application, the device further comprises:

a determining unit, configured to determine a third difference between the normalized main motion feature sequence and the normalized hand motion feature sequence;

The determining unit is configured to determine a second loss based on the third difference, where the second loss is related to the third difference;

the updating unit is used for:

The determining unit is used for:

In combination with any one of the embodiments of the present application, the extraction unit is configured to:

The extraction unit is used for:

acquiring a full connection diagram of the reference human body;

In a fourth aspect, a human-machine interaction device is provided, the human-machine interaction device comprising:

The camera is used for collecting a historical action sequence of a target human body;

An acquisition unit for acquiring a human motion prediction model obtained according to any one of the embodiments in combination with the present application;

The prediction unit is configured to process the historical motion sequence of the target human body by using the human body motion prediction model, and predict to obtain a future motion sequence of the target human body, where the future motion sequence of the target human body includes: a future motion sequence of a subject of the target human body and a future motion sequence of a hand of the target human body, the subject including a portion of the target human body other than the hand, the hand including a left hand of the target human body and a right hand of the target human body;

and the execution unit is used for responding to the future action sequence of the main body of the target human body and the future action sequence of the hand of the target human body to execute target operation.

In a fifth aspect, there is provided an electronic device comprising: a processor and a memory for storing computer program code, the computer program code comprising computer instructions; causing a processor to perform the first aspect and any implementation thereof as described above, when the program instructions are executed by the processor; the program instructions, when executed by a processor, or cause the processor to perform the second aspect and any embodiments thereof as described above.

In a sixth aspect, there is provided another electronic device comprising: a processor, a transmitting device, an input device, an output device, and a memory for storing computer program code, the computer program code comprising computer instructions; the electronic device performs the first aspect and any implementation thereof as described above, when the processor executes the computer instructions; the electronic device may alternatively perform the second aspect and any embodiments thereof as described above, when the processor executes the computer instructions.

In a seventh aspect, there is provided a computer readable storage medium having a computer program stored therein, the computer program comprising program instructions; causing a processor to perform the first aspect and any implementation thereof as described above, when the program instructions are executed by the processor; the program instructions, when executed by a processor, or cause the processor to perform the second aspect and any embodiments thereof as described above.

In an eighth aspect, there is provided a computer program product comprising a computer program or instructions; when the computer program or instructions are run on a computer, the computer is caused to perform the first aspect and any implementation thereof described above; the program instructions, when executed by a processor, or cause the processor to perform the second aspect and any embodiments thereof as described above.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

In the application, the training device acquires the initial model, the historical action sequence of the reference human body and the target future action sequence corresponding to the historical action sequence. And extracting target action characteristics of the main body and target action characteristics of the hand from the historical action sequence by using the initial model. And normalizing the target motion feature of the main body and the target motion feature of the hand by DN to obtain a normalized main body motion feature sequence and a normalized hand motion feature sequence, thereby eliminating the position difference between the target motion feature of the main body and the target motion feature of the hand. And predicting the future motion sequence of the main body and the future motion sequence of the hand based on the normalized main body motion feature sequence and the normalized hand motion feature sequence, so that the accuracy of the future motion sequence of the main body and the accuracy of the future motion sequence of the hand can be improved. Finally, based on the future action sequence of the main body, the future action sequence of the hand and the target future action sequence, updating parameters of the initial model to obtain a human body action prediction model, so that the human body action prediction model can have the capability of predicting the future action sequence of the main body and the future action sequence of the hand, wherein based on the future action sequence of the hand, future gestures can be determined, further, the prediction of the refined action of the human body can be realized, and the prediction accuracy of the human body action can be improved.

Drawings

In order to more clearly describe the embodiments of the present application or the technical solutions in the background art, the following description will describe the drawings that are required to be used in the embodiments of the present application or the background art.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic flow chart of a training method of a human motion prediction model according to an embodiment of the present application;

Fig. 2 is a schematic view of a human joint according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a training framework of a human motion prediction model according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a cross alignment module according to an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a comparison of different prediction methods according to an embodiment of the present application;

FIG. 6 is a schematic diagram showing an effect of eliminating part difference according to an embodiment of the present application;

FIG. 7a is a schematic diagram illustrating a predictive effect of a scroll action according to an embodiment of the present application;

fig. 7b is a schematic diagram of a predicted effect of a cooking action according to an embodiment of the present application;

FIG. 7c is a schematic diagram showing a predicted effect of drinking actions according to an embodiment of the present application;

FIG. 7d is a schematic diagram of a predicted effect of a eating behavior according to an embodiment of the present application;

FIG. 7e is a schematic diagram illustrating a prediction effect of a transfer action according to an embodiment of the present application;

fig. 7f is a schematic diagram of a prediction effect of a photographing action according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of a man-machine interaction method according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a training device for a human motion prediction model according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of a man-machine interaction device according to an embodiment of the present application;

fig. 11 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The execution subject of the embodiment of the application is a training device (hereinafter simply referred to as a training device) of a human motion prediction model, where the training device may be any electronic device capable of executing the technical scheme disclosed in the embodiment of the method of the application. Alternatively, the training device may be one of the following: computer, server.

It should be understood that the method embodiments of the present application may also be implemented by means of a processor executing computer program code. Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. Referring to fig. 1, fig. 1 is a flowchart of a training method of a human motion prediction model according to an embodiment of the present application.

101. And acquiring an initial model, a historical action sequence of a reference human body and a target future action sequence corresponding to the historical action sequence.

In the embodiment of the application, the initial model can be any deep learning model. In one implementation of obtaining an initial model, the training device receives an initial model input by a user through an input component. Optionally, the input assembly includes: keyboard, mouse, touch screen, touch pad, audio input device.

In another implementation of obtaining the initial model, the training device receives the initial model sent by the terminal. Optionally, the terminal includes: cell phone, computer, panel computer, server.

In the embodiment of the present application, the reference human body may be any human body. The historical action sequence is an action sequence which is obtained by observing a reference human body and occurs to the reference human body. Optionally, the historical motion sequence is a historical image sequence including n frames of historical images, and each of the historical images includes a reference human body. Alternatively, n is 5.

Optionally, the historical motion sequence includes historical motion of a joint, and fig. 2 is a schematic diagram of a human joint according to an embodiment of the present application. The specific meaning of the joints shown in fig. 2 can be seen in table 1 below.

TABLE 1

Optionally, the action categories in the historical action sequence include the categories in table 2 below.

Lifting flight	Eating food	Photographing	Wearing wear	Dithering	Telephone call making
						Movement of	Stripping machine	Viewing	Is occurring	Cutting and cutting	Treading
Delivery of	Drinking water	Pouring	Action in a race	Screw-on screw	Binding
						See	Page and read	Operation of	Cooking	Dry cup	Extrusion

TABLE 2

In one implementation of obtaining a historical motion sequence of a reference human body, a training device photographs the reference human body through a camera to obtain a historical image sequence including the reference human body. The training device takes the historical image sequence as the historical action sequence.

In another implementation of obtaining a historical motion sequence of a reference human body, the training device receives a historical motion sequence input by a user through an input component.

In yet another implementation of obtaining a historical motion sequence of the reference human body, the training device receives the historical motion sequence sent by the terminal.

In the implementation of the application, the target future action sequence is a continuation of the historical action sequence, and the target future action sequence is executed after the historical action sequence is executed by the reference human body. Optionally, the target future action sequence is also an action sequence which is obtained by observing the reference human body and is generated by the reference human body, but the generation time of the target future action sequence is after the generation time of the historical action sequence. Optionally, the target future motion sequence is a future image sequence comprising n frames of future images, each future image comprising a reference human body.

In one implementation of acquiring a future sequence of actions of a reference human body, the training device photographs the reference human body through a camera to obtain a future sequence of images including the reference human body. The training device takes the future image sequence as the future action sequence.

In another implementation of obtaining a future sequence of actions of the reference person, the training device receives the future sequence of actions of the reference person entered by the user through the input component.

In yet another implementation of obtaining the future motion sequence of the reference human body, the training device receives the future motion sequence of the reference human body sent by the terminal.

It should be understood that, in the embodiment of the present application, the step of performing the step of acquiring the initial model, the step of performing the step of acquiring the historical motion sequence of the reference human body, and the step of performing the step of acquiring the target future motion sequence corresponding to the historical motion sequence may be performed simultaneously or may be performed separately, which is not limited in the present application.

102. And extracting target motion characteristics of the main body and target motion characteristics of the hand from the historical motion sequence by using the initial model.

In the embodiment of the application, the main body comprises parts except hands in the human body, wherein the hands are parts below the wrists. In one possible implementation, the hand includes a left hand and a right hand, and the body includes portions of the human body other than the left hand and the right hand. The body of the reference body is a part of the reference body other than the hand. The hand part comprises the hand of a human body, wherein the hand is the part below the wrist. In one possible implementation, the hands include a left hand and a right hand. The hand of the reference human body is the hand in the reference human body.

103. And normalizing the target motion characteristics of the main body and the target motion characteristics of the hand by the initial model through DN to obtain a normalized main body motion characteristic sequence and a normalized hand motion characteristic sequence.

In the same human body, because there is a part difference between the actions of different parts, if the future action sequence of the human body is predicted directly based on the action characteristics of different parts, a larger error is likely to be generated, wherein the part difference comprises at least one of the following: the difference in dimension, the difference in degree of freedom, the difference in motion amplitude. For example, in a reference human body, there is a site difference between the target motion characteristics of the subject and the target motion characteristics of the hand.

The initial model normalizes the target motion feature of the main body and the target motion feature of the hand through DN to obtain a normalized main body motion feature sequence and a normalized hand motion feature sequence, and the position difference between the target motion feature of the main body and the target motion feature of the hand can be eliminated, so that the subsequent future motion sequence of the reference human body can be predicted based on the normalized main body motion feature sequence and the normalized hand motion feature sequence.

104. The initial model predicts a future motion sequence of the subject and a future motion sequence of the hand based on the normalized subject motion feature sequence and the normalized hand motion feature sequence.

In step 103, the initial model predicts the future motion sequence of the reference human body based on the normalized main motion feature sequence and the normalized hand motion feature sequence, so that the error can be reduced and the accuracy of the prediction can be improved. In this step, the results of the initial model prediction include the future motion sequence of the subject and the future motion sequence of the hand. Based on the future motion sequence of the hand, future gestures of the reference human body can be determined, so that the prediction of the refined motion of the reference human body can be realized.

105. And updating parameters of the initial model based on the future motion sequence of the main body, the future motion sequence of the hand and the target future motion sequence to obtain a human motion prediction model.

In step 105, the training device updates the parameters of the initial model by taking the target future action sequence as the supervision information of the prediction result of the initial model. In one possible implementation, the training device determines a subject target motion sequence of the subject and a hand target motion sequence of the hand based on the target future motion sequence, wherein the subject target motion sequence is a motion sequence of the subject in the target future motion sequence, and the hand target motion sequence is a motion sequence of the hand in the target future motion sequence. The subject target motion sequence may be used as a true value (ground truth, GT) of the subject's future motion sequence, and the hand target motion sequence may be used as GT of the hand's future motion sequence.

The training device determines a first loss based on a first difference between a future motion sequence of the subject and a target motion sequence of the subject and a second difference between the future motion sequence of the hand and the target motion sequence of the hand, wherein the first loss is positively correlated with the first difference and the second difference. In one possible implementation, the training means obtains the first loss by weighted summing the first difference and the second difference.

In another possible implementation, the future motion sequences of the hand include a left hand future motion sequence and a right hand future motion sequence, and the hand target motion sequences include a left hand target motion sequence and a right hand target motion sequence. The training device determines a subject loss according to the difference between the future motion sequence of the subject and the subject target motion sequence, determines a left hand loss according to the difference between the future motion sequence of the left hand and the left hand target motion sequence, and determines a right hand loss according to the difference between the future motion sequence of the right hand and the right hand target motion sequence. The first loss is obtained by weighted summation of the body loss, the left hand loss and the right hand loss, wherein the body loss, the left hand loss and the right hand loss are positively correlated with the first loss.

After the first loss is obtained, the training device updates parameters of the initial model based on the first loss to obtain a human motion prediction model. Specifically, the training device determines a counter-propagating gradient of the initial model based on the first loss, and updates parameters of the initial model based on the gradient until the first loss converges, thereby obtaining the human motion prediction model.

In the embodiment of the application, the training device acquires the initial model, the historical action sequence of the reference human body and the target future action sequence corresponding to the historical action sequence. And extracting target action characteristics of the main body and target action characteristics of the hand from the historical action sequence by using the initial model. And normalizing the target motion feature of the main body and the target motion feature of the hand by DN to obtain a normalized main body motion feature sequence and a normalized hand motion feature sequence, thereby eliminating the position difference between the target motion feature of the main body and the target motion feature of the hand. And predicting the future motion sequence of the main body and the future motion sequence of the hand based on the normalized main body motion feature sequence and the normalized hand motion feature sequence, so that the accuracy of the future motion sequence of the main body and the accuracy of the future motion sequence of the hand can be improved. Finally, based on the future action sequence of the main body, the future action sequence of the hand and the target future action sequence, updating parameters of the initial model to obtain a human body action prediction model, so that the human body action prediction model can have the capability of predicting the future action sequence of the main body and the future action sequence of the hand, wherein based on the future action sequence of the hand, future gestures can be determined, further, the prediction of the refined action of the human body can be realized, and the prediction accuracy of the human body action can be improved.

As an alternative embodiment, the future motion sequence of the hand comprises a future motion sequence of the left hand and a future motion sequence of the right hand. The training device also performs the following steps before performing step 105:

201. And determining a third difference between the normalized subject motion feature sequence and the normalized hand motion feature sequence.

In the embodiment of the application, the third difference can be used for measuring the position difference between the normalized main body action characteristic sequence and the normalized hand action characteristic sequence. In one possible implementation, the training device calculates a maximum mean difference (maximum MEAN DISCREPANCY, MMD) between the normalized subject motion feature sequence and the normalized hand motion feature sequence, resulting in a third difference.

202. Based on the third difference, a second loss is determined.

In the embodiment of the present application, the second loss is related to the third difference, and optionally, the training device uses the third difference as the second loss.

After determining the second penalty, the training device performs the following steps in performing step 105:

203. And updating parameters of the initial model based on the first loss and the second loss to obtain the human motion prediction model.

In one possible implementation, the training device obtains the total loss by weighted summing the first loss and the second loss. And determining a counter-propagating gradient of the initial model based on the total loss, and updating parameters of the initial model based on the gradient until the total loss converges to obtain the human body motion prediction model.

In this embodiment, the training device determines the second loss based on the third difference when the third difference is positively correlated with the second loss after determining the third difference between the normalized subject motion feature sequence and the normalized hand motion feature sequence. Based on the first loss and the second loss, the parameters of the initial model are updated to obtain a human motion prediction model, the position difference can be removed by DN through the second loss, and then the effect of the human motion prediction model on eliminating the position difference between the main body and the hand can be improved, so that the prediction accuracy of the human motion prediction model on the future motion sequence of the human body is improved.

As an alternative embodiment, the normalized hand-motion sequence includes a normalized left-hand-motion feature sequence and a normalized right-hand-motion feature sequence. The training device performs the following steps in performing step 201:

301. And calculating the average value of the normalized main body motion characteristic sequence, the average value of the normalized left hand motion characteristic sequence and the average value of the normalized right hand motion characteristic sequence along the space dimension respectively to obtain a main body average motion characteristic sequence, a left hand average motion characteristic sequence and a right hand average motion characteristic sequence.

Because the normalized main body action feature sequence has time dimension information and space dimension information, the initial model obtains a main body average action feature sequence by calculating the average value of the normalized main body action feature sequence along the space dimension, and the integrity of the time dimension information can be maintained under the condition of the average space dimension information.

Similarly, the initial model obtains a left-hand average action characteristic sequence by calculating the average value of the normalized left-hand operation characteristic sequence. And the initial model obtains the right-hand average action characteristic sequence by calculating the average value of the normalized right-hand action characteristic sequence.

302. And calculating MMD between every two of the main body average motion characteristic sequence, the left hand average motion characteristic sequence and the right hand average motion characteristic sequence to obtain the third difference.

Specifically, the training device calculates the MMD of the main body average motion feature sequence and the left hand average motion feature sequence to obtain a main left difference, calculates the MMD of the main body average motion feature sequence and the right hand average motion feature sequence to obtain a main right difference, and calculates the MMD of the left hand average motion feature sequence and the right hand average motion feature sequence to obtain a left right difference.

After the main left difference, the main right difference and the left-right difference are obtained, the training device obtains a third difference based on the main left difference, the main right difference and the left-right difference. Optionally, the training device obtains the third difference by summing the main left difference, the main right difference, and the left-right difference.

In this embodiment, the training device calculates the average value of the normalized main motion feature sequence, the average value of the normalized left-hand motion feature sequence, and the average value of the normalized right-hand motion feature sequence along the spatial dimension to obtain the main average motion feature sequence, the left-hand average motion feature sequence, and the right-hand average motion feature sequence, respectively, and retains the integrity of the time dimension information in the case of the information of the average spatial dimension. Therefore, aiming at the main body average action characteristic sequence, the left hand average action characteristic sequence and the right hand average action characteristic sequence, the MMD difference between every two main body average action characteristic sequences and the right hand average action characteristic sequence is calculated to obtain a third difference, so that the third difference can better represent the position difference between the main body and the left hand, the position difference between the main body and the right hand and the position difference between the left hand and the right hand.

Alternatively, in the case where the training device obtains the third difference by performing step 301 and step 302, the second loss is calculated by:

Wherein MMD (a, B) represents MMD of computation a and B. Avg (·) represents averaging. For normalized left-hand actuation feature,/>For normalized main motion characteristics,/>Is a normalized right-hand actuation feature. /(I)Is the second loss.

As an alternative embodiment, the training device uses the initial model to perform the following steps, extracting the target motion features of the subject and the target motion features of the hand from the historical motion sequence:

401. And extracting the action characteristic sequence of the main body from the historical action sequence to obtain the historical main body action characteristic sequence.

402. And extracting the action characteristic sequence of the hand from the historical action sequence to obtain the historical hand action characteristic sequence.

403. The target motion characteristics of the subject and the target motion characteristics of the hand are obtained by performing discrete cosine transforms (discrete cosine transform, DCT) on the historical subject motion characteristic sequence and the historical hand motion characteristic sequence, respectively.

Specifically, the initial model obtains target motion characteristics of the subject by performing DCT on the historical subject motion characteristic sequence, and obtains target motion characteristics of the hand by performing DCT on the historical hand motion characteristic sequence.

In this embodiment, the training device extracts the motion feature sequence of the subject from the historical motion sequence to obtain the historical subject motion feature sequence, and extracts the motion feature sequence of the hand from the historical motion sequence to obtain the historical hand motion feature sequence, and then performs DCT on the historical subject motion feature sequence to obtain the target motion feature of the subject, so that smoothness of the time dimension can be obtained, and further the target motion feature of the subject can better represent the information of the time dimension of the historical subject motion feature sequence. The target motion characteristics of the hand can be obtained by DCT on the historical hand motion characteristic sequences, so that smoothness of the time dimension can be obtained, and further the target motion characteristics of the hand can better represent the information of the time dimension of the historical hand motion characteristic sequences.

As an alternative embodiment, the initial model includes a graph convolutional neural network (graph convolutional network, GCN). The initial model performs the following steps in performing step 403:

501. and acquiring a full-connected graph (full-connected graph) of the reference human body.

In the embodiment of the application, the full connection diagram of the reference human body is constructed based on the connection relation of bones and joints of the reference human body. In the full-connection graph, any one node has a connection relation with all the rest nodes, wherein the nodes correspond to joints of a human body.

502. And using the history main motion feature sequence and the history hand motion feature sequence as input of the full-connection graph, and processing the full-connection graph by using the GCN to obtain initial motion features of the main body and initial motion features of the hand.

The initial model takes the historical main body action characteristic sequence and the historical hand action characteristic sequence as the input of the full-connection graph, so that each joint in the full-connection graph can have initial action characteristics. The initial model utilizes GCN to process the full-connection graph, so that the information of the initial motion characteristics of the joints can be transmitted through the full-connection graph, and the information of the initial motion characteristics of different joints can be fused, so that the fusion of the information of the motion characteristics of different parts in a reference human body is realized, and finally, the initial motion characteristics of a main body and the initial motion characteristics of hands can be determined based on the full-connection graph.

503. The target motion characteristics of the subject are obtained by performing DCT on the initial motion characteristics of the subject.

504. The target motion characteristics of the hand are obtained by performing DCT on the initial motion characteristics of the hand.

In this embodiment, after the initial model acquires the full-connection map of the reference human body, the initial model uses the historic main motion feature sequence and the historic hand motion feature sequence as inputs of the full-connection map, and thus each joint in the full-connection map can be provided with initial motion features. The initial model processes the full-connection graph by using the GCN, so that the information of the initial motion characteristics of the joints can be transmitted based on the connection relation of the joints, and the information of the initial motion characteristics of different joints can be fused, so that the fusion of the information of the motion characteristics of different parts in a reference human body is realized, and finally, the initial motion characteristics of a main body and the initial motion characteristics of hands can be determined based on the full-connection graph.

In one possible implementation manner, the embodiment of the application further provides a training framework of the human motion prediction model, which corresponds to the training method of the human motion prediction model. Referring to fig. 3, fig. 3 is a schematic diagram of a training frame of a human motion prediction model according to an embodiment of the present application, where the training frame may be used to train an initial model to obtain a human motion prediction model.

As shown in fig. 3, in the training framework, after the initial model acquires the historical motion sequence of the reference human body, the historical motion sequence is firstly subjected to discrete cosine transform, wherein the historical motion sequence comprises a historical main motion sequence (i.e. X _m in fig. 3), a historical left hand motion sequence (i.e. X _l in fig. 3) and a historical right hand motion sequence (i.e. X _r in fig. 3). It should be appreciated that the historical hand motion features described above include a historical left hand motion feature and a historical right hand motion feature. And then the initial model encodes the discrete cosine transformed result through an internal context encoding module (intra-context encoding) to obtain the characteristics of different parts, wherein the internal context encoding module is the GCN. Specifically, the features of the different parts include a target motion feature of the main body (i.e., S _m in fig. 3), a target motion feature of the left hand (i.e., S _l in fig. 3), and a target motion feature of the right hand (i.e., S _r in fig. 3), wherein the target motion feature of the main body includes a target motion feature of the left wrist and a target motion feature of the right wrist, the target motion feature of the left hand includes a target motion feature of the left wrist, and the target motion feature of the right hand includes a target motion feature of the right wrist. The cross alignment module (cross-context alignment) of the initial model aligns the features of different parts to obtain normalized main motion feature sequence (i.e. in figure 3)) Normalized left-hand action feature sequence (i.e./>, FIG. 3)) And normalized right-hand action feature sequences (i.e./>, FIG. 3)). The predictor of the final initial model predicts a future motion sequence of the reference human body based on the normalized main motion feature sequence, the normalized left hand motion feature sequence and the normalized right hand motion feature sequence, wherein the future motion sequence of the reference human body comprises the future motion sequence of the main body (i.e./>, in fig. 3) A future motion sequence of the left hand (i.e./>, in fig. 3)) And future motion sequences of the right hand (i.e./>, in fig. 3)). The predictor includes a multi-layer perceptron (multilayer perceptron, MLP) and an Inverse Discrete Cosine Transform (IDCT). And then, supervising the predicted future action sequence by utilizing a target future action sequence corresponding to the historical action sequence, and updating the parameters of discrete cosine, the parameters of the internal context coding module, the parameters of the cross alignment module and the parameters of the predictor.

Fig. 4 is a schematic structural diagram of a cross alignment module according to an embodiment of the present application. Fig. 4 shows a process of normalizing features of different locations by a cross-alignment module through a Distribution Norm (DN), which is specifically implemented as follows:

The cross alignment module may normalize by DN for any two of the subject's target motion features, the left hand target motion features, and the right hand target motion features, respectively. The normalization of motion features at different locations by the DN is explained below by describing an implementation process of normalizing the target motion features of the subject and the target motion features of the left hand by the DN. Specifically, the normalization of the target motion feature of the main body and the target motion feature of the left hand by using the DN can be achieved by the following formula:

Wherein alpha is a learnable parameter, alpha e [0.5,1]. Mu _l is the average of the target motion features of the left hand, σ _l is the variance of the target motion features of the left hand, specifically mu _l＝Avg(S_l),σ_l＝Var(S_l),S_l is the target motion features of the left hand, avg (-) represents the average along the joint dimension (i.e., along the spatial dimension described above), var (-) represents the variance along the joint dimension. Mu _m is the average value of the subject's target motion characteristics, and sigma _m is the variance of the subject's target motion characteristics. Mu _lm,α is the average value of motion features obtained by normalizing the target motion features of the subject and the target motion features of the left hand, and sigma _lm,α is the variance of motion features obtained by normalizing the target motion features of the subject and the target motion features of the left hand. s _m(s_l) is the data of the subject's target motion feature (left-hand target motion feature), and s ^′ _m(s_l ^′) is the data of the normalized subject motion feature (normalized left-hand motion feature). E=e ^-5 is a parameter avoiding variance of 0. Where H is the height (height) of the images included in the sequence of historical images.

The formula (2) characterizes a calculation process of normalizing the target motion feature of the main body and the target motion feature of the left hand through DN, and similarly, the target motion feature of the main body and the target motion feature of the right hand can be normalized, and the target motion feature of the right hand and the target motion feature of the left hand can be normalized. In the actual treatment process, the different parts can be circularly normalized by the following formula:

where DN (a, B, C) represents normalization of a and B by DN, where C is a parameter required for DN, and the specific implementation process can be seen in equation (2). In the cyclic normalization process, the intermediate result of the motion characteristics of the subject (normalized subject motion characteristics)/>In the cyclic normalization process, the intermediate result of the action characteristic of the left hand (normalized left hand action characteristic),/>Is the intermediate result of the motion feature of the right hand (normalized right hand motion feature) in the loop normalization process. Beta and gamma are the same as alpha and are learnable parameters, and the three parameters can be updated through training.

Fig. 4 also shows that for the normalized subject motion characteristics (i.e.) Normalized left-hand actuation feature (i.e./>)) And normalized right-hand actuation features (i.e./>) MMD is calculated between two pairs.

Alternatively, the structure of the initial model is shown in table 3 below.

/>

TABLE 3 Table 3

In table 3, T represents the duration of the input historical motion sequence, and H _c each represent the high (height) of the image. 12GCN_Blocks means that the GCN has 12 identical modules in total. BN represents normalization (batch normalization). Tanh is the activation function and Droupout is the regularization operation. Dn_ml represents a parameter used to normalize the target motion feature of the subject and the target motion feature of the left hand by DN, dn_mr represents a parameter used to normalize the target motion feature of the subject and the target motion feature of the right hand by DN, and dn_rl represents a parameter used to normalize the target motion feature of the left hand and the target motion feature of the right hand by DN. Linear represents the linearization process.

In the current technology, a future motion sequence of a human body is usually predicted by using a prediction method based on an articulation point, for example, human motion prediction (Learning trajectory dependencies for human motion prediction) is performed based on track correlation, which is a prediction method based on an articulation point. In the case of training based on the training frame shown in fig. 3 to obtain the human motion prediction model, the human motion prediction model is used to predict the future motion sequence of the human body, so that the position difference between different positions can be eliminated, and the gesture can be predicted. Therefore, the human motion prediction model is utilized to predict the future motion sequence of the human body, and compared with the prediction result of the prediction method based on the articulation point, the method can make progress.

Specifically, fig. 5 is a schematic diagram illustrating comparison of different prediction methods according to an embodiment of the present application. As shown in fig. 5, the upper half of the prediction process is a prediction process of the node-based prediction method (hereinafter, simply referred to as a conventional method process), and the lower half is a process of performing prediction using a human motion prediction model trained based on the training framework shown in fig. 3 (hereinafter, simply referred to as a method process of the present application). Obviously, the method process of the application can not only predict the future action sequence of the main body, but also predict the future action sequence of the hand, wherein the future action sequence of the hand comprises the future action sequence of the palm, and the conventional method cannot predict the future action sequence of the hand. The process of the method can also eliminate the position difference between different positions through the cross alignment module, but the traditional process does not eliminate the position difference between different positions. Thus, future sequences of actions based on the inventive method procedure may be regarded as an improvement of future sequences of actions based on the conventional method procedure.

Referring to fig. 6, fig. 6 is a schematic diagram illustrating an effect of eliminating part difference according to an embodiment of the application. As shown in fig. 6, the results in fig. 6 are obtained by using the human motion prediction model when the number of iterations is 3000. Optionally, the dimension reduction effect can be improved by iterating through T-distribution-random neighbor embedding (T-distributed Stochastic Neighbor Embedding, T-SNE). In fig. 6, three columns are provided, wherein one column represents the motion characteristics of one part, specifically, the first column represents the motion characteristics of the left hand, the second column represents the motion characteristics of the main body, and the third column represents the motion characteristics of the right hand. Each column has three rows, with the first row representing historical motion characteristics. The second row represents the target motion feature and the third row represents the normalized motion feature. The size of the dots in the figure represents the features. As shown in fig. 6, the similarity between the normalized distribution of the left-hand operation features, the normalized distribution of the body motion features, and the normalized distribution of the right-hand operation features is more concentrated than the similarity between the distribution of the left-hand target motion features, the distribution of the body target motion features, and the distribution of the right-hand target motion features. But also can eliminate ambiguity and isomerism among different parts, thereby being beneficial to the interaction among the characteristics of different parts. The distribution of the normalized left-hand actuation features, the distribution of the normalized body actuation features, and the distribution of the normalized right-hand actuation features are also more similar than the distribution of the historical left-hand actuation features, the distribution of the historical body actuation features, and the distribution of the historical right-hand actuation features. This means that the feature of different parts is normalized by the cross alignment module, so that the part difference between different parts can be eliminated.

Referring to fig. 7a to 7f, fig. 7a to 7f are schematic diagrams illustrating effects of future action sequences according to embodiments of the present application. Fig. 7a is a schematic diagram of a predicted effect of a flipping motion, fig. 7b is a schematic diagram of a predicted effect of a cooking motion, fig. 7c is a schematic diagram of a predicted effect of a drinking motion, fig. 7d is a schematic diagram of a predicted effect of a eating motion, fig. 7e is a schematic diagram of a predicted effect of a transmitting motion, and fig. 7f is a schematic diagram of a predicted effect of a photographing motion.

Specifically, in fig. 7a to 7f, the duration of the historical motion sequences is 1000 milliseconds (ms), i.e., -1000ms to 0ms, and the duration of the predicted future motion sequences is 1000ms, i.e., 0ms to 1000ms. The history action sequence in fig. 7a is a right view (RIGHT VIEW) of a scroll action, the history action sequence in fig. 7b is a right view of a cooking action, the history action sequence in fig. 7c is a left view (left view) of a drinking action, the history action sequence in fig. 7d is a left view of a eating action, the history action sequence in fig. 7e is a front view of a transfer action, and the history action sequence in fig. 7f is a front view of a photographing action.

The human body motion prediction model obtained by training based on the training method of the human body motion prediction model provided by the previous step can be used for predicting human body motion. Specifically, after the historical motion sequence of the target human body is obtained, the historical motion sequence is processed by utilizing a human motion prediction model, so that the future motion sequence of the target human body can be predicted, wherein the future motion sequence of the target human body comprises: a future motion sequence of a subject of the target human body and a future motion sequence of a hand of the target human body, the subject including a portion of the target human body other than the hand, the hand including a left hand of the target human body and a right hand of the target human body. Further, in the actual application scene, corresponding processing can be performed based on the future action sequence of the target human body.

In one possible implementation scenario, the prediction of human actions based on the human action prediction model may be applied to human-machine interaction (HRI, human-robot interaction), which refers to the interaction of a user with a human-machine interaction device, which may be any electronic device. In the HRI application scenario, the human-computer interaction device needs to understand the intention expressed by the user by identifying the action of the user, so that the operation corresponding to the action of the user can be executed according to the intention expressed by the user. Therefore, the response speed of the human-computer interaction device to the actions of the user is also influenced by the poor experience of the user to human-computer interaction. Based on the above, the embodiment of the application also provides a man-machine interaction method, so as to improve the response speed of the man-machine interaction device to the action of the user, and further improve the experience of the user on man-machine interaction.

An execution subject of the technical scheme disclosed by the embodiment of the man-machine interaction method is a man-machine interaction device. The man-machine interaction device can be any electronic equipment capable of executing the technical scheme disclosed by the embodiment of the man-machine interaction method. The man-machine interaction device comprises a camera. Alternatively, the man-machine interaction device may be one of the following: a robot for providing a service.

It will be appreciated that embodiments of the human-computer interaction method may also be implemented by way of a processor executing computer program code. Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application. Referring to fig. 8, fig. 8 is a flow chart of a man-machine interaction method according to an embodiment of the application.

801. And acquiring a historical action sequence of the target human body through the camera.

In the embodiment of the application, the target human body is the body of the user interacting with the man-machine interaction device. The historical action sequence is the action which is made by the target human body in the process of interacting with the human-computer interaction device. In the process of interaction with a target human body, the human-computer interaction device acquires a target image sequence comprising the target human body through the camera, and further can acquire a historical action sequence of the target human body based on the target image sequence. In one possible implementation manner, the human-computer interaction device identifies the motion of the target human body in each image by performing motion recognition on the images in the target image sequence, so that a historical motion sequence of the target human body can be obtained.

802. And acquiring the human motion prediction model obtained by training according to the training method of the human motion prediction model.

803. And processing the historical motion sequence of the target human body by using the human body motion prediction model, and predicting to obtain the future motion sequence of the target human body.

In the implementation of the present application, the future action sequence of the target human body includes: a future motion sequence of a subject of the target human body and a future motion sequence of a hand of the target human body, the subject including a portion of the target human body other than the hand, the hand including a left hand of the target human body and a right hand of the target human body.

804. And executing the target operation in response to the future motion sequence of the subject of the target human body and the future motion sequence of the hand of the target human body.

In the embodiment of the application, the target operation is an operation determined by the human-computer interaction device according to the intention expressed by the user after determining the intention expressed by the user based on the future action sequence of the main body of the target human body and the future action sequence of the hand of the target human body. For example, the human-computer interaction device may determine that the intention expressed by the user is to make a call based on a future motion sequence of the subject of the target human body and a future motion sequence of the hand of the target human body, and then the target operation may be to turn down the volume of the speaker of the human-computer interaction device. For another example, the human-computer interaction device determines that the intention expressed by the user is to leave the door based on the future motion sequence of the subject of the target human body and the future motion sequence of the hand of the target human body, and then the target operation may be to output a voice requesting slow walking.

In the embodiment of the application, the human-computer interaction device acquires the historical action sequence of the target human body through the camera. And then, after the human body motion prediction model is obtained, the historical motion sequence of the target human body is processed by utilizing the human body motion prediction model, and the future motion sequence of the target human body is predicted. The future action sequence of the target human body comprises: the human-computer interaction device can predict the future gesture of the target human body based on the future motion sequence of the hand of the target human body, that is, the human-computer interaction device can realize the prediction of the refined motion, so that the human-computer interaction device can more comprehensively predict the future motion of the target human body by combining the future motion sequence of the main body of the target human body in the future motion sequence of the target human body. Therefore, the human-computer interaction device determines the target operation based on the future action sequence of the main body of the target human body and the future action sequence of the hand of the target human body, and the accuracy of the target operation can be improved. And the human-computer interaction device responds to the future action sequence of the main body of the target human body and the future action sequence of the hand of the target human body to execute the target operation, so that the action of the user can be responded in advance, the response speed of the action of the user can be further improved, and the experience of the user on human-computer interaction is further improved.

In the embodiment of the application, the prejudging device acquires the historical action sequence of the target human body through the camera. And then, after the human body motion prediction model is obtained, the historical motion sequence of the target human body is processed by utilizing the human body motion prediction model, and the future motion sequence of the target human body is predicted. The future action sequence of the target human body comprises: the future motion sequence of the hand of the target human body can be predicted by the pre-judging device based on the future motion sequence of the hand of the target human body. Thus, the predictive device may determine whether the target human body is to perform a dangerous operation with the hand based on the future gesture of the target human body. Under the condition that the dangerous operation is determined to be executed by the hand of the target human body, a prompt message is output to prompt the target human body to stop the dangerous operation and prompt related personnel to stop the dangerous operation in time, so that the probability of accident occurrence is reduced through the pre-judgment of dangerous actions.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

The foregoing details of the method according to the embodiments of the present application and the apparatus according to the embodiments of the present application are provided below.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a training device for a human motion prediction model according to an embodiment of the present application, where the training device 1 for a human motion prediction model includes: the training device 1 of the human motion prediction model comprises an acquisition unit 11, an extraction unit 12, a normalization unit 13, a prediction unit 14 and an update unit 15, and the training device further comprises a determination unit 16, specifically:

An obtaining unit 11, configured to obtain an initial model, a historical motion sequence of a reference human body, and a target future motion sequence corresponding to the historical motion sequence;

An extracting unit 12 for extracting target motion features of a subject and target motion features of a hand from the history motion sequence using the initial model, the subject including a part of the reference human body other than the hand, the hand including the hand of the reference human body;

the normalization unit 13 is configured to normalize the target motion feature of the subject and the target motion feature of the hand based on the initial model through DN, to obtain a normalized subject motion feature sequence and a normalized hand motion feature sequence;

A prediction unit 14, configured to predict a future motion sequence of the subject and a future motion sequence of the hand based on the normalized subject motion feature sequence and the normalized hand motion feature sequence by using the initial model;

And an updating unit 15, configured to update parameters of the initial model based on the future motion sequence of the subject, the future motion sequence of the hand, and the target future motion sequence, so as to obtain a human motion prediction model.

In combination with any embodiment of the present application, the updating unit 15 is configured to:

In combination with any of the embodiments of the present application, the device 1 further comprises:

A determining unit 16, configured to determine a third difference between the normalized subject motion feature sequence and the normalized hand motion feature sequence;

the determining unit 16 is configured to determine a second loss based on the third difference, where the second loss is related to the third difference;

The updating unit 15 is configured to:

The determining unit 16 is configured to:

In combination with any of the embodiments of the present application, the extracting unit 12 is configured to:

the extraction unit 12 is configured to:

acquiring a full connection diagram of the reference human body;

Referring to fig. 10, fig. 10 is a schematic structural diagram of a man-machine interaction device according to an embodiment of the present application, where the man-machine interaction device 2 includes: camera 21, acquisition unit 22, prediction unit 23, execution unit 24, specifically:

A camera 21 for acquiring a history motion sequence of a target human body;

an acquisition unit 22 for acquiring a human motion prediction model obtained according to any one of the embodiments of the present application;

A prediction unit 23, configured to process the historical motion sequence of the target human body by using the human motion prediction model, and predict to obtain a future motion sequence of the target human body, where the future motion sequence of the target human body includes: a future motion sequence of a subject of the target human body and a future motion sequence of a hand of the target human body, the subject including a portion of the target human body other than the hand, the hand including a left hand of the target human body and a right hand of the target human body;

An execution unit 24 for executing a target operation in response to a future motion sequence of a subject of the target human body and a future motion sequence of a hand of the target human body.

In some embodiments, the functions or modules included in the apparatus provided by the embodiments of the present application may be used to perform the methods described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

Fig. 11 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application. The electronic device 3 comprises a processor 31, a memory 32. Optionally, the electronic device 3 further comprises input means 33 and output means 34. The processor 31, memory 32, input device 33, and output device 34 are coupled by connectors, including various interfaces, transmission lines, buses, etc., as are not limited in this regard. It should be appreciated that in various embodiments of the application, coupled is intended to mean interconnected by a particular means, including directly or indirectly through other devices, e.g., through various interfaces, transmission lines, buses, etc.

The processor 31 may comprise one or more processors, for example one or more central processing units (central processing unit, CPU), which in the case of a CPU may be a single-core CPU or a multi-core CPU. Alternatively, the processor 31 may be a processor group constituted by a plurality of CPUs, the plurality of processors being coupled to each other through one or more buses. In the alternative, the processor may be another type of processor, and the embodiment of the application is not limited.

Memory 32 may be used to store computer program instructions as well as various types of computer program code for performing aspects of the present application. Optionally, the memory includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM) for associated instructions and data.

The input means 33 are for inputting data and/or signals and the output means 34 are for outputting data and/or signals. The input device 33 and the output device 34 may be separate devices or may be an integral device.

It will be appreciated that in embodiments of the present application, the memory 32 may be used to store not only relevant instructions, but also relevant data, and embodiments of the present application are not limited to the specific data stored in the memory.

It will be appreciated that fig. 11 shows only a simplified design of an electronic device. In practical applications, the electronic device may further include other necessary elements, including but not limited to any number of input/output devices, processors, memories, etc., and all electronic devices that can implement the embodiments of the present application are within the scope of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein. It will be further apparent to those skilled in the art that the descriptions of the various embodiments of the present application are provided with emphasis, and that the same or similar parts may not be described in detail in different embodiments for convenience and brevity of description, and thus, parts not described in one embodiment or in detail may be referred to in description of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (DIGITAL VERSATILEDISC, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: a read-only memory (ROM) or a random access memory (random access memory, RAM), a magnetic disk or an optical disk, or the like.

Claims

1. A method of training a human motion prediction model, the method comprising:

Normalizing the target action features of the main body and the target action features of the hands by the initial model through a distribution norm to obtain a normalized main body action feature sequence and a normalized hand action feature sequence;

Updating parameters of the initial model based on the future motion sequence of the subject, the future motion sequence of the hand and the target future motion sequence to obtain a human motion prediction model, and updating parameters of the initial model based on the future motion sequence of the subject, the future motion sequence of the hand and the target future motion sequence to obtain a human motion prediction model, comprising: and updating parameters of the initial model by taking the target future action sequence as supervision information of a prediction result of the initial model to obtain the human action prediction model, wherein the prediction result comprises the future action sequence of the main body and the future action sequence of the hand.

2. The method of claim 1, wherein updating parameters of the initial model based on the future motion sequence of the subject, the future motion sequence of the hand, and the target future motion sequence to obtain a human motion prediction model comprises:

3. The method of claim 2, wherein prior to said updating parameters of said initial model based on said first loss to obtain said human motion prediction model, said method further comprises:

4. A method according to claim 3, wherein the normalized hand-motion feature sequence comprises a normalized left-hand-motion feature sequence and a normalized right-hand-motion feature sequence;

5. The method according to claim 1 or 2, wherein the extracting target motion features of a subject and target motion features of a hand from the historical motion sequence comprises:

6. The method of claim 5, wherein the initial model comprises an image convolutional neural network;

acquiring a full connection diagram of the reference human body;

The historical main body action feature sequence and the historical hand action feature sequence are used as input of the full-connection graph, and the full-connection graph is processed by utilizing a graph convolution neural network to obtain initial action features of the main body and initial action features of the hands;

7. The man-machine interaction method is characterized in that the man-machine interaction method is applied to a man-machine interaction device, the man-machine interaction device comprises a camera, and the method comprises the following steps:

Acquiring a human motion prediction model trained by the method according to any one of claims 1 to 6;

8. A training device for a human motion prediction model, the device comprising:

the normalization unit is used for normalizing the target action characteristics of the main body and the target action characteristics of the hand through a distribution norm based on the initial model to obtain a normalized main body action characteristic sequence and a normalized hand action characteristic sequence;

An updating unit, configured to update parameters of the initial model based on the future motion sequence of the subject, the future motion sequence of the hand, and the target future motion sequence to obtain a human motion prediction model, and update parameters of the initial model based on the future motion sequence of the subject, the future motion sequence of the hand, and the target future motion sequence to obtain a human motion prediction model, including: and updating parameters of the initial model by taking the target future action sequence as supervision information of a prediction result of the initial model to obtain the human action prediction model, wherein the prediction result comprises the future action sequence of the main body and the future action sequence of the hand.

9. A human-machine interaction device, characterized in that the human-machine interaction device comprises:

An acquisition unit for acquiring the human motion prediction model trained by the method according to any one of claims 1 to 6;

10. An electronic device, comprising: a processor and a memory for storing computer program code, the computer program code comprising computer instructions;

The electronic device performing the method of any one of claims 1 to 6, when the processor executes the computer instructions;

The electronic device, or the method of claim 7, when the processor executes the computer instructions.

11. A computer readable storage medium having a computer program stored therein, the computer program comprising program instructions;

Causing a processor to perform the method of any one of claims 1 to 6, when the program instructions are executed by the processor;

Where the program instructions are executed by a processor, or cause the processor to perform the method of claim 7.