CN116824686A

CN116824686A - Action recognition method and related device

Info

Publication number: CN116824686A
Application number: CN202210278157.5A
Authority: CN
Inventors: 张莹; 李琛
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-03-21
Filing date: 2022-03-21
Publication date: 2023-09-29

Abstract

The application discloses a motion recognition method and a related device, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving, vehicle-mounted scenes and the like. When a target object is shot to obtain an image frame to be identified, two-dimensional joint point position information of the target object in the image frame to be identified is obtained, the two-dimensional joint point position information is used as input of an action identification model, a feature generation module is utilized to conduct feature generation to obtain a target feature vector according to the two-dimensional joint point position information, then a prediction module is utilized to predict the target feature vector to obtain action rotation parameters and action displacement parameters of each joint point of the target object, and therefore a kinematic analysis module is utilized to conduct kinematic analysis according to the action rotation parameters and the action displacement parameters to obtain three-dimensional joint point position information of the corresponding joint point. Therefore, the calculated amount and the calculated time of the motion recognition model are greatly reduced, the motion recognition efficiency is improved, and the real-time motion recognition is realized.

Description

Action recognition method and related device

Technical Field

The present application relates to the field of computer vision, and in particular, to a motion recognition method and related apparatus.

Background

Artificial intelligence (Artificial Intelligence, AI) is a new scientific technology for studying and developing theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. In recent years, with the development of artificial intelligence, computer vision technology based on artificial intelligence has also been rapidly developed, and human motion recognition is an important method, so that the method has a wide application prospect in various fields such as security protection, human-computer interaction, object understanding, object special effects, game entertainment, film and television production, three-dimensional (3D) modeling and the like.

At present, when motion recognition is performed, the body type and motion parameters of a human body parameterized model can be directly estimated according to an input image or video, for example, the image characteristics of each frame of image in a video are extracted by a convolution network, and then the time sequence information of a motion sequence is captured by a time sequence network module to obtain more accurate motion estimation.

However, the method mainly adopts coding image features of large models such as ResNet-50 and the like to realize 3D human body motion recognition, and has the advantages of large calculated amount, long calculation time, low motion recognition efficiency and difficulty in realizing real-time motion recognition.

Disclosure of Invention

In order to solve the technical problems, the application provides the motion recognition method and the related device, which greatly reduce the calculated amount and the calculated time of the motion recognition model, improve the motion recognition efficiency and facilitate the realization of real-time motion recognition. Meanwhile, due to the reduction of the calculated amount, the network structure complexity of the motion recognition model is greatly reduced, the motion recognition is easy to realize based on a lightweight network, and the method is more suitable for realizing real-time motion recognition of the mobile terminal.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides an action recognition method, including:

when a target object is shot to obtain an image frame to be identified, acquiring two-dimensional joint point position information of the target object in the image frame to be identified;

according to the two-dimensional joint point position information, performing feature generation by using a feature generation module of an action recognition model to obtain a target feature vector;

according to the target feature vector, predicting by using a prediction module of the motion recognition model to obtain motion rotation parameters and motion displacement parameters of each joint point of the target object;

and according to the motion rotation parameters and the motion displacement parameters, performing kinematic analysis by using a kinematic analysis module of the motion recognition model to obtain the three-dimensional joint point position information of the corresponding joint point.

In one aspect, an embodiment of the present application provides an action recognition apparatus, where the apparatus includes an acquisition unit, a generation unit, a prediction unit, and an analysis unit:

the acquisition unit is used for acquiring the two-dimensional joint point position information of the target object in the image frame to be identified when the target object is shot to obtain the image frame to be identified;

The generating unit is used for generating characteristics by utilizing a characteristic generating module of the action recognition model according to the two-dimensional joint point position information to obtain a target characteristic vector;

the prediction unit is used for predicting by using a prediction module of the motion recognition model according to the target feature vector to obtain motion rotation parameters and motion displacement parameters of each joint point of the target object;

the analysis unit is used for performing kinematic analysis by utilizing the kinematic analysis module of the motion recognition model according to the motion rotation parameter and the motion displacement parameter to obtain the three-dimensional joint point position information of the corresponding joint point.

In one aspect, an embodiment of the present application provides an electronic device for motion recognition, the electronic device including a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the action recognition method according to the foregoing aspect according to instructions in the program code.

In one aspect, embodiments of the present application provide a computer-readable storage medium storing program code for performing the action recognition method of the foregoing aspect.

In one aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements the method of action recognition of the preceding aspect.

According to the technical scheme, when the target object is shot to obtain the image frame to be identified, the two-dimensional joint point position information of the target object in the image frame to be identified is firstly obtained, then the two-dimensional joint point position information is used as the input of the motion identification model, and the motion identification is realized by predicting the three-dimensional joint point position information according to the two-dimensional joint point position information through the motion identification model. Because the input of the motion recognition model is the two-dimensional node position information instead of the image or the video, the motion recognition model is not required to extract the two-dimensional node position information from a large amount of information contained in the image or the video through complex processing, so that the calculation amount and calculation time of the motion recognition model are greatly reduced, and the network structure complexity of the motion recognition model is also greatly reduced. In predicting the three-dimensional joint point position information through the motion recognition model, feature generation can be performed by using a feature generation module of the motion recognition model according to the two-dimensional joint point position information to obtain a target feature vector, and then prediction is performed by using a prediction module of the motion recognition model according to the target feature vector to obtain the motion rotation parameter and the motion displacement parameter of each joint point of the target object, so that according to the motion rotation parameter and the motion displacement parameter, kinematic analysis (such as forward kinematic analysis) is performed by using a kinematic analysis module of the motion recognition model to obtain the three-dimensional joint point position information of the corresponding joint point. Therefore, the two-dimensional joint point position information required by the motion recognition is directly used as the input of the motion recognition model, so that the calculated amount and the calculated time of the motion recognition model are greatly reduced, the motion recognition efficiency is improved, and the real-time motion recognition is convenient to realize. Meanwhile, due to the reduction of the calculated amount, the network structure complexity of the motion recognition model is greatly reduced, the motion recognition is easy to realize based on a lightweight network, and the method is more suitable for realizing real-time motion recognition of the mobile terminal.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

Fig. 1 is an application scenario architecture diagram of an action recognition method according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for identifying actions according to an embodiment of the present application;

FIG. 3 is a block diagram of an action recognition model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an animation of a target object generated according to three-dimensional joint point position information according to an embodiment of the present application;

fig. 5 is a schematic diagram of a human motion estimation result according to an embodiment of the present application;

fig. 6 is a schematic diagram of an effect for 3D character driving according to an embodiment of the present application;

FIG. 7 is a flowchart of a training method of an action recognition model according to an embodiment of the present application;

fig. 8 is a network structure of Fusion Block and FC Block provided by the embodiment of the present application;

Fig. 9 is a block diagram of an action recognition device according to an embodiment of the present application;

fig. 10 is a block diagram of a terminal according to an embodiment of the present application;

fig. 11 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

The application range of human motion recognition is very wide, and the method can be used in various fields such as security protection, man-machine interaction, object understanding, object (human body) special effects, game entertainment, film and television/short video production, three-dimensional (3D) modeling and the like. For example, the three-dimensional joint point position information of the human joint point can be identified and positioned by utilizing the human motion, so that the 3D modeling is realized to simulate the human motion, and the method can be further used for manufacturing a film and a television; object understanding can also be performed by locating three-dimensional joint position information of a human body joint, including understanding the motion condition of an object (e.g., a human body), such as understanding whether the human body swings arms, dances, or performs other motions; human special effects can be added to the human body based on the action recognition of the human body; human-computer interaction, game entertainment, such as motion recognition of human body to realize game interaction, etc., are not illustrated one by one.

It should be noted that, with the wide application of mobile terminals, people's life and work are basically kept away from the mobile terminals, and implementing real-time motion recognition on the mobile terminals is also becoming a requirement of people. The motion recognition method provided by the related art mainly adopts coding image features such as ResNet-50 and the like to realize 3D human motion recognition, and has the advantages of large calculated amount, long calculation time, low motion recognition efficiency, difficulty in realizing real-time motion recognition and difficulty in being applied to mobile terminals.

In order to solve the technical problems, the embodiment of the application provides a motion recognition method, which takes the two-dimensional joint point position information required by motion recognition directly as the input of a motion recognition model, thereby greatly reducing the calculated amount and calculation time of the motion recognition model, improving the motion recognition efficiency and facilitating the realization of real-time motion recognition. Meanwhile, due to the reduction of the calculated amount, the network structure complexity of the motion recognition model is greatly reduced, the motion recognition is easy to realize based on a lightweight network, and the method is more suitable for realizing real-time motion recognition of the mobile terminal.

As shown in fig. 1, fig. 1 shows an application scenario architecture diagram of an action recognition method. A terminal 101 may be included in the application scenario. The terminal 101 may be a mobile terminal or a fixed terminal, and the embodiment of the present application is mainly described by using the terminal 101 as a mobile terminal. The terminal 101 may be, for example, a mobile phone, a computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, an aircraft, etc., but is not limited thereto. The embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent traffic, auxiliary driving, vehicle-mounted scenes and the like.

Taking the terminal 101 as a mobile terminal and the mobile terminal as a mobile phone as an example, if a target object (the target object is, for example, a human body, an animal, etc.) is shot by the mobile phone, and thus a special effect animation is added to the target object in real time, for example, if a special effect animation, such as an armor, is added to the shot target object, then it is necessary to perform action recognition on the target object, and thus an armor matched with the current action of the target object is added to the target object according to the three-dimensional node position information.

Specifically, when the target object is photographed to obtain an image frame to be identified, the mobile phone may first obtain two-dimensional node position information of the target object in the image frame to be identified. The articulation point may be a point for representing a joint that may be moved on the target object and may include, for example, a right heel, a left heel, a right knee, a left knee, a right hip, a left hip, a right wrist, a left wrist, a right elbow, a left elbow, a right shoulder, a left shoulder, a head, and the like. The two-dimensional joint point position information is used for reflecting the position of the joint point of the target object in the image frame to be identified, and can be extracted in advance.

The mobile phone takes the two-dimensional joint point position information as the input of the action recognition model, so that the action recognition is realized by predicting the three-dimensional joint point position information according to the two-dimensional joint point position information through the action recognition model. Because the input of the motion recognition model is the two-dimensional node position information instead of the image or the video, the motion recognition model is not required to extract the two-dimensional node position information from a large amount of information contained in the image or the video through complex processing, so that the calculation amount and calculation time of the motion recognition model are greatly reduced, and the network structure complexity of the motion recognition model is also greatly reduced.

In predicting the three-dimensional joint point position information through the motion recognition model, the mobile phone can utilize the feature generation module of the motion recognition model to perform feature generation according to the two-dimensional joint point position information to obtain a target feature vector, and then utilize the prediction module of the motion recognition model to predict according to the target feature vector to obtain the motion rotation parameter and the motion displacement parameter of each joint point of the target object, so that the motion analysis module of the motion recognition model is utilized to perform the motion analysis (such as forward motion analysis) according to the motion rotation parameter and the motion displacement parameter to obtain the three-dimensional joint point position information of the corresponding joint point. The three-dimensional joint point position information can reflect the action of the target object in the three-dimensional space, and then armor matched with the current action of the target object is added to the target object according to the three-dimensional joint point position information. The display result after the armor matched with the current action of the target object is added can be shown as 102 in fig. 1.

It will be appreciated that the methods provided by embodiments of the present application may involve artificial intelligence (Artificial Intelligence, AI), which is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, obtains knowledge, and uses knowledge to obtain optimal results. The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

The method provided by the embodiment of the application can particularly relate to Computer Vision (CV), which is a science for researching how to make a machine "see", and further means that a camera and a Computer are used for replacing human eyes to recognize, follow and measure targets and the like, and further graphic processing is carried out, so that the Computer is processed into images which are more suitable for human eyes to observe or transmit to an instrument to detect. Computer vision technologies typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping, autopilot, intelligent transportation, etc., as well as common biometric technologies such as face recognition, fingerprint recognition, etc. The embodiment of the application mainly relates to behavior recognition, 3D technology and the like.

The method provided by the embodiment of the application can also relate to Machine Learning (ML) which is a multi-domain interdisciplinary and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. For example, an action recognition model is derived based on machine learning training.

Next, taking an example of a method for identifying actions performed by a mobile terminal, the method for identifying actions provided by the embodiment of the present application will be described in detail with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 shows a flow chart of a method of action recognition, the method comprising:

s201, when a target object is shot to obtain an image frame to be identified, acquiring two-dimensional joint point position information of the target object in the image frame to be identified.

If the mobile terminal shoots a target object (the target object is a human body, an animal, etc., for example, the embodiment of the application mainly takes the target object is a human body as an example), so that when the motion of the target object is identified in real time, the mobile terminal can acquire the two-dimensional node position information of the target object in the image frame to be identified.

And S202, performing feature generation by using a feature generation module of the motion recognition model according to the two-dimensional joint point position information to obtain a target feature vector.

The mobile terminal inputs the two-dimensional joint point position information into the action recognition model, so that the action recognition is realized by predicting the three-dimensional joint point position information through the action recognition model according to the two-dimensional joint point position information. The motion recognition model can comprise a feature generation module, a prediction module and a kinematic analysis module, and can be obtained through pre-training, and a training method of the motion recognition model is described in detail later.

After the mobile terminal inputs the two-dimensional joint point position information into the action recognition model, the feature generation module can be utilized to perform feature generation according to the two-dimensional joint point position information, and a target feature vector is obtained. Wherein the two-dimensional node position information may be represented by X.

It can be understood that, when the motion recognition is performed on the image frame to be recognized, if the image frame to be recognized is not the first image frame, for example, the previous image frame and the image frame to be recognized are located in the same image frame sequence, and the previous image frame is before the image frame to be recognized and is adjacent to the image frame to be recognized in the image frame sequence, because the motion is usually continuous, the motion made by the target object in the image frame to be recognized and the motion made in the previous image frame will not change greatly, so, in order to ensure the stability of the prediction result, the related information of the previous image frame can be effectively utilized to stabilize the prediction result when the motion recognition is performed on the image frame to be recognized.

In this case, the feature generating module may include a first feature extracting module and a first feature fusion module, where the implementation manner of S202 may be to first obtain a feature extracting result of the two-dimensional node position information, then generate, according to the feature extracting result of the two-dimensional node position information, a first feature vector corresponding to the image frame to be identified by using the first feature extracting module, and then obtain a first feature vector corresponding to the previous image frame related to the information of the previous image frame, so that the first feature vector corresponding to the image frame to be identified and the first feature vector corresponding to the previous image frame are feature fused by using the first feature fusion module, to obtain the target feature vector.

By the method, the feature vector of the image frame to be identified can be enhanced by effectively utilizing the related information of the previous image frame, so that the target feature vector comprising richer information is obtained, and the target object in the image frame to be identified can be identified according to the richer information, so that the prediction result is stabilized.

In some cases, the feature generating module may include a multi-layer feature extracting module, and taking the multi-layer feature extracting module as an example, the second feature extracting module and the first feature extracting module sequentially, the feature vector extracted by the feature extracting module can more and more embody the position of the three-dimensional joint point from the input direction to the output direction of the motion identifying module. In order to achieve more accurate feature fusion, the feature generation module further comprises a multi-layer feature fusion module, when the multi-layer feature extraction module is a second feature extraction module and a first feature extraction module in sequence, the multi-layer feature fusion module is a second feature fusion module and a first feature fusion module in sequence, and the second feature extraction module and the second feature fusion module are located before the first feature extraction module, and at the moment, the action recognition model can be shown by referring to fig. 3. In fig. 3, the motion recognition model may include a feature generation module 301, a prediction module 302, and a kinematic analysis module 303, the feature generation module 301 including a second feature extraction module 3011, a second feature fusion module 3012, a first feature extraction module 3013, and a first feature fusion module 3014.

In this case, the feature extraction result of the two-dimensional joint point position information may be obtained by performing feature extraction on the two-dimensional joint point position information by using a second feature extraction module to obtain a second feature vector corresponding to the image frame to be identified, and further determining the second feature vector corresponding to the image frame to be identified as the feature extraction result. Correspondingly, according to the feature extraction result of the two-dimensional joint point position information, the first feature extraction module is utilized to generate a first feature vector corresponding to the image frame to be identified, and the second feature fusion module is used for fusing a second feature vector corresponding to the image frame to be identified with a second feature vector corresponding to the previous image frame to obtain a fused feature vector; and encoding the fusion feature vector through a first feature extraction module to obtain a first feature vector.

Note that, the second feature extraction module 3011 is located before the first feature extraction module 3013, and the second feature extraction module 3011 extracts an Early feature vector with respect to the first feature extraction module 3013, so the second feature extraction module 3011 may be referred to as an Early feature extraction module, denoted as Early Stage; the first feature extraction module 3013 may be referred to as a Late feature extraction module, denoted as Late Stage; accordingly, the second feature Fusion module 3012 may be referred to as an Early feature Fusion module, denoted Early Fusion; the first feature Fusion module 3014 may be referred to as a Late feature Fusion module, denoted as Late Fusion; the second feature extraction module 3011 performs feature extraction on the two-dimensional node position information, and the obtained second feature vector may be referred to as an early feature vector of the image frame to be identified, which is expressed as Wherein t is the number of frames, representing that the image frame to be identified is the t-th image frame; the second feature vector corresponding to the previous image frame may be referred to as an early feature vector of the previous image frame, denoted +.>Wherein t-1 is the number of frames, indicating that the previous image frame is the t-1 th image frame; the fused feature vector output by the second feature fusion module 3012 may be referred to as a fused early feature vector, and may be expressed as +.>The first feature extraction module 3013 outputs a first featureThe syndrome vector may be referred to as a late feature vector of the image frame to be identified, and may be expressed as +.>The first feature vector corresponding to the previous image frame may be referred to as a late feature vector of the previous image frame, denoted +.>The target feature vector output by the first feature fusion module 3014 may be referred to as a post-fusion feature vector, denoted +.>

Compared with the related art, the timing network module used in the related art aims at a shot video, front and back frame images of an image frame to be identified, which needs to be subjected to action identification, are visible and are not suitable for real-time action identification, but the embodiment of the application uses only the previous image frame, is more suitable for real-time action identification, and is more suitable for a mobile terminal.

S203, according to the target feature vector, predicting by using a prediction module of the motion recognition model to obtain motion rotation parameters and motion displacement parameters of each joint point of the target object.

And after extracting the target feature vector, predicting by using a prediction module according to the target feature vector to obtain the action rotation parameter and the action displacement parameter of each joint point of the target object. In one possible implementation, since the motion rotation parameter and the motion displacement parameter are different parameters, the emphasis points of the motion rotation parameter and the motion displacement parameter on the target feature vector may be slightly different, and thus the motion rotation parameter and the motion displacement parameter may be respectively predicted based on different branch networks of the prediction module. Specifically, referring to fig. 3, the prediction module 302 may include a rotation parameter prediction module 3021 and a displacement prediction module 3022. The rotation parameter prediction module 3021 may be represented as a Quat Head, which may predict according to a target feature vector to obtain an action rotation parameter Q of each node of the target object; the displacement prediction module 3022 may be denoted as a Trans Head, and the displacement prediction module 3022 may predict the motion displacement parameter T of each node of the target object according to the target feature vector.

S204, according to the motion rotation parameters and the motion displacement parameters, performing kinematic analysis by using a kinematic analysis module of the motion recognition model to obtain the three-dimensional joint point position information of the corresponding joint point.

Referring to fig. 3, the mobile terminal may perform kinematic analysis by using the kinematic analysis module 303 according to the motion rotation parameter and the motion displacement parameter to obtain three-dimensional joint point position information of the corresponding joint point. The kinematic analysis module may be denoted as FK Layer, and the kinematic analysis may generally include forward kinematic analysis and backward kinematic analysis, and the embodiment of the present application calculates three-dimensional node position information mainly through the forward kinematic analysis. Wherein the three-dimensional joint point position information can be represented by J.

In one possible implementation, the motion recognition model may further include a discriminator network, e.g., as shown in fig. 3, which may include a first discriminator network 304, the first discriminator network 304 may discriminate the true or false of the motion based on three-dimensional joint point location information, and the first discriminator network 304 may pass through D _J A representation; in another example, the arbiter network may further include a second arbiter network 305, the second arbiter network 305 may discriminate the true or false of the motion based on the motion rotation parameters, and the second arbiter network 305 may pass through D _Q And (3) representing. However, the identifier network is mainly used in the process of training the motion recognition model, and when the motion recognition model is used for motion recognition, the identifier network is not required to be used, so the identifier network will be described in detail in the process of training the motion recognition model.

After the three-dimensional node position information is obtained, the mobile terminal can generate the animation of the target object according to the three-dimensional node position information. The animation may be an animation of driving a 3D model corresponding to the target object, or may be an animation of adding a human special effect to the target object. For example, when the target object displayed by the mobile terminal is shown in fig. 4 (a), after the motion recognition is performed based on the embodiment of the present application, the schematic diagram of the human special effect, i.e. adding the armor based on the three-dimensional joint point position information obtained by the motion recognition, may be shown in fig. 4 (b).

The embodiment of the application performs quantitative evaluation and qualitative evaluation on the provided action recognition method. Wherein, the quantitative evaluation results obtained can be seen in Table 1:

TABLE 1

In table 1, MPJPE (Mean Per Joint Position Error, average value of joint point coordinate errors, PVE (average value of all vertex coordinate errors of 3D model) and PA-MPJPE (Procrustes Aligned Mean Per Joint Position Error) are all evaluation indexes for quantitative evaluation, and PA-MPJPE is calculated after rigid transformation (such as translation, rotation and scaling) is performed on prediction output to align with corresponding true values, as can be seen from table 1, the numerical values of the methods provided by the embodiments of the present application on each evaluation index are smaller than those of the related art, so that it can be seen that the calculated amount of the motion recognition method provided by the embodiments of the present application is greatly reduced compared with that of the motion recognition method provided by the related art, and the performance is obviously improved on the synthetic data set MOCAP and the real data set 3DPW of the service evaluation arrangement.

The qualitative evaluation results can be shown in fig. 5 and fig. 6, wherein fig. 5 is a schematic diagram of the human body motion estimation result, and the human body motion in the video can be accurately estimated by testing the actually photographed video, as can be seen from fig. 5, the human body motion in the video can be accurately estimated by the method provided by the embodiment of the application, the human body motion estimation result is shown in fig. 5 (a) and fig. 5 (b); in addition, the video and the 3D character can also be selected randomly to test the driving effect, and as can be seen from fig. 6, the motion recognition method provided by the embodiment of the application can be applied to the human special effects such as 3D character driving, etc., where (a) in fig. 6 is a video shot on a human body, and (b) in fig. 6 is a human special effect driven by the 3D character.

Next, a training method of the motion recognition model will be described in detail. In order to train to obtain an action recognition model, firstly, constructing a corresponding initial network model, wherein the initial network model comprises a feature generation initial module, a prediction initial module and a kinematic analysis initial module, and referring to fig. 7, the method comprises the following steps:

s701, acquiring the history position information of the two-dimensional joint points of the history object in the history image frame.

In the embodiment of the application, the two-dimensional joint point historical position information of the historical object in the historical image frame can be used as a training sample to train to obtain the action recognition model. In general, the two-dimensional joint point history position information serving as a training sample is two-dimensional joint point history position information of each object of a plurality of history image frames, and the plurality of two-dimensional joint point history position information may be input into the initial network model in batch.

S702, according to the two-dimensional joint point historical position information, utilizing the feature generation initial module to perform feature generation to obtain a target historical feature vector.

S703, according to the target historical feature vector, predicting by using the prediction initial module to obtain the action historical rotation parameter and the action historical displacement parameter of each node of the historical object.

S704, according to the motion history rotation parameters and the motion history displacement parameters, performing kinematic analysis by using the kinematic analysis initial module to obtain three-dimensional joint point history position information of the corresponding joint point.

It should be noted that, S701-S704 are similar to the implementation manner of S201-S204 in the use process of the action recognition model, and will not be described herein.

And S705, constructing a target loss function according to the three-dimensional node historical position information.

And S706, optimizing and adjusting model parameters of the initial network model according to the target loss function to obtain the action recognition model.

After the three-dimensional joint point historical position information is obtained through prediction of the kinematic analysis initial module, in order to train the initial network model, a target loss function can be constructed based on the three-dimensional joint point historical position information, and then the initial network model is optimized and adjusted according to the target loss function until the target loss function meets preset conditions, training is stopped, and an action recognition model is obtained.

In the embodiment of the present application, the training of the motion recognition model may be performed on the terminal or may be performed on the server, which is not limited in the embodiment of the present application. The server can be an independent server, an integrated server, a cloud server and the like.

In one possible implementation, the motion recognition model may further include a first identifier network, so the initial network model may further include the first identifier network for identifying whether the predicted three-dimensional joint point historical position information is true or false. Further, the motion recognition model may further include a second identifier network, so the initial network model further includes the second identifier network for determining whether the motion history rotation parameter is true or false.

The embodiment of the application judges the true or false of the action through the identifier network such as the first identifier network and the second identifier network, thereby further enhancing the fluency of the action.

When the motion recognition model provided by the embodiment of the application is shown in fig. 3, a specific network structure of the motion recognition model is shown in table 2:

TABLE 2

Wherein B is the number of samples of the video frame sequence, T is the number of frames, FC Block is the fully connected module, fusion Block is the Fusion module, BN (Batch Normalization) is the batch normalization (i.e., normalization), reLU is the activation function, and GRU is the gating cycle unit.

Fig. 8 shows network structures of Fusion Block and FC Block, where 801 is a network structure of Fusion Block, and when the Fusion Block is used as the second feature Fusion module, an input of the Fusion Block may be, for example, a second feature vector corresponding to a previous image frame and a second feature vector of an image frame to be identified, and an output may be a Fusion feature vector; 802 is the network structure of the FC Block.

It should be noted that the specific design of the network structure defined in table 2 is only an example, and the network structure may be increased or decreased according to the computing resources, for example, the designs of FC Block and Fusion Block may appropriately increase the number of full connection layers, may increase or decrease the number of output channels, and so on.

The objective loss function is a key for training the motion recognition model, and whether the objective loss function comprehensively constrains the motion recognition result (for example, three-dimensional node position information) will affect the accuracy of motion recognition performed by the motion recognition model obtained by training. Therefore, the embodiment of the application adopts various loss functions to restrict the accuracy, stability and rationality of the generated actions, so that more vivid and accurate action recognition results can be obtained by using the action recognition model.

Based on this, the implementation manner of S705 may be to construct an action recognition loss function, an action change loss function and an antagonism loss function according to the three-dimensional joint point history position information, where the action recognition loss function is used to measure the accuracy of action recognition, the action change loss function is used to measure the stability of action change between different image frames, and the antagonism loss function is used to measure the rationality of action recognition; and constructing a target loss function according to at least one of the motion recognition loss function, the motion change loss function and the antagonism loss function.

The objective loss function can be expressed as:

L _all ＝L _action +L _quat-velo +L _gan

wherein L is _all L is the target loss function _action Identifying a loss function for an action, L _quat-velo For the action change loss function, L _gan To combat the loss function.

In one possible implementation, based on the discussion of the corresponding embodiment of fig. 2, the motion recognition model may further include a first identifier network, so the initial network model may further include the first identifier network, where the manner of constructing the countermeasures loss function according to the three-dimensional joint point historical position information may be to determine the three-dimensional joint point historical position information by using the first identifier network to obtain a first determination result, and then construct the countermeasures loss function according to the three-dimensional joint point historical position information and the first determination result.

In order to further enhance the smoothness of the motion, the motion recognition model may further include a second identifier network, so the initial network model further includes the second identifier network, and based on this, the embodiment of the present application may further determine the motion history rotation parameter according to the second identifier network to obtain a second determination result, where the manner of constructing the counterloss function according to the three-dimensional joint point history position information and the first determination result may be to construct the counterloss function according to the three-dimensional joint point history position information, the first determination result, and the second determination result.

For the rationality of motion recognition, the generated countermeasure is adopted to judge the true or false of motion, and the countermeasure loss function can be expressed as:

L _gan ＝w _gan (L _gan-Q +L _gan-J )

wherein, the liquid crystal display device comprises a liquid crystal display device,for L2 loss function, +.>And->Second discriminant of the second discriminant network respectivelyResults and first discrimination results of the first discriminator network, Q _pred Representing predicted motion history rotation parameters, Q _gt Representing true value of motion rotation parameter, J _pred Representing predicted three-dimensional joint point historical position information, J _gt Representing true value of position information, L _gan Representing the counterloss function, L _gan-Q Representing a loss function of the second arbiter network, L _gan-J Representing a loss function, w, of the first arbiter network _gan The representation weight can be set according to actual requirements.

In one possible implementation manner, the manner of constructing the motion recognition loss function according to the three-dimensional joint point historical position information may be to determine a first loss function according to the three-dimensional joint point historical position information and the position information true value; determining a second loss function according to the action history rotation parameter and the action rotation parameter true value corresponding to the three-dimensional joint point history position information; determining a third loss function according to the action history displacement parameter and the displacement parameter true value corresponding to the three-dimensional joint point history position information; and then carrying out weighted summation on the first loss function, the second loss function and the third loss function to obtain the action recognition loss function.

Aiming at the accuracy of motion recognition, the embodiment of the application respectively constrains the motion history rotation parameter, the motion history displacement parameter and the three-dimensional joint point history position information to be similar to the true values corresponding to the motion history rotation parameter, the motion history displacement parameter and the three-dimensional joint point history position information, namely the motion recognition loss function can be calculated by the following formula:

L _action ＝w _quat L _quat +w _trans L _trans +w _joint L _joint

wherein L is _joint As a first loss function, L _quat As a second loss function, L _trans As a function of the third loss,the motion analysis initial module outputs a 6-dimensional representation O of the motion history rotation parameter, which is usually output by the motion analysis initial module, and can be converted into a quaternion representation Q and a rotation matrix representation R, wherein Q is the L1 loss function, Q, R and O are 3 representation modes of the motion history rotation parameter _pred 、R _pred 、O _pred Respectively representing the predicted action history rotation parameters of different representation modes, Q _gt 、R _gt 、O _gt Respectively represent action rotation parameter true values of different representation modes, T _pred Representing predicted motion history displacement parameters, T _gt Representing the true value of the displacement parameter, J _pred Representing predicted three-dimensional joint point historical position information, J _gt Representing a true value of the location information. w (w) _quat 、w _trans 、w _joint The weights corresponding to the loss functions can be respectively represented, and can be set according to actual requirements.

In one possible implementation manner, the manner of constructing the motion change loss function according to the three-dimensional joint point historical position information may be to determine a fourth loss function according to the difference value between the three-dimensional joint point historical position information corresponding to any two historical image frames and the first difference value truth value; determining a fifth loss function according to the difference value between the action history rotation parameters corresponding to the two history image frames and the second difference value true value; determining a sixth loss function according to the difference value between the action history displacement parameters corresponding to the two history image frames and the third difference value true value; and then carrying out weighted summation on the fourth loss function, the fifth loss function and the sixth loss function to obtain an action change loss function.

Aiming at motion stability, the embodiment of the application respectively restricts the change of motion history rotation parameters, the change of motion history displacement parameters and the change of three-dimensional joint point history position information to be similar to the corresponding change true values, namely, the motion change loss function can be calculated by the following formula:

L _velo ＝w _velo (L _quat-velo +L _trans-velo +L _joint-velo )

wherein L is _joint-velo As a fourth loss function, L _quat-velo As a fifth loss function, L _trans-velo As a sixth loss function, w _velo The weight of each loss function can be set according to actual requirements,is the L1 penalty function.The difference value between the action history rotation parameters corresponding to any two history image frames in different expression modes is +.>Respectively, are the true values of the second difference values under different expression modes,/>For predicting the difference between the motion history displacement parameters corresponding to any two history image frames +.>For the third difference value true value, ++>For predicting the difference between the history position information of the three-dimensional joint corresponding to any two history image frames,/for the prediction of the difference between the history position information of the three-dimensional joint corresponding to any two history image frames>Is the true value of the first difference.

In one possible implementation, each of the weights may be set to w _quat ＝10.0，w _joint ＝20.0，w _trans ＝15.0，w _velo ＝5.0，w _gan =0.1. Each weight can be adjusted according to actual requirements, and the value of each weight is not limited in the embodiment of the application.

It should be noted that, based on the implementation manner provided in the above aspects, further combinations may be further performed to provide further implementation manners.

Based on the action recognition method provided in the corresponding embodiment of fig. 2, the embodiment of the application further provides an action recognition device 900. Referring to fig. 9, the action recognition apparatus 900 includes an acquisition unit 901, a generation unit 902, a prediction unit 903, and an analysis unit 904:

the acquiring unit 901 is configured to acquire two-dimensional node position information of a target object in an image frame to be identified when the target object is photographed to obtain the image frame to be identified;

the generating unit 902 is configured to perform feature generation by using a feature generating module of the motion recognition model according to the two-dimensional node position information, so as to obtain a target feature vector;

the prediction unit 903 is configured to predict, according to the target feature vector, by using a prediction module of the motion recognition model, to obtain a motion rotation parameter and a motion displacement parameter of each node of the target object;

the analysis unit 904 is configured to perform kinematic analysis by using a kinematic analysis module of the motion recognition model according to the motion rotation parameter and the motion displacement parameter, so as to obtain three-dimensional joint point position information of the corresponding joint point.

In a possible implementation manner, the feature generating module includes a first feature extracting module and a first feature fusion module, and the generating unit 902 is specifically configured to:

acquiring a feature extraction result of the two-dimensional joint point position information;

generating a first feature vector corresponding to the image frame to be identified by utilizing the first feature extraction module according to the feature extraction result of the two-dimensional joint point position information;

and carrying out feature fusion on a first feature vector corresponding to the image frame to be identified and a first feature vector corresponding to a previous image frame by the first feature fusion module to obtain the target feature vector, wherein the previous image frame and the image frame to be identified are positioned in the same image frame sequence, and the previous image frame is positioned before the image frame to be identified and is adjacent to the image frame to be identified in the image frame sequence.

In a possible implementation manner, the feature generating module further includes a second feature extracting module and a second feature fusion module, where the second feature extracting module and the second feature fusion module are located before the first feature extracting module in the action recognition model, and the generating unit 902 is specifically configured to:

Performing feature extraction on the two-dimensional joint point position information through the second feature extraction module to obtain a second feature vector corresponding to the image frame to be identified;

determining a second feature vector corresponding to the image frame to be identified as the feature extraction result;

fusing a second feature vector corresponding to the image frame to be identified with a second feature vector corresponding to the previous image frame through the second feature fusion module to obtain a fused feature vector;

and encoding the fusion feature vector through the first feature extraction module to obtain the first feature vector.

In a possible implementation manner, the generating unit 902 is further configured to:

and generating the animation of the target object according to the three-dimensional node position information.

In a possible implementation manner, the initial network model corresponding to the action recognition model includes a feature generation initial module, a prediction initial module, and a kinematic analysis initial module, and the apparatus further includes a training unit, where the training unit is configured to:

acquiring the history position information of a two-dimensional joint point of a history object in a history image frame;

according to the two-dimensional joint point historical position information, utilizing the feature generation initial module to perform feature generation to obtain a target historical feature vector;

According to the target historical feature vector, predicting by utilizing the prediction initial module to obtain an action historical rotation parameter and an action historical displacement parameter of each joint point of the historical object;

according to the motion history rotation parameters and the motion history displacement parameters, performing kinematic analysis by using the kinematic analysis initial module to obtain three-dimensional joint point history position information of the corresponding joint points;

constructing a target loss function according to the three-dimensional joint point historical position information;

and optimizing and adjusting model parameters of the initial network model according to the target loss function to obtain the action recognition model.

In a possible implementation manner, the training unit is specifically configured to:

respectively constructing an action recognition loss function, an action change loss function and an antagonism loss function according to the three-dimensional articulation point historical position information, wherein the action recognition loss function is used for measuring the accuracy of action recognition, the action change loss function is used for measuring the stability of action change between different image frames, and the antagonism loss function is used for measuring the rationality of action recognition;

and constructing the target loss function according to at least one of the action recognition loss function, the action change loss function and the counterloss function.

In a possible implementation manner, the initial network model further comprises a first arbiter network, and the training unit is specifically configured to:

judging the historical position information of the three-dimensional joint point through the first identifier network to obtain a first judging result;

and constructing the antagonism loss function according to the three-dimensional joint point historical position information and the first discrimination result.

In a possible implementation, the initial network model further includes a second arbiter network, the training unit being further configured to;

judging the action history rotation parameters according to the second identifier network to obtain a second judging result;

the training unit is specifically configured to:

and constructing the counterdamage function according to the three-dimensional joint point historical position information, the first discrimination result and the second discrimination result.

determining a first loss function according to the historical position information and the position information true value of the three-dimensional articulation point;

determining a second loss function according to the action history rotation parameter and the action rotation parameter true value corresponding to the three-dimensional joint point history position information;

Determining a third loss function according to the action history displacement parameter and the displacement parameter true value corresponding to the three-dimensional joint point history position information;

and carrying out weighted summation on the first loss function, the second loss function and the third loss function to obtain the action recognition loss function.

determining a fourth loss function according to the difference value between the three-dimensional joint point historical position information corresponding to the two adjacent historical image frames and the first difference value true value;

determining a fifth loss function according to the difference value between the action history rotation parameters corresponding to the two adjacent history image frames and the second difference value true value;

determining a sixth loss function according to the difference value between the action history displacement parameters corresponding to the two adjacent history image frames and the third difference value true value;

and carrying out weighted summation on the fourth loss function, the fifth loss function and the sixth loss function to obtain the motion change loss function.

The embodiment of the application also provides an electronic device for motion recognition, which can be a terminal, and takes the terminal as a smart phone in a mobile terminal as an example:

Fig. 10 is a block diagram illustrating a part of a structure of a smart phone according to an embodiment of the present application. Referring to fig. 10, the smart phone includes: radio Frequency (RF) circuit 1010, memory 1020, input unit 1030, display unit 1040, sensor 1050, audio circuit 1060, wireless fidelity (WiFi) module 1070, processor 1080, and power source 1090. The input unit 1030 may include a touch panel 1031 and other input devices 1032, the display unit 1040 may include a display panel 1041, and the audio circuit 1060 may include a speaker 1061 and a microphone 1062. It will be appreciated that the smartphone structure shown in fig. 10 is not limiting of the smartphone, and may include more or fewer components than shown, or may combine certain components, or may be arranged in a different arrangement of components.

The memory 1020 may be used to store software programs and modules that the processor 1080 performs various functional applications and data processing of the smartphone by executing the software programs and modules stored in the memory 1020. The memory 1020 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebooks, etc.) created according to the use of the smart phone, etc. In addition, memory 1020 may include high-speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state memory device.

Processor 1080 is the control center of the smartphone, connects the various parts of the entire smartphone with various interfaces and lines, performs various functions of the smartphone and processes the data by running or executing software programs and/or modules stored in memory 1020, and invoking data stored in memory 1020. Optionally, processor 1080 may include one or more processing units; preferably, processor 1080 may integrate an application processor primarily handling operating systems, user interfaces, applications, etc., with a modem processor primarily handling wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 1080.

In this embodiment, processor 1080 in the smartphone may perform the following steps:

Referring to fig. 11, fig. 11 is a schematic diagram of a server 1100 according to an embodiment of the present application, where the server 1100 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (Central Processing Units, abbreviated as CPUs) 1122 (e.g., one or more processors) and a memory 1132, and one or more storage media 1130 (e.g., one or more mass storage devices) storing application programs 1142 or data 1144. Wherein the memory 1132 and the storage medium 1130 may be transitory or persistent. The program stored on the storage medium 1130 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 1122 may be provided in communication with a storage medium 1130, executing a series of instruction operations in the storage medium 1130 on the server 1100.

The Server 1100 may also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input/output interfaces 1158, and/or one or more operating systems 1141, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

In the present embodiment, the steps required to be performed by the central processor 1122 in the server 1100 may be implemented based on the server structure shown in fig. 11.

According to an aspect of the present application, there is provided a computer-readable storage medium for storing a program code for executing the action recognition method according to the foregoing embodiments.

According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the methods provided in the various alternative implementations of the above embodiments.

The descriptions of the processes or structures corresponding to the drawings have emphasis, and the descriptions of other processes or structures may be referred to for the parts of a certain process or structure that are not described in detail.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of motion recognition, the method comprising:

2. The method according to claim 1, wherein the feature generating module includes a first feature extracting module and a first feature fusion module, the feature generating by using the feature generating module of the motion recognition model according to the two-dimensional node position information, to obtain a target feature vector, includes:

3. The method of claim 2, wherein the feature generation module further includes a second feature extraction module and a second feature fusion module, the feature extraction module and the second feature fusion module being located before the first feature extraction module in the motion recognition model, the obtaining the feature extraction result of the two-dimensional node position information includes:

the generating, by using the first feature extraction module, a first feature vector corresponding to the image frame to be identified according to the feature extraction result of the two-dimensional node position information includes:

4. A method according to any one of claims 1-3, wherein the method further comprises:

5. A method according to any one of claims 1-3, wherein the initial network model to which the motion recognition model corresponds includes a feature generation initial module, a prediction initial module, and a kinematic analysis initial module, the method further comprising:

6. The method of claim 5, wherein constructing the objective loss function based on the three-dimensional joint point historical location information and the first discrimination result comprises:

7. The method of claim 6, wherein the initial network model further comprises a first arbiter network, and wherein constructing the counterdamage function from the three-dimensional joint point historical location information comprises:

8. The method of claim 7, wherein the initial network model further comprises a second arbiter network, the method further comprising;

the constructing the countermeasures loss function according to the three-dimensional joint point historical position information and the first discrimination result comprises the following steps:

9. The method of claim 6, wherein constructing the motion recognition loss function from the three-dimensional joint point historical location information comprises:

10. The method of claim 6, wherein constructing the motion change loss function from the three-dimensional joint point historical location information comprises:

11. An action recognition device, characterized in that the device comprises an acquisition unit, a generation unit, a prediction unit and an analysis unit:

12. The apparatus of claim 11, wherein the feature generation module includes a first feature extraction module and a first feature fusion module, and the generation unit is specifically configured to:

13. An electronic device for motion recognition, the electronic device comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-10 according to instructions in the program code.

14. A computer readable storage medium for storing program code which, when executed by a processor, causes the processor to perform the method of any of claims 1-10.

15. A computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any of claims 1-10.