CN111222459B

CN111222459B - Visual angle independent video three-dimensional human body gesture recognition method

Info

Publication number: CN111222459B
Application number: CN202010010324.9A
Authority: CN
Inventors: 邱丰; 马利庄
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2023-05-12
Anticipated expiration: 2040-01-06
Also published as: CN111222459A

Abstract

The invention relates to a visual angle independent video three-dimensional human body gesture recognition method, which comprises the following steps: step 1: virtual data generation phase: synthesizing virtual camera parameters based on a human body posture data set containing three-dimensional labels at will, and then generating a two-dimensional/three-dimensional data tuple; step 2: model training stage: training a first module of a modularized neural network for obtaining a model with camera view generalization capability and a second module of the modularized neural network for obtaining a model capable of protecting inter-frame motion continuity by using the generated two-dimensional/three-dimensional data elements respectively; step 3: unconstrained video reasoning phase: and (3) predicting the video acquired by any unconstrained acquisition by utilizing the multi-module depth neural network trained in the step (2) to acquire a three-dimensional human body posture recognition result. Compared with the prior art, the method is based on the modularized neural network combined training method, and the generalization capability of three-dimensional human body gesture recognition is effectively improved.

Description

Visual angle independent video three-dimensional human body gesture recognition method

Technical Field

The invention relates to a three-dimensional human body posture recognition technology in the technical field of computer vision, in particular to a video three-dimensional human body posture recognition method which aims at the unknown visual angle data synthesis, modularized neural network training and preprocessing method of video tasks, namely visual angle independence.

Background

In recent decades, with the development of technology related to artificial intelligence and deep learning, the problem of human body posture recognition has also advanced. Video human body gesture recognition, in particular to three-dimensional human body gesture recognition aiming at video, has long been important content in the fields of computer vision and intelligent human-computer interaction; the portable mobile electronic device integrates multiple subjects such as digital image processing, man-machine interaction, computer graphics, computer vision and the like, and further integrates the popularization of portable mobile electronic devices such as security monitoring networks, intelligent robots, intelligent mobile phones, tablet personal computers and the like into the life of people.

The existing three-dimensional human body posture recognition algorithm can be divided into single-stage human body posture recognition and multi-stage human body posture recognition according to a predicted target: the method has the advantages that more hidden information in the pictures is utilized, the accuracy is higher in a laboratory environment, but the method is limited by the fact that the RGB picture data with three-dimensional labels are missing and cannot be separated from the laboratory acquisition environment, so that generalization capability is poor, and the method is difficult to convert into products with strong usability to generate commercial value; the method is characterized in that training of a two-dimensional human body posture estimation part can be performed by collecting a large number of internet unconstrained pictures in a manual labeling mode, and the problem of two-dimensional to three-dimensional prediction is proved to be a task which is relatively easy to complete through the paper of Martinez et al. In order to facilitate transformation, the invention generally adopts a multi-stage human body gesture recognition architecture, however, on the basis of the existing stronger two-dimensional human body key point detection model, the existing method is still easy to be overfitted to the camera parameters of the dataset because of the visual angle deficiency limited by the three-dimensional acquisition data.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a visual angle independent video three-dimensional human body gesture recognition method, and provides a virtual visual angle synthesis method, wherein a camera visual angle enhancement module is utilized to generate a random visual angle, and a camera projection relationship is matched to obtain a two-dimensional/three-dimensional data tuple which is used for multi-module training and generalization capability verification of a neural network; in addition, the input of three-dimensional prediction is normalized by utilizing a two-dimensional human body posture detection frame mode, so that the three-dimensional human body posture estimation method in an unconstrained environment is separated from the limit of internal parameters and external parameters of a camera, and the method has stronger generalization capability.

The aim of the invention can be achieved by the following technical scheme:

a visual angle independent video three-dimensional human body gesture recognition method, the recognition method comprising:

step 1: virtual data generation phase: synthesizing virtual camera parameters based on a human body posture data set containing three-dimensional labels at will, and then generating a two-dimensional/three-dimensional data tuple;

step 2: model training stage: training a first module of a modularized neural network for obtaining a model with camera view generalization capability and a second module of the modularized neural network for obtaining a model capable of protecting inter-frame motion continuity by using the generated two-dimensional/three-dimensional data elements respectively;

step 3: unconstrained video reasoning phase: and (3) predicting the video acquired by any unconstrained acquisition by utilizing the multi-module depth neural network trained in the step (2) to acquire a three-dimensional human body posture recognition result.

Further, the step 1 specifically includes: for any human body posture data set containing three-dimensional labels, a camera view angle enhancement module is adopted to synthesize virtual camera parameters, and a projection relation is utilized to generate a two-dimensional/three-dimensional data tuple.

Further, the camera parameters include external parameters for determining the position and orientation of the camera and internal parameters for determining the projected focal length of the camera.

Further, the first module in the step 2 performs training of view enhancement by using a single frame data tuple.

Further, the second module in the step 2 performs timing model training by using the continuous sequence of data tuples.

The first module and the second module only need to meet the condition that the first module is a single-frame two-dimensional to three-dimensional prediction module, the second module is a time sequence three-dimensional to three-dimensional correction module, and the first module and the second module are connected in series to complete two-dimensional to three-dimensional prediction.

Further, before inputting the neural network, the steps 2 and 3 further include a two-dimensional detection normalization preprocessing process for performing camera-independent two-dimensional detection on the two-dimensional detection result, where the corresponding description formula is as follows:

wherein K is _x,y Representing two-dimensional detection returnsUnifying the two-dimensional point coordinates after pretreatment,

representing the original two-dimensional point coordinates,

representing the center coordinates, w, of a two-dimensional detection frame ^d ,h ^d The width and the height of the two-dimensional detection frame are respectively.

Further, the video obtained by unconstrained acquisition in the step 3 specifically includes natural condition acquisition or video sequences subjected to scaling, clipping, speed changing and other color adjustment transformations.

Compared with the prior art, the invention has the following advantages:

(1) The visual angle independent video three-dimensional human body gesture recognition method provided by the invention has the advantages that in the virtual data generation stage, a reasonable random visual angle is assumed to replace a camera visual angle used in the acquisition of a data set in the original fixed visual angle training, so that the dependence of the internal parameters and the external parameters of the camera of the data set is overcome; in the model training stage, the modularized design can train two independent modules respectively, can also train the video stream data tuples completely in series, has definite task purposes of the two independent modules, can be verified independently, and has strong generalization capability.

(2) Because the invention utilizes the time sequence model training, the prediction prompt for a long time can be obtained on the basis of controlling the receptive field; in the unconstrained video reasoning stage, due to effective normalization frame design and selection, projection relation dependence is decoupled, and a good prediction effect can be obtained for a large number of videos which are acquired through the Internet and lack of camera parameters, extreme character proportion (often represented by too small scale proportion of characters in the original video), cutting and other processing.

(3) The invention provides a visual angle independent video three-dimensional human body gesture recognition method, which trains a plurality of module neural networks by using a large number of two-dimensional/three-dimensional data tuples enhanced by camera visual angles, and simultaneously carries out preprocessing of two-dimensional input by using a camera independent two-dimensional detection normalization method; the first module in the invention can adapt to unconstrained three-dimensional human body posture estimation tasks to obtain stronger camera generalization capability, and the second module can effectively utilize continuous characteristics on time sequences to ensure that prediction key points obtain better spatial stability and the whole prediction achieves ideal precision.

Drawings

FIG. 1 is a flow chart of a method structure of the present invention;

FIG. 2 is a schematic diagram of rotation (attitude angle) control at the time of camera parameter generation in the method of the present invention;

FIG. 3 is a diagram showing an example of the structure of a first modular neural network and a second modular neural network in the method of the present invention;

FIG. 4 is a schematic view of the projection relationship in the method of the present invention;

FIG. 5 is a schematic diagram of a two-dimensional detection frame normalization method in the method of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

Fig. 1 is a flow chart of the overall structure of a visual angle independent video three-dimensional human body gesture recognition method of the present invention, which mainly includes the following three stages: the virtual data generation stage, the model training stage and the unconstrained video reasoning stage, and the method also comprises a camera-independent two-dimensional detection normalization method which can be used in both the training stage and the reasoning stage;

virtual data generation phase: for any disclosed three-dimensional human body posture academic data set or three-dimensional human body posture data set acquired by a motion capture system, corresponding two-dimensional projection and three-dimensional posture data tuples are generated through the camera view angle synthesis principle and projection transformation provided by the invention;

model training stage: training a neural network model, and training a first module provided by the invention by using a large-scale visual angle enhancement single-frame data tuple to obtain better visual angle resistance and decoupling with camera parameters; for the second module provided by the invention, the video stream data containing the three-dimensional labels is used for time sequence learning and prediction, so that the spatial continuity on the time sequence is obtained, and the gesture recognition accuracy is improved;

unconstrained video reasoning phase: the method comprises the steps of preprocessing a video data stream obtained under a general environment (in-the-wild) and a module based on any two-dimensional human body key point detection, and carrying out forward prediction on processed two-dimensional data sequentially through a first module and a second module by using a special normalization method of two-dimensional human body detection key point results, so as to obtain human body gestures based on three-dimensional key point representation.

The human body posture data representation method is mainly a key point-skeleton representation method; the first module provided by the invention is mainly used for improving the view angle generalization capability, and the second module is mainly used for obtaining the characteristics of larger receptive field and good stability of time sequence prediction. The described scenarios include, but are not limited to, research and applications involving video human gesture recognition. The invention is based on a modularized neural network combined training method, and effectively improves the generalization capability of three-dimensional human body gesture recognition.

The application range of the virtual data generation stage method includes, but is not limited to, a published academic data set, a motion capture system acquisition data set and the like, and the application range is applicable only if three-dimensional labels and camera parameters (namely, only if a two-dimensional/three-dimensional projection relationship exists).

The implementation of the first module and the second module defined in the respective method processes of the model training stage and the unconstrained video reasoning stage is not limited to the implementation in the specification of the invention, and all neural networks and the like which are suitable for predicting three-dimensional human body gestures based on single-frame two-dimensional human body detection and predicting continuous sequence three-dimensional human body gestures based on time sequence models can replace the first module and the second module referred to by the invention.

The method process of the unconstrained video reasoning stage is applicable to unconstrained video, i.e., natural condition acquisition, or video sequences that have undergone transformations including, but not limited to, scaling, cropping, shifting, and other color adjustments.

The normalization method serving as a special preprocessing stage in the model training stage and the unconstrained video reasoning stage is suitable for unconstrained videos: even if the relative projection position relationship between the person and the camera is destroyed, the method is applicable only if the two-dimensional human body key point result or the detection module can be detected.

Further, the specific flow details of each stage in the method of the invention are as follows:

in the visual angle independent video three-dimensional human body gesture recognition method provided by the invention, the virtual data generation stage further comprises the following steps: for the existing three-dimensional human body posture data set, a plurality of different camera parameters are generated through a random scheme with reasonable design, wherein the parameters comprise external parameters for determining the position and the orientation of a camera and internal parameters for determining the projection focal length frame of the camera. Generating corresponding two-dimensional human body data under different camera parameters for the three-dimensional human body data by continuously utilizing a projection relation on the basis of random parameters, so as to obtain a two-dimensional/three-dimensional human body posture data tuple; for the video data set, the motion range of the observed human body in the video sequence is considered, and reasonable internal parameters are obtained according to the motion trail, so that the projection view cone comprises a motion point set of three-dimensional human body key points as far as possible.

In the visual angle independent video three-dimensional human body gesture recognition method provided by the invention, the model training stage further comprises the following steps: and training the neural network of the sub-module. That is, for the first module, training for view enhancement can be performed using either single frames or continuous, but mainly single frame data tuples to obtain a model with camera view generalization capability; for the second module, sequential model training is performed mainly using a continuous sequence of data tuples to obtain a model that can protect inter-frame motion continuity. Care should be taken that: generally, for RGB video input, the first module can be understood as a two-dimensional to three-dimensional regression problem, and the second module is a sequential three-dimensional to sequential three-dimensional regression problem; for RGB-D video input, additional depth latitude may be added to both the first and second modules.

In the visual angle independent video three-dimensional human body gesture recognition method provided by the invention, the unconstrained video reasoning stage further comprises the following steps: the method comprises the steps of obtaining a two-dimensional detection result through any known two-dimensional human body key point detection method, preprocessing data by using the camera-independent two-dimensional detection normalization method, and sequentially carrying out forward reasoning through the first module and the second module to obtain a three-dimensional human body posture estimation result corresponding to a video sequence.

In the visual angle independent video three-dimensional human body gesture recognition method provided by the invention, the camera independent two-dimensional detection normalization method further comprises the following steps: the two-dimensional human body key point detection frame (or multiplied by a proper coefficient) is used as a normalization standard to normalize the two-dimensional human body key point input, and the normalization method has better camera resistance and better resistance to projection relation loss and damage caused by local zooming, cutting and the like of the picture.

The video three-dimensional human body gesture recognition method independent of the visual angle provided by the invention is specifically described below with reference to the specific embodiment.

In the first phase of the method of the invention, the virtual data generation phase: firstly, according to the existing three-dimensional human body posture data set, such as a published academic data set of Human3.6M, generating a random configuration of camera parameters for three-dimensional human body world coordinates of each video segment, wherein the configuration can set the camera position and the rotation angle according to the observed human body height and the activity range, such as: randomly determining a point as a passing point O in the direction of the optical axis of the camera by taking the mean value of the projection of the character moving range on the ground plane as a center point, taking 0.75 of the height of the character as an observation sphere center and taking 0.5 times of the height of the character as a Gaussian radius; randomly selecting the Euclidean distance of a camera from an O point in uniform distribution within 4.0 meters to 6.5 meters, wherein the camera attitude angle is shown in figure 2, the fixed roll (rotation in the camera shooting direction of a rotation on camera's direction vector) angle is fixed, the pitch (rotation in the transverse direction of a Rotate on the cross product of the other camera's up and direction vectors camera) angle is randomly generated between-15 degrees and +15 degrees, and the yaw (rotation in the vertical direction of a rotation on camera's up vector camera) angle is randomly generated between 0 degrees and +360 degrees; because the human3.6m dataset is built with camera references, reference generation may be temporarily omitted here. And randomly generating the values again during each sampling, and obtaining a two-dimensional/three-dimensional human body posture data tuple by utilizing a projection relation.

In the second stage of the invention, the model training stage: training a first module (camera-agnostic regressor) by adopting a random single-frame data tuple, wherein the implementation adopts a two-time iterative regression model with residual error connection based on a deep neural network, and obtains three-dimensional human body posture key points according to two-dimensional human body posture key point regression, and the model has the characteristics of camera independence and wide viewing angle; the second module (temporal regressor) is improved based on a hole convolution model on time sequence, and a three-dimensional posture correction network is designed for expanding the characteristics of the receptive field by using a hole convolution method and increasing the spatial continuity of the three-dimensional prediction result on time sequence, so that the function of compensating the prediction result of the first module is achieved. The loss function used for two-part supervision of the training process is shown below.

Wherein L is the total loss,

and->

The weights of the first module and the second module respectively,/>

and

the losses of the first module and the second module, respectively.

In the third stage of the invention, the unconstrained video reasoning stage: based on the existing two-dimensional detection result, the video acquired by any unconstrained condition is sequentially predicted by utilizing the multi-module deep neural network acquired by the training convergence of the second stage of the invention to acquire a three-dimensional human body posture result. The implementation of the first and second modules and the reasoning process are shown in fig. 3, where single-frame and multi-frame in fig. 3 represent a single and multiple two-dimensional frames, respectively.

The camera-independent two-dimensional detection normalization method used in the second and third stages of the invention: as shown in fig. 4, the principal point and the optical point in fig. 4 respectively represent the center points of the two-dimensional detection frame before and after projection, and the projection relationship and the conventional method generally use the pixel size of the original picture or the size of the square circumscribed by the original picture as the normalization standard, and the calculation function is as follows:

it can be seen that it depends on camera parameters (focal length) and its equivalent also on artwork size, lacking transformation resistance such as cropping. The normalization method provided by the invention is shown in fig. 5, and a specific calculation function is shown as follows, and the normalization method has the characteristics of keeping the two-dimensional detection size stable and being irrelevant to camera parameters:

wherein K is _x,y Representing the two-dimensional point coordinates after the two-dimensional detection normalization pretreatment,

representing the original two-dimensional point coordinates +.>

Representing the center coordinate of the two-dimensional detection frame, m is greater than 1, and the embodiment takes 1.2, w ^d ,h ^d The width and the height of the two-dimensional detection frame are respectively.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims

1. The visual angle independent video three-dimensional human body gesture recognition method is characterized by comprising the following steps of:

2. The visual angle independent video three-dimensional human body gesture recognition method according to claim 1, wherein the step 1 specifically comprises: for any human body posture data set containing three-dimensional labels, a camera view angle enhancement module is adopted to synthesize virtual camera parameters, and a projection relation is utilized to generate a two-dimensional/three-dimensional data tuple.

3. The visual angle independent three-dimensional human body gesture recognition method of claim 2, wherein the camera parameters include an external parameter determining the position and orientation of the camera and an internal parameter determining the projected focal length of the camera.

4. The method of claim 1, wherein the first module in step 2 performs visual angle enhancement training using single frame data tuples.

5. The visual angle independent three-dimensional human body gesture recognition method of claim 1, wherein the second module in step 2 performs time sequence model training using a continuous sequence of data tuples.

6. The three-dimensional human body gesture recognition method of video independent of visual angle according to claim 1, wherein before inputting the neural network, the method further comprises a two-dimensional detection normalization preprocessing process of camera independent of the two-dimensional detection result, which corresponds to the description formula:

representing the original two-dimensional point coordinates,

7. The visual angle independent video three-dimensional human body gesture recognition method according to claim 1, wherein the video obtained by unconstrained acquisition in the step 3 specifically comprises a video sequence obtained by natural condition acquisition or scaling, clipping, speed changing and other color adjustment transformation.