CN110705463A

CN110705463A - Video human behavior recognition method and system based on multi-mode double-flow 3D network

Info

Publication number: CN110705463A
Application number: CN201910936088.0A
Authority: CN
Inventors: 马昕; 武寒波; 宋锐; 荣学文; 田国会; 李贻斌
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2019-09-29
Filing date: 2019-09-29
Publication date: 2020-01-17

Abstract

The invention discloses a video human behavior recognition method and system based on a multi-mode double-flow 3D network, which comprises the following steps: a depth dynamic map sequence DDIS generated based on the depth video; a pose evaluation graph sequence PEMS generated based on the RGB video; respectively inputting the depth dynamic graph sequence and the pose evaluation graph sequence into a 3D convolutional neural network, and constructing a DDIS stream and a PEMS stream to obtain respective classification results; and fusing the obtained classification results to obtain a final behavior recognition result. The invention has the beneficial effects that: the DDIS can well describe the human motion in the long-term behavior video and the outline of an interactive object by modeling the local space-time structure information of the video. PEMS can clearly catch the change of human posture, eliminate the mixed and disorderly interference of background. The multi-mode double-flow 3D network architecture can effectively model the global space-time dynamics of behavior videos in different data modes, and has excellent identification performance.

Description

Video human behavior recognition method and system based on multi-mode double-flow 3D network

Technical Field

The invention relates to the technical field of human behavior recognition, in particular to a video human behavior recognition method and system based on a multi-mode double-flow 3D network.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

Human behavior recognition based on video has attracted more and more attention in the field of computer vision in recent years due to its wide application, such as intelligent monitoring, video retrieval, and elderly care. Video behavior recognition is a more challenging task than image classification, since videos are highly dimensional and the temporal structure between successive frames can also provide important additional information. Therefore, the spatiotemporal feature learning has very important significance for the behavior recognition based on the video. Spatial features are typically used to describe the appearance of human and objects and configuration information of a scene, while temporal features mainly capture the change of behavioral motion over time. Although much work has been done on behavior recognition, how to effectively extract discriminative spatiotemporal information from video to improve behavior recognition performance is still being continuously explored.

In recent years, deep neural networks have gained widespread attention and success in video-based behavior recognition due to significant increases in computing power and availability of large annotated data sets. Three typical deep network structures commonly used for modeling video spatio-temporal dynamics are as follows: dual-stream Convolutional Neural Networks (CNN), Convolutional Neural Networks (CNN) + Recurrent Neural Networks (RNN), and 3D Convolutional Neural Networks. Dual-stream CNN is the most popular algorithmic framework in 2D convolutional neural networks, where RGB images and optical flow are input into a spatial stream network and a temporal stream network, respectively, to extract appearance and motion information for behavior recognition. Although the dual stream CNN has achieved good performance, it cannot directly learn the temporal patterns of behavioral videos. RNN has good modeling capability on the time dependence of long-term videos, and can effectively solve the problem. The CNN + RNN structure utilizes CNN to learn spatial characteristics and RNN to model time dynamically. However, the two network architectures described above cannot capture the spatial and temporal information of the video at the same time, and to overcome this limitation, the 3D CNN can simultaneously encode the spatiotemporal dynamic characteristics of the behavioral video by extending the two-dimensional convolution kernel into a three-dimensional convolution kernel.

The inventors have found that, although the behavior recognition research based on deep learning has made great progress, it is still a complex and challenging research topic. First, CNN is a texture-driven network structure that is better at describing the color and texture characteristics of objects, rather than human motion. Therefore, CNN-based methods tend to predict behavior from scenes and objects, which makes them more susceptible to cluttered backgrounds. Second, 3D CNN, while having the significant advantage of learning spatio-temporal features simultaneously, is generally only applied to RGB video. RGB data is highly sensitive to color and lighting variations, occlusion, and background clutter, and furthermore, it is difficult to obtain more high-level cues, such as body pose and body contour information, from RGB video alone.

Disclosure of Invention

In order to solve the problems, the invention provides a video human body behavior recognition method and system based on a multi-mode double-flow 3D network, and the performance of behavior recognition is improved by designing a new multi-mode double-flow 3D network framework and utilizing complementary characteristics of different modes.

In some embodiments, the following technical scheme is adopted:

a video human behavior recognition method based on a multi-mode double-flow 3D network comprises the following steps:

a Depth Dynamic Image Sequence (DDIS) generated based on the Depth video;

a Pose Estimation Map Sequence (PEMS) generated based on the RGB video;

respectively inputting the depth dynamic graph sequence and the pose evaluation graph sequence into a 3D convolutional neural network, and constructing a DDIS stream and a PEMS stream to obtain respective classification results;

and fusing the obtained classification results to obtain a final behavior recognition result.

In other embodiments, a video human behavior recognition system based on a multi-modal dual-stream 3D network is disclosed, comprising:

a module for generating a sequence of depth dynamics maps based on the depth video;

a module for evaluating a sequence of graphs based on pose generated from RGB video;

a module for inputting the depth dynamic image sequence and the pose evaluation image sequence into a 3D convolutional neural network respectively to obtain respective classification results;

and the module is used for fusing the obtained classification results to obtain a final behavior recognition result.

In other embodiments, a terminal device is disclosed that includes a processor and a computer-readable storage medium, the processor to implement instructions; the computer readable storage medium is used for storing a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the video human body behavior identification method based on the multi-mode double-flow 3D network.

In other embodiments, a computer-readable storage medium is disclosed, in which a plurality of instructions are stored, the instructions being adapted to be loaded by a processor of a terminal device and to execute the above-mentioned video human body behavior recognition method based on a multi-modal dual-stream 3D network.

Compared with the prior art, the invention has the beneficial effects that:

the invention respectively inputs a depth dynamic graph sequence (DDIS) generated based on a depth video and a pose evaluation graph sequence (PEMS) generated based on an RGB video into a dual-stream 3D CNN to model the global space-time dynamics of the video. The DDIS can well describe the human motion in the long-term behavior video and the outline of an interactive object by modeling the local space-time structure information of the video. PEMS can clearly catch the change of human posture, eliminate the mixed and disorderly interference of background.

The method provided by the invention integrates the advantages of DDIS and PEMS, and explores the learning ability of the 3D convolutional neural network on global space-time characteristics under different data modes. The effectiveness and superiority of the method are verified through a large number of experiments carried out on an SBU Kinect interaction data set, a UTD-MHAD data set and an NTURGB + D data set.

Drawings

Fig. 1 is a flowchart of a video human behavior recognition method based on a multi-modal double-flow 3D network according to an embodiment of the present invention;

FIG. 2 is a flow chart of DDIS construction according to an embodiment of the present invention;

fig. 3 is a partial RGB image and a corresponding pose estimation diagram of four types of behaviors in an NTU RGB + D data set according to an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Example one

In one or more embodiments, a video human behavior recognition method based on a multi-modal dual-stream 3D network is disclosed, as shown in fig. 1, including the following steps:

(1) for one behavior sample, synchronous RGB and depth videos are respectively collected.

(2) A depth dynamic map sequence generated based on the depth video;

(3) a sequence of pose evaluation graphs generated based on the RGB video;

(4) respectively inputting the depth dynamic image sequence and the posture evaluation image sequence into a 3D convolutional neural network to obtain respective classification results;

(5) and fusing the obtained classification results to obtain a final behavior recognition result.

According to the implementation method, a depth dynamic graph sequence DDIS and a pose evaluation graph sequence PEMS are respectively used as data input of two 3 DCNNs in a framework to construct a DDIS stream and a PEMS stream, and the two streams can simultaneously model the video global space-time dynamics under two different data modes. And finally, fusing the classification results of the double flows to obtain the final behavior prediction. The proposed framework fuses complementary information of depth data and pose data to obtain better recognition performance.

The method of this embodiment is described in detail below:

1. depth dynamic map sequence

Although 3D CNNs work well in video-based behavior recognition, fixed-length input sequences are often required, which limits their flexibility and lacks the ability to model for long periods of time. Most of the existing behavior recognition methods based on 3D CNN generally adopt sparse sampling to obtain video sequence input of a network, the sampling method cannot well model the global time structure of the whole video, and some important motion details can be lost for complex long-term behavior videos.

The embodiment provides a Depth Dynamic Image Sequence (DDIS), which is a representation method capable of simply and effectively describing the spatiotemporal dynamics of a Depth video. Suppose a depth video V containing N frames is represented as

Wherein I_tIs the t-th depth image. Inspired by the dynamic image, the present embodiment extends it into the depth data. First, for a depth video, a set of short video clips is generated by adopting a sliding window method, specifically, a sliding window W with a width L is moved along the time axis of a depth video sequence by a step length s to generate T short clips, namely

Wherein S_tIs the t-th video segment. The window width L is set to different values according to different data sets, the step length s is closely related to the video sequence length N, and the calculation method is shown as formula (1):

wherein [ a ] represents the largest integer less than a. In order to effectively capture local space-time dynamic in each short segment of the depth video, sequencing pooling is adopted to aggregate space-time information in each segment into a depth dynamic image.

As a new time pooling method, sequencing pooling not only can well capture the time sequence change of the video, but also is easy to implement. The sorting pooling learns a linear function using a pairwise linear sorting machine whose parameters can encode the timing of the video frames, which is used as a new representation of the video.

Order to

Wherein

The width L of the sliding window, which represents the jth image in the tth segment of the depth video, is also the length of the video segment. The time-varying average vector operation is used to capture timing information between successive frames in the tth video segment, see equation (2).

Smoothing vector sequences

The timing information between L consecutive images in the t-th depth video segment can still be preserved. A linear ordering function is defined as

α∈R^D. Alpha is a parameter vector of the ordering function, relative timing information between video frames can be preserved, i.e., if

Then the ordering function value satisfies

The objective function of rank pooling is defined using structure risk minimization, see equation (3).

ε_ijIs a relaxation variable, α^*Is the optimal parameter vector that satisfies the above objective function, which is then converted into a two-dimensional matrix representing the generated dynamic image that aggregates all image frames of a short segment of depth video and can simultaneously describe the spatial motion and temporal structure information of the short segment of depth video.

The present embodiment constructs the DDIS by applying sorted pooling to the T short segments of the depth video, respectively, as shown in fig. 2. DDIS describes the space-time change of human body outline, contains the motion details of human body behavior with identification force, and is helpful for improving the identification performance. In addition, the DDIS can effectively model the space-time structure of the long-term deep behavior video in a very light structure, and the capability of 3D CNN for learning the global space-time characteristics of the long-term video is enhanced.

2. Sequence of pose assessment graphs

The posture data can provide rich information to reflect the motion of each part of the human body, has close relation with behavior recognition, and is helpful to improve the recognition performance by fusing the posture information of the human body. Postural assessment has achieved great success in recent years and has proven to be an effective method for obtaining skeletal data. Compared with a method for acquiring bone data by a depth sensor, the image-based posture estimation has wider application and is not limited to indoor human behavior recognition. The posture evaluation graph generated by applying 2D posture evaluation on the RGB video sequence keeps the intuitive information of human body posture, and can avoid the interference of disordered backgrounds, thereby enhancing the performance of behavior recognition.

The Partial Affinity Fields (PAFs) are a non-parametric representation, and can be used for realizing effective multi-person 2D pose assessment by learning the relevance between body parts and human bodies in images. First, the entire image is input into a two-branch multi-stage CNN, where one branch predicts a set of 2D confidence maps of body part positions for limb detection and the other branch predicts a set of 2D vector fields representing limb association for limb connection. Each branch is an iterative architecture that continually optimizes the prediction results in multiple stages in succession. Secondly, based on the optimized confidence map and the association domain, a greedy bottom-up analysis step is adopted to perform bilateral matching so as to associate the candidate body parts. Finally, the connections having the same body part detection candidates are assembled into a whole-body posture evaluation diagram of the human body in the image.

The embodiment adopts a pose evaluation method based on partial affinity fields to carry out 2D pose evaluation on RGB (red, green and blue) videos, and generates a corresponding pose evaluation graph by applying pose evaluation on each color image for an RGB video sequence; sparse sampling is then applied to the originally generated RGB pose estimation graph sequence to obtain a lightweight compact data representation, referred to as a pose estimation graph sequence (PEMS). Due to space limitations, fig. 3 shows a portion of the image in PEMS for four types of behavior in the NTU RGB + D data set. PEMS encodes the spatial dynamics of human body posture over time, providing rich body part motion information. In addition, the PEMS can directly distinguish human bodies from videos, eliminates the influence caused by disordered backgrounds and non-human body behaviors, and greatly improves the behavior recognition performance.

3. Multimodal dual-stream 3D network

As is well known, videos have high dimensionality, time structures between successive frames can provide very valuable behavior information, and spatiotemporal dynamics have very important significance for behavior identification in videos. Efficient spatio-temporal structural modeling helps to enhance the performance of video behavior recognition. The traditional double-flow network respectively utilizes the spatial stream CNN and the time stream CNN to extract spatial and time information, however, the traditional double-flow network only uses a single RGB image as the input of the spatial stream CNN and cannot capture the global spatial dynamics of human behaviors in a video; moreover, it cannot learn temporal and spatial features simultaneously. To solve this problem, 3D CNN is proposed for joint modeling of spatio-temporal dynamics and achieves a good effect in video-based human behavior recognition.

The embodiment provides a new behavior identification framework by combining the characteristics of the double-flow structure and the 3D network. Compared with the conventional deep network-based method for behavior recognition only by using a single data modality, the embodiment proposes to adopt a dual-stream 3D structure to fuse information of multiple data modalities so as to obtain higher recognition performance. As shown in fig. 1, the framework of the present embodiment includes two branches under different data modalities: DDIS flows and PEMS flows. Both streams are implemented using a 3D CNN model. DDIS based on depth video generation and PEMS based on RGB video generation are input as 3D CNN in the corresponding streams, respectively. Finally, the classification scores of the two branches are averaged to generate a final classification score of the video for behavior identification.

The multi-mode double-flow 3D network framework provided by the embodiment can be used for fusing the advantages of two different data modes, and the identification efficiency is obviously improved. A behavior sample is represented as two different data modalities, each processed in two branches of the recognition framework. In the DDIS stream, the DDIS can describe the space-time dynamic state of the whole depth video in a light and concise mode, and the learning capacity of the 3D CNN in the space-time domain is enhanced. In addition, the DDIS reflects the change of the human body shape along with time and the object outline information related to the behavior, and can provide characteristic clues with identification power for the identification of the human interaction behavior. In the PEMS flow, the PEMS codes the spatial change of the body posture along with time, and can directly extract the posture information containing the motion of each part of the human body from the video to eliminate the influence of background disorder. The two data modes have different advantages, and the fused recognition effect is better; two branches in the framework can use one 3D CNN to model the global space-time dynamic of the behavior video, and the identification capability of the method can be further improved.

4. Experimental part

The effectiveness of the proposed method was verified by performing rich experiments in three challenging RGB-D datasets. Firstly, a data set for video behavior recognition is briefly introduced; specific implementation details are described next; then, the performance of the multi-mode double-current 3D network architecture is evaluated on three RGB-D data sets, and experimental results are described in detail and compared and analyzed with other methods.

4.1 data set

SBU Kinect interaction data set. The data set was collected using a Microsoft Kinect sensor, containing eight types of human-human interaction behavior: approaching, leaving, pushing, kicking, hitting, exchanging objects, hugging, handshaking. In the data set, seven participants make up 21 combinations, each containing a different pair of participants. Each interactive activity is performed one to two times in each combination, with approximately 300 interactive videos for the entire data set. 5-fold cross validation was performed on this data set, following standard evaluation protocols.

UTD-MHAD dataset. This data set was acquired with a Kinect camera and wearable inertial sensors in an indoor experimental scenario, comprising 27 types of actions performed by 8 participants (4 females and 4 males), each participant repeating 4 times per action, the data set comprising a total of 861 video sequences. In the experiment, video samples of participants 1, 3, 5, 7 were used as training data and video samples of participants 2, 4, 6, 8 were used as test data.

NTU RGB + D dataset. The data set is acquired simultaneously by using 3 Microsoft Kinect V.2 cameras, is the largest RGB-D behavior recognition data set at present, and has 56880 behavior samples. The data set contains 60 different behavior classes performed by 40 participants. This data set is extremely challenging due to the large number of behavior samples and behavior classes contained, as well as the rich intra-class/inter-class variation. The data set has two standard evaluation schemes: cross-person evaluation and cross-perspective evaluation. For cross-person assessment, 20 participants 'behavioral samples were used for training, and the other 20 participants' behavioral samples were used for testing; for cross-perspective evaluation, the behavioral samples captured by camera 2 and camera 3 were used for training, while the remaining samples captured by camera 1 were tested.

4.2 implementation details

For the NTU RGB + D dataset, the background in the original RGB-D video is cluttered and occupies a larger spatial proportion than the human body region. Therefore, in order to eliminate the interference of the cluttered background and better focus on the human body region, the embodiment cuts out the human body foreground region from the original RGB-D video. Specifically, after 2D attitude estimation is applied to an RGB video sequence, an attitude estimation graph with the same size as an original RGB image is obtained, and the attitude estimation graph only contains human body postures, so that a disordered background is eliminated; and then detecting each generated posture evaluation graph to obtain a boundary box of the human body foreground, and adjusting the cut boundary box of the human body foreground into the size of 256 multiplied by 256 to construct the PEMS. Furthermore, since the RGB and depth video sequences are fully aligned and synchronized in the dataset, the bounding boxes detected on the pose evaluation graph (of the same size as the corresponding RGB image) can also be projected proportionally into the corresponding depth image for cropping of the human foreground. Each behavior sample cropped depth image is adjusted to a fixed size of 265 x 320, i.e., the same aspect ratio as the original depth image, to generate a DDIS representation.

For the UTD-MHAD dataset, all frames in the DDIS are cropped and resized to a size of 120 × 160, i.e., half the resolution of the original depth frame. The resolution of the image frames in PEMS is 1/4 of the original pose graph. For the SBU-Kinect interaction data set, the image frame size for DDIS and PEMS were 120 × 160, which are 1/4 of the original data resolution, respectively.

To implement the proposed multi-modal dual-stream 3D network framework, the present embodiment employs 3D-Resnet18 as the basic 3D network for two streams, considering that the performance of Res3D on different reference data sets is better than C3D. In the training phase, the DDIS stream and the PEMS stream are trained separately, the 3D-respet 18 model in each stream is initialized using pre-training weights obtained from the Kinetics data set, and the number of behavior classes in the classification layer is set to the number of classes of the corresponding data set. The 3D-Resnet18 network used in the proposed method can be replaced by any other 3D CNN structure. DDIS and PEMS each contain 16 frames of images, each sized 112 x 112 to meet the input requirements of the 3D-Resnet18 model 112 x 16. Data enhancement techniques include random cropping and horizontal flipping. For training of the network, a minimum batch random gradient descent method is used for learning the weight of the network, the momentum value is set to be 0.9, and the weight attenuation is set to be 0.0001. Each 3D network model was trained 500 times with an initial learning rate of 0.01 and a decay factor of 10. The minimum batch size for each iteration of the SBU Kinect dataset and UTD-MHAD dataset is 32 samples, and the NTU RGB + D dataset is 64 samples. All experiments were performed based on a pytorech toolbox. In the testing process, DDIS and PEMS of a given test video are respectively input into 3D CNN of two branches to generate class score vectors under two data modes, and the class score vectors of two streams are averaged to obtain the final class score of the video for behavior prediction.

4.3 Performance evaluation

TABLE 1 multimodal Dual stream 3D CNN Performance evaluation

To evaluate the performance of the proposed method, extensive experiments were performed on three challenging RGB-D datasets. Specifically, in order to fully verify the effectiveness of the proposed multi-modal dual-flow fusion method, and simultaneously evaluate the performance of behavior recognition using only a single data modality, the experimental results and comparison are shown in table 1.

First, as can be seen from table 1, for the SBU-Kinect interaction dataset, the performance of PEMS flow is about 4.5% higher than DDIS flow, which may be due to the fact that the dataset only contains some simple human-human interaction behaviors with obvious differences, so that PEMS excluding cluttered background interference and paying attention to human posture is more discriminating to behavior recognition. For the UTD-MHAD dataset, the rate of identification of DDIS streams was 87.42% which is about 4% higher than PEMS streams. The dataset contains some behaviors with similar motion patterns, such as "draw triangles" and "draw circles"; and some time-reversed behavior, such as "slide left" and "slide right". Therefore, the DDIS which can capture the space-time dynamic state of the local behavior per frame image has better identification performance. For the NTU RGB + D data set, the performance of the DDIS flow in the two evaluation schemes is better than that of the PEMS flow, the recognition rate under the cross person setting and the cross visual angle setting is respectively improved by 8.31 percent and 6.1 percent, and the probable reason is presumed to be that the data set contains rich daily behaviors and human-object interaction behaviors, so that the DDIS capable of modeling the space-time motion of a human body and describing the outline of an interactive object can obtain better recognition performance.

As shown in table 1, the proposed multi-modal dual-stream fusion scheme improves the recognition accuracy of the SBU-Kinect interaction dataset to 100%. For the UTD-MHAD dataset, dual stream fusion also achieved the highest recognition rate, reaching 91.61%, approximately 4% and 8% higher than the DDIS and PEMS streams, respectively. In addition, the method still obtains good identification effect in two evaluation schemes of NTU RGB + D data sets, and the accuracy rates are respectively 89.38% and 93.48%. The experimental results show that the multi-mode double-flow fusion scheme remarkably improves the recognition results on three data sets, is superior to a single data flow, and shows that the DDIS flow and the PEMS flow are complementary.

4.4 comparison of different fusion strategies

TABLE 2 comparison of Performance of different fusion strategies on three datasets

This embodiment explores several different fusion strategies to fuse the classification outputs of the two branches. For a test sample, the classification score vectors from the DDIS stream and the PEMS stream are averaged, maximized or multiplied to obtain a final fusion score vector, and the maximum value index in the final class score vector of the test video is used as a class label to be identified. The results of comparison of the three general fusion methods are shown in Table 2. As can be seen from table 2, the mean fusion method achieves the highest recognition accuracy on all three data sets compared to the maximum fusion and the multiplicative fusion. Based on the above comparative analysis, the present embodiment combines the recognition results of the dual-flow architecture by using an average value fusion strategy.

4.5 comparison with other Excellent algorithms

To verify the superiority of the proposed method, this example compares it with other superior methods on the SBU-Kinect interaction dataset, UTD-MHAD dataset and NTU RGB + D dataset. Table 3 shows the results of several methods on the SBU-Kinect interaction data set, and it can be seen that most of the previous methods are based on skeletal modalities, wherein RotClips + MTCNN and GCA-LSTM improve the recognition rate to more than 94%. TL (Mul (CL 1D)_tree+ CMHI)) utilizes 3D skeletons, body part images and motion historical images to respectively collect body posture, body part outline and body motion information of behaviors and integrate the body posture, body part outline and body motion information into a mixed deep learning system structure for human behavior recognition, and the method integrates three kinds of characteristic information, and the recognition rate is as high as 99.3%. Compared with the method, the multi-modal double-flow 3D CNN framework provided by the embodiment has the best performance, the precision reaches 100%, and the effectiveness of the method is proved.

Table 4 shows the comparison between the method mentioned in the UTD-MHAD data set and other methods, and it can be seen that the recognition accuracy obtained by merging the DDIS stream and the PEMS stream is 91.61%, which is better than most of the previous algorithms. However, the performance is reduced by 7.19% compared to the state-of-the-art method of this dataset, Ensem-NN. Ensem-NN proposes a skeletal-based integrated neural network for video human behavior recognition, where four different sub-networks are designed based on a base network to extract different features of behavior. Ensem-NN integrates the spatiotemporal characteristics of the whole skeleton, the local characteristics of different parts of the body and the motion characteristics of the skeleton joints between two continuous frames, thereby realizing efficient behavior recognition. In contrast, the proposed method lacks modeling of local features of human behavior, and recognition capabilities may be diminished.

For the NTU RGB + D dataset, a comparison of the proposed method with other advanced algorithms is shown in table 5, the performance of the method of this embodiment is superior to most of the past work in both the cross-person setting and the cross-viewing angle setting, with accuracy rates of 89.38% and 93.48%, respectively, demonstrating the high efficiency of the proposed method. In particular, the proposed method achieves the best performance in both settings and a significant improvement compared to the multi-modality based approach. Furthermore, as can be seen from table 5, the performance of the proposed method is significantly improved in both evaluation settings compared to most monomodal methods. Glimpse Clouds have comparable performance to the proposed method in the cross-viewing setting, which is 2.78% better than the proposed method when tested using the cross-person setting. Furthermore, it is worth mentioning that DGNN has a superior performance compared to the proposed method. In the cross-person evaluation scheme, the method and the DGNN of the embodiment achieve similar recognition results. However, for cross-perspective evaluation, the accuracy of DGNN is 96.1%, which is about 2.6% higher than that of the method of the present embodiment, probably because DGNN represents joint and bone information as directed acyclic graphs, and a new directed neural network (DGNN) is designed to predict behaviors according to a graph structure of adaptive learning, which effectively establishes a dependency relationship between joints and bones in bone data and a time information model between consecutive frames, thereby improving the recognition performance. The method mainly considers the modeling of global space-time dynamic information, omits the description of the local characteristic information of human behavior, and therefore the identification precision is slightly reduced.

As can be seen from the experimental results of tables 3, 4 and 5, the DDIS stream also achieves better recognition effect on three data sets, which indicates that the 3D CNN has good feature learning ability and applicability to depth video. In summary, the multi-modal dual-stream 3D CNN framework proposed in this embodiment has good recognition performance on three evaluated RGB-D datasets, mainly for the following reasons:

(1) DDIS based depth video generation provides spatio-temporal dynamic information of human morphology and object contour information related to behavior. The PEMS directly extracts human body gestures from the RGB video, and the human body gestures comprise gesture information with identification power. The DDIS flow and the PEMS flow are complementary, and their fusion is advantageous for improving the performance of behavior recognition.

(2) The multi-mode double-stream structure provided by the method has the advantages that the 3D CNN is applied to each independent branch stream to model the global space-time characteristics of video behaviors, and the recognition effect is further improved.

TABLE 3 comparison of Experimental results for SBU KINECT interaction data set

TABLE 4 comparison of the results of the UTD-MHAD data set

Table 5 comparison of experimental results for NTU RGB + D dataset

The multi-modal double-flow 3D CNN framework provided by the embodiment is used for modeling global space-time dynamics under different data modalities so as to realize more efficient video human behavior recognition. The proposed DDIS representation describes the change in human morphology over time, providing rich behavioral spatiotemporal motion information. In addition, the method can capture the details of human interaction, and is beneficial to the recognition of human interaction behaviors. PEMS means that changes in human body posture are explicitly modeled by directly distinguishing human bodies from RGB video, which is of great help to eliminate the effects of background clutter and non-human motion. The video representation under two different data modes is compact and concise, and the calculation amount of the whole video is reduced. And respectively inputting the DDIS and the PEMS into the 3DCNN to construct a dual-stream network structure, wherein each branch stream can model the global space-time dynamics of a behavior sample. The method achieves 100% of recognition accuracy on an SBU-Kinect interaction data set, and is superior to most of the existing methods on UTD-MHAD and NTU RGB + D data sets. In future research work, we will discuss how to extract and fuse local features of human behaviors to obtain more accurate behavior recognition results.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A video human behavior recognition method based on a multi-mode double-flow 3D network is characterized by comprising the following steps:

a depth dynamic map sequence DDIS generated based on the depth video;

a pose evaluation graph sequence PEMS generated based on the RGB video;

2. The method for recognizing the human body behaviors through the video based on the multi-modal double-stream 3D network according to claim 1, wherein the depth dynamic map sequence generated based on the depth video specifically comprises:

moving by a set step length s along a time axis of a depth video sequence by adopting a sliding window with the width of L to generate T short segments; and aggregating the space-time information in each short segment into a depth dynamic image by adopting a sequencing pooling method to obtain a depth dynamic image sequence.

3. The method for recognizing the human body behaviors through the video based on the multi-modal dual-stream 3D network as claimed in claim 2, wherein the step length s and the length N of the depth video sequence satisfy the following relation:

4. the method for recognizing the video human body behaviors based on the multi-modal double-flow 3D network according to claim 2, wherein a sequencing pooling method is adopted to aggregate the spatio-temporal information in each short segment into a deep dynamic image, and specifically comprises the following steps: order toWhereinRepresenting the jth image in the tth segment of the depth video, wherein the width L of the sliding window is the length of the video segment;

capturing time sequence information between continuous frames in the t-th video segment through time-varying average vector operation;

the objective function of rank pooling is defined using structural risk minimization:

and obtaining an optimal parameter vector meeting the objective function, wherein the parameter vector is converted into a dynamic image generated by a two-dimensional matrix representation, the dynamic image integrates all image frames of a short section of the depth video, and the spatial motion and time structure information of the short section of the depth video can be described at the same time.

5. The method for recognizing the human body behaviors based on the video of the multi-modal dual-stream 3D network as claimed in claim 1, wherein the gesture evaluation graph sequence generated based on the RGB video specifically comprises:

for an RGB video sequence, generating a respective pose estimation map by applying a pose estimation on each color image; and then applying sparse sampling to the originally generated RGB posture estimation graph sequence to obtain a posture estimation graph sequence.

6. The method for identifying the human body behaviors through the videos based on the multi-mode double-flow 3D network according to claim 1, wherein the classification score vectors generated by the DDIS flow and the PEMS flow are averaged to obtain the final classification score of the behavior video.

7. A video human behavior recognition system based on a multi-mode double-flow 3D network is characterized by comprising:

8. A terminal device comprising a processor and a computer-readable storage medium, the processor being configured to implement instructions; the computer-readable storage medium is used for storing a plurality of instructions, wherein the instructions are suitable for being loaded by a processor and executing the video human behavior recognition method based on the multi-mode double-flow 3D network according to any one of claims 1-6.

9. A computer-readable storage medium, in which a plurality of instructions are stored, wherein the instructions are adapted to be loaded by a processor of a terminal device and execute the method for video human behavior recognition based on multi-modal dual-stream 3D network according to any one of claims 1 to 6.