CN110991278A

CN110991278A - Human body action recognition method and device in video of computer vision system

Info

Publication number: CN110991278A
Application number: CN201911142506.5A
Authority: CN
Inventors: 吉长江
Original assignee: Beijing Moviebook Technology Corp Ltd
Current assignee: Beijing Moviebook Technology Corp Ltd
Priority date: 2019-11-20
Filing date: 2019-11-20
Publication date: 2020-04-10

Abstract

The application discloses a method and a device for recognizing human body actions in videos of a computer vision system, and relates to the field of human body action recognition. The method comprises the following steps: acquiring a video of a computer vision system, and performing data preprocessing on the video; extracting characteristic information of the video subjected to data preprocessing by adopting a parallel multi-characteristic fusion network algorithm; training the human body action on the deep learning model according to the extracted characteristic information; and classifying and identifying the video to be identified by adopting the trained deep learning model. The device includes: the device comprises an acquisition module, an extraction module, a training module and an identification module. The method and the device improve the accuracy of action recognition and the accuracy of classification recognition.

Description

Human body action recognition method and device in video of computer vision system

Technical Field

The present application relates to the field of human motion recognition, and in particular, to a method and an apparatus for recognizing human motion in a video of a computer vision system.

Background

In earlier traditional algorithmic research make internal disorder or usurp, action recognition typically required manual design rules to extract feature descriptors. In the feature extraction method, global features contain more information of human bodies, and a moving object is regarded as a whole to be subjected to feature description, and the method is sensitive to changes such as noise, shielding phenomena and the like. Local features such as space-time interest points are not dependent on moving human body segmentation positioning and tracking, are not very sensitive to noise and shielding, but are difficult to extract stable space-time interest points. Algorithms based on the combination of global and local features that fuse the advantages of both are more common. For example, the positions of the joint points are detected, different types of motion models are trained by using multi-pose estimation, the semantic space relation between people is extracted, and the human motion recognition is realized by combining appearance characteristics. Track features in a video sequence are obtained by utilizing an optical flow field, then optical flow direction histogram, motion direction histogram and motion boundary histogram features are extracted to obtain a motion descriptor, in the process of extracting the features along the motion track, the influence caused by camera motion is eliminated from an optical flow image, and the accuracy of human body motion recognition is further improved. Unlike RGB image data, there are algorithms based on depth image input information including 3D joint points, such as geometric relation modeling for simulating body parts using rotation and translation in 3D space, and a motion feature acquisition method based on a depth motion map, which uses an L2 regularized co-representation classifier for motion classification recognition.

With the development of deep neural networks such as convolutional neural networks, long-short term memory networks and the like, human body action recognition gradually makes greater progress, and the performance of the algorithm exceeds that of the traditional algorithm. Experts propose a dual-stream CNN (convolutional neural Networks) structure, in which a spatial stream processes a single frame image, a temporal stream processes continuous multi-frame dense optical flow information, the spatial and temporal Networks train together, and finally, results are fused and classified. And an expert provides a time division network aiming at the problem of longer time span of a video sequence based on a double-flow method, and performs motion analysis on the whole motion video based on uniform sparse sampling and video grade supervision. The CNN network and the LSTM (Long Short-term memory network) network are combined, and the performance of the classification algorithm is improved through the sequence of the effective expression frames of the memory units. A novel C3D network is newly provided to extract the space-time characteristics of the video, a better experimental effect is obtained, and the 3D convolution can well capture time sequence information.

However, there are many problems with conventional recognition of human interaction. First, the space of human interactive features is complex. Under the real condition, the spatial complexity of the motion brings great difficulty to the extraction of the motion characteristics of the human body, especially the identification of the human body interaction of multiple people. In consideration of the occlusion property of the action, phenomena such as human and background, mutual occlusion between human and human, and self-occlusion may exist in the process of executing the action, which may lead to more complex target detection of the moving individual. In addition, due to the change of the dynamic background in the action execution process, the existence of other problems such as different backgrounds, different illumination, different resolutions, interference of irrelevant action personnel and the like can bring great difficulty to the human action recognition. Second, the motion characteristics at different time periods differ. For human body interaction, different actions have different completion periods, for example, the handshake action and the charging action take different time, and the same action is performed by different people with different time lengths. For the same action, the information feature quantities obtained in different action time phases are also different, for example, in an action starting phase and an action ending phase, the feature difference between different actions is small, while in an action executing phase, the feature difference between different actions is obvious, and the feature difference brought by different action time phases brings difficulty to action recognition. The key frame acquisition in the human body action execution process has important significance on feature extraction, how to add consideration on action time stage in the video down-sampling process to obtain the key frame with more action feature information, and the key frame brings challenges to human body interaction action recognition. Furthermore, the complexity of the interactive feature information. For single human motion recognition, only motion features of a single person need to be considered, such as global features of a human body, such as optical flow and silhouette, or local features of local spatio-temporal interest points. For the feature extraction of human body interaction, not only the motion feature information of each individual moving need to be considered, but also the interaction information of the human-to-human motion needs to be considered, such as the relative position and orientation between human and human in the motion execution process. Secondly, the feature extraction process also has interference caused by individual actions unrelated to interaction.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to one aspect of the application, a method for recognizing human body actions in a video of a computer vision system is provided, which comprises the following steps:

acquiring a video of a computer vision system, and performing data preprocessing on the video;

extracting characteristic information of the video subjected to data preprocessing by adopting a parallel multi-characteristic fusion network algorithm;

training the human body action on the deep learning model according to the extracted characteristic information;

and classifying and identifying the video to be identified by adopting the trained deep learning model.

Optionally, performing feature information extraction on the video after the data preprocessing by using a parallel multi-feature fusion network algorithm, including:

and extracting the characteristic information of the video subjected to data preprocessing based on a parallel multi-characteristic fusion algorithm of an inclusion network and a ResNet depth residual error network.

Optionally, the data preprocessing is performed on the video, and includes:

when double-person interactive action recognition is carried out, the video of the double-person interactive action is cut and divided into two action videos only containing a single person.

and respectively extracting the characteristic information of the whole video and the single video after the segmentation by using a parallel multi-characteristic fusion network algorithm.

Optionally, the method further comprises:

in the process of extracting the characteristic information, a Gaussian model down-sampling method fusing time stage characteristics is adopted, different sampling intervals are given to human bodies in different time stages in interactive action, and redundant information is removed.

According to another aspect of the present application, there is provided an apparatus for recognizing human body motion in a video of a computer vision system, comprising:

an acquisition module configured to acquire a video of a computer vision system, data pre-processing the video;

the extraction module is configured to extract the characteristic information of the video after the data preprocessing by adopting a parallel multi-characteristic fusion network algorithm;

a training module configured to train human body actions on the deep learning model according to the extracted feature information;

and the recognition module is configured to perform classified recognition on the video to be recognized by adopting the trained deep learning model.

Optionally, the extraction module is specifically configured to:

Optionally, the obtaining module is specifically configured to:

Optionally, the extraction module is specifically configured to:

Optionally, the extraction module is further configured to:

According to yet another aspect of the application, there is provided a computing device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method as described above when executing the computer program.

According to yet another aspect of the application, a computer-readable storage medium, preferably a non-volatile readable storage medium, is provided, having stored therein a computer program which, when executed by a processor, implements a method as described above.

According to yet another aspect of the application, there is provided a computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method described above.

According to the technical scheme, the video is subjected to data preprocessing through the video of the computer vision system, a parallel multi-feature fusion network algorithm is adopted, the video after the data preprocessing is subjected to feature information extraction, training of human body actions is carried out on the deep learning model according to the extracted feature information, the trained deep learning model is adopted to carry out classification and recognition on the video to be recognized, and the accuracy of the action recognition and the accuracy of the classification and recognition are improved. Further, aiming at the action characteristic differences of different action time stages of the video, the video key frame is obtained by a Gaussian model down-sampling method fusing time stage characteristics, so that the influence of a large amount of redundant information is removed, and the action identification accuracy is further improved.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flow diagram of a method for human motion recognition in a video of a computer vision system according to one embodiment of the present application;

FIG. 2 is a flow diagram of a method for human motion recognition in a video of a computer vision system according to another embodiment of the present application;

FIG. 3 is a block diagram of a human motion recognition device in video of a computer vision system according to another embodiment of the present application;

FIG. 4 is a block diagram of a computing device according to another embodiment of the present application;

fig. 5 is a diagram of a computer-readable storage medium structure according to another embodiment of the present application.

Detailed Description

FIG. 1 is a flow diagram of a method for human motion recognition in a video of a computer vision system according to one embodiment of the present application. Referring to fig. 1, the method includes:

101: acquiring a video of a computer vision system, and performing data preprocessing on the video;

102: extracting characteristic information of the video subjected to data preprocessing by adopting a parallel multi-characteristic fusion network algorithm;

103: training the human body action on the deep learning model according to the extracted characteristic information;

104: and classifying and identifying the video to be identified by adopting the trained deep learning model.

In this embodiment, optionally, the extracting the feature information of the video after the data preprocessing is performed by using a parallel multi-feature fusion network algorithm, including:

In this embodiment, optionally, the data preprocessing is performed on the video, and includes:

In this embodiment, optionally, the method further includes:

According to the method provided by the embodiment, the video of the computer vision system is obtained, the video is subjected to data preprocessing, the parallel multi-feature fusion network algorithm is adopted, the feature information of the video subjected to data preprocessing is extracted, the deep learning model is trained for human body actions according to the extracted feature information, the trained deep learning model is adopted to classify and recognize the video to be recognized, and the accuracy of action recognition and the accuracy of classification and recognition are improved. Further, aiming at the action characteristic differences of different action time stages of the video, the video key frame is obtained by a Gaussian model down-sampling method fusing time stage characteristics, so that the influence of a large amount of redundant information is removed, and the action identification accuracy is further improved.

FIG. 2 is a flow diagram of a method for human motion recognition in a video of a computer vision system according to another embodiment of the present application. Referring to fig. 2, the method includes:

201: acquiring a video of a computer vision system, and performing data preprocessing on the video;

in the data preprocessing stage, for the limitation of insufficient database capacity, the data volume can be enlarged by adopting modes of video horizontal turning, random clipping and the like.

In this embodiment, the data preprocessing performed on the video may specifically include:

202: extracting feature information of the video subjected to data preprocessing based on a parallel multi-feature fusion algorithm of an inclusion network and a ResNet depth residual error network;

the Incep network improves the network performance while reducing the network parameter number by introducing the characteristic receptive fields with different scales; the ResNet network is improved aiming at the network degradation phenomenon caused by the increase of the network depth, so that higher classification accuracy is obtained.

In this embodiment, the step may specifically include:

when double-person interactive action recognition is carried out, feature information extraction is carried out on the whole video and the single-person video after segmentation by respectively using a parallel multi-feature fusion network algorithm.

The individual action detail characteristic information can be extracted from the segmented single video, and the whole video can learn the characteristic information such as the relative position, the orientation and the like of the two persons.

203: in the process of extracting the characteristic information, a Gaussian model down-sampling method fusing time stage characteristics is adopted, different sampling intervals are given to human bodies in different time stages in interactive action, and redundant information is removed;

204: training the human body action on the deep learning model according to the extracted characteristic information;

205: and classifying and identifying the video to be identified by adopting the trained deep learning model.

In the embodiment, when double-person interactive action recognition is carried out, in the video classification stage, the primary recognition results obtained by the whole video and the individual segmentation video are subjected to decision-level fusion, so that the action classification accuracy can be improved.

FIG. 3 is a block diagram of a human motion recognition device in a video of a computer vision system according to another embodiment of the present application. Referring to fig. 3, the apparatus includes:

an acquisition module 301 configured to acquire a video of a computer vision system, perform data preprocessing on the video;

an extraction module 302 configured to extract feature information of the video after data preprocessing by using a parallel multi-feature fusion network algorithm;

a training module 303 configured to train human body actions on the deep learning model according to the extracted feature information;

and the recognition module 304 is configured to perform classification recognition on the video to be recognized by adopting the trained deep learning model.

In this embodiment, optionally, the extraction module is specifically configured to:

In this embodiment, optionally, the obtaining module is specifically configured to:

In this embodiment, optionally, the extraction module is further configured to:

The apparatus provided in this embodiment may perform the method provided in any of the above method embodiments, and details of the process are described in the method embodiments and are not described herein again.

According to the device provided by the embodiment, the video of the computer vision system is obtained, the video is subjected to data preprocessing, the parallel multi-feature fusion network algorithm is adopted, the video subjected to data preprocessing is subjected to feature information extraction, the deep learning model is trained for human body actions according to the extracted feature information, the trained deep learning model is adopted to classify and recognize the video to be recognized, and the accuracy of action recognition and the accuracy of classification and recognition are improved. Further, aiming at the action characteristic differences of different action time stages of the video, the video key frame is obtained by a Gaussian model down-sampling method fusing time stage characteristics, so that the influence of a large amount of redundant information is removed, and the action identification accuracy is further improved.

Embodiments also provide a computing device, referring to fig. 4, comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, the computer program, when executed by the processor 1110, implementing the method steps 1131 for performing any of the methods according to the invention.

The embodiment of the application also provides a computer readable storage medium. Referring to fig. 5, the computer readable storage medium comprises a storage unit for program code provided with a program 1131' for performing the steps of the method according to the invention, which program is executed by a processor.

The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of human motion recognition in a video of a computer vision system, comprising:

2. The method according to claim 1, wherein the extracting the feature information of the video after the data preprocessing by using a parallel multi-feature fusion network algorithm comprises:

3. The method of claim 1, wherein pre-processing the video data comprises:

4. The method according to claim 3, wherein the extracting the feature information of the video after the data preprocessing by using a parallel multi-feature fusion network algorithm comprises:

5. The method according to any one of claims 1-4, further comprising:

6. An apparatus for human action recognition in a video of a computer vision system, comprising:

7. The apparatus of claim 6, wherein the extraction module is specifically configured to:

8. The apparatus of claim 6, wherein the acquisition module is specifically configured to:

9. The apparatus of claim 8, wherein the extraction module is specifically configured to:

10. The apparatus of any one of claims 6-9, wherein the extraction module is further configured to: