CN110866458A

CN110866458A - Multi-user action detection and identification method and device based on three-dimensional convolutional neural network

Info

Publication number: CN110866458A
Application number: CN201911032206.1A
Authority: CN
Inventors: 宋波
Original assignee: Beijing Yingpu Technology Co Ltd
Current assignee: Beijing Yingpu Technology Co Ltd
Priority date: 2019-10-28
Filing date: 2019-10-28
Publication date: 2020-03-06

Abstract

The application discloses a method and a device for detecting and identifying multi-user actions based on a three-dimensional convolutional neural network. The method comprises the steps of preprocessing a video of a training data set to obtain a sequence of body actions of each person; and inputting the extracted body motion sequence of each person into a three-dimensional convolutional neural network model for motion detection and identification, wherein the three-dimensional convolutional neural network model comprises two 3D convolutional pooling units, two convolutional layers, two maximum pooling layers, a flat layer, two complete connection layers and an output layer. The device comprises a preprocessing module and a detection and identification module. The method and the device can be applied to multi-person motion detection, and the three-dimensional convolution neural network model has good generalization capability and higher identification accuracy when being applied to different types of video data.

Description

Multi-user action detection and identification method and device based on three-dimensional convolutional neural network

Technical Field

The application relates to the field of video processing, in particular to a method and a device for detecting and identifying multi-user actions based on a three-dimensional convolutional neural network.

Background

Human motion detection and recognition is a specific application of video processing, aiming at recognizing human motion and activity through a series of behavior recognition about subjects and surrounding environment, which is a hot problem in the field of computer vision. Human motion detection is an important task for many applications, such as video surveillance systems, video retrieval, multimedia applications of human motion, and so on. Human action recognition methods can be divided into two broad categories: detection recognition and classification. The detection and identification method firstly detects the motion of a person and then identifies the motion. These methods are validated using conventional monitoring data sets, such as KTH, Weisman, IX-MAS, UCF-ARG, PETS, and the like. These data sets are recorded under controlled conditions so that the individual is in the best shot position at the camera, with a simple static background and similar lighting variations. Under the classification method, classification is generally performed according to the motion contained in the video. These methods are evaluated using newly developed data sets that are videos collected from a network, such as from YouTube, or videos recorded using a mobile camera and a complex background under practical conditions without controlling lighting conditions, and include Hollywood, Hollywood2, UCFsport, UCF50, UCF101, HMDB51, HMDB, etc., which explore the diversity of video content (the size of a human body, changes in the position of a human body, etc.), changes in camera motion, and background analysis of such data sets. In order to improve the human motion recognition performance, most of recent studies adopt various deep learning models. Since human body behavior is extracted from multiple movements of the human body or parts thereof, the recognition process must involve video processing in order to understand the pattern of visual appearance changes. Many approaches use several input videos for deep learning modeling to recognize human activities, such as recognizing human behavior using CNN models and long-term short-term memory (LSTM); there is another example of many features proposed by scholars for use with CNN models, using raw frames, optical flow and motion stacked difference images as inputs to CNN models for human behavior recognition; yet another approach is to use six-stream features for general motion recognition, where a number of inputs are used, including a full image, a human image representing only the human body, and the optical flow results of each of the previous features.

The above described method has the following drawbacks:

1. most of the existing methods aim at single-person motion detection and identification, and because of the multi-complexity characteristics of multi-person motion detection, the methods aiming at multi-person motion detection are less;

2. the existing multi-person motion detection and identification method often has low identification accuracy due to the characteristics of complexity and diversity;

3. existing methods are generally directed to a specific type of video data set, such as testing using a conventional monitoring data set or testing using a video data set collected from a network, and few models can be applied to both types of models simultaneously, i.e., the existing models are less generalizable.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to one aspect of the application, a method for detecting and identifying actions of multiple persons based on a three-dimensional convolutional neural network is provided, and the method comprises the following steps:

preprocessing a video of a training data set to obtain a sequence of body actions of each person;

and inputting the extracted body motion sequence of each person into a three-dimensional convolutional neural network model for motion detection and identification, wherein the three-dimensional convolutional neural network model comprises two 3D convolutional pooling units, two convolutional layers, two maximum pooling layers, a flat layer, two complete connection layers and an output layer.

Optionally, the preprocessing the video of the training data set includes:

separating the human body from the video of the training data set;

and extracting a sequence of the body action of each person from the video after the human body is separated.

Optionally, the separating the human body from the video of the training data set includes:

determining a background frame using a background evaluation method of an initialization value of SAD continuity between every two images and entropy of each block;

and calculating the background frame to obtain a differential absolute image, and then performing structural texture decomposition on the differential absolute image to obtain the region of the moving object to complete human body separation.

Optionally, the sequence of extracting the body motion of each person from the video after separating the human body is performed by a tracking method based on an extended version of the coring correlation filter.

Optionally, the preprocessing further comprises extracting motion history images from the video of the training data set;

and combining the extracted motion history image with the sequence of the body motion of each person, and inputting the combined motion history image and the sequence of the body motion of each person into a three-dimensional convolutional neural network model together for motion detection and identification.

According to another aspect of the present application, there is provided a multi-person motion detection and recognition apparatus based on a three-dimensional convolutional neural network, comprising:

a pre-processing module configured to pre-process the video of the training data set to obtain a sequence of body movements of each person;

and the detection and identification module is configured to input the extracted sequence of the body actions of each person into a three-dimensional convolutional neural network model for action detection and identification, wherein the three-dimensional convolutional neural network model comprises two 3D convolutional pooling units, two convolutional layers, two maximum pooling layers, a flat layer, two complete connection layers and an output layer.

Optionally, the preprocessing module includes:

a human body separation submodule configured to separate a human body from a video of a training data set;

a sequence extraction sub-module configured to extract a sequence of body movements of each person from the video of the separated human body.

Optionally, the human body separation submodule includes:

a background frame sub-module configured to determine a background frame using an initialization value of SAD continuity between each two images and a background evaluation method of entropy of each block;

and the structure texture decomposition submodule is configured to calculate the background frame to obtain a difference absolute image, and then perform structure texture decomposition on the difference absolute image to obtain the region of the moving object and complete human body separation.

Optionally, the preprocessing module further includes:

a history image extraction sub-module configured to extract motion history images from the video of the training data set;

and in the detection and identification module, the extracted motion history image is combined with the sequence of the body action of each person, and the combined motion history image and the sequence are jointly input into the three-dimensional convolutional neural network model for action detection and identification.

The application discloses a method and a device for detecting and identifying actions of multiple persons based on a three-dimensional convolutional neural network, due to the fact that training data are preprocessed, sequences of every person are extracted, and improved three-dimensional convolutional neural network models are adopted to detect and identify actions of the human body, therefore, the method and the device can be applied to action detection of multiple persons, and the three-dimensional convolutional neural network models have good generalization capability and higher identification accuracy when being applied to video data of different types.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a schematic flow chart diagram of a method for multi-user motion detection and recognition based on a three-dimensional convolutional neural network according to an embodiment of the present application;

FIG. 2 is a schematic block diagram of a multi-user motion detection and recognition apparatus based on a three-dimensional convolutional neural network according to an embodiment of the present application;

FIG. 3 is a block schematic diagram of a computing device of one embodiment of the present application;

fig. 4 is a schematic block diagram of a computer-readable storage medium according to an embodiment of the present application.

Detailed Description

Fig. 1 is a schematic flow diagram of a method for multi-person motion detection and recognition based on a three-dimensional convolutional neural network, according to an embodiment of the present application, which may generally include:

s1, preprocessing the video of the training data set to obtain a sequence of the body actions of each person:

prior to model training using a training data set, the data set needs to be preprocessed, the human body is separated in a video and a sequence of body movements is extracted during the movements. The preprocessing process is completed by using a background modeling technology, and specifically comprises the following steps: background frames can be determined quickly and efficiently using an initialization value of SAD continuity between every two images and then using a background evaluation method of entropy of each block. In order to minimize information in each frame and reduce noise and false modeling areas, a cvSub () function in OpenCV may be used to perform function calculation on a background frame to obtain a differential absolute image, then calculate a structural texture decomposition of the differential absolute image, then use a structural component containing only a uniform part of the image for a subsequent process, so far, obtain an area of a moving object, and then adjust the size of the preprocessed data.

In the case of many moving objects or people in a scene, each person in the scene needs to be detected and tracked to generate a sequence of each person. For this purpose, an extended version of the tracking method based on a coring correlation filter (KCF) is used. And extracting and generating a sequence representing human body actions of each person from the tracking result. However, these sequences or RGB videos may contain some redundant information, such as static background, for this reason, Motion History Images (MHI) need to be extracted from the videos, the MHI is combined with the sequences, and then the sequences and the sequences are jointly sent into a three-dimensional convolutional neural network model (3DCNN) for training, and the use of the MHI improves the recognition accuracy and reduces the recognition time.

S2, inputting the extracted sequence of each person' S body motion into a three-dimensional convolutional neural network model for motion detection and recognition, wherein the three-dimensional convolutional neural network model comprises two 3D convolutional pooling units, two convolutional layers, two maximum pooling layers, a flat layer, two complete connection layers and an output layer:

the three-dimensional convolutional neural network (3DCNN) is a supervised learning model with a multi-level deep learning network, which can learn a plurality of invariant features from an input video, and convolution and pooling are main components in the 3DCNN model. In this embodiment, the 3D cnn model architecture includes two 3D convolution pooling units, two convolution layers, two maximum pooling layers, one flat layer, two fully connected layers, and an output layer, where the output layer includes ten neurons representing the number of actions, and the activation function is a ReLU function. And combining the MHI obtained in the last step with the sequence of each person, and then jointly sending the MHI into a three-dimensional convolutional neural network model for training.

The method described in this embodiment uses the KTH, Weizmann, and UCF-ARG datasets for model training, which are three conventional monitoring datasets. And performing model test by using PETS, UCF101, Hollywood and MHAD data sets, wherein the test data set is a conventional monitoring data set and three video data sets collected from a network, and the test data set is directly sent into a trained 3DCNN for detecting and identifying multiple human bodies without any preprocessing part. The resolution of the data set was 32 × 32, the time depth was 10, the batch size during the model training was 128, and the learning rate was 1. According to the embodiment, different types of data sets are tested, and the test result shows that compared with the prior art, the generalization capability and accuracy of the 3DCNN model of the embodiment to different data sets are improved.

Fig. 2 is a schematic block diagram of a multi-person motion detection and recognition apparatus based on a three-dimensional convolutional neural network according to an embodiment of the present application, which may generally include:

a pre-processing module 1 configured to pre-process the video of the training data set, obtaining a sequence of body movements of each person:

A detection and recognition module 2 configured to input the extracted sequence of each person's body motion into a three-dimensional convolutional neural network model for motion detection and recognition, the three-dimensional convolutional neural network model including two 3D convolutional pooling units, two convolutional layers, two max-pooling layers, one flat layer, two fully-connected layers, and an output layer:

The apparatus of this embodiment uses the KTH, Weizmann, and UCF-ARG datasets for model training, which are three conventional monitoring datasets. And performing model test by using PETS, UCF101, Hollywood and MHAD data sets, wherein the test data set is a conventional monitoring data set and three video data sets collected from a network, and the test data set is directly sent into a trained 3DCNN for detecting and identifying multiple human bodies without any preprocessing part. The resolution of the data set was 32 × 32, the time depth was 10, the batch size during the model training was 128, and the learning rate was 1. According to the embodiment, different types of data sets are tested, and the test result shows that compared with the prior art, the generalization capability and accuracy of the 3DCNN model of the embodiment to different data sets are improved.

Embodiments also provide a computing device, referring to fig. 3, comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, the computer program, when executed by the processor 1110, implementing the method steps 1131 for performing any of the methods according to the invention.

The embodiment of the application also provides a computer readable storage medium. Referring to fig. 4, the computer readable storage medium comprises a storage unit for program code provided with a program 1131' for performing the steps of the method according to the invention, which program is executed by a processor.

The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A multi-person action detection and identification method based on a three-dimensional convolutional neural network comprises the following steps:

2. The method of claim 1, wherein preprocessing the video of the training data set comprises:

separating the human body from the video of the training data set;

3. The method of claim 2, wherein the separating the human body from the video of the training data set comprises:

4. The method of claim 2, wherein the extracting of the sequence of body movements of each person from the video of separated persons is performed using an extended version of a tracking method based on a coring correlation filter.

5. The method according to any one of claims 1 to 4,

the preprocessing further comprises extracting motion history images from the video of the training data set;

6. A multi-person action detection and identification device based on a three-dimensional convolutional neural network comprises:

7. The apparatus of claim 6, wherein the preprocessing module comprises:

8. The apparatus of claim 7, wherein the human body separating submodule comprises:

9. The apparatus of claim 7, wherein the sequence of extracting the body movements of each person from the video after separating the persons is performed by a tracking method based on an extended version of a coring correlation filter.

10. The apparatus according to any one of claims 6-9,

the preprocessing module further comprises: