CN112580494A

CN112580494A - Method and device for identifying and tracking personnel in monitoring video based on deep learning

Info

Publication number: CN112580494A
Application number: CN202011487218.6A
Authority: CN
Inventors: 宋波
Original assignee: Beijing Moviebook Technology Corp ltd
Current assignee: Beijing Moviebook Technology Corp ltd
Priority date: 2020-12-16
Filing date: 2020-12-16
Publication date: 2021-03-30

Abstract

The application discloses a method and a device for identifying and tracking personnel in a monitoring video based on deep learning, and relates to the field of identification and tracking. The method comprises the following steps: inputting the data set into a MobileNet model, using a plurality of attributes of a user as characteristic input for training, and obtaining categories corresponding to the attributes; when personnel identification tracking is carried out, a monitoring video and a target picture are input into an RCNN model, an object to be tracked contained in the target picture is taken as a query condition, identification tracking is carried out on the object in the monitoring video according to the category obtained by the MobileNet model, and a tracking result is output. The device includes: a training module and a tracking module. Compared with a single feature training and identifying method, the method has the advantages that the accuracy of identifying and tracking is improved greatly, and meanwhile, the searching efficiency is improved.

Description

Method and device for identifying and tracking personnel in monitoring video based on deep learning

Technical Field

The present application relates to the field of identification and tracking, and in particular, to a method and an apparatus for identifying and tracking people in a surveillance video based on deep learning.

Background

Currently, there are a variety of methods available for tracking and identifying people in a video. Jing et al used a constrained optimization process, called deformed outline, for a person wearing glasses in a given set of images, using a three-step procedure, an edge map generator, to filter out unnecessary edge points, optimizing the shape and position of the glasses. The Histogram Oriented Gradient (HOG) and Local Binary Pattern (LBP) used by Lorenzo et al categorize different attributes of a garment, such as the length of the sleeves, the color of the fabric, the pattern on the fabric, the presence or absence of a collar on the fabric, etc. Seo et al uses a CNN-based architecture to achieve classification of garment images.

Most of the existing methods use a single feature of a person for training and realize detection of the person based on the single feature, such as detection based on a single clothing style or a single glasses, and the detection based on the single feature has low efficiency and accuracy.

Disclosure of Invention

It is an object of the present application to overcome the above problems or to at least partially solve or mitigate the above problems.

According to one aspect of the application, a method for identifying and tracking people in a monitoring video based on deep learning is provided, and comprises the following steps:

inputting the data set into a MobileNet model, using a plurality of attributes of a user as characteristic input for training, and obtaining categories corresponding to the attributes;

when personnel identification tracking is carried out, a monitoring video and a target picture are input into an RCNN (Region-CNN) model, an object to be tracked contained in the target picture is taken as a query condition, identification tracking is carried out on the object in the monitoring video according to the category obtained by the MobileNet model, and a tracking result is output.

Optionally, training using a plurality of attributes of the user as feature input to obtain categories corresponding to the plurality of attributes, including:

training was performed using gender, shirt style, and eye state as feature inputs, resulting in eight categories corresponding to these three attributes.

Optionally, identifying and tracking the object in the monitoring video according to the category obtained by the MobileNet model by using the object to be tracked included in the target picture as a query condition, including:

and taking the object to be tracked contained in the target picture as a query condition, extracting features of the monitoring video by using a CNN (Convolutional Neural network) according to the category obtained by the MobileNet model, sending the monitoring video into an SVM (Support Vector Machines) classifier for discrimination, and then correcting the frame by using a regressor.

Optionally, the method further comprises:

the RCNN model uses a selective search algorithm to view video frames through windows of different sizes, each size of window locating the position of the object to be tracked based on texture, color and intensity pairs.

Optionally, outputting the tracking result, including:

and outputting a tracking result through a bounding box RoI bbox of the region of interest.

According to another aspect of the present application, there is provided an apparatus for identifying and tracking people in surveillance video based on deep learning, including:

a training module configured to input the data set into a MobileNet model, train using a plurality of attributes of a person as feature input, and obtain categories corresponding to the plurality of attributes;

and the tracking module is configured to input a monitoring video and a target picture into the RCNN model when personnel identification tracking is performed, identify and track the object in the monitoring video according to the category obtained by the MobileNet model by taking the object to be tracked contained in the target picture as a query condition, and output a tracking result.

Optionally, the training module is specifically configured to:

Optionally, the tracking module is specifically configured to:

and taking the object to be tracked contained in the target picture as a query condition, using CNN to extract features of the surveillance video according to the category obtained by the MobileNet model, sending the feature to an SVM classifier for discrimination, and then using a regressor to correct the frame.

Optionally, the tracking module is specifically configured to:

According to yet another aspect of the application, there is provided a computing device comprising a memory, a processor and a computer program stored in the memory and executable by the processor, wherein the processor implements the method as described above when executing the computer program.

According to yet another aspect of the application, a computer-readable storage medium, preferably a non-volatile readable storage medium, is provided, having stored therein a computer program which, when executed by a processor, implements a method as described above.

According to yet another aspect of the application, there is provided a computer program product comprising computer readable code which, when executed by a computer device, causes the computer device to perform the method described above.

According to the technical scheme, the data set is input into the MobileNet model, multiple attributes of a user are used as feature input for training, categories corresponding to the attributes are obtained, when people are identified and tracked, the monitoring video and the target picture are input into the RCNN model, an object to be tracked contained in the target picture is used as a query condition, the object in the monitoring video is identified and tracked according to the categories obtained by the MobileNet model, and a tracking result is output.

The above and other objects, advantages and features of the present application will become more apparent to those skilled in the art from the following detailed description of specific embodiments thereof, taken in conjunction with the accompanying drawings.

Drawings

Some specific embodiments of the present application will be described in detail hereinafter by way of illustration and not limitation with reference to the accompanying drawings. The same reference numbers in the drawings identify the same or similar elements or components. Those skilled in the art will appreciate that the drawings are not necessarily drawn to scale. In the drawings:

FIG. 1 is a flowchart of a method for identifying and tracking people in a surveillance video based on deep learning according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for identifying and tracking people in a surveillance video based on deep learning according to another embodiment of the present application;

FIG. 3 is a block diagram of an apparatus for identifying and tracking people in surveillance video based on deep learning according to another embodiment of the present application;

FIG. 4 is a block diagram of a computing device according to another embodiment of the present application;

fig. 5 is a diagram of a computer-readable storage medium structure according to another embodiment of the present application.

Detailed Description

Fig. 1 is a flowchart of a method for identifying and tracking people in a surveillance video based on deep learning according to an embodiment of the present application. Referring to fig. 1, the method includes:

101: inputting the data set into a MobileNet model, using a plurality of attributes of a user as characteristic input for training, and obtaining categories corresponding to the attributes;

102: when personnel identification tracking is carried out, the monitoring video and the target picture are input into the RCNN model, the object to be tracked contained in the target picture is taken as a query condition, the object in the monitoring video is identified and tracked according to the category obtained by the MobileNet model, and a tracking result is output.

In this embodiment, optionally, training using a plurality of attributes of the user as feature input to obtain categories corresponding to the plurality of attributes includes:

In this embodiment, optionally, identifying and tracking the object in the surveillance video according to the category obtained by the MobileNet model by using the object to be tracked included in the target picture as the query condition includes:

and taking an object to be tracked contained in the target picture as a query condition, using CNN to extract features of the monitoring video according to the category obtained by the MobileNet model, sending the monitoring video into an SVM classifier for discrimination, and then using a regressor to correct the frame.

In this embodiment, optionally, the method further includes:

the RCNN model uses a selective search algorithm to view video frames through different sized windows, each sized window locating the position of an object to be tracked based on texture, color and intensity pairs.

In this embodiment, optionally, outputting the tracking result includes:

In the method provided by this embodiment, the data set is input into the MobileNet model, multiple attributes of a user are used as feature inputs for training, categories corresponding to the multiple attributes are obtained, when people are identified and tracked, the monitoring video and the target picture are input into the RCNN model, an object to be tracked contained in the target picture is used as a query condition, the object in the monitoring video is identified and tracked according to the categories obtained by the MobileNet model, and a tracking result is output.

Fig. 2 is a flowchart of a method for identifying and tracking people in a surveillance video based on deep learning according to another embodiment of the present application. Referring to fig. 2, the method includes:

201: collecting a data set in advance and preprocessing the data set;

the data set is preprocessed by manually labeling data and then converted into a binary system for storage, so that the training time and the size of the test data set can be effectively reduced.

202: inputting the preprocessed data set into a MobileNet model, and training by using gender, shirt style and eye state as characteristic input to obtain eight categories corresponding to the three attributes;

wherein the MobileNet model employs a deep separable convolution, which means that the deep convolution is replaced by a point-state convolution, and the number of parameters of the structure is drastically reduced compared to a normal convolution having the same depth in the network, thereby forming a lightweight deep neural network.

In this embodiment, the MobileNet model for training may be used as a platform with the RCNN model, and feature extraction and training are performed through the MobileNet model to obtain categories corresponding to multiple attributes of a person, which are used as a criterion and are tracked by identifying an object through the R-CNN model.

203: when personnel are identified and tracked, inputting a monitoring video and a target picture into an RCNN model, using an object to be tracked contained in the target picture as a query condition, viewing video frames through windows of different sizes by using a selective search algorithm according to the category obtained by a MobileNet model, positioning the position of the object to be tracked based on texture, color and intensity pairs of the windows of each size, extracting features of the monitoring video by using CNN, sending the monitoring video into an SVM classifier for discrimination, and correcting a frame by using a regressor;

the CNN specifically performs feature extraction and object identification on the region determined by the selective search algorithm. Usually, the determined area is multiple, for example, 2000 areas are extracted from each frame of image by using a selective search algorithm, and the like, which is not specifically limited in this embodiment.

204: the tracking result is output through a bounding box RoI bbox (Region of Interest bounding box).

FIG. 3 is a block diagram of an apparatus for identifying and tracking people in surveillance video based on deep learning according to another embodiment of the present application. Referring to fig. 3, the apparatus includes:

a training module 301 configured to input the data set into a MobileNet model, using a plurality of attributes of a person as feature inputs for training, resulting in categories corresponding to the plurality of attributes;

and the tracking module 302 is configured to, when the person is identified and tracked, input the monitoring video and the target picture into the RCNN model, identify and track the object in the monitoring video according to the category obtained by the MobileNet model by using the object to be tracked included in the target picture as a query condition, and output a tracking result.

In this embodiment, optionally, the training module is specifically configured to:

In this embodiment, optionally, the tracking module is specifically configured to:

The apparatus provided in this embodiment may perform the method provided in any of the above method embodiments, and details of the process are described in the method embodiments and are not described herein again.

According to the device provided by the embodiment, the data set is input into the MobileNet model, a plurality of attributes of a user are used as feature input for training, categories corresponding to the attributes are obtained, when people are identified and tracked, the monitoring video and the target picture are input into the RCNN model, an object to be tracked contained in the target picture is used as a query condition, the object in the monitoring video is identified and tracked according to the categories obtained by the MobileNet model, and a tracking result is output.

Embodiments also provide a computing device, referring to fig. 4, comprising a memory 1120, a processor 1110 and a computer program stored in said memory 1120 and executable by said processor 1110, the computer program being stored in a space 1130 for program code in the memory 1120, the computer program, when executed by the processor 1110, implementing the method steps 1131 for performing any of the methods according to the invention.

The embodiment of the application also provides a computer readable storage medium. Referring to fig. 5, the computer readable storage medium comprises a storage unit for program code provided with a program 1131' for performing the steps of the method according to the invention, which program is executed by a processor.

The embodiment of the application also provides a computer program product containing instructions. Which, when run on a computer, causes the computer to carry out the steps of the method according to the invention.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed by a computer, cause the computer to perform, in whole or in part, the procedures or functions described in accordance with the embodiments of the application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by a program, and the program may be stored in a computer-readable storage medium, where the storage medium is a non-transitory medium, such as a random access memory, a read only memory, a flash memory, a hard disk, a solid state disk, a magnetic tape (magnetic tape), a floppy disk (floppy disk), an optical disk (optical disk), and any combination thereof.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for identifying and tracking people in a monitoring video based on deep learning is characterized by comprising the following steps:

when personnel identification tracking is carried out, a monitoring video and a target picture are input into an RCNN model, an object to be tracked contained in the target picture is taken as a query condition, identification tracking is carried out on the object in the monitoring video according to the category obtained by the MobileNet model, and a tracking result is output.

2. The method of claim 1, wherein using a plurality of attributes of a person as a feature input for training to obtain categories corresponding to the plurality of attributes comprises:

3. The method according to claim 1, wherein identifying and tracking the object in the surveillance video according to the category obtained by the MobileNet model by using the object to be tracked included in the target picture as a query condition comprises:

4. The method of claim 1, further comprising:

5. The method of any one of claims 1-4, wherein outputting the tracking result comprises:

6. A device for identifying and tracking people in surveillance videos based on deep learning is characterized by comprising:

7. The apparatus of claim 6, wherein the training module is specifically configured to:

8. The apparatus of claim 6, wherein the tracking module is specifically configured to:

9. The apparatus of claim 6, wherein the tracking module is specifically configured to:

10. The apparatus according to any one of claims 6-9, wherein the tracking module is specifically configured to: