CN110598592A

CN110598592A - Intelligent real-time video monitoring method suitable for nursing places

Info

Publication number: CN110598592A
Application number: CN201910805510.9A
Authority: CN
Inventors: 袁贤; 彭富明; 孙瑜; 陈阳阳; 杨涛
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2019-08-29
Filing date: 2019-08-29
Publication date: 2019-12-20

Abstract

The invention provides an intelligent real-time video monitoring method suitable for nursing places, which comprises the following steps: inputting the image sequence into a designed detection network; detecting multi-model fusion of the network; non-primary category suppression based on context; identifying the attribute of the person object; tracking of a human object of interest.

Description

Intelligent real-time video monitoring method suitable for nursing places

Technical Field

The invention relates to an intelligent monitoring technology for nursing institutions, in particular to an intelligent real-time video monitoring method suitable for nursing places.

Background

The important technology of intelligent video monitoring is the identification and tracking of targets. The current tracking based on deep learning is mostly realized based on detection. The research based on the deep learning target detection comprises the related fields of target detection based on a single-frame image, target detection based on a video, target detection based on a three-dimensional point cloud and the like. With the development of deep learning, the application of the convolutional neural network in image-based target detection is mainly based on detection of a single frame image, and video-based target detection is gradually concerned, and the data relevance of time sequence information and context of a video is concerned, so that end-to-end object detection is realized. However, the difficulty is that the detection of the object in the image is disabled due to the change of the visual angle, the illumination, the self deformation and the like, and particularly, the problems of motion blur, low resolution and the like often exist in the video.

After detecting a moving object, in a background application of a medical environment, it is necessary to further identify whether the object belongs to a patient and acquire identity information of the patient. At present, the target recognition can be served by the traditional artificial design features or the abstract depth features obtained through the convolutional neural network, and a very high accuracy rate can be obtained. The target identification means is realized based on classification means, such as classification based on characteristic distance, statistical learning-based classification means and logistic regression-based classification means. The classification means is many and mature, and the main working difficulty is to design a set of suitable feature extraction means and classification means.

Disclosure of Invention

The invention aims to provide an intelligent real-time video monitoring method suitable for nursing places.

The technical scheme for realizing the purpose of the invention is as follows: an intelligent real-time video monitoring method suitable for nursing places comprises the following steps:

inputting the image sequence into a designed detection network;

detecting multi-model fusion of the network;

non-primary category suppression based on context;

identifying the attribute of the person object;

tracking of a human object of interest.

Further, the detection networks include the ResNet residual network and the GoogleNet Google network.

Further, multiple models of the detection network are fused into a result union set of multiple model detection.

Further, the non-main category of the image after multi-model fusion is suppressed by using the target window statistical information in the multi-frame information of the video sequence, which specifically comprises the following steps:

step S301, assuming that A is a set of all bounding boxes of a plurality of frames in a video;

step S302, summing the detection scores of each bounding box in the A on each category to obtain the total score of each category in the video frames;

step S303, the categories of the objects are sorted from high to low according to the total score, a threshold is set, if the score of an object is lower than the threshold, the object is determined to be a non-main category, the bounding box of the object is deleted, and the object is not subjected to the attribute recognition of the character object.

Further, the attribute identification of the person object specifically includes:

inputting the image with the main category suppressed to a VGG-16 convolutional neural network;

fusing the information of the characteristic diagrams of the 1, 3 and 5 convolution layers of the VGG-16 convolution neural network;

after the fused depth features are obtained, calculating the Euclidean distance between one frame of image in the monitoring video and a template in a library by using a classification means based on feature distance;

finding out a template image with the minimum Euclidean distance between the bounding box and the library and smaller than a preset threshold value, judging that a target object in the detection frame is the object in the library, giving semantic information to the detection frame as an attribute entry of the template object, and marking the detection frame as a patient detection frame;

if the distance is larger than the preset threshold value, the target object in the detection frame is judged not to be the template object in the library,

and if the Euclidean distances between the images in the detection frame and the images in all the libraries are larger than a certain threshold value, judging that the object in the detection frame does not belong to the patient group, and abandoning the follow-up tracking of the detection frame.

Further, a Deep-Sort tracking algorithm is adopted to track the patient detection frame.

Compared with the prior art, the invention has the following advantages: (1) by adopting a multi-model fusion technology and non-main category inhibition based on context, the accuracy of moving target detection can be improved; (2) the detected moving target image is subjected to feature extraction under a VGG-16 framework, multi-layer features are fused, similarity measurement is carried out by utilizing the fused depth features, whether a target in a detection frame is a patient or not is obtained, the accuracy of character identification can be improved, meanwhile, due to the fact that character attribute identification is introduced, more reference information can be provided for monitoring personnel/nursing personnel, and the moving target is not provided any more by only using an enclosure frame to hook the moving target under other application scenes.

The invention is further described below with reference to the accompanying drawings.

Drawings

Fig. 1 is a schematic diagram of a framework flow of a target identification and tracking method provided by the present invention.

Fig. 2 is a schematic diagram of two kinds of multi-model fusion based on SSD detection provided by the present invention.

Fig. 3 is a schematic diagram of a framework flow of the target attribute identification method provided by the present invention.

Detailed Description

The framework flow of the target recognition and tracking method is shown in fig. 1. The method comprises the steps that a video image sequence is obtained through a common camera installed at a proper position in a nursing place, the video image sequence is respectively input into an improved ResNet residual error network SSD frame and an improved GoogleNet Google network SSD frame, as shown in figure 2, responses of different detectors are carried out in a single-frame image, different features are extracted, and different detection frame (BoudingBox) results corresponding to an object are obtained. The improved method is characterized in that an SSD detection framework is improved, the original SSD is classified by using a VGG network, a ResNet residual network is used on the SSD framework, and a GoogleNet Google network is used on the SSD framework.

And performing multi-model fusion on two different detection frames obtained by two different detection methods in the last step, as shown in fig. 2, wherein the core of the multi-model fusion is strategy selection of multi-mode information fusion. In the application context of the nursing place, in order to reduce the missed detection rate, the results of the multi-model detection are merged. In order to make full use of the information of the detection frames, the detection relocation is performed, that is, the position information of the detection frames is re-corrected according to the overlapping condition of the surrounding frames and the scores. And (3) performing model integration on the SSD detection frame (marked as A) based on the residual error network and the SSD detection frame (marked as B) based on the Google network, so that the SSD detection frame and the SSD detection frame (marked as B) jointly predict the detection results, and the detection results are merged. And the results of detection by A + those by B were collected. For example, if a detects that a patient 1 appears in the X region of the picture, and B detects that no patient appears in the picture, the final result will be determined that a patient 1 appears in the X region; if the AB simultaneously detects the appearance of the patient 1, but respectively appears in the X area and the Y area, the final result is judged that the X U Y area appears the patient.

Optimizing the corrected detection frame obtained in the last step, eliminating interference in a complex environment, fully utilizing information on a video stream time sequence, designing and using non-main category suppression based on context, performing statistical analysis on all detected detection frames of multiple frames in a video, suppressing non-main category objects with later scores, and further optimizing a detection result. In particular to

S301, assuming that A is a set of all bounding boxes of multiple frames in a video, when a moving object is detected, a rectangular frame is used to surround the moving object, the rectangular frame is a bounding box and is used for dividing a local area in the detection frame, and only the local area is used for classification or regression, so that the operation amount can be reduced;

step S302, summing up the detection scores of each bounding box in a on each category, so as to obtain a total score of each category in the video frames.

In step S303, the categories of the objects are sorted from high to low according to the total score, the passing inhibition score is far lower than the detection score of the main category, and then the bounding box of the non-main category is removed.

For the detection result obtained in the last step, the identity information of the detection target is further required to be obtained, as shown in fig. 3, a VGG-16 convolutional neural network is used for extracting features, except for using the feature map of the convolutional layer at the highest layer, a feature map with a larger size at the previous layer is also used, and the feature map at the last layer is subjected to upsampling and then fused with the feature map at the previous layer, so that the spatial resolution capability is improved, and the information of the feature maps of 1, 3 and 5 convolutional layers of the VGG-16 is used for fusion. After the fused depth features are obtained, calculating the Euclidean distance between the subject and a template in a library by using a classification means based on feature distance, if the distance between a traversal template and the template image with the minimum Euclidean distance between the detection frame and the template in the library is found out after the traversal template has the distance smaller than a certain threshold value, judging that a target object in the detection frame is the object in the library, giving semantic information of the detection frame as an attribute vocabulary entry of the template object, and marking the detection frame as a patient detection frame; if the distance is greater than a certain threshold value, the target object in the detection frame is judged not to be the template object in the library, if the Euclidean distances between the image in the detection frame and the images in all the libraries are greater than the certain threshold value, the object in the detection frame is judged not to belong to a patient group, and follow-up tracking on the detection frame is abandoned. For example: a, B, C three patients are in the template, the threshold set by me is 10, then the Euclidean distances of the actually obtained images and A, B, C are respectively 40 (greater than 10, not A type), 56 (greater than 10, not B type) and 78 (greater than 10, not C type), and as all the distances (40, 56 and 78) are greater than 10, the monitored target is judged not to be a patient, the target is directly discarded, and the target is not tracked subsequently.

And finally, in order to realize stable tracking of the target, the high-efficiency Deep-Sort-based tracking algorithm is adopted, the tracker is used for compensating the detector, meanwhile, the tracking algorithm can be used for transferring the detection frame with higher score to the adjacent frame, and the non-maximum value is used for inhibiting and eliminating the redundant detection frame, so that the accuracy rate is improved. Therefore, the accuracy rate and precision rate of object detection in the video can be further improved by the combined optimization mode. The information and the detection frame of the target object are displayed in a man-machine interaction interface in real time, and measures such as sound alarm and the like are assisted, so that the staff are effectively helped to nurse the patient.

Claims

1. An intelligent real-time video monitoring method suitable for nursing places is characterized by comprising the following steps:

inputting the image sequence into a designed detection network;

detecting multi-model fusion of the network;

non-primary category suppression based on context;

identifying the attribute of the person object;

tracking of a human object of interest.

2. The method of claim 1, wherein the detection networks include a ResNet residual network and a GoogleNet Google network.

3. The method of claim 1, wherein the multi-model fusion of the detection network is a union of results of the multi-model detection.

4. The method according to claim 1, wherein the non-main categories of the multi-model fused image are suppressed by using the target window statistical information in the multi-frame information of the video sequence, specifically:

5. The method of claim 1, wherein the attribute identification of the human object specifically comprises: inputting the image with the main category suppressed to a VGG-16 convolutional neural network;

6. The method of claim 7, wherein the patient detection box is tracked using a Deep-Sort tracking algorithm.