CN112257527A

CN112257527A - Mobile phone detection method based on multi-target fusion and space-time video sequence

Info

Publication number: CN112257527A
Application number: CN202011079614.5A
Authority: CN
Inventors: 龚勋; 王琛中; 王立
Original assignee: Southwest Jiaotong University
Current assignee: Southwest Jiaotong University
Priority date: 2020-10-10
Filing date: 2020-10-10
Publication date: 2021-01-22
Anticipated expiration: 2040-10-10
Also published as: CN112257527B

Abstract

The invention relates to a mobile phone detection method based on multi-target fusion and a space-time video sequence, which comprises the steps of training an improved yolo model to obtain a detection model, and inputting video image frames to operate the detection model to obtain a first frame prediction value; decoding the first frame predicted value, removing a frame with the score value lower than a preset value, realizing NMS (network management system) by using a Diou threshold value, and inhibiting a mobile phone frame when only the mobile phone frame appears according to the decoding result of a certain frame of image; taking the inhibited result as a target template, inputting a video image frame as a candidate frame search area, inputting the candidate frame search area to a full-connection twin network, and selecting a result with the largest score map similarity to mark a mobile phone frame in the video image frame; if the set number of frames has been tracked, the above steps are repeated until the video image input is finished. The invention is based on the lightweight detection network in the One-stage detection algorithm, finely modifies the network structure and the training and detection modes, and obtains higher detection precision under the condition of not reducing the detection speed.

Description

Mobile phone detection method based on multi-target fusion and space-time video sequence

Technical Field

The invention relates to the technical field of image processing, in particular to a mobile phone detection method for multi-target fusion and space-time video sequences.

Background

The detection precision and speed are always the core problems of target detection, and in the process of target detection, in order to obtain a more accurate detection effect, a heavyweight detection algorithm capable of obtaining high precision is usually selected, so that the inference speed of the system at the mobile terminal equipment is greatly limited.

The Chinese patent application with the application number of 202010048048.5 discloses an intelligent monitoring method, equipment and a readable medium for recognizing mobile phone anti-photographing, which performs machine learning on massive mobile phone appearances through an intelligent monitoring system; erecting a camera probe at a place where anti-shooting needs to be arranged, wherein the camera probe is in real-time communication with an intelligent monitoring system; the camera transmits the shot image to an intelligent monitoring system in real time; identifying whether a mobile phone exists or not through an intelligent monitoring system; if the mobile phone exists, the intelligent monitoring system judges whether a mobile phone is used for shooting according to the shot image; the intelligent monitoring system judges that the mobile phone is used for photographing, and then outputs alarm information in real time to remind workers of timely reminding. Using a detection algorithm taking Darknet53 as a Backbone to carry out primary detection, and then monitoring by matching with methods such as bone generation, action recognition and the like; in addition, some methods use similar algorithms to perform initial positioning, and then perform detection by searching from the whole to the local. But the mode of the detection system at the mobile end is not real-time basically due to the mode of the detection system.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a mobile phone detection method for multi-target fusion and space-time video sequences, and solves the defects of the existing detection method.

The purpose of the invention is realized by the following technical scheme: the mobile phone detection method based on multiple target fusion and space-time video sequences comprises the following steps:

training the improved yolo model to obtain a detection model, and inputting a video image frame to operate the detection model to obtain a first frame prediction value;

decoding the first frame predicted value, removing a frame with the score value lower than a preset value, realizing NMS (non-maximum suppression) by using a Diou threshold value, and suppressing the mobile phone frame when only the mobile phone frame appears according to the decoding result of a certain frame image;

taking the inhibited result as a target template, inputting a video image frame as a candidate frame search area, simultaneously inputting the video image frame and the candidate frame search area into a full-connection twin network, and selecting the result with the largest score map similarity to carry out frame marking on the mobile phone in the video image frame;

if the set number of frames has been tracked, the above steps are repeated until the video image input is finished.

Further, the mobile phone detection method further comprises the step of repeatedly taking the suppressed result as a target template, inputting the video image frame as a candidate frame search area, simultaneously inputting the video image frame and the candidate frame search area to the full-connection twin network, and selecting the result with the largest score map similarity to perform frame marking on the mobile phone in the video image frame if the number of frames is not set.

Further, the mobile phone detection method further comprises the step of acquiring a training set and a test set before the step of training the improved yolo model to obtain a detection model and inputting the video image frame to run the detection model to obtain the first frame prediction value.

Further, the step of acquiring the training set and the test set comprises: the method comprises the steps of performing frame division processing on a recorded video, labeling processed video pictures, extracting partial pictures at intervals of frames to construct a data set, and dividing the data set into a training set and a test set according to a certain proportion.

Further, the decoding the first frame prediction value, removing a frame with a score value lower than a preset value, implementing NMS with a Diou threshold, and suppressing the mobile phone frame when only the mobile phone frame appears according to a decoding result of a certain frame image includes:

according to the decodingFormula bx ═ sigmoid (t)_x)+cx、by＝sigmoid(t_y)+cy、bw＝p_we^tw、bh＝p_he^thDecoding the first frame prediction value by conf ═ sigmoid (raw _ conf) and prob ═ sigmoid (raw _ prob);

removing the box with confidence or category probability not meeting the requirement with score threshold of 0.4 and implementing NMS with Diou threshold of 0.1;

and (3) rejecting a prediction frame related to the mobile phone in the corresponding image if the mobile phone frame does not have a human body frame, a hand frame or a camera frame according to the decoding result of the certain frame of image, so as to restrain the mobile phone frame.

Further, improvements to the yolo model include the following:

increasing an s branch for detecting a small object for yolov3-tiny to improve the detection effect of small objects such as a camera;

on the basis of the structure of the former step, an SPP (spatial Pyramid Power), SAM (spatial Attention Module) and CAM (channel Attention Module) module is added to be connected with the residual error, so as to improve the feature extraction capability.

The invention has the following advantages: a mobile phone detection method based on multi-target fusion and a space-time video sequence is based on a lightweight detection network in an One-stage detection algorithm, the network structure, training and detection modes are modified finely, high detection precision is obtained under the condition that the detection speed is not reduced, detected targets are tracked by using a tracking algorithm, the problem that difficult samples with large shielding and angle inclination exist is solved, the consumption of a system to resources is reduced, and the reasoning speed of the system at a mobile end is improved greatly on the whole.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as presented in the figures, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the present invention relates to a mobile phone detection method based on multi-target fusion and spatiotemporal video sequence, which specifically comprises the following contents:

and S1, performing framing processing on the video recorded by the camera in the actual application scene, and randomly extracting partial pictures at intervals to construct a data set. And labeling the mobile phone, the human body, the hand and the camera in each image by using LabelImg labeling software, and dividing the data set into a training set and a test set according to a certain proportion.

S2, training the detection model by using the improved yolov3 network, wherein the network training input is a training set picture and a corresponding label, and the network output is predicted t_x,t_y,t_w,t_hOffset value, original confidence and original class probability.

Further, in the training process, for focalloss of confidence loss, considering that the imbalance of positive and negative samples of yolov3 network model is much lower than that of retianenet, the value of α is selected to be 0.4, and the calculation formula of confidence loss is as follows:

L_focalloss＝-α_t(1-p_t)^μ*γlog(p_t)

and S3, operating the detection model to obtain a predicted value of the first frame.

S4, decoding the predicted value according to the following decoding formula, removing the box with lower confidence coefficient or class probability by score threshold of 0.40 and realizing NMS by Diou threshold of 0.1;

bx＝sigmoid(t_x)+cx

by＝sigmoid(t_y)+cy

bw＝p_we^tw

bh＝p_he^th

conf＝sigmoid(raw_conf)

prob＝sigmoid(raw_prob)

wherein: b_x、b_y、b_h、b_wRespectively representing the horizontal and vertical coordinates and the height and width of the center of the prediction frame, p_hAnd p_wRepresenting the height and width of the prior box, respectively. t is t_xAnd t_yThe predicted offset, t, of the center of the object from the upper left corner of the grid is shown_wAnd t_hRepresenting the predicted offset of the object relative to the prior frame, c_xAnd c_yThen the coordinates of the upper left corner of the grid are represented, score ═ conf (confidence) × prob (class probability).

And S5, if the mobile phone frame does not have the human body frame, the hand frame or the camera frame according to the decoding result of the certain frame picture, the mobile phone frame is restrained.

And S6, taking the suppressed result as a target template, taking the video image frame as a candidate frame searching area, and simultaneously sending the target template and the video image frame into a full-connection twin network to obtain a similarity measurement result score map through template matching.

And S7, selecting the result with the maximum similarity to mark the mobile phone in the video image frame.

S8, judging whether the set frame number is tracked or not, if not, repeating the steps S6-S8, if so, executing the step S9;

s9, steps S3-S9 are repeated until the input of the video image is finished.

In terms of multi-target association, the contribution points of the invention are as follows:

it was found that the giou (generalized Intersection over union) -based position loss (used in the present invention) presents an imbalance opposite to the variance-based position loss, for which the average tag box size and average position loss of the s, m, l branches are statistically generalized,and combining the quantity proportion of each branch frame, and adopting a negative exponential function (a.e)^-b/x) Unbalanced fitting correction is carried out on the basic function, and the problem of unbalanced position loss of the large and small frames based on the GIoU is solved.

Following the premise assumption that the average position loss of each branch frame is almost equal when the data volume is large enough, the average label frame size and the average position loss of the s, m and l branches of the first arm-up epoch (the preheating period, namely the period with small learning rate at the beginning of training) are calculated in a partial generalization manner in the training process, and a negative exponential function (a.e) is adopted by combining the quantity proportion of each branch frame^-b/x) Unbalanced fitting correction is carried out on the basic function, loss weight of each branch position in the subsequent iteration process is adjusted, and the problem of unbalanced position loss of the large frame and the small frame based on the GIoU is solved.

The problem of rewriting labels in yolo is found, namely, a displacement anchor box assigned to a certain object has a probability of being covered by a subsequent object, so that covered training cannot be carried out, and the specific improvement steps are as follows:

if a certain anchor frame is endowed with a label by a certain original object, judging whether the original object has a unique frame;

if the original object has the unique frame, judging whether the current object can be assigned with an anchor, if so, cancelling the assignment of the current object to a certain anchor frame, otherwise, searching the next anchor frame with the highest iou value for assignment;

if the original object has no unique frame, judging whether the anchor of the highest iou of the existing object and the anchor of the non-highest iou of the original object cover the original assignment; if so, judging whether an anchor of the non-highest iou of the existing object and an anchor of the highest iou of the original object exist or not; if yes, judging whether the current object can have an assignment anchor, if yes, cancelling assignment of the current object to a certain anchor frame, and otherwise, covering the original assignment; if the anchor of the existing object which is not the highest iou and the anchor of the original object which is not the highest iou exist, the person with the lower iou is covered.

The method takes account of the fact that a main and auxiliary distinguishing is needed between the mobile phone and other auxiliary detection targets, all losses of the mobile phone are multiplied by a priority coefficient, and the coefficient is 1.10.

The threshold obtained by atss (adaptive Training Sample selection) is limited, and when the threshold is smaller than a certain value, the quality of the Training Sample corresponding to the obtained threshold is considered to be low, so that the selection mode of the threshold is abandoned, and only the highest IoU of the Training samples to be selected is selected. In the present invention, the predetermined value is 0.10.

The association which basically does not consume computing resources is carried out on the multi-target object, and the computing resource consumption of a cognitive detection mode is reduced.

In the aspect of space-time information fusion, the invention utilizes the context information of two dimensions of time domain and space domain, and obviously improves the problems of shielding and drifting in the tracking process.

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The mobile phone detection method based on the fusion of various targets and the space-time video sequence is characterized in that: the mobile phone detection method comprises the following steps:

decoding the first frame predicted value, removing a frame with the score value lower than a preset value, realizing NMS (network management system) by using a Diou threshold value, and inhibiting a mobile phone frame when only the mobile phone frame appears according to the decoding result of a certain frame of image;

2. The mobile phone detection method based on multi-target fusion and spatio-temporal video sequence according to claim 1, characterized in that: if the frame number is not set, the steps of repeatedly taking the suppressed result as a target template, inputting a video image frame as a candidate frame search area, simultaneously inputting the video image frame and the candidate frame search area to a full-connection twin network, and selecting the result with the largest score map similarity to mark the frames of the mobile phones in the video image frame are repeated.

3. The mobile phone detection method based on multi-target fusion and spatio-temporal video sequence according to claim 1, characterized in that: the mobile phone detection method further comprises the step of acquiring a training set and a test set before the steps of training the improved yolov3 model to obtain a detection model and inputting the video image frame to run the detection model to obtain a first frame prediction value.

4. The mobile phone detection method based on multi-target fusion and spatio-temporal video sequence according to claim 3, characterized in that: the step of obtaining the training set test set comprises: the method comprises the steps of performing frame division processing on a recorded video, labeling processed video pictures, extracting partial pictures at intervals of frames to construct a data set, and dividing the data set into a training set and a test set according to a certain proportion.

5. The mobile phone detection method based on multi-target fusion and spatio-temporal video sequence according to claim 1, characterized in that: the decoding the first frame prediction value, removing a frame with a score value lower than a preset value, realizing NMS (network management system) by using a Diou threshold value, and inhibiting a mobile phone frame when only the mobile phone frame appears according to a decoding result of a certain frame image comprises the following steps:

according to the decoding formula bx-sigmoid (t)_x)+cx、by＝sigmoid(t_y)+cy、bw＝p_we^tw、bh＝p_he^thDecoding the first frame prediction value by conf ═ sigmoid (raw _ conf) and prob ═ sigmoid (raw _ prob);

6. The mobile phone detection method based on multi-target fusion and spatio-temporal video sequence according to claim 1, characterized in that: improvements to the yolo model include the following:

on the basis of the model structure of the previous step, SPP, SAM and CAM modules are added to be connected with the residual error, so that the feature extraction capability is improved.