CN110287826A

CN110287826A - A kind of video object detection method based on attention mechanism

Info

Publication number: CN110287826A
Application number: CN201910499786.9A
Authority: CN
Inventors: 李建强; 白骏; 刘雅琦
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2019-09-27
Anticipated expiration: 2039-06-11
Also published as: CN110287826B

Abstract

The present invention relates to a kind of video object detection methods based on attention mechanism, are related to computer vision.The present invention includes the following steps: step S1, extracts the candidate feature figure of current time frame；Step S2, fusion window is set in time in the past section, Laplce's variance of each frame in calculation window, using normalized square mean as the weight of frame each in window, the candidate feature figure of frames all in window is weighted summation and obtains temporal aspect, the candidate feature of current time frame is connected with temporal aspect, obtains characteristic pattern to be detected；Step S3 extracts the characteristic pattern of additional scale using convolutional layer on characteristic pattern to be detected；Step S4 carries out target category and position prediction using convolutional layer on the characteristic pattern of different scale.Feature fusion of the invention is assigned with different weights to the frame feature of different quality in time in the past section, so that the fusion of timing information is more abundant, improves the performance of detection model.

Description

A kind of video object detection method based on attention mechanism

Technical field

The present invention relates to computer visions, are related to deep learning, are related to video object detection technique.

Background technique

Image object detection method based on deep learning achieves huge progress, such as RCNN in past quinquenniad Series of network, SSD network and YOLO series of network.But in fields such as video monitoring, vehicle assistant drives, the mesh based on video Mark detection has more extensive demand.It since there are motion blurs in video, blocks, metamorphosis diversity, illumination variation The problems such as diversity, can not obtain good testing result merely with the target in image object detection technique detection video. There is continuity in video between consecutive frame and frame in time, spatially there is similitude, the position of target is between frame and frame It is associated, how to become the key for promoting video object detection performance using Goal time order information in video.

Current video object detection framework mainly has three classes: a kind of that video frame is considered as independent image using image mesh Mark detection algorithm is detected, and such methods have ignored temporal information and independently detect to each frame, and the effect is unsatisfactory；Separately A kind of method combines target detection with target following technology, such methods post-processed in the result of detection so as to Track target, the precision of tracking easily cause error propagation dependent on detection；There is a method in which only being examined on a small number of key frames It surveys, the feature of remaining frame is then generated using Optic flow information and key frame feature, although timing information is utilized in such methods But the calculating cost of light stream is very big, it is difficult to be used for quickly detecting.

Summary of the invention

The object of the present invention is to provide a kind of sufficiently fusion temporal aspects, fast and accurately video object detection method.

In order to solve the above technical problems, the present invention provides a kind of video object detection method based on attention mechanism, Include the following steps:

Step S1 extracts the video frame images input Mobilenet network of current point in time to obtain candidate feature figure；

Step S2 sets a temporal aspect in the time in the past section adjacent with current point in time and merges window, to spy Sign fusion window in video frame to be fused, calculate separately its image Laplce's variance, after being normalized, as respectively to The candidate feature figure of all frames to be fused is weighted summation according to fusion weight and obtains present frame by the fusion weight for merging frame The candidate feature of current time step video frame is connected with temporal aspect in the channel of feature dimension, obtains by required temporal aspect To the characteristic pattern to be detected for having merged timing information；

Step S3 extracts additional scale using convolution feature extraction layer and maximum pond layer on characteristic pattern to be detected Characteristic pattern to be detected；

Step S4, on the characteristic pattern to be detected of different scale, using convolutional layer carry out present frame on target category and The prediction of bounding box coordinates.

Further, in step S1, the video frame of current point in time t is detected, first by current point in time video frame Image I_tIt inputs Mobilenet network and carries out feature extraction, whereinH_IAnd W_IThe respectively height of video frame And width, extraction obtain candidate feature figure Represent real number, C₁, H₁And W₁Respectively candidate feature figure Feature port number, height and width.

Further, in step S2, a width w is set in the time in the past section of current point in time t as the Fusion Features of s Window enables the video frame images to be fused in Fusion Features window are as follows: { I_t-iI ∈ [1, s], it is to be fused in Fusion Features window The corresponding candidate feature figure of video frame are as follows: { F_t-iI ∈ [1, s].By each video frame images I to be fused_t-iBe converted to gray scale Scheme G_t-i, and on the basis of grayscale image calculate image Laplce's variance, the Laplce at grayscale image G coordinate (x, y) Operator isThe Laplace operator of image passes through the second dervative for calculating each pixel all directions of image, to catch The region jumpy of pixel value in image is caught, the corner in detection image, Laplce's variance of image then body can be used to Showed the pixel value situation of change of whole image, if Laplce's variance is larger, illustrated that image is more visible, on the contrary image compared with It is fuzzy.

Each grayscale image G is calculated first_t-iLaplce's mean valueH_IAnd W_IRespectively height and width of grayscale image:

Next each grayscale image G is calculated_t-iLaplce's variance

If video frame is more visible, candidate feature facilitates the detection of target, otherwise some frames are due to moving target Cause image fuzzy.The candidate feature of these frames, which is unfavorable for detection target, should distribute not the video frame of different readabilities Same fusion weight calculates all first so that detection model focuses more on clearly feature rather than fuzzy feature The fusion weight α of video frame to be fused_t-i:

It is merged the frame candidate feature in Fusion Features window to obtain current point in time in a manner of weighted sum Temporal aspectThe candidate feature of temporal aspect and present frame is attached in channel dimension, completes melting for timing information Close, obtain first for detection characteristic pattern to be detected.

Further, in step S3, in the characteristic pattern to be detected for obtaining current point in time and having merged temporal aspectAfterwards, it is The characteristic pattern to be detected of more scales is obtained, using 3 × 3 convolutional layers and 2 × 2 pond layers characteristic pattern to be detected is carried out into one Step feature extraction reduces the size of characteristic pattern to be detected simultaneously, and local message is more in this way in the big characteristic pattern to be detected of size It is abundant, it is suitble to predict small size target, the small characteristic pattern to be detected of size contains stronger global semantic information, is suitble to The biggish target of detecting size finally obtains e characteristic patterns to be detected by e-1 feature extraction:

Further, in step S4, by additional feature extraction, obtain multiple dimensioned characteristic pattern to be detected, by Setting has the anchor frame of priori position on the mapping to be checked of different scale, using two 3 × 3 convolutional layers in these features to be detected Object boundary frame is carried out respectively with respect to the offset of anchor frame and the classification of target using channel dimension on figure.Enable classification number be d (including Background), for each characteristic pattern to be detectedIt is predicted by 3 × 3 convolution classification prediction intervals and 3 × 3 convolution bounding box prediction intervals After obtain classification prediction resultAnd bounding box prediction result

Detailed description of the invention

Fig. 1 is schematic diagram of the present invention.

Specific embodiment

In conjunction with the accompanying drawings, the present invention is further explained in detail.These attached drawings are rough schematic view, only to show Meaning mode illustrates basic structure of the invention, therefore it only shows the composition relevant to the invention.

Embodiment 1

As shown in Figure 1, this example provides a kind of video object detection method based on attention mechanism, including walk as follows Suddenly

Step S2 sets a temporal aspect in the time in the past section adjacent with current point in time and merges window, for Video frame to be fused in Fusion Features window calculates separately its image Laplce's variance, after being normalized, as each The candidate feature figure of all frames to be fused is weighted summation according to weight and obtains present frame institute by the fusion weight of frame to be fused The candidate feature of current time step video frame is connected with temporal aspect in channel dimension, is merged by the temporal aspect needed The characteristic pattern to be detected of timing information；

In the step S1, current point in time t video frame is detected current point in time video frame images I first_t It inputs Mobilenet and carries out feature extraction, whereinH_IAnd W_IIt is the height and width of frame image respectively, extracts To candidate feature figureWherein C₁, H₁, W₁The respectively port number of candidate feature figure, height and width.

In the step S2, a width w is set in the time in the past section of current point in time t as the Fusion Features window of s Mouthful, enabling the length of time in the past section is q, then the setting rule of Fusion Features window width is shown below, i.e., if past Time step length is greater than s, then sets s for fusion window width, if time in the past step-length degree is less than s, not enough spies Fusion window width is then set as the past the length of time step by sign.

Enable the video frame images to be fused in Fusion Features window are as follows: { I_t-iI ∈ [1, s], wait melt in Fusion Features window Close the corresponding candidate feature figure of video frame are as follows: { F_t-iI ∈ [1, s].By each video frame images I to be fused_t-iBe converted to ash Degree figure G_t-i, and on the basis of grayscale image calculate image Laplce's variance, the La Pula at grayscale image G coordinate (x, y) This operator are as follows:

Wherein G (x, y) represents pixel value of the grayscale image G at coordinate (x, y).The Laplace operator of image passes through calculating The second dervative of each pixel all directions of image can be used to detect figure to capture the region jumpy of pixel value in image Corner as in, Laplce's variance of image then embodies the pixel value situation of change of whole image, if Laplce side Difference is larger, then illustrates that image is more visible, otherwise image is more fuzzy.

Each grayscale image G is calculated first_t-iLaplce's mean valueH_IAnd W_IThe respectively height and width of grayscale image.

Next each grayscale image G is calculated_t-iLaplce's variance

If video frame is more visible, candidate feature facilitates the detection of target, otherwise some frames are due to moving target Cause image fuzzy.The candidate feature of these frames, which is unfavorable for detection target, should distribute not the video frame of different readabilities With fusion weight, more clearly frame feature weight is bigger so that detection model focus more on clearly feature rather than Fuzzy feature calculates the fusion weight α of all video frames to be fused first_t-i:

It is merged the frame candidate feature in Fusion Features window to obtain current point in time in a manner of weighted sum Temporal aspect

The candidate feature of temporal aspect and present frame is attached in channel dimension, the fusion of timing information is completed, obtains First characteristic pattern to be detected for detection

In the step S3, in the characteristic pattern to be detected for obtaining current point in time and having merged temporal aspectIt afterwards, is terrible To more multiple dimensioned characteristic pattern to be detected, it is same that further feature extraction is carried out to characteristic pattern to be detected using convolutional layer and pond layer When reduce the size of characteristic pattern to be detected, local message is suitble to pair compared with horn of plenty in this way in the big characteristic pattern to be detected of size Small size target predicted, the small characteristic pattern to be detected of size contains stronger global semantic information, be suitble to detecting size compared with Big target finally obtains e characteristic patterns to be detected by e-1 feature extraction:

In the step S4, by additional feature extraction, multiple dimensioned characteristic pattern to be detected is obtained, by difference Setting has the anchor frame of priori position on the mapping to be checked of scale, is utilized on these characteristic patterns to be detected using two convolutional layers Channel dimension carries out object boundary frame with respect to the offset of anchor frame and the classification of target respectively.Enabling classification number is d (including background), right In each characteristic pattern to be detected Wherein C_Fi, H_Fi, W_FiRespectively the port number of this feature figure, Height and width, the anchor frame number of each location of pixels are n_i, after convolution classification prediction interval and the prediction of convolution bounding box prediction interval To classification prediction resultAnd bounding box prediction result

Claims

1. a kind of video object detection method based on attention mechanism, which comprises the steps of:

Step S1 extracts the video frame images input Mobilenet of current point in time to obtain candidate feature figure；

Step S2 sets a temporal aspect in the time in the past section adjacent with current point in time and merges window, for feature The video frame to be fused in window is merged, its image Laplce's variance is calculated separately, after being normalized, as respectively wait melt The candidate feature figure of all frames to be fused is weighted summation according to weight and obtained needed for present frame by the fusion weight for closing frame The candidate feature of current time step video frame is connected with temporal aspect in channel dimension, has been merged timing by temporal aspect The characteristic pattern to be detected of information；

Step S3, using convolution feature extraction layer and maximum pond layer extracted on characteristic pattern to be detected additional scale to Detect characteristic pattern；

Step S4 carries out target category and boundary on present frame using convolutional layer on the characteristic pattern to be detected of different scale The prediction of frame coordinate.

2. the video object detection method according to claim 1 based on attention mechanism, which is characterized in that

In the step S1, current point in time t video frame is detected current point in time video frame images I first_tInput Mobilenet network carries out feature extraction and obtains candidate feature figure F_t；WhereinH_IAnd W_IRespectively video The height and width of frame, extraction obtain candidate feature figure Represent real number, C₁, H₁And W₁It is respectively candidate special Levy feature port number, the height and width of figure.

3. the video object detection method according to claim 2 based on attention mechanism, which is characterized in that

In the step S2, a width w is set in the time in the past section of current point in time t as the Fusion Features window of s, is enabled Video frame images to be fused in Fusion Features window are as follows: { I_t-iI ∈ [1, s], video frame pair to be fused in Fusion Features window The candidate feature figure answered are as follows: { F_t-iI ∈ [1, s]；By each video frame images I to be fused_t-iBe converted to grayscale image G_t-i；

Calculate each grayscale image G_t-iLaplce's varianceIt is calculated by normalization Laplce's variance all to be fused The fusion weight α of video frame_t-i；Frame candidate feature in Fusion Features window is merged to obtain in a manner of weighted sum The temporal aspect of current point in timeThe candidate feature of temporal aspect and present frame is attached in channel dimension, when completion The fusion of sequence information, obtain first for detection characteristic pattern to be detected

4. the video object detection method according to claim 3 based on attention mechanism, which is characterized in that

In the step S3, in the characteristic pattern to be detected for obtaining current point in time and having merged temporal aspectAfterwards, volume 3 × 3 are utilized Lamination and 2 × 2 pond layers carry out further feature extraction to characteristic pattern to be detected while reducing the size of characteristic pattern to be detected, inspection Characteristic pattern is surveyed to carry out further feature extraction while reducing the size of characteristic pattern to be detected, by e-1 feature extraction, final To e characteristic patterns to be detected:

5. the video object detection method according to claim 4 based on attention mechanism, which is characterized in that

In the step S4, by additional feature extraction, multiple dimensioned characteristic pattern to be detected is obtained, by different scale Mapping to be checked on setting have priori position anchor frame, utilized on these characteristic patterns to be detected using two 3 × 3 convolutional layers Channel dimension carries out object boundary frame with respect to the offset of anchor frame and the classification of target respectively；By 3 × 3 convolution classification prediction intervals and 3 × 3 convolution bounding box prediction intervals are for each characteristic pattern to be detectedIt is predicted by convolution classification prediction interval and convolution bounding box Classification prediction result is obtained after layer predictionAnd bounding box prediction result