CN115131710A

CN115131710A - Real-time action detection method based on multi-scale feature fusion attention

Info

Publication number: CN115131710A
Application number: CN202210785189.4A
Authority: CN
Inventors: 柯逍; 缪欣
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2022-09-30

Abstract

The invention relates to a real-time action detection method based on multi-scale feature fusion attention. Secondly, extracting key frames of the input video clips, and extracting optical flow information of the obtained key frames. Respectively inputting the obtained video segments, key frames and key frame optical flows into a ResNext101 network and a Darknet network for feature extraction, enhancing the features through a multi-scale feature fusion attention module, splicing the spatiotemporal features, further fusing the spatiotemporal features through channel attention, finally obtaining a category boundary box and confidence through classification and regression, and obtaining a prediction result through NMS (non-maximum suppression).

Description

Real-time action detection method based on multi-scale feature fusion attention

Technical Field

The invention relates to the field of pattern recognition and computer vision, in particular to a real-time action detection method based on multi-scale feature fusion attention.

Background

With the development of science and technology, motion detection is a hot research problem in recent years, and the application of real-time motion detection is more and more extensive in the fields of unmanned driving, safety monitoring, transportation, man-machine interaction systems and the like. Most of the most advanced existing motion detection methods adopt a double-flow architecture, however, optical flow information has the problems of time consumption in calculation, large consumption of storage space and the like, and along with the change of motion amplitude of a video, the optical flow information of the whole video inevitably has noise segments, and the final motion feature representation is also influenced. Second, most methods extract slice-level or frame-level features through 2D/3D networks, considering only temporal dependencies from a single scale (i.e., short-term or long-term), ignoring multi-scale temporal dependencies. And the time sequence characteristics and the space characteristics extracted by the deep network are directly spliced, so that the condition that the data sources of the time sequence characteristics and the space characteristics are different and the element association relation in the characteristics is also different is ignored.

Disclosure of Invention

In view of this, the present invention provides a real-time motion detection method based on multi-scale feature fusion attention, which can effectively identify student behaviors. The method comprises the steps of firstly carrying out frame set division on a data set video clip, and carrying out data enhancement on the data set video clip through random sequencing operation. Secondly, extracting key frames of the input video clips, and extracting optical flow information of the obtained key frames. Respectively inputting the obtained video segments, key frames and key frame optical flows into a ResNext101 network and a Darknet network for feature extraction, enhancing the features through a multi-scale feature fusion attention module, splicing the spatiotemporal features, further fusing the spatiotemporal features through channel attention, finally obtaining a category boundary box and confidence through classification and regression, and obtaining a prediction result through NMS (non-maximum suppression).

The invention specifically adopts the following technical scheme:

a real-time motion detection method based on multi-scale feature fusion attention is characterized by comprising the following steps: the method comprises the following steps:

step S1: dividing a data set video clip into frame sets, and performing data enhancement on the data set video clip through random sequencing operation; extracting key frames of the video clips, and extracting optical flow information of the key frames;

step S2: inputting the obtained video segments into a ResNext101 network to extract time sequence characteristics, compressing the time sequence characteristics, and inputting key frames and optical flow information of the key frames into a Darknet network to extract spatial characteristics and motion characteristics;

step S3: obtaining multi-scale features by stacking kinematic attention modules of different expansion rates;

step S4: the spatiotemporal features are spliced to further fuse the spatiotemporal features through channel attention;

step S5: and obtaining a class boundary box and confidence through classification and regression networks, and finally obtaining a boundary box with the maximum probability as a prediction result through non-maximum value inhibition NMS.

Further, step S1 specifically includes the following steps:

step S11: uniformly sampling the data set video segment at intervals of p frames, and dividing the sampled video segment into n equal-length frame sets, namely S ═ S { S } ₁ ,s ₂ ,…,s _n Is set for each frame s _fi Consists of video frame sequences with equal length;

step S12: set of frame pairs s ₁ ,s ₂ ,…,s _n The random ordering constitutes a new video segment S '═ { S' ₁ ,s' ₂ ,…,s' _n Achieving the effect of data enhancement for the training process;

step S13: dividing an input video clip into a starting part, a middle part and an ending part, respectively and randomly extracting a frame as a key frame to briefly represent video actions;

step S14: optical flow information is extracted for the key frames using a RAFT model.

Further, step S2 specifically includes the following steps:

step S21: inputting the acquired video segments into a 3D backbone network ResNext101 network to extract time sequence characteristics M e R ^C ^×T×H×W Wherein T isThe number of input frames, H and W are the height and width of the input image, and C is the number of output channels;

step S22: inputting the key frame into a 2D backbone network Darknet network to extract spatial characteristics K epsilon R ^C'×H×W ；

Step S23: inputting the key frame optical flow information extracted by the RAFT model into a 2D backbone network Darknet network to extract motion characteristics O E R ^C”×H×W ；

Step S24: to match the output signature of the 2D backbone network, the depth dimension of the ResNext101 output signature M is reduced to 1, compressing the output volume to [ C H W [ ]]Obtaining the compressed characteristic M' epsilon R ^C×H×W 。

Further, step S3 specifically includes the following steps:

step S31: respectively enabling the extracted three features K, O, M' to pass through two projection layers to generate feature maps of 512 channels; the projection layer adopts a 1 × 1 convolutional layer to reduce the channel dimension, and a 3 × 3 convolutional layer to refine the semantic context;

step S32: stacking the motion attention modules by different expansion rates to generate output features K ', O ', M ' with multiple receptive fields to cover the dimensions of all objects;

the structure of the exercise attention module is represented as:

X _out ＝X _attn *X _res +X _in

X _attn ＝F _attn (APool(X _in )；θ,Ω)

X _res ＝F(X _in ；θ,Ω)

in the formula, F (-) represents a residual function, APool (-) represents an average pool layer, and θ and Ω represent the structures of convolution layers, respectively; use APool (-) to perform a non-full compression operation, then look to channel X _attn *X _res Is upsampled to match channel X _in To output of (c).

Further, step S4 specifically includes the following steps:

step S41: splicing the characteristics K ', O ' and M ' to obtain the characteristic A epsilon R ^{(C+C'+C”)×H×W} ；

Step S42: inputting the characteristic A into two convolution layers to generate a new characteristic mapping B E R ^C×H×W (ii) a Then converting the characteristic mapping B into the size of C multiplied by N to obtain the F e R ^C×N Wherein N ═ hxw;

step S43: let F be equal to R ^C×N And transpose thereof F ^T ∈R ^N×C Multiplying and calculating the characteristic correlation among channels to generate a matrix G belonging to R ^C×C ；

Step S44, inputting the matrix into a Softmax layer to generate a channel attention mapping Q e R ^C×C ；

Step S45: performing matrix multiplication on the channel attention mapping Q and the characteristic F, and converting the result into a three-dimensional space with the same shape as the characteristic mapping B to obtain the characteristic F' belonging to the R ^C×H×W ；

Step S46: combining the tensor F' with the original input feature mapping B through summation operation to obtain an output C epsilon R ^C×H×W ：

C＝δ·F'+B

Where δ is a training parameter.

Further, step S5 specifically includes the following steps:

step S51: passing the fused features through a 1 × 1 kernel convolutional layer for generating output channels of [ (5 × (NumCls +5)) × hxw ] size, where (NumCls +5) includes NumCls class action scores cls, 4 coordinates [ bx, by, bw, bh ] and a confidence score Conf;

step S52: selecting 5 prior anchors on a data set by a k-means clustering algorithm;

step S53: on the basis of the initial anchor point frame, the position and the confidence coefficient of the boundary frame are regressed through Sigmod, the loss of the boundary frame is calculated through CIOU loss, the confidence coefficient loss is calculated through binary cross entropy loss, and the CIOU loss calculation formula is as follows:

in the formula, b, and b ^gt Representing the center points of two rectangular boxes, i.e. the coordinates bx, by],[x ^gt ,y ^gt ]ρ represents the euclidean distance between two rectangular frames, u represents the distance of the diagonal of the closure region of the two rectangular frames, and IOU is the ratio of the overlapping area of the bounding boxes to the total area;

step S54: classifying through a full connection layer and a Softmax layer, and calculating the classification Loss through the Focal local, wherein the calculation formula is as follows:

wherein, both alpha and gamma are adjustable hyper-parameters; cls ^gt Is a model prediction, and the value is between 0 and 1;

step S55: adding the boundary frame loss, the confidence coefficient loss and the classification loss to obtain a total loss, and reversely updating the network parameters;

step S56: selecting a confidence threshold, taking out frames and scores of each class with scores larger than a certain threshold for sorting, filtering low-threshold prediction boundary frames, utilizing the positions and scores of the frames to inhibit NMS (network management system) through a non-maximum value, and finally obtaining the boundary frame with the maximum probability as a prediction result;

the non-maximum value inhibits NMS, namely, the scores of all predicted bounding boxes are sorted, the highest score and the corresponding box are selected, the rest boxes are traversed, and if the IOU of the current highest score box is larger than a certain threshold value, the box is deleted; and continuing to select one with the highest score from the unprocessed boxes, and repeating the process.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-scale feature fusion attention-based real-time motion detection method as described above when executing the program.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a multi-scale feature fusion attention based real-time motion detection method as described above.

Compared with the prior art, the invention and the preferred scheme thereof have the following beneficial effects:

by rearranging and splicing the video segments, the diversity of data is increased on the premise of ensuring that the semantic information and the time dependency of the video are not damaged according to the time dependency among the video segments.

Aiming at introducing optical flow information to confusable actions in action detection for processing hard samples, a key frame-based optical flow information data input method is provided to replace the traditional optical flow data input. And time sequence information among video frames is reserved, and motion information is acquired through change among key frames and optical flow information. Compared with the traditional data input, the motion information can be acquired more clearly, the generation of noise data is effectively avoided, and the calculation amount and the storage space of optical flow information are saved.

Based on the multi-scale feature fusion attention, the multi-scale features are fused by extracting the multi-scale motion features of the targets with different scales, and the multi-scale feature fusion method is different from the traditional multi-scale feature fusion method, wherein the multi-scale feature attention module only uses the last layer of feature map to perform multi-scale fusion, and the calculation cost is reduced.

Drawings

The invention is described in further detail below with reference to the accompanying drawings and the detailed description;

fig. 1 is a schematic diagram of the flow and working principle of the embodiment of the invention.

Detailed Description

In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:

it should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present embodiment provides a real-time motion detection method based on multi-scale feature fusion attention, which specifically includes the following steps:

step S1: and dividing the data set video clip into frame sets, and performing data enhancement on the data set video clip through a random ordering operation. Secondly, extracting key frames of the video clips, and extracting optical flow information of the key frames;

step S2: inputting the acquired video clip into a ResNext101 network to extract time sequence characteristics, compressing the time sequence characteristics, and inputting the key frame and the optical flow information of the key frame into a Darknet network to extract spatial characteristics and motion characteristics;

step S3: performing multi-scale feature fusion on the features through a multi-scale feature fusion attention module;

step S4: the space-time characteristics are further fused by splicing the space-time characteristics through channel attention;

step S5: and obtaining a class boundary box and confidence level through classification and regression network, and finally obtaining a boundary box with the maximum probability as a prediction result through NMS (non-maximum suppression).

In this embodiment, the step S1 includes the following steps:

step S12: set of frame pairs s ₁ ,s ₂ ,…,s _n Randomly ordering constitutes a new video segment S '═ S' ₁ ,s' ₂ ,…,s' _n The effect of data enhancement is achieved for the use of the training process;

step S14: extracting optical flow information from the key frame by using a RAFT model;

the RAFT model is an end-to-end optical flow estimation deep neural network model, has strong generalization capability and has high efficiency in the aspects of training speed, parameter quantity and reasoning time.

In this embodiment, step S2 specifically includes the following steps:

step S21: inputting the acquired video clip into a 3D backbone network ResNext101 network to extract a time sequence characteristic M E R ^C ^×T×H×W Where T is the number of input frames, H and W are the height and width of the input images, and C is the number of output channels;

Step S24: to match the output signature of the 2D backbone network, the depth dimension of the ResNext101 output signature M is reduced to 1, compressing the output volume to [ C × H × W [ ]]Obtaining the compressed characteristic M' epsilon R ^C×H×W 。

In this embodiment, step S3 specifically includes the following steps:

step S31: respectively passing the extracted three features K, O, M' through two projection layers (one 1 × 1 convolution layer to reduce the channel dimension and one 3 × 3 convolution layer to refine the semantic context), generating feature maps of 512 channels;

step S32: the moving attention module is stacked by different expansion ratios, generating output features K ', O', M "with multiple receptive fields, covering the dimensions of all objects.

The kinematic attention module may be expressed as:

X _out ＝X _attn *X _res +X _in

X _attn ＝F _attn (APool(X _in )；θ,Ω)

X _res ＝F(X _in ；θ,Ω)

in the formula, F (·) represents a residual function, APool (·) represents an average pool layer, and θ and Ω represent the structures of convolution layers, respectively. Use APool (-) to perform non-full compression operation and then look to channel X _attn *X _res Up-sampling the output of to match channel X _in An output of (d);

in this embodiment, step S4 specifically includes the following steps:

step S41: obtaining the splicing characteristics A epsilon R by the splicing characteristics K ', O' and M ^{(C+C'+C”)×H×W} ；

Step S42: inputting the characteristic A into two convolution layers to generate a new characteristic mapping B E R ^C×H×W . Then converting B into C multiplied by N to obtain F epsilon R ^C×N Wherein N ═ hxw;

step S43: let F be equal to R ^C×N And transpose F thereof ^T ∈R ^N×C Multiplying and calculating the characteristic correlation among channels to generate a matrix G belonging to R ^C×C ；

Step S46: combining the tensor F' with the original input feature mapping B through summation operation to obtain an output C belonging to R ^C×H×W ：

C＝δ·F'+B

Where δ is a training parameter.

In this embodiment, step S5 specifically includes the following steps:

in the formula, b, and b ^gt Representing the center points of two rectangular boxes, i.e. the coordinates bx, by],[x ^gt ,y ^gt ]ρ is the euclidean distance between two rectangular frames, u is the distance of the diagonal of the closure region of the two rectangular frames, and IOU is the ratio of the overlapping area of the bounding boxes to the total area.

wherein, alpha and gamma are both adjustable hyper-parameters. cls ^gt The model is used for prediction, and the value of the model is between (0-1).

step S56: selecting a confidence threshold, taking out the frames and scores of each class with the scores larger than a certain threshold for sorting, filtering out low-threshold prediction boundary frames, performing NMS (non-maximum suppression) by using the positions and scores of the frames, and finally obtaining the boundary frame with the maximum probability as a prediction result.

NMS (non-maximum suppression) sorts the scores of all predicted bounding boxes, selects the highest score and its corresponding box, traverses the rest boxes, and deletes its box if the IOU is larger than a certain threshold value. And continuing to select one with the highest score from the unprocessed boxes, and repeating the process.

In particular, the invention is based on real-time motion detection of multi-scale feature fusion attention. By rearranging and splicing the video segments, the diversity of data is increased on the premise of ensuring that the semantic information and the time dependency of the video are not damaged according to the time dependency among the video segments. Aiming at introducing optical flow information to confuse actions in action detection for sample processing difficulty, a key frame-based optical flow information data input method is provided to replace the traditional optical flow data input. And time sequence information among video frames is reserved, and motion information is acquired through change among key frames and optical flow information. Compared with the traditional data input, the motion information can be acquired more clearly, the generation of noise data is effectively avoided, and the calculation amount and the storage space of optical flow information are saved. The multi-scale feature fusion method is based on multi-scale feature fusion attention, multi-scale features are fused by extracting multi-scale motion features of targets with different scales, and different from the traditional multi-scale feature fusion, the multi-scale feature attention module only uses the last layer of feature map to conduct multi-scale fusion, and therefore calculation cost is reduced.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow of the flowcharts, and combinations of flows in the flowcharts, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

The present invention is not limited to the above preferred embodiments, and other various real-time motion detection methods based on multi-scale feature fusion attention can be obtained by anyone skilled in the art according to the teaching of the present invention.

Claims

1. A real-time motion detection method based on multi-scale feature fusion attention is characterized by comprising the following steps: the method comprises the following steps:

step S4: the space-time features are spliced to further fuse the space-time features through channel attention;

step S5: and obtaining a class boundary box and confidence coefficient through classification and regression networks, and finally obtaining a boundary box with the maximum probability as a prediction result through suppressing NMS through a non-maximum value.

2. The real-time motion detection method based on multi-scale feature fusion attention of claim 1, characterized in that: step S1 specifically includes the following steps:

step S11: uniformly sampling the data set video segment at intervals of p frames, and dividing the sampled video segment into n equal-length frame sets, namely S ═ S { S } ₁ ,s ₂ ,…,s _n H, each frame set s _fi Consists of video frame sequences with equal length;

step S12: set of frame pairs { s ₁ ,s ₂ ,…,s _n Randomly ordering constitutes a new video segment S '═ S' ₁ ,s' ₂ ,…,s' _n Achieving the effect of data enhancement for the training process;

3. The multi-scale feature fusion attention-based real-time motion detection method according to claim 2, characterized in that: step S2 specifically includes the following steps:

step S21: inputting the acquired video segments into a 3D backbone network ResNext101 network to extract time sequence characteristics M e R ^C ^×T×H×W Where T is the number of input frames, H and W are the height and width of the input images, and C is the number of output channels;

4. The multi-scale feature fusion attention-based real-time motion detection method according to claim 3, characterized in that: step S3 specifically includes the following steps:

step S31: respectively enabling the three extracted features K, O, M' to pass through two projection layers to generate feature maps of 512 channels; the projection layer adopts a 1 × 1 convolutional layer to reduce the channel dimension, and a 3 × 3 convolutional layer to refine the semantic context;

the structure of the exercise attention module is represented as:

X _out ＝X _attn *X _res +X _in

X _attn ＝F _attn (APool(X _in )；θ,Ω)

X _res ＝F(X _in ；θ,Ω)

in the formula, F (-) represents a residual function, APool (-) represents an average pool layer, and θ and Ω represent the structures of convolution layers, respectively; use APool (-) to perform a non-full compression operation, then look to channel X _attn *X _res Up-sampling the output of to match channel X _in To output of (c).

5. The multi-scale feature fusion attention-based real-time motion detection method of claim 4, wherein: step S4 specifically includes the following steps:

C＝δ·F'+B

Where δ is a training parameter.

6. The multi-scale feature fusion attention-based real-time motion detection method of claim 5, wherein: step S5 specifically includes the following steps:

step S52: selecting 5 prior anchors on a data set through a k-means clustering algorithm;

in the formula, b, and b ^gt Representing the center points of two rectangular boxes, i.e. the coordinates bx, by],[x ^gt ,y ^gt ]Rho represents the Euclidean distance between two rectangular frames, u represents the distance of a diagonal line of a closure region of the two rectangular frames, and IOU is the ratio of the overlapping area of the bounding frames to the total area;