CN115131710A - Real-time action detection method based on multi-scale feature fusion attention - Google Patents

Real-time action detection method based on multi-scale feature fusion attention Download PDF

Info

Publication number
CN115131710A
CN115131710A CN202210785189.4A CN202210785189A CN115131710A CN 115131710 A CN115131710 A CN 115131710A CN 202210785189 A CN202210785189 A CN 202210785189A CN 115131710 A CN115131710 A CN 115131710A
Authority
CN
China
Prior art keywords
frame
frames
attention
network
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210785189.4A
Other languages
Chinese (zh)
Inventor
柯逍
缪欣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN202210785189.4A priority Critical patent/CN115131710A/en
Publication of CN115131710A publication Critical patent/CN115131710A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/762Arrangements for image or video recognition or understanding using pattern recognition or machine learning using clustering, e.g. of similar faces in social networks
    • G06V10/763Non-hierarchical techniques, e.g. based on statistics of modelling distributions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a real-time action detection method based on multi-scale feature fusion attention. Secondly, extracting key frames of the input video clips, and extracting optical flow information of the obtained key frames. Respectively inputting the obtained video segments, key frames and key frame optical flows into a ResNext101 network and a Darknet network for feature extraction, enhancing the features through a multi-scale feature fusion attention module, splicing the spatiotemporal features, further fusing the spatiotemporal features through channel attention, finally obtaining a category boundary box and confidence through classification and regression, and obtaining a prediction result through NMS (non-maximum suppression).

Description

Real-time action detection method based on multi-scale feature fusion attention
Technical Field
The invention relates to the field of pattern recognition and computer vision, in particular to a real-time action detection method based on multi-scale feature fusion attention.
Background
With the development of science and technology, motion detection is a hot research problem in recent years, and the application of real-time motion detection is more and more extensive in the fields of unmanned driving, safety monitoring, transportation, man-machine interaction systems and the like. Most of the most advanced existing motion detection methods adopt a double-flow architecture, however, optical flow information has the problems of time consumption in calculation, large consumption of storage space and the like, and along with the change of motion amplitude of a video, the optical flow information of the whole video inevitably has noise segments, and the final motion feature representation is also influenced. Second, most methods extract slice-level or frame-level features through 2D/3D networks, considering only temporal dependencies from a single scale (i.e., short-term or long-term), ignoring multi-scale temporal dependencies. And the time sequence characteristics and the space characteristics extracted by the deep network are directly spliced, so that the condition that the data sources of the time sequence characteristics and the space characteristics are different and the element association relation in the characteristics is also different is ignored.
Disclosure of Invention
In view of this, the present invention provides a real-time motion detection method based on multi-scale feature fusion attention, which can effectively identify student behaviors. The method comprises the steps of firstly carrying out frame set division on a data set video clip, and carrying out data enhancement on the data set video clip through random sequencing operation. Secondly, extracting key frames of the input video clips, and extracting optical flow information of the obtained key frames. Respectively inputting the obtained video segments, key frames and key frame optical flows into a ResNext101 network and a Darknet network for feature extraction, enhancing the features through a multi-scale feature fusion attention module, splicing the spatiotemporal features, further fusing the spatiotemporal features through channel attention, finally obtaining a category boundary box and confidence through classification and regression, and obtaining a prediction result through NMS (non-maximum suppression).
The invention specifically adopts the following technical scheme:
a real-time motion detection method based on multi-scale feature fusion attention is characterized by comprising the following steps: the method comprises the following steps:
step S1: dividing a data set video clip into frame sets, and performing data enhancement on the data set video clip through random sequencing operation; extracting key frames of the video clips, and extracting optical flow information of the key frames;
step S2: inputting the obtained video segments into a ResNext101 network to extract time sequence characteristics, compressing the time sequence characteristics, and inputting key frames and optical flow information of the key frames into a Darknet network to extract spatial characteristics and motion characteristics;
step S3: obtaining multi-scale features by stacking kinematic attention modules of different expansion rates;
step S4: the spatiotemporal features are spliced to further fuse the spatiotemporal features through channel attention;
step S5: and obtaining a class boundary box and confidence through classification and regression networks, and finally obtaining a boundary box with the maximum probability as a prediction result through non-maximum value inhibition NMS.
Further, step S1 specifically includes the following steps:
step S11: uniformly sampling the data set video segment at intervals of p frames, and dividing the sampled video segment into n equal-length frame sets, namely S ═ S { S } 1 ,s 2 ,…,s n Is set for each frame s fi Consists of video frame sequences with equal length;
step S12: set of frame pairs s 1 ,s 2 ,…,s n The random ordering constitutes a new video segment S '═ { S' 1 ,s' 2 ,…,s' n Achieving the effect of data enhancement for the training process;
step S13: dividing an input video clip into a starting part, a middle part and an ending part, respectively and randomly extracting a frame as a key frame to briefly represent video actions;
step S14: optical flow information is extracted for the key frames using a RAFT model.
Further, step S2 specifically includes the following steps:
step S21: inputting the acquired video segments into a 3D backbone network ResNext101 network to extract time sequence characteristics M e R C ×T×H×W Wherein T isThe number of input frames, H and W are the height and width of the input image, and C is the number of output channels;
step S22: inputting the key frame into a 2D backbone network Darknet network to extract spatial characteristics K epsilon R C'×H×W
Step S23: inputting the key frame optical flow information extracted by the RAFT model into a 2D backbone network Darknet network to extract motion characteristics O E R C”×H×W
Step S24: to match the output signature of the 2D backbone network, the depth dimension of the ResNext101 output signature M is reduced to 1, compressing the output volume to [ C H W [ ]]Obtaining the compressed characteristic M' epsilon R C×H×W
Further, step S3 specifically includes the following steps:
step S31: respectively enabling the extracted three features K, O, M' to pass through two projection layers to generate feature maps of 512 channels; the projection layer adopts a 1 × 1 convolutional layer to reduce the channel dimension, and a 3 × 3 convolutional layer to refine the semantic context;
step S32: stacking the motion attention modules by different expansion rates to generate output features K ', O ', M ' with multiple receptive fields to cover the dimensions of all objects;
the structure of the exercise attention module is represented as:
X out =X attn *X res +X in
X attn =F attn (APool(X in );θ,Ω)
X res =F(X in ;θ,Ω)
in the formula, F (-) represents a residual function, APool (-) represents an average pool layer, and θ and Ω represent the structures of convolution layers, respectively; use APool (-) to perform a non-full compression operation, then look to channel X attn *X res Is upsampled to match channel X in To output of (c).
Further, step S4 specifically includes the following steps:
step S41: splicing the characteristics K ', O ' and M ' to obtain the characteristic A epsilon R (C+C'+C”)×H×W
Step S42: inputting the characteristic A into two convolution layers to generate a new characteristic mapping B E R C×H×W (ii) a Then converting the characteristic mapping B into the size of C multiplied by N to obtain the F e R C×N Wherein N ═ hxw;
step S43: let F be equal to R C×N And transpose thereof F T ∈R N×C Multiplying and calculating the characteristic correlation among channels to generate a matrix G belonging to R C×C
Step S44, inputting the matrix into a Softmax layer to generate a channel attention mapping Q e R C×C
Step S45: performing matrix multiplication on the channel attention mapping Q and the characteristic F, and converting the result into a three-dimensional space with the same shape as the characteristic mapping B to obtain the characteristic F' belonging to the R C×H×W
Step S46: combining the tensor F' with the original input feature mapping B through summation operation to obtain an output C epsilon R C×H×W
C=δ·F'+B
Where δ is a training parameter.
Further, step S5 specifically includes the following steps:
step S51: passing the fused features through a 1 × 1 kernel convolutional layer for generating output channels of [ (5 × (NumCls +5)) × hxw ] size, where (NumCls +5) includes NumCls class action scores cls, 4 coordinates [ bx, by, bw, bh ] and a confidence score Conf;
step S52: selecting 5 prior anchors on a data set by a k-means clustering algorithm;
step S53: on the basis of the initial anchor point frame, the position and the confidence coefficient of the boundary frame are regressed through Sigmod, the loss of the boundary frame is calculated through CIOU loss, the confidence coefficient loss is calculated through binary cross entropy loss, and the CIOU loss calculation formula is as follows:
Figure BDA0003731276980000041
Figure BDA0003731276980000042
in the formula, b, and b gt Representing the center points of two rectangular boxes, i.e. the coordinates bx, by],[x gt ,y gt ]ρ represents the euclidean distance between two rectangular frames, u represents the distance of the diagonal of the closure region of the two rectangular frames, and IOU is the ratio of the overlapping area of the bounding boxes to the total area;
step S54: classifying through a full connection layer and a Softmax layer, and calculating the classification Loss through the Focal local, wherein the calculation formula is as follows:
Figure BDA0003731276980000043
wherein, both alpha and gamma are adjustable hyper-parameters; cls gt Is a model prediction, and the value is between 0 and 1;
step S55: adding the boundary frame loss, the confidence coefficient loss and the classification loss to obtain a total loss, and reversely updating the network parameters;
step S56: selecting a confidence threshold, taking out frames and scores of each class with scores larger than a certain threshold for sorting, filtering low-threshold prediction boundary frames, utilizing the positions and scores of the frames to inhibit NMS (network management system) through a non-maximum value, and finally obtaining the boundary frame with the maximum probability as a prediction result;
the non-maximum value inhibits NMS, namely, the scores of all predicted bounding boxes are sorted, the highest score and the corresponding box are selected, the rest boxes are traversed, and if the IOU of the current highest score box is larger than a certain threshold value, the box is deleted; and continuing to select one with the highest score from the unprocessed boxes, and repeating the process.
An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the multi-scale feature fusion attention-based real-time motion detection method as described above when executing the program.
A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a multi-scale feature fusion attention based real-time motion detection method as described above.
Compared with the prior art, the invention and the preferred scheme thereof have the following beneficial effects:
by rearranging and splicing the video segments, the diversity of data is increased on the premise of ensuring that the semantic information and the time dependency of the video are not damaged according to the time dependency among the video segments.
Aiming at introducing optical flow information to confusable actions in action detection for processing hard samples, a key frame-based optical flow information data input method is provided to replace the traditional optical flow data input. And time sequence information among video frames is reserved, and motion information is acquired through change among key frames and optical flow information. Compared with the traditional data input, the motion information can be acquired more clearly, the generation of noise data is effectively avoided, and the calculation amount and the storage space of optical flow information are saved.
Based on the multi-scale feature fusion attention, the multi-scale features are fused by extracting the multi-scale motion features of the targets with different scales, and the multi-scale feature fusion method is different from the traditional multi-scale feature fusion method, wherein the multi-scale feature attention module only uses the last layer of feature map to perform multi-scale fusion, and the calculation cost is reduced.
Drawings
The invention is described in further detail below with reference to the accompanying drawings and the detailed description;
fig. 1 is a schematic diagram of the flow and working principle of the embodiment of the invention.
Detailed Description
In order to make the features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail as follows:
it should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
As shown in fig. 1, the present embodiment provides a real-time motion detection method based on multi-scale feature fusion attention, which specifically includes the following steps:
step S1: and dividing the data set video clip into frame sets, and performing data enhancement on the data set video clip through a random ordering operation. Secondly, extracting key frames of the video clips, and extracting optical flow information of the key frames;
step S2: inputting the acquired video clip into a ResNext101 network to extract time sequence characteristics, compressing the time sequence characteristics, and inputting the key frame and the optical flow information of the key frame into a Darknet network to extract spatial characteristics and motion characteristics;
step S3: performing multi-scale feature fusion on the features through a multi-scale feature fusion attention module;
step S4: the space-time characteristics are further fused by splicing the space-time characteristics through channel attention;
step S5: and obtaining a class boundary box and confidence level through classification and regression network, and finally obtaining a boundary box with the maximum probability as a prediction result through NMS (non-maximum suppression).
In this embodiment, the step S1 includes the following steps:
step S11: uniformly sampling the data set video segment at intervals of p frames, and dividing the sampled video segment into n equal-length frame sets, namely S ═ S { S } 1 ,s 2 ,…,s n Is set for each frame s fi Consists of video frame sequences with equal length;
step S12: set of frame pairs s 1 ,s 2 ,…,s n Randomly ordering constitutes a new video segment S '═ S' 1 ,s' 2 ,…,s' n The effect of data enhancement is achieved for the use of the training process;
step S13: dividing an input video clip into a starting part, a middle part and an ending part, respectively and randomly extracting a frame as a key frame to briefly represent video actions;
step S14: extracting optical flow information from the key frame by using a RAFT model;
the RAFT model is an end-to-end optical flow estimation deep neural network model, has strong generalization capability and has high efficiency in the aspects of training speed, parameter quantity and reasoning time.
In this embodiment, step S2 specifically includes the following steps:
step S21: inputting the acquired video clip into a 3D backbone network ResNext101 network to extract a time sequence characteristic M E R C ×T×H×W Where T is the number of input frames, H and W are the height and width of the input images, and C is the number of output channels;
step S22: inputting the key frame into a 2D backbone network Darknet network to extract spatial characteristics K epsilon R C'×H×W
Step S23: inputting the key frame optical flow information extracted by the RAFT model into a 2D backbone network Darknet network to extract motion characteristics O E R C”×H×W
Step S24: to match the output signature of the 2D backbone network, the depth dimension of the ResNext101 output signature M is reduced to 1, compressing the output volume to [ C × H × W [ ]]Obtaining the compressed characteristic M' epsilon R C×H×W
In this embodiment, step S3 specifically includes the following steps:
step S31: respectively passing the extracted three features K, O, M' through two projection layers (one 1 × 1 convolution layer to reduce the channel dimension and one 3 × 3 convolution layer to refine the semantic context), generating feature maps of 512 channels;
step S32: the moving attention module is stacked by different expansion ratios, generating output features K ', O', M "with multiple receptive fields, covering the dimensions of all objects.
The kinematic attention module may be expressed as:
X out =X attn *X res +X in
X attn =F attn (APool(X in );θ,Ω)
X res =F(X in ;θ,Ω)
in the formula, F (·) represents a residual function, APool (·) represents an average pool layer, and θ and Ω represent the structures of convolution layers, respectively. Use APool (-) to perform non-full compression operation and then look to channel X attn *X res Up-sampling the output of to match channel X in An output of (d);
in this embodiment, step S4 specifically includes the following steps:
step S41: obtaining the splicing characteristics A epsilon R by the splicing characteristics K ', O' and M (C+C'+C”)×H×W
Step S42: inputting the characteristic A into two convolution layers to generate a new characteristic mapping B E R C×H×W . Then converting B into C multiplied by N to obtain F epsilon R C×N Wherein N ═ hxw;
step S43: let F be equal to R C×N And transpose F thereof T ∈R N×C Multiplying and calculating the characteristic correlation among channels to generate a matrix G belonging to R C×C
Step S44, inputting the matrix into a Softmax layer to generate a channel attention mapping Q e R C×C
Step S45: performing matrix multiplication on the channel attention mapping Q and the characteristic F, and converting the result into a three-dimensional space with the same shape as the characteristic mapping B to obtain the characteristic F' belonging to the R C×H×W
Step S46: combining the tensor F' with the original input feature mapping B through summation operation to obtain an output C belonging to R C×H×W
C=δ·F'+B
Where δ is a training parameter.
In this embodiment, step S5 specifically includes the following steps:
step S51: passing the fused features through a 1 × 1 kernel convolutional layer for generating output channels of [ (5 × (NumCls +5)) × hxw ] size, where (NumCls +5) includes NumCls class action scores cls, 4 coordinates [ bx, by, bw, bh ] and a confidence score Conf;
step S52: selecting 5 prior anchors on a data set by a k-means clustering algorithm;
step S53: on the basis of the initial anchor point frame, the position and the confidence coefficient of the boundary frame are regressed through Sigmod, the loss of the boundary frame is calculated through CIOU loss, the confidence coefficient loss is calculated through binary cross entropy loss, and the CIOU loss calculation formula is as follows:
Figure BDA0003731276980000081
Figure BDA0003731276980000082
in the formula, b, and b gt Representing the center points of two rectangular boxes, i.e. the coordinates bx, by],[x gt ,y gt ]ρ is the euclidean distance between two rectangular frames, u is the distance of the diagonal of the closure region of the two rectangular frames, and IOU is the ratio of the overlapping area of the bounding boxes to the total area.
Step S54: classifying through a full connection layer and a Softmax layer, and calculating the classification Loss through the Focal local, wherein the calculation formula is as follows:
Figure BDA0003731276980000083
wherein, alpha and gamma are both adjustable hyper-parameters. cls gt The model is used for prediction, and the value of the model is between (0-1).
Step S55: adding the boundary frame loss, the confidence coefficient loss and the classification loss to obtain a total loss, and reversely updating the network parameters;
step S56: selecting a confidence threshold, taking out the frames and scores of each class with the scores larger than a certain threshold for sorting, filtering out low-threshold prediction boundary frames, performing NMS (non-maximum suppression) by using the positions and scores of the frames, and finally obtaining the boundary frame with the maximum probability as a prediction result.
NMS (non-maximum suppression) sorts the scores of all predicted bounding boxes, selects the highest score and its corresponding box, traverses the rest boxes, and deletes its box if the IOU is larger than a certain threshold value. And continuing to select one with the highest score from the unprocessed boxes, and repeating the process.
In particular, the invention is based on real-time motion detection of multi-scale feature fusion attention. By rearranging and splicing the video segments, the diversity of data is increased on the premise of ensuring that the semantic information and the time dependency of the video are not damaged according to the time dependency among the video segments. Aiming at introducing optical flow information to confuse actions in action detection for sample processing difficulty, a key frame-based optical flow information data input method is provided to replace the traditional optical flow data input. And time sequence information among video frames is reserved, and motion information is acquired through change among key frames and optical flow information. Compared with the traditional data input, the motion information can be acquired more clearly, the generation of noise data is effectively avoided, and the calculation amount and the storage space of optical flow information are saved. The multi-scale feature fusion method is based on multi-scale feature fusion attention, multi-scale features are fused by extracting multi-scale motion features of targets with different scales, and different from the traditional multi-scale feature fusion, the multi-scale feature attention module only uses the last layer of feature map to conduct multi-scale fusion, and therefore calculation cost is reduced.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations of methods, apparatus (devices), and computer program products according to embodiments of the invention. It will be understood that each flow of the flowcharts, and combinations of flows in the flowcharts, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows.
The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.
The present invention is not limited to the above preferred embodiments, and other various real-time motion detection methods based on multi-scale feature fusion attention can be obtained by anyone skilled in the art according to the teaching of the present invention.

Claims (6)

1. A real-time motion detection method based on multi-scale feature fusion attention is characterized by comprising the following steps: the method comprises the following steps:
step S1: dividing a data set video clip into frame sets, and performing data enhancement on the data set video clip through random sequencing operation; extracting key frames of the video clips, and extracting optical flow information of the key frames;
step S2: inputting the obtained video segments into a ResNext101 network to extract time sequence characteristics, compressing the time sequence characteristics, and inputting key frames and optical flow information of the key frames into a Darknet network to extract spatial characteristics and motion characteristics;
step S3: obtaining multi-scale features by stacking kinematic attention modules of different expansion rates;
step S4: the space-time features are spliced to further fuse the space-time features through channel attention;
step S5: and obtaining a class boundary box and confidence coefficient through classification and regression networks, and finally obtaining a boundary box with the maximum probability as a prediction result through suppressing NMS through a non-maximum value.
2. The real-time motion detection method based on multi-scale feature fusion attention of claim 1, characterized in that: step S1 specifically includes the following steps:
step S11: uniformly sampling the data set video segment at intervals of p frames, and dividing the sampled video segment into n equal-length frame sets, namely S ═ S { S } 1 ,s 2 ,…,s n H, each frame set s fi Consists of video frame sequences with equal length;
step S12: set of frame pairs { s 1 ,s 2 ,…,s n Randomly ordering constitutes a new video segment S '═ S' 1 ,s' 2 ,…,s' n Achieving the effect of data enhancement for the training process;
step S13: dividing an input video clip into a starting part, a middle part and an ending part, respectively and randomly extracting a frame as a key frame to briefly represent video actions;
step S14: optical flow information is extracted for the key frames using a RAFT model.
3. The multi-scale feature fusion attention-based real-time motion detection method according to claim 2, characterized in that: step S2 specifically includes the following steps:
step S21: inputting the acquired video segments into a 3D backbone network ResNext101 network to extract time sequence characteristics M e R C ×T×H×W Where T is the number of input frames, H and W are the height and width of the input images, and C is the number of output channels;
step S22: inputting the key frame into a 2D backbone network Darknet network to extract spatial characteristics K epsilon R C'×H×W
Step S23: inputting the key frame optical flow information extracted by the RAFT model into a 2D backbone network Darknet network to extract motion characteristics O E R C”×H×W
Step S24: to match the output signature of the 2D backbone network, the depth dimension of the ResNext101 output signature M is reduced to 1, compressing the output volume to [ C H W [ ]]Obtaining the compressed characteristic M' epsilon R C×H×W
4. The multi-scale feature fusion attention-based real-time motion detection method according to claim 3, characterized in that: step S3 specifically includes the following steps:
step S31: respectively enabling the three extracted features K, O, M' to pass through two projection layers to generate feature maps of 512 channels; the projection layer adopts a 1 × 1 convolutional layer to reduce the channel dimension, and a 3 × 3 convolutional layer to refine the semantic context;
step S32: stacking the motion attention modules by different expansion rates to generate output features K ', O ', M ' with multiple receptive fields to cover the dimensions of all objects;
the structure of the exercise attention module is represented as:
X out =X attn *X res +X in
X attn =F attn (APool(X in );θ,Ω)
X res =F(X in ;θ,Ω)
in the formula, F (-) represents a residual function, APool (-) represents an average pool layer, and θ and Ω represent the structures of convolution layers, respectively; use APool (-) to perform a non-full compression operation, then look to channel X attn *X res Up-sampling the output of to match channel X in To output of (c).
5. The multi-scale feature fusion attention-based real-time motion detection method of claim 4, wherein: step S4 specifically includes the following steps:
step S41: splicing the characteristics K ', O ' and M ' to obtain the characteristic A epsilon R (C+C'+C”)×H×W
Step S42: inputting the characteristic A into two convolution layers to generate a new characteristic mapping B E R C×H×W (ii) a Then converting the characteristic mapping B into the size of C multiplied by N to obtain the F e R C×N Wherein N ═ hxw;
step S43: let F be equal to R C×N And transpose F thereof T ∈R N×C Multiplying and calculating the characteristic correlation among channels to generate a matrix G belonging to R C×C
Step S44, inputting the matrix into a Softmax layer to generate a channel attention mapping Q e R C×C
Step S45: performing matrix multiplication on the channel attention mapping Q and the characteristic F, and converting the result into a three-dimensional space with the same shape as the characteristic mapping B to obtain the characteristic F' belonging to the R C×H×W
Step S46: combining the tensor F' with the original input feature mapping B through summation operation to obtain an output C belonging to R C×H×W
C=δ·F'+B
Where δ is a training parameter.
6. The multi-scale feature fusion attention-based real-time motion detection method of claim 5, wherein: step S5 specifically includes the following steps:
step S51: passing the fused features through a 1 × 1 kernel convolutional layer for generating output channels of [ (5 × (NumCls +5)) × hxw ] size, where (NumCls +5) includes NumCls class action scores cls, 4 coordinates [ bx, by, bw, bh ] and a confidence score Conf;
step S52: selecting 5 prior anchors on a data set through a k-means clustering algorithm;
step S53: on the basis of the initial anchor point frame, the position and the confidence coefficient of the boundary frame are regressed through Sigmod, the loss of the boundary frame is calculated through CIOU loss, the confidence coefficient loss is calculated through binary cross entropy loss, and the CIOU loss calculation formula is as follows:
Figure FDA0003731276970000031
Figure FDA0003731276970000032
in the formula, b, and b gt Representing the center points of two rectangular boxes, i.e. the coordinates bx, by],[x gt ,y gt ]Rho represents the Euclidean distance between two rectangular frames, u represents the distance of a diagonal line of a closure region of the two rectangular frames, and IOU is the ratio of the overlapping area of the bounding frames to the total area;
step S54: classifying through a full connection layer and a Softmax layer, and calculating the classification Loss through the Focal local, wherein the calculation formula is as follows:
Figure FDA0003731276970000041
wherein, both alpha and gamma are adjustable hyper-parameters; cls gt Is a model prediction, and the value is between 0 and 1;
step S55: adding the boundary frame loss, the confidence coefficient loss and the classification loss to obtain a total loss, and reversely updating the network parameters;
step S56: selecting a confidence threshold, taking out frames and scores of each class with scores larger than a certain threshold for sorting, filtering low-threshold prediction boundary frames, utilizing the positions and scores of the frames to inhibit NMS (network management system) through a non-maximum value, and finally obtaining the boundary frame with the maximum probability as a prediction result;
the non-maximum value inhibits NMS, namely, the scores of all predicted bounding boxes are sorted, the highest score and the corresponding box are selected, the rest boxes are traversed, and if the IOU of the current highest score box is larger than a certain threshold value, the box is deleted; and continuing to select one with the highest score from the unprocessed boxes, and repeating the process.
CN202210785189.4A 2022-07-05 2022-07-05 Real-time action detection method based on multi-scale feature fusion attention Pending CN115131710A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210785189.4A CN115131710A (en) 2022-07-05 2022-07-05 Real-time action detection method based on multi-scale feature fusion attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210785189.4A CN115131710A (en) 2022-07-05 2022-07-05 Real-time action detection method based on multi-scale feature fusion attention

Publications (1)

Publication Number Publication Date
CN115131710A true CN115131710A (en) 2022-09-30

Family

ID=83382942

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210785189.4A Pending CN115131710A (en) 2022-07-05 2022-07-05 Real-time action detection method based on multi-scale feature fusion attention

Country Status (1)

Country Link
CN (1) CN115131710A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115763167A (en) * 2022-11-22 2023-03-07 黄华集团有限公司 Solid cabinet breaker and control method thereof
CN115883878A (en) * 2022-11-25 2023-03-31 南方科技大学 Video editing method and device, electronic equipment and storage medium
CN117671357A (en) * 2023-12-01 2024-03-08 广东技术师范大学 Pyramid algorithm-based prostate cancer ultrasonic video classification method and system
WO2024136115A1 (en) * 2022-12-23 2024-06-27 한국전자기술연구원 Human micro-gesture recognition system and method to which multi-frame time-axis channel-crossing algorithm is applied

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN110287826A (en) * 2019-06-11 2019-09-27 北京工业大学 A kind of video object detection method based on attention mechanism
CN111627052A (en) * 2020-04-30 2020-09-04 沈阳工程学院 Action identification method based on double-flow space-time attention mechanism
CN112434608A (en) * 2020-11-24 2021-03-02 山东大学 Human behavior identification method and system based on double-current combined network
CN114373194A (en) * 2022-01-14 2022-04-19 南京邮电大学 Human behavior identification method based on key frame and attention mechanism
WO2022134655A1 (en) * 2020-12-25 2022-06-30 神思电子技术股份有限公司 End-to-end video action detection and positioning system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109101896A (en) * 2018-07-19 2018-12-28 电子科技大学 A kind of video behavior recognition methods based on temporal-spatial fusion feature and attention mechanism
CN110287826A (en) * 2019-06-11 2019-09-27 北京工业大学 A kind of video object detection method based on attention mechanism
CN111627052A (en) * 2020-04-30 2020-09-04 沈阳工程学院 Action identification method based on double-flow space-time attention mechanism
CN112434608A (en) * 2020-11-24 2021-03-02 山东大学 Human behavior identification method and system based on double-current combined network
WO2022134655A1 (en) * 2020-12-25 2022-06-30 神思电子技术股份有限公司 End-to-end video action detection and positioning system
CN114373194A (en) * 2022-01-14 2022-04-19 南京邮电大学 Human behavior identification method based on key frame and attention mechanism

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张聪聪;何宁;: "基于关键帧的双流卷积网络的人体动作识别方法", 南京信息工程大学学报(自然科学版), no. 06, 28 November 2019 (2019-11-28), pages 96 - 101 *
柯逍 等: "基于时空交叉感知的实时动作检测方法", 电子学报, no. 2, 29 February 2024 (2024-02-29), pages 574 - 588 *
柯逍; 缪欣: "Real-Time Action Detection Method based on Multi-Scale Spatiotemporal Feature", 2022 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, COMPUTER VISION AND MACHINE LEARNING, 30 October 2022 (2022-10-30), pages 245 - 248, XP034273756, DOI: 10.1109/ICICML57342.2022.10009833 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115763167A (en) * 2022-11-22 2023-03-07 黄华集团有限公司 Solid cabinet breaker and control method thereof
CN115763167B (en) * 2022-11-22 2023-09-22 黄华集团有限公司 Solid cabinet circuit breaker and control method thereof
CN115883878A (en) * 2022-11-25 2023-03-31 南方科技大学 Video editing method and device, electronic equipment and storage medium
WO2024136115A1 (en) * 2022-12-23 2024-06-27 한국전자기술연구원 Human micro-gesture recognition system and method to which multi-frame time-axis channel-crossing algorithm is applied
CN117671357A (en) * 2023-12-01 2024-03-08 广东技术师范大学 Pyramid algorithm-based prostate cancer ultrasonic video classification method and system
CN117671357B (en) * 2023-12-01 2024-07-05 广东技术师范大学 Pyramid algorithm-based prostate cancer ultrasonic video classification method and system

Similar Documents

Publication Publication Date Title
Hou et al. Point-to-voxel knowledge distillation for lidar semantic segmentation
Huang et al. Comprehensive attention self-distillation for weakly-supervised object detection
Ng et al. Actionflownet: Learning motion representation for action recognition
CN115131710A (en) Real-time action detection method based on multi-scale feature fusion attention
CN110245655B (en) Single-stage object detection method based on lightweight image pyramid network
CN111611847B (en) Video motion detection method based on scale attention hole convolution network
RU2693916C1 (en) Character recognition using a hierarchical classification
CN110889375B (en) Hidden-double-flow cooperative learning network and method for behavior recognition
CN110110689B (en) Pedestrian re-identification method
JP2020126624A (en) Method for recognizing face using multiple patch combination based on deep neural network and improving fault tolerance and fluctuation robustness
Inkawhich et al. Adversarial attacks for optical flow-based action recognition classifiers
CN113298815A (en) Semi-supervised remote sensing image semantic segmentation method and device and computer equipment
CN111027377B (en) Double-flow neural network time sequence action positioning method
CN111310766A (en) License plate identification method based on coding and decoding and two-dimensional attention mechanism
Wang et al. Robust object detection via instance-level temporal cycle confusion
CN112906623A (en) Reverse attention model based on multi-scale depth supervision
US20230154139A1 (en) Systems and methods for contrastive pretraining with video tracking supervision
CN112733590A (en) Pedestrian re-identification method based on second-order mixed attention
CN112801068A (en) Video multi-target tracking and segmenting system and method
Suratkar et al. Employing transfer-learning based CNN architectures to enhance the generalizability of deepfake detection
CN115410081A (en) Multi-scale aggregated cloud and cloud shadow identification method, system, equipment and storage medium
CN113239885A (en) Face detection and recognition method and system
Mishra et al. Understanding a Deep Machine Listening Model Through Feature Inversion.
CN115830324A (en) Semantic segmentation domain adaptive label correction method and device based on candidate label set
Al-Ani et al. An optimal feature selection technique using the concept of mutual information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination