CN114782859A - Method for establishing space-time perception positioning model of target behaviors and application - Google Patents

Method for establishing space-time perception positioning model of target behaviors and application Download PDF

Info

Publication number
CN114782859A
CN114782859A CN202210313781.4A CN202210313781A CN114782859A CN 114782859 A CN114782859 A CN 114782859A CN 202210313781 A CN202210313781 A CN 202210313781A CN 114782859 A CN114782859 A CN 114782859A
Authority
CN
China
Prior art keywords
target
image
behavior
feature
mask
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210313781.4A
Other languages
Chinese (zh)
Other versions
CN114782859B (en
Inventor
左峥嵘
沈凡姝
王岳环
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN202210313781.4A priority Critical patent/CN114782859B/en
Publication of CN114782859A publication Critical patent/CN114782859A/en
Application granted granted Critical
Publication of CN114782859B publication Critical patent/CN114782859B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for establishing a target behavior perception space-time positioning model and application thereof, belonging to the technical field of image processing and comprising the steps of establishing a target behavior perception space-time positioning model based on a depth network, wherein the target behavior perception space-time positioning model comprises a space-time behavior perception sub-network and a space-space target positioning sub-network; the spatiotemporal behavior awareness subnetwork comprises: mask prediction module based on support set graphImage (A)
Figure DDA0003568138320000011
And query set images
Figure DDA0003568138320000012
Obtaining a target area mask feature map
Figure DDA0003568138320000013
Obtaining query image information characteristic diagram based on image information perception module
Figure DDA0003568138320000014
Will be provided with
Figure DDA0003568138320000015
And
Figure DDA0003568138320000016
obtaining a feature map with enhanced target information by inputting an image-level feature fusion layer
Figure DDA0003568138320000017
Obtaining dense frame sequential images based on motion information perception module
Figure DDA0003568138320000018
Characteristic map of motion information
Figure DDA0003568138320000019
Will be provided with
Figure DDA00035681383200000110
And
Figure DDA00035681383200000111
inputting the global feature fusion layer to obtain a space-time behavior perception feature map
Figure DDA00035681383200000112
Will be provided with
Figure DDA00035681383200000113
The input behavior classification module obtains a classification result; and obtaining a target positioning result through a spatial target positioning sub-network. The invention effectively focuses on and utilizes the target area information, and has higher positioning accuracy.

Description

Method for establishing space-time perception positioning model of target behaviors and application
Technical Field
The invention belongs to the technical field of image processing, and particularly relates to a method for establishing a space-time perception positioning model of target behaviors and application of the space-time perception positioning model.
Background
Video sequence space-time perception belongs to a more popular problem in the field of computer vision in recent years, and aims to enable a model to perform positioning detection on a target in a video and perform identification marking from a starting frame to an ending frame of a behavior through learning of the video. Because training is usually performed based on a large amount of data, and the number of frames of a video segment is large, more computer memory resources are consumed in each iterative training process, and the time required for training is often long. In addition, due to the fact that the target scale changes are large in difference, the problems of missed detection and false detection are more, and under the condition that the target outline shape changes little, the behavior classification difficulty is also high.
For the existing video sequence space-time perception method, a deep network-based method is mostly adopted for learning, more features can be considered compared with the traditional method based on artificial features, and the method has stronger adaptability and feature expression capability. The more mainstream method based on the deep network comprises the following steps: according to the method based on the double-flow feature extraction, the spatial information is acquired by relying on the color channel and the time information is acquired by calculating the optical flow, so that more color feature information is contained in the calculation, the optical flow needs to be additionally calculated, the feature fusion cannot be carried out end to end with the color channel, the calculation amount is large, and more resources are consumed; a time dimension is added based on the traditional 2D convolution, a three-dimensional space-time perception method is carried out, the method generally has the same input frame frequency and feature map for target detection and behavior positioning, the extracted feature map often has no pertinence and enough representation capability when facing multitask, and a large amount of training input is needed to obtain a good convergence effect.
In addition, most of the existing space-time behavior perception models are trained on the basis of behavior data sets taking human bodies as main bodies, complex deformation of relative structures of the objects, such as self deformation generated by sitting, lifting hands and the like, does not exist for rigid bodies, observation deformation caused by differences of visual angle distances and the like exists only, the environment where a person is located has strong influence on behavior classification of the person (such as falling or lying down), and for an aircraft, the space-time behavior perception models need to pay more attention to the change of postures of the objects rather than the relative change of the structures of the objects. In addition, the state switching such as target turning or releasing may be in the same sky background, and classification of partial behaviors is weakly dependent on the background, so that more attitude changes of the concerned targets are needed to guide classification of the behaviors.
In summary, because the existing methods perform coarse-precision perception and feature extraction on the whole image information, pixel information of the region where the target is located is not fully utilized, and the existing methods are not suitable for the situation of behavior perception of the target in the air, and the methods usually rely on a large number of labeled data sets, the calculation complexity is high, the memory consumed during calculation is large, and the existing video sequence space-time perception positioning method needs to be further improved.
Disclosure of Invention
Aiming at the defects and improvement requirements of the prior art, the invention provides a method for establishing a target behavior space-time perception positioning model and application thereof, aiming at solving the technical problems that the existing behavior perception space-time positioning method is not suitable for performing behavior perception space-time positioning on rigid targets such as aircrafts and the like because pixel information of the region where the target is located is not fully utilized.
To achieve the above object, according to one aspect of the present invention, there is provided a method for establishing a target behavior-aware spatiotemporal localization model, including: establishing a deep neural network, and training the deep neural network by using a data set to obtain a target behavior perception space-time positioning model;
the deep neural network comprises a space-time behavior perception sub-network and a space-time target positioning sub-network, wherein the space-time behavior perception sub-network comprises:
a mask prediction module for predicting the support set image
Figure BDA0003568138300000021
And query set images
Figure BDA0003568138300000022
Carrying out target area feature perception, and carrying out target mask prediction on the query set image by using a perception result to obtain a target area mask feature map
Figure BDA0003568138300000023
An image information perception module for image-matching the query set
Figure BDA0003568138300000024
Performing feature extraction to obtain an image information feature map
Figure BDA0003568138300000025
An image-level feature fusion layer for masking the feature map of the target region
Figure BDA0003568138300000026
And image information feature map
Figure BDA0003568138300000027
Performing channel-by-channel feature fusion, and combining the fused features with an image information feature map
Figure BDA0003568138300000028
Obtaining a target information enhanced feature map by superposition
Figure BDA0003568138300000029
Motion information perception module for image of dense frame sequence
Figure BDA00035681383000000210
Sensing the motion information to obtain a motion information characteristic diagram
Figure BDA00035681383000000211
Global feature fusion layer for sportsInformation characteristic diagram
Figure BDA00035681383000000212
Feature map for feature compression and enhancement with target information
Figure BDA00035681383000000213
Performing characteristic cascade to obtain a space-time behavior perception characteristic diagram
Figure BDA00035681383000000214
And the behavior classification module is used for sensing the characteristic diagram of the time-space behavior
Figure BDA0003568138300000031
Extracting the region of interest and predicting the behavior class probability to obtain a behavior class prediction result of the target;
spatial object localization subnetwork for imaging query set
Figure BDA0003568138300000032
Carrying out target identification and detection positioning to obtain a target positioning frame prediction result;
wherein the support set image
Figure BDA0003568138300000033
Randomly sampling T frame images obtained from a video sequence marked with a mask label; query set images
Figure BDA0003568138300000034
Randomly sampling T frame images which are obtained from an input video sequence and are marked with target positioning frame labels; the support set image and the query set image are different from each other, and T is a positive integer; dense frame sequential images
Figure BDA0003568138300000035
Is an alpha T frame image randomly sampled from an input video sequence, and alpha is more than 1.
The established target behavior perception space-time positioning model carries out mask prediction while carrying out feature extraction on an input video sequence frame to obtain a target region mask feature map, carries out channel-by-channel fusion on the target region mask feature map and an image information feature map to obtain a target information enhanced feature map, fuses the target information enhanced feature map and an extracted motion information feature map into a space-time behavior perception feature map, and finally carries out region-of-interest extraction on the space-time behavior perception feature map to obtain a behavior classification result of a target; by predicting the mask feature map of the target area and performing channel-by-channel fusion with the image information feature map, the attention degree of the model to the pixel information of the target area can be enhanced, and the influence of background interference information is reduced, so that the pixel information of the area where the target is located is fully utilized, the accuracy of the model for extracting the features of the target area is improved, the detection accuracy and classification accuracy of behavior sensing are finally improved, and the model can effectively and accurately realize the behavior sensing space-time positioning of rigid targets such as aircrafts and the like.
The established target behavior perception space-time positioning model only needs to extract image characteristics, predict a target area mask characteristic diagram, perceive motion information, perform target detection and other operations when performing behavior perception space-time positioning, does not depend on color information, enables the model to pay more attention to the outline structure information of a target in a video sequence to be detected, is applicable to video sequences with gray scale display such as an infrared video sequence and has stronger generalization capability; meanwhile, the abandonment of the color channels reduces the memory condition of the model which depends on the training operation, and greatly reduces the memory occupation amount in the operation.
Further, the mask prediction module includes:
a first feature extraction module for respectively extracting support set images
Figure BDA0003568138300000036
And query set images
Figure BDA0003568138300000037
Carrying out feature extraction to obtain a support set image feature map
Figure BDA0003568138300000038
And query set image feature graph
Figure BDA0003568138300000039
A first background suppression module for utilizing the support set image
Figure BDA00035681383000000310
Mask label of
Figure BDA00035681383000000311
Suppressing the background in the image features of the support set to obtain a first target enhanced feature map
Figure BDA00035681383000000312
A second background suppression module for utilizing the query set images
Figure BDA00035681383000000313
Target positioning frame label
Figure BDA00035681383000000314
Suppressing the background in the image features of the query set to obtain a second target enhanced feature map
Figure BDA0003568138300000041
A correlation measurement module for measuring the similarity of the first and second target enhanced feature maps to obtain a super-correlation feature matrix
Figure BDA0003568138300000042
A feature compression module for compressing the hyper-correlation feature matrix
Figure BDA0003568138300000043
Performing depth dimension compression to obtain featuresCompression matrix
Figure BDA0003568138300000044
A feature fusion module for compressing the matrix with the residual error idea
Figure BDA0003568138300000045
Performing feature fusion on each layer of features to obtain a feature fusion matrix
Figure BDA0003568138300000046
And a down-sampling layer for fusing the features into a matrix
Figure BDA0003568138300000047
Converting into two-dimensional characteristic graph to obtain target area mask characteristic graph
Figure BDA0003568138300000048
In the invention, in the mask prediction module, the background suppression module utilizes a mask label or a target positioning frame label to suppress the background, thereby effectively reducing the dependence on the background and realizing the target enhancement; the relevancy measuring module measures the similarity of the target enhanced query set image and the support set image, so that the prediction of the target area mask feature map can be accurately finished by means of a small number of labeled support set images; the feature compression module keeps the feature dimension of the image of the query set unchanged, compresses the feature dimension of the image of the support set and compresses the super-correlation feature matrix under the condition of avoiding feature information loss
Figure BDA0003568138300000049
Dimension reduction is carried out, so that the memory consumption caused by overlarge matrix dimension in subsequent calculation can be greatly reduced; the characteristic fusion module carries out characteristic fusion operation on each layer of characteristics in the compressed super correlation matrix by utilizing a residual error thought, fuses the characteristics of different receptive fields, enables the image characteristics of the query set and the image characteristics of the support set to be better fused,therefore, the extracted features gradually tend to semantic information along with the deepening of the network layer number, so that the semantic information of the high layer is reserved, and the detailed information of the bottom layer is also reserved. In general, the target behavior perception space-time positioning model established by the invention can accurately predict the mask characteristic diagram of the target area, reduce the dependence on the background and occupy less memory.
Further, in the model training process, the mask prediction module further includes:
context decoder for mapping target region mask feature map
Figure BDA00035681383000000410
Restoring to be consistent with the original input image in size to obtain a query set target mask prediction graph
Figure BDA00035681383000000411
Training the deep neural network by using a data set, wherein the training comprises a first stage training and a second stage training;
the first stage of training comprises:
constructing a support set image by using the video sequence labeled with the mask label, constructing a query set image by using the video sequence labeled with the mask label and the target positioning frame label, and constructing a first data set by using the constructed support set image and the query set image;
performing supervised training on the mask prediction module by using a first data set to obtain a pre-trained mask prediction module;
the second stage of training comprises:
loading pre-trained mask prediction module parameters in a deep neural network, constructing a support set image by using a video sequence labeled with a mask label, constructing a query set image and a dense frame sequence image by using a video sequence labeled with a target behavior label and a target positioning frame label, constructing a second data set by using the constructed support set image, the query set image and the dense frame sequence image, and training a deep neural network model by using the second data set to obtain a target behavior perception space-time positioning model.
The invention adopts a two-stage training mode to train the established model: in the first stage of training, a mask prediction module is supervised and trained by using a support set image constructed by a video sequence with a marked mask label and a query set image constructed by a video sequence with a marked mask label and a target positioning frame label, so that the mask prediction module can obtain good target area pixel classification capability and accurately predict a target area mask feature map; in the second stage of training, the mask prediction module trained in the first stage is used for predicting a mask feature map of a target area, and the whole model is trained end to end, so that the parameters of the whole model are adjusted; by combining the two stages of training, the model training process does not depend on a large number of data sets labeled with mask labels and target positioning box labels at the same time.
Further, the training loss function is:
Figure BDA0003568138300000051
wherein,
Figure BDA0003568138300000052
represents the total loss;
Figure BDA0003568138300000053
representing behavior classification loss calculated based on the spatio-temporal feature information;
Figure BDA0003568138300000054
representing the behavior frame positioning loss calculated based on the space-time characteristic information;
Figure BDA0003568138300000055
representing a prediction penalty of the mask prediction; eta12Epsilon is a preset weight parameter; in the first stage of training, eta1=η20,. epsilon.is greater than zero; in the second stage of training, eta12All epsilonGreater than zero.
In the model training process of the first stage, the training loss function only considers the prediction loss of the mask prediction, and can effectively ensure that the mask prediction module can accurately predict the mask feature map of the target area after the training of the first stage; in the second stage of training, the training loss function considers the behavior classification loss, the behavior frame positioning loss and the prediction loss of the mask prediction at the same time, so that the model can accurately predict the behavior classification result and the target behavior frame positioning result, and the mask prediction module is subjected to fine adjustment, thereby further ensuring the accuracy of behavior perception space-time positioning by using the model after the model training is finished.
Further, for the ith frame of input image, the mask prediction module outputs a target mask prediction map
Figure BDA0003568138300000056
The corresponding mask prediction penalty is:
Figure BDA0003568138300000061
wherein,
Figure BDA0003568138300000062
representing a target prediction mask image
Figure BDA0003568138300000063
The total number of pixels in (1), x represents the target prediction mask image
Figure BDA0003568138300000064
The pixel point in (2);
Figure BDA0003568138300000065
representing a label value corresponding to a pixel x of an ith frame of input image, wherein a positive sample value is 1, and a negative sample value is 0;
Figure BDA0003568138300000066
image representing input image of ith framePredicting the pixel x as the probability value of the positive sample; in the first stage of training, the mask prediction loss is obtained by measuring the mask label error of the prediction result and the query data set; in the second stage of training, the mask prediction loss is calculated by measuring the prediction result and the target positioning box label error of the query data set.
In the first stage of training, the mask label of the query set image is directly used as an optimization target to calculate the prediction loss of the mask, so that supervision training of a mask prediction module is realized, and the mask prediction module has better mask prediction capability; in the second stage of training, the target positioning frame label of the query set image is used as an optimization target to calculate the prediction loss of the mask, so that the weak supervision training of the mask prediction module is realized, the dependence on a large number of data sets labeled with the mask label and the target positioning frame label at the same time is avoided, and the prediction effect of the mask prediction module is further improved.
Further, in the first training phase, η1=η2=0,ε=1;
In the second training phase, eta1=η2=0.5,ε=0.1。
Further, the image-level feature fusion layer masks the feature map of the target area
Figure BDA0003568138300000067
And image information feature map
Figure BDA0003568138300000068
Performing channel-by-channel feature fusion, and combining the fused features with an image information feature map
Figure BDA0003568138300000069
Obtaining a target information enhanced feature map by superposition
Figure BDA00035681383000000610
The method comprises the following steps:
target area mask feature map through bilinear interpolation operation
Figure BDA00035681383000000611
Conversion to image information feature map
Figure BDA00035681383000000612
After the same size of the medium and same channel features, the medium and same channel features are compared with the image information feature map
Figure BDA00035681383000000613
Performing Hadamard product operation on the corresponding channel characteristics, and comparing the Hadamard product with the image information characteristic diagram
Figure BDA00035681383000000614
Adding to obtain target information enhanced feature map
Figure BDA00035681383000000615
In the invention, after the image-level feature fusion layer unifies dimensions of feature images to be fused, pixel-level feature fusion is realized through Hadamard product operation, and the feature information of an image domain target can be enhanced while the feature difference among original channels of the feature images is better kept.
Further, the global feature fusion layer is used for carrying out feature mapping on the motion information
Figure BDA00035681383000000616
Feature map for feature compression and enhancement with target information
Figure BDA00035681383000000617
Carrying out characteristic cascade to obtain a space-time behavior perception characteristic diagram
Figure BDA00035681383000000618
The method comprises the following steps:
characterizing a motion information by a convolution operation
Figure BDA00035681383000000619
Feature map compressed to enhancement with target information
Figure BDA00035681383000000620
Feature map enhanced with target information after same dimensionality
Figure BDA00035681383000000621
Performing characteristic cascade splicing to obtain a space-time behavior perception characteristic diagram
Figure BDA00035681383000000622
In the invention, the global feature fusion layer performs convolution feature extraction on the motion information feature map to realize further compression of features, and realizes feature fusion of the compressed motion information feature map and the feature map enhanced by the target information in a feature cascade splicing mode, so that the feature map obtained by fusion retains image information and motion information.
According to another aspect of the present invention, there is provided a method for space-time perception and localization of target behaviors, comprising:
randomly sampling T frame images from a video sequence marked with a mask label to serve as support set images;
randomly sampling T frame images from a video sequence to be predicted to serve as sparse frame sequence images, and randomly sampling alpha T frame images from the video sequence to be predicted to serve as dense frame sequence images; alpha is more than 1;
inputting the sparse frame sequence images into a spatial target positioning sub-network in the target behavior sensing space-time positioning model established by the method for establishing the target behavior space-time sensing positioning model so as to predict a target positioning frame of a frame image target, and labeling the sparse frame sequence images by utilizing a target positioning frame prediction result to obtain a query set image;
and inputting the support set image, the query set image and the dense frame sequence image into a target behavior perception space-time positioning model, predicting the behavior category and the target positioning frame of a frame image target according to the output of the target behavior perception space-time positioning model, and completing the target behavior space-time positioning.
According to yet another aspect of the present invention, there is provided a computer readable storage medium comprising a stored computer program; when the computer program is executed by the processor, the device of the computer readable storage medium is controlled to execute the method for establishing the target behavior space-time perception positioning model provided by the invention and/or the method for establishing the target behavior space-time perception positioning model provided by the invention.
Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:
(1) according to the invention, a mask prediction module for predicting the mask characteristic diagram of the target area is introduced into the behavior sensing space-time positioning model, and the mask characteristic diagram of the target area obtained by prediction of the mask characteristic diagram prediction module and the image information characteristic diagram are subjected to channel-by-channel fusion, so that the attention of the model to the pixel information of the target area can be enhanced, and the influence of background interference information is reduced, thereby fully utilizing the pixel information of the area where the target is located, improving the accuracy of the model for extracting the characteristics of the target area, and finally improving the detection accuracy and classification accuracy of behavior sensing, so that the model can effectively and accurately realize the behavior sensing space-time positioning of rigid targets such as aircrafts and the like.
(2) The established target behavior perception space-time positioning model only needs to extract image characteristics, predict a target area mask characteristic diagram, perceive motion information, perform target detection and other operations when performing behavior perception space-time positioning, does not depend on color information, enables the model to pay more attention to the outline structure information of a target in a video sequence to be detected, is applicable to video sequences with gray scale display such as an infrared video sequence and has stronger generalization capability; meanwhile, the abandonment of the color channels reduces the memory condition of the model which depends on the training operation, and greatly reduces the memory occupation amount in the operation.
(3) According to the method, the input video sequence is subjected to weak supervision learning, in the mask prediction module, multi-scale global feature information of an image is obtained based on the first feature extractor module, the relevance measurement module can be used for rapidly converging under the condition that the sample size is insufficient, the number of parameters needing training is reduced, the extracted global feature information is compressed by the feature compression module, the memory consumed by model calculation in operation is reduced, the obtained feature information is coded and fused by the down-sampling layer, a final mask prediction feature map is obtained, a weak supervision mask prediction function is realized, the workload of manufacturing a large number of data tags is reduced, and the dependence on the strong supervision information of true-value tags is relieved.
Drawings
FIG. 1 is a schematic structural diagram of a spatiotemporal perceptual positioning model of a target behavior according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a mask prediction module according to an embodiment of the present invention;
FIG. 3 is a block diagram of a context decoder according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an image information sensing module according to an embodiment of the present invention;
fig. 5 is a tag diagram of an input target positioning box and a mask prediction diagram of a mask prediction module according to an embodiment of the present invention; wherein, each line respectively corresponds to the image original image, the mask prediction image and the target positioning block diagram; the mask prediction results of two randomly selected frames of images in different sequences are respectively shown in (a), (b), (c) and (d);
FIG. 6 is a gradient class activation thermodynamic diagram for different behaviors in accordance with the method of the present invention; the same line is from samples of different frames of the same video sequence, original images of the video frames are sampled and displayed on the left side of a dotted line, the effect after gradient activation thermodynamic diagrams are superposed is on the right side of the dotted line, and the higher the brightness is, the stronger the perception to the area is; (a) activating a thermodynamic diagram for a gradient class corresponding to the decoy release behavior sequence, (b) activating a thermodynamic diagram for a gradient class corresponding to the target overturn behavior sequence;
FIG. 7 is a graphical representation of spatiotemporal behavior perception results of bait release for a single target multi-behavior state according to embodiments of the present invention and other methods; target behavior information contained in the frame sequence images of the same row sequentially comprises target flight, target turning, target release of bait bombs, bait bomb release process and bait bomb release end; wherein, (a) is the detection result of the method of the invention, (b) is the detection result of IM-BPSTM method, (c) is the information description diagram of the label of the detection result;
FIG. 8 is a graph showing the spatiotemporal perception results of bait release behavior for a multi-objective multi-state situation in accordance with an embodiment of the present invention; wherein the different line images are respectively from samples of different video sequences;
FIG. 9 is a graph showing the temporal-spatial perception results of the bait cartridge release behavior when the target is occluded according to the present invention;
FIG. 10 is a graph showing spatiotemporal perception results for target escape behavior according to an embodiment of the present invention; wherein (a), (b), (c), (d) are from samples of different video sequences, respectively;
FIG. 11 is a diagram showing the time-space sensing results of the embodiment of the present invention for the target takeoff behavior;
FIG. 12 is a graph showing the spatiotemporal perception results for target landing behavior according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
In the present application, the terms "first," "second," and the like (if any) in the description and the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
In order to solve the technical problems that the existing behavior sensing space-time positioning method is not suitable for performing behavior sensing space-time positioning on rigid body targets such as aircrafts and the like because the pixel information of the region where the target is located is not fully utilized, the invention provides a method for establishing a target behavior space-time sensing positioning model and application thereof, and the overall thought is as follows: a mask prediction module for predicting a mask feature map of a target area is introduced into a model, the mask feature map of the target area obtained by prediction of the mask feature map prediction module and an image information feature map are subjected to channel-by-channel fusion, and then subsequent behavior classification is carried out, so that the attention of the model to pixel information of the target area can be enhanced, and the influence of background interference information is reduced, so that the pixel information of the area where the target is located is fully utilized, and meanwhile, color channels are abandoned, so that the memory condition of the model depending on training operation is reduced, and the memory occupation amount during operation is greatly reduced.
The following are examples.
Example 1:
a method for establishing a target behavior perception space-time positioning model comprises the following steps: and establishing a deep neural network, and training the deep neural network by using a data set to obtain a target behavior perception space-time positioning model.
As shown in fig. 1, in this embodiment, the deep neural network includes a spatio-temporal behavior awareness subnetwork and a spatial target positioning subnetwork, and the spatio-temporal behavior awareness subnetwork includes:
a mask prediction module for image support set
Figure BDA0003568138300000091
And query set images
Figure BDA0003568138300000092
Carrying out target area feature perception, and carrying out target mask prediction on the query set image by using a perception result to obtain a target area mask feature map
Figure BDA0003568138300000093
In this embodiment, the mask prediction module is symbolized as
Figure BDA0003568138300000094
An image information perception module for recognizing the query set image
Figure BDA0003568138300000095
Carrying out feature extraction to obtain an image information feature map
Figure BDA0003568138300000096
In this embodiment, the image information sensing module is symbolized as
Figure BDA0003568138300000097
An image level feature fusion layer for masking the feature map of the target region
Figure BDA0003568138300000101
And image information feature map
Figure BDA0003568138300000102
Carrying out feature fusion and combining the fused features with the image information feature map
Figure BDA0003568138300000103
Obtaining a target information enhanced feature map by superposition
Figure BDA0003568138300000104
In this embodiment, the image-level feature fusion layer is symbolized as
Figure BDA0003568138300000105
Motion information perception module for image of dense frame sequence
Figure BDA0003568138300000106
Sensing the motion information to obtain a motion information characteristic diagram
Figure BDA0003568138300000107
In this embodiment, the motion information sensing module is symbolized as
Figure BDA0003568138300000108
A global feature fusion layer for feature mapping of motion information
Figure BDA0003568138300000109
Performing characteristic pressureReduced and enhanced feature maps with target information
Figure BDA00035681383000001010
Carrying out characteristic cascade to obtain a space-time behavior perception characteristic diagram
Figure BDA00035681383000001011
In this embodiment, the global feature fusion layer is symbolized as
Figure BDA00035681383000001012
And the behavior classification module is used for sensing the characteristic diagram of the time-space behavior
Figure BDA00035681383000001013
Extracting the region of interest and predicting the behavior category probability to obtain a behavior category prediction result of the target;
spatial object localization subnetwork for imaging query set
Figure BDA00035681383000001014
Carrying out target identification and detection positioning to obtain a target positioning frame prediction result;
wherein the support set image
Figure BDA00035681383000001015
Randomly sampling T frame images obtained from a video sequence marked with a mask label; query set images
Figure BDA00035681383000001016
Randomly sampling T frame images which are obtained from an input video sequence and are marked with target positioning frame labels; the support set image and the query set image are different from each other, and T is a positive integer; dense frame sequential images
Figure BDA00035681383000001017
Is alpha T frame image obtained by random sampling in input video sequence, and alpha is more than 1; since the image information focuses more on the features of the two-dimensional plane, it movesThe information is more focused on the time dimension, so in this embodiment, the set image is supported
Figure BDA00035681383000001018
Query set images
Figure BDA00035681383000001019
As input of a mask prediction module, a query set image is used as input of an image information perception module, and a dense frame sequence image is used
Figure BDA00035681383000001020
As input to a motion information perception module; optionally, in this embodiment, the support set image and the query set image are sampled every 8 frames, and the dense frame sequence image is sampled every 2 frames, that is, α ═ 4.
Referring to fig. 2, in the embodiment, the mask prediction module includes:
a first feature extraction module for respectively extracting the support set images
Figure BDA00035681383000001021
And query set images
Figure BDA00035681383000001022
Carrying out feature extraction to obtain a support set image feature map
Figure BDA00035681383000001023
And query set image feature graph
Figure BDA00035681383000001024
In this embodiment, the first feature extraction module is symbolized as
Figure BDA00035681383000001025
A first background suppression module for utilizing the support set image
Figure BDA00035681383000001026
Mask label of
Figure BDA00035681383000001027
Suppressing the background in the image features of the support set to obtain a first target enhanced feature map
Figure BDA00035681383000001028
A second background suppression module for utilizing the query set image
Figure BDA00035681383000001029
Target positioning frame label
Figure BDA00035681383000001030
Suppressing the background in the image features of the query set to obtain a second target enhanced feature map
Figure BDA00035681383000001031
A correlation measurement module for measuring the similarity of the first and second target enhanced feature maps to obtain a super-correlation feature matrix
Figure BDA0003568138300000111
In this embodiment, the correlation metric module is symbolized as
Figure BDA0003568138300000112
A feature compression module for compressing the hyper-correlation feature matrix
Figure BDA0003568138300000113
Performing depth dimension compression to obtain a feature compression matrix
Figure BDA0003568138300000114
In this embodiment, the feature compression module is symbolized as
Figure BDA0003568138300000115
A feature fusion module for compressing the matrix by using the residual error idea
Figure BDA0003568138300000116
Performing feature fusion on each layer of features to obtain a feature fusion matrix
Figure BDA0003568138300000117
In this embodiment, the feature fusion module is symbolized as
Figure BDA0003568138300000118
And a down-sampling layer for fusing the features into a matrix
Figure BDA0003568138300000119
Converting into two-dimensional characteristic graph to obtain target area mask characteristic graph
Figure BDA00035681383000001110
In the present embodiment, the downsampled layer is symbolized as
Figure BDA00035681383000001111
As an alternative implementation, in this embodiment, ResNet-50 is used as the first feature extractor module used in the present invention
Figure BDA00035681383000001112
It should be noted that other common backbone network feature extraction modules composed of multiple layers of network sub-modules may also be applied to the present invention.
In FIG. 2, the first background suppression module utilizes support set images
Figure BDA00035681383000001113
Mask label of
Figure BDA00035681383000001114
Against background in support set image featuresSuppressing to obtain a first target enhancement characteristic diagram
Figure BDA00035681383000001115
The method specifically comprises the following steps:
support set image by bilinear interpolation
Figure BDA00035681383000001116
Mask label of
Figure BDA00035681383000001117
Conversion and support of a set of image feature maps
Figure BDA00035681383000001118
After the sizes of the corresponding features of different levels are consistent, the feature map is associated with the feature map of the support set image
Figure BDA00035681383000001119
Performing Hadamard product operation to obtain a first target enhanced feature map
Figure BDA00035681383000001120
The corresponding computational expression is:
Figure BDA00035681383000001121
therein, Ψl() Indicating that mask tags will be supported
Figure BDA00035681383000001122
Transition to and l-th layer support feature
Figure BDA00035681383000001123
A uniform-sized bilinear interpolation function, <' > indicating a Hadamard product, L indicating the number of total levels of the supported set image feature map, for all layer feature sets of the supported set image feature map for the i-th frame
Figure BDA00035681383000001124
Is shown as
Figure BDA00035681383000001125
Frame i supports all layer feature set of image target enhancement feature map
Figure BDA00035681383000001126
Is shown as
Figure BDA00035681383000001127
Similarly, a second background suppression module utilizes the query set image
Figure BDA00035681383000001128
Target positioning frame label
Figure BDA00035681383000001129
Suppressing the background in the image features of the query set to obtain a second target enhanced feature map
Figure BDA00035681383000001130
The method specifically comprises the following steps:
query set image by bilinear interpolation
Figure BDA00035681383000001131
Target positioning frame label
Figure BDA00035681383000001132
Conversion and query set image feature maps
Figure BDA00035681383000001133
After the sizes of the corresponding features of different levels are consistent, the feature images of the image feature maps of the query set are compared with the feature images of the corresponding features of the query set
Figure BDA00035681383000001134
Performing Hadamard product operation to obtain a second target enhanced feature map
Figure BDA00035681383000001135
The corresponding computational expression is:
Figure BDA0003568138300000121
therein, Ψl() Indicating the target positioning frame label
Figure BDA0003568138300000122
Transition to and I-layer support feature
Figure BDA0003568138300000123
A size-consistent bilinear interpolation function, _ indicating Hadamard product, L indicating the total number of layers of the query set image feature map, and aggregating all layer features of the i-th frame query image feature map
Figure BDA0003568138300000124
Is shown as
Figure BDA0003568138300000125
All-layer feature set of ith frame query image target enhancement feature map
Figure BDA0003568138300000126
Is shown as
Figure BDA0003568138300000127
And the background suppression module is used for suppressing the background by using the mask label or the target positioning frame label, effectively reducing the dependence on the background and realizing the target enhancement.
In fig. 2, the correlation measurement module measures the similarity of the first target enhanced feature map and the second target enhanced feature map to obtain a super correlation feature matrix
Figure BDA0003568138300000128
The corresponding computational expression is:
Figure BDA0003568138300000129
wherein, all layer characteristics in the super correlation characteristic matrix corresponding to the ith frame of image are collected
Figure BDA00035681383000001210
Is shown as
Figure BDA00035681383000001211
ξ () represents an activation function for suppressing noise information of the similarity measure result; in the present embodiment, ξ () represents a ReLU activation function.
In this embodiment, a similarity measurement mode is used to calculate a feature correlation matrix between the support set image feature map pixel-fused with the mask label and the query set image feature map pixel-fused with the target frame label, and a similarity measurement mode is used to calculate a super-correlation feature matrix between the first target enhanced feature map and the second target enhanced feature map.
As shown in FIG. 2, in the present embodiment, the feature compression module
Figure BDA00035681383000001212
The method specifically comprises 3 compression submodules, wherein each compression submodule comprises 2 convolution layers with cores of 5 x 5, 1 convolution layer with cores of 3 x 3, and an activation layer and a normalization layer which are connected behind each convolution layer;
for correlation of super-correlation feature matrix
Figure BDA00035681383000001213
Compressing in depth dimension to reduce memory occupation during calculation, and retaining more effective characteristic information to obtain characteristic compression matrix
Figure BDA00035681383000001214
Namely:
Figure BDA00035681383000001215
wherein, the super-correlation characteristics corresponding to the ith frame imageCompressing all layers in the matrix to obtain a feature set
Figure BDA00035681383000001216
Is shown as
Figure BDA00035681383000001217
Figure BDA00035681383000001218
Representing a combination of operations of 4D convolution, group normalization and ReLU activation, k _ sqzlIndicating that the l-th layer corresponds to the convolution kernel size.
In the embodiment, the feature compression module is used for compressing the cascaded support features and query features to keep the feature dimension of the query image unchanged, and compressing the feature dimension of the support image, so that the super-correlation feature matrix is compressed under the condition of avoiding feature information loss
Figure BDA0003568138300000131
Dimension reduction is carried out, so that the memory consumption caused by overlarge matrix dimension in subsequent calculation can be greatly reduced; it should be noted that, in this embodiment, the number of the compression sub-modules in the feature compression module may be adjusted to other numbers according to the actual calculation amount and the compression effect requirement, and generally, should be greater than 2.
In this embodiment, the feature fusion module
Figure BDA0003568138300000132
And feature compression module
Figure BDA0003568138300000133
The structure is similar, and the fusion sub-module comprises 3 fusion sub-modules, wherein each fusion sub-module comprises 3 convolution kernels of 3 x 3, and an activation layer and a normalization layer which are connected behind each convolution layer;
for compressing matrices for features using residual concepts
Figure BDA0003568138300000134
Each layer of feature in (1) is characterizedThe fusion operation, namely:
Figure BDA0003568138300000135
wherein, k _ mixlExpressing the size of the convolution kernel corresponding to the l-th layer, realizing the fusion of multi-level structural features, and obtaining the fusion of the last layer of features
Figure BDA0003568138300000136
Is shown as
Figure BDA0003568138300000137
Thereby obtaining a feature fusion matrix
Figure BDA0003568138300000138
In this embodiment, the feature fusion module adopts a step-by-step nesting manner, utilizes a multi-layer residual structure, and fuses features of different receptive fields, so that the query image features and the support image features can be better fused, the extracted features gradually tend to semantic information along with the deepening of the number of network layers, and better multi-layer feature information, such as deep abstract information and bottom layer detail information, can be obtained.
In this embodiment, the down-sampling layer
Figure BDA0003568138300000139
Comprises a mean pooling layer for fusing the features into a matrix
Figure BDA00035681383000001310
Conversion into 2-dimensional feature maps, i.e. target area mask feature maps
Figure BDA00035681383000001311
Is used for mixing with
Figure BDA00035681383000001312
Together as an image-level feature fusion module
Figure BDA00035681383000001313
And (4) obtaining the feature map with strengthened target position information.
In order to effectively train the mask prediction module and evaluate the prediction error loss of the mask prediction module, in this embodiment, the behavior-aware spatio-temporal positioning model further includes, in a training phase:
context decoder for mapping target region mask feature map
Figure BDA00035681383000001314
Restoring to be consistent with the original input image in size to obtain a query set target mask prediction graph
Figure BDA00035681383000001315
In this embodiment, the context decoder is symbolized as
Figure BDA00035681383000001316
The structure is shown in fig. 3, and comprises a plurality of 3 × 3 2D convolution layers, a ReLU active layer and an up-sampling layer connected in series, and finally the output of the softmax layer is used as the final prediction result; restoring the target mask prediction image with the same size as the original input image according to the prediction target area characteristic image
Figure BDA00035681383000001317
By means of a context decoder
Figure BDA00035681383000001318
In the embodiment, the mask label of the query set image is used as an optimization target to calculate the prediction loss of the mask, so that the supervision training of the mask prediction module is realized, and the mask prediction module has better mask prediction capability.
As shown in fig. 1, in this embodiment, the image information sensing module
Figure BDA0003568138300000141
The structure of (2) is shown in FIG. 4, and comprises a convolution layer, a pooling layer, and 4 convolutions connected in sequenceA block and mean pooling layer; image information perception module
Figure BDA0003568138300000142
The calculation process of (c) can be expressed as:
Figure BDA0003568138300000143
Figure BDA0003568138300000144
Figure BDA0003568138300000145
Figure BDA0003568138300000146
Figure BDA0003568138300000147
wherein,
Figure BDA0003568138300000148
for images from a query set
Figure BDA0003568138300000149
The overall representation of the T frame image of (1);
Figure BDA00035681383000001410
represents a conventional n x n convolution kernel with time dimension t; ρ represents the mean pooling operation.
As shown in fig. 1, to mask a feature map on a target area
Figure BDA00035681383000001411
And image information feature map
Figure BDA00035681383000001412
In the embodiment, the image-level feature fusion layer masks the feature map of the target region
Figure BDA00035681383000001413
And image information feature map
Figure BDA00035681383000001414
Performing channel-by-channel feature fusion, and combining the fused features with an image information feature map
Figure BDA00035681383000001415
Obtaining a target information enhanced feature map by superposition
Figure BDA00035681383000001416
The process can be represented as:
by bilinear interpolation operation will
Figure BDA00035681383000001417
Is converted into
Figure BDA00035681383000001418
After the same channel feature has the same dimension, the same as
Figure BDA00035681383000001419
Performing Hadamard product, and
Figure BDA00035681383000001420
are added to obtain
Figure BDA00035681383000001421
Namely:
Figure BDA00035681383000001422
wherein [ ] indicates the Hadamard product, Ψc() Representation of conversion of target area mask feature map into image informationC, a bilinear interpolation function with the same channel size of the information characteristic graph, wherein C is the number of characteristic channels;
Figure BDA00035681383000001423
the characteristic representation of the image information characteristic diagram of the c channel;
Figure BDA00035681383000001424
masking a feature representation of the feature map for the target area of the c-th channel;
as shown in fig. 1, in this embodiment, the motion information sensing module
Figure BDA00035681383000001425
Structure of and image information sensing module
Figure BDA00035681383000001426
Similarly, motion information perception module
Figure BDA00035681383000001427
The calculation process of (a) can be expressed as:
Figure BDA00035681383000001428
Figure BDA00035681383000001429
Figure BDA00035681383000001430
wherein,
Figure BDA00035681383000001431
for images from a dense sequence of frames
Figure BDA00035681383000001432
The whole representation of the T frame image of (1);
as shown in FIG. 1, in this embodiment, a global feature fusion layer is usedFor the motion information feature map
Figure BDA00035681383000001433
Feature map for feature compression and enhancement with target information
Figure BDA00035681383000001434
Carrying out characteristic cascade to obtain a space-time behavior perception characteristic diagram
Figure BDA0003568138300000151
The method comprises the following steps:
characterizing a motion information by a convolution operation
Figure BDA0003568138300000152
Feature maps compressed to enhancement with target information
Figure BDA0003568138300000153
Feature map enhanced with target information after same dimensionality
Figure BDA0003568138300000154
Performing characteristic cascade splicing to obtain a space-time behavior perception characteristic diagram
Figure BDA0003568138300000155
Wherein the convolution operation employs a convolution kernel of size 1 x 1 with a time dimension of 5;
in this embodiment, after the global feature fusion layer unifies dimensions of the feature maps to be fused, feature fusion is realized in a tensor splicing manner, so that the feature maps obtained by fusion retain image information and motion information at the same time.
In the embodiment, an image information sensing module for acquiring image information by using a sparse frame image and a motion information sensing module for acquiring motion information by using a dense frame image are constructed by nesting the combination of the rolling blocks and the pooling layers for multiple times, image-level feature fusion is performed on features acquired by the image information sensing module and a target area mask feature map, and finally spatiotemporal information is fused by a global feature fusion layer to obtain a spatiotemporal information feature map enhanced by target information.
As shown in FIG. 1, in this embodiment, the behavior classification module includes a feature map for input spatiotemporal behavior perception by the region of interest extractor
Figure BDA0003568138300000156
Extracting the interested region, and outputting a maximum likelihood estimation behavior classification result by using a multi-classifier.
Referring to fig. 1, in the present embodiment, the spatial domain target positioning sub-network includes:
a second feature extraction module for extracting the image of the query set
Figure BDA0003568138300000157
Carrying out feature extraction;
the suggestion frame generation module is used for carrying out target detection according to the features extracted by the second feature extraction module to generate suggestion frames; the suggestion box is used for identifying the possible existing area of the target;
and the positioning frame regression module is used for identifying a suggestion frame with a target from the suggestion frames generated by the suggestion frame generation module to serve as a target positioning frame.
As an optional implementation manner, in this embodiment, the spatial domain target positioning sub-network selects a Faster R-CNN network; the second feature extraction module adopts a ResNet-50 backbone network; it should be noted that other common target detection positioning networks and feature extraction backbone networks may also be applied to the present invention.
In the embodiment, the deep neural network is trained by using a data set, wherein the training comprises a first stage training and a second stage training;
the first stage of training comprises:
constructing a support set image by using the video sequence labeled with the mask label, constructing a query set image by using the video sequence labeled with the mask label and the target positioning frame label, and constructing a first data set by using the constructed support set image and the constructed query set image;
performing supervised training on the mask prediction module by using a first data set to obtain a pre-trained mask prediction module;
the second stage of training comprises:
loading pre-trained mask prediction module parameters in a deep neural network, constructing a support set image by using a video sequence labeled with a mask label, constructing a query set image and a dense frame sequence image by using a video sequence labeled with a target behavior label and a target positioning frame label, constructing a second data set by using the constructed support set image, the query set image and the dense frame sequence image, and training a deep neural network model by using the second data set to obtain a target behavior perception space-time positioning model;
the training loss function is:
Figure BDA0003568138300000161
wherein,
Figure BDA0003568138300000162
represents the total loss;
Figure BDA0003568138300000163
representing behavior classification loss calculated based on the space-time characteristic information, and specifically adopting multi-classification cross entropy calculation;
Figure BDA0003568138300000164
representing behavior frame positioning loss calculated based on the space-time characteristic information, including classification loss and boundary regression frame loss;
Figure BDA0003568138300000165
represents the prediction loss of the mask prediction; eta12Epsilon is a preset weight parameter; in this embodiment, η in the first stage of training1η 20, 1; in the second stage of training, eta1=η2=0.5,ε=0.1;
For the ith frame of input image, the mask prediction module outputs a target mask prediction image
Figure BDA0003568138300000166
The corresponding mask prediction penalty is:
Figure BDA0003568138300000167
wherein,
Figure BDA0003568138300000168
representing a target prediction mask image
Figure BDA0003568138300000169
Wherein x represents the target prediction mask image
Figure BDA00035681383000001610
The pixel point in (1);
Figure BDA00035681383000001611
the label value corresponding to a pixel x of the ith frame input image is represented, wherein a positive sample value is 1, and a negative sample value is 0;
Figure BDA00035681383000001612
a probability value representing that a pixel x of an ith frame input image is predicted as a positive sample; in the first stage of training, the mask prediction loss is obtained by measuring the mask label error of the prediction result and the query data set; in the second stage of training, the mask prediction loss is calculated by measuring the prediction result and the target positioning box label error of the query data set.
In this embodiment, before model training, preprocessing operations such as deformation and image normalization are performed on images obtained by random sampling in an input video sequence, and training sample data expansion is performed; optionally, in this example, the data enhancement method includes random cropping, random size modification, and random flipping;
in the first stage, a pair mask prediction module
Figure BDA00035681383000001613
Performing supervision training to obtain good pixel classification capability of the target area; optionally, in this embodiment, a public data set Pascal VOC data set is selected to perform supervision training on the model, where the Pascal VOC data set includes 20 types of images of an aircraft and the like in different scenes, and a target frame label and a mask label corresponding to the images;
in the second stage, the mask prediction module
Figure BDA0003568138300000171
Performing weak supervision training, and acquiring a prediction mask of an image target area by using a target frame label as a pseudo label of an image of a query set; image information sensing module based on query set image and target positioning frame label
Figure BDA0003568138300000172
Motion information perception module
Figure BDA0003568138300000173
Carrying out supervision training, and adjusting parameters of the integral model; in the embodiment, gray level video sequences of different states of the aircraft in a real scene and corresponding target positioning frame labels are input; because the first data set and the second data set have the domain difference of the data sets, the mask prediction module is still finely adjusted in the second training stage, a support set image is constructed by utilizing the video sequence marked with the mask label, and the model parameters are adjusted by using the target frame label as the pseudo label of the query set image, so that the prediction loss caused by the domain difference is reduced;
performing weak supervised learning on a mask prediction module by inputting a target frame label as a pseudo label of a query set image, wherein the weak supervised learning comprises the following steps:
according to
Figure BDA0003568138300000174
Setting the input image IqiTarget positioning frame label
Figure BDA0003568138300000175
Through the mask prediction module
Figure BDA0003568138300000176
To obtain
Figure BDA0003568138300000177
Wherein (x)i,yi) Inputting images for the ith frame
Figure BDA00035681383000001717
And labeling block diagrams
Figure BDA0003568138300000178
The coordinates of any one of the pixel points are,
Figure BDA0003568138300000179
inputting images for the ith frame
Figure BDA00035681383000001710
The coordinates of the top left corner vertex of the labeling box, wi、hiIndicating the length and width of the label box,
Figure BDA00035681383000001711
representing an input image
Figure BDA00035681383000001712
The middle pixel point belongs to the set of pixel points of the target area,
Figure BDA00035681383000001713
representing an input image
Figure BDA00035681383000001714
The middle pixel point belongs to the pixel point set of the background area.
As shown in the figure5, a first behavior of the original image of the input video sequence, a second behavior mask prediction module and a context decoder
Figure BDA00035681383000001715
Output mask prediction graph
Figure BDA00035681383000001716
The third step is to obtain a target positioning frame label graph according to the target coordinate label conversion; the mask prediction effect of two randomly selected frames of images in different sequences is respectively shown in (a), (b), (c) and (d); optionally, in this embodiment, in the first training stage, the training round is 2000 rounds, the initial value of the learning rate is 0.001, and the adaptive learning rate method Adam optimizer is selected; in the second training stage, the training round is 100 rounds, the initial value of the learning rate is 0.01, the learning rate is adjusted by adopting a random gradient descending mode with momentum (momentum) items, and the weight attenuation is set to 10 in order to prevent overfitting-5(ii) a The input video is uniformly sampled along the time axis at 10 frames per second, the size of the image sequence is (1920,1080), the shorter sides of the image sequence are randomly scaled to 256 pixels by the data enhancement module and randomly cropped to obtain a uniform input (256 ).
In summary, the target behavior awareness spatio-temporal positioning model established in this embodiment performs mask prediction while performing feature extraction on an input video sequence frame to obtain a target region mask feature map, performs channel-by-channel fusion on the target region mask feature map and an image information feature map to obtain a target information enhanced feature map, fuses the target information enhanced feature map and an extracted motion information feature map into a spatio-temporal behavior awareness feature map, and finally performs region-of-interest extraction on the spatio-temporal behavior awareness feature map to obtain a behavior classification result of a target; by predicting the mask feature map of the target area and performing channel-by-channel fusion with the image information feature map, the attention of the model to the pixel information of the target area can be enhanced, and the influence of background interference information is reduced, so that the pixel information of the area where the target is located is fully utilized, the accuracy of the model for extracting the feature of the target area is improved, the detection accuracy and classification accuracy of space-time behavior perception are finally improved, and the model can effectively and accurately realize the behavior perception space-time positioning of rigid targets such as aircrafts and the like. When the behavior sensing space-time positioning model is used for performing behavior sensing space-time positioning, the model only needs to extract image characteristics, predict a target area mask characteristic diagram and sense motion information, perform target detection and other operations, does not depend on color information, and can focus more on the outline structure information of a target in a video sequence to be detected, so that the model can be suitable for video sequences displayed in gray scales such as an infrared video sequence and the like and has stronger generalization capability; meanwhile, the abandonment of the color channels reduces the memory condition of the model which depends on the training operation, and greatly reduces the memory occupation amount in the operation.
Example 2:
a space-time perception positioning method for target behaviors comprises the following steps:
randomly sampling T frame images from a video sequence marked with a mask label to serve as support set images;
randomly sampling T frame images from a video sequence to be predicted to serve as sparse frame sequence images, and randomly sampling alpha T frame images from the video sequence to be predicted to serve as dense frame sequence images; alpha is more than 1;
inputting the sparse frame sequence images into a spatial target positioning sub-network in the target behavior sensing space-time positioning model established by the method for establishing the target behavior space-time sensing positioning model provided by the embodiment 1 to predict a target positioning frame of a frame image target, and labeling the sparse frame sequence images by using a target positioning frame prediction result to obtain a query set image;
the support set image, the query set image and the dense frame sequence image are input into the target behavior space-time perception positioning model established by the method for establishing the target behavior space-time perception positioning model provided by the embodiment 1, the behavior category and the target positioning frame of the frame image target are predicted according to the output of the target behavior space-time perception positioning model, and the target behavior space-time positioning is completed.
Example 3:
a computer readable storage medium comprising a stored computer program; when the computer program is executed by the processor, the apparatus on which the computer readable storage medium is located is controlled to execute the method for establishing the target behavior spatio-temporal perceptual positioning model provided in the above embodiment 1, and/or the method for establishing the target behavior spatio-temporal perceptual positioning model provided in the above embodiment 2.
The performance of the model established by the invention is further explained by combining the specific target behavior space-time perception positioning result.
For convenience of description, in the following description, the model established by the above embodiment 1 will be denoted as IM-BPSTM-Mask model.
And activating a thermodynamic diagram calculation method Grad-CAM based on the gradient classes, constructing an importance weight matrix according to the forward result and the backward propagation gradient value of the input image feature map, and displaying the importance weight matrix in a thermodynamic diagram mode. Fig. 6 is a gradient class activation thermodynamic diagram obtained by the IM-BPSTM-Mask model method established in embodiment 1, where the same line is from samples of different frames of the same video sequence, the left side of the dotted line shows samples of original images of video frames, and the right side of the dotted line shows an effect after the gradient class activation thermodynamic diagrams are superimposed, and the higher the brightness is, the stronger the perception on the region is; (a) activating a thermodynamic diagram for a gradient class corresponding to the decoy release behavior sequence, (b) activating a thermodynamic diagram for a gradient class corresponding to the target overturn behavior sequence; according to the results shown in FIG. 6, the IM-BPSTM-Mask model method can reduce the activation of the background interference region, mainly focuses on the region where the target is located and the region related to behavior perception; according to (a) in fig. 6, it can be seen that the activation region of the model for the bait release behavior sequence is mainly concentrated on the target itself and the position of the bait cartridge, and the activation region is concentrated on the region of the target before the release is not started and after the release is finished, so that the interference of the tail smoke of the bait cartridge in the background is effectively ignored; it can be seen from fig. 6 (b) that the activation region for the target roll-over behavior sequence is mainly focused on the head region where the pitch angle change of the target is most easily judged. The IM-BPSTM-Mask model can effectively focus on the typical characteristics of target behavior change and inhibit the perception of irrelevant background information.
The following verification and explanation of the beneficial effects obtained by the target behavior space-time perception positioning method provided by the present invention are performed by combining the comparison results of the embodiment 2 and the existing behavior perception space-time positioning method. Using a grey-scale aircraft video sequence as an input, 107 video sequences are contained in the data set, and 11186 frames of images are total, wherein 65 video sequences are used as training samples, and 42 videos are used as testing samples. The same video sequence contains multiple behavior classes of the target, and multiple behavior classes may exist in the same target at the same time, and multiple targets may be contained in the video sequence. The label information of each behavior category is shown in fig. 7. Statistics of the number of behavior categories as shown in table 1, since the task mainly aims at the target behaviors such as bait release and escape, the number of samples for take-off and landing is small.
TABLE 1
Flying Bait release Escape from the body Somersault Taking off Landing Sliding on ground
13483 5266 951 730 197 109 230
The intersection-to-parallel ratio IoU is used as a comparison basis for the detection result, and reflects the intersection area-to-parallel area ratio of the predicted rectangular box and the actual target box. And when the calculated intersection ratio is larger than a set threshold value, the detection result is considered to be correct.
TP is a true case and represents the correct detection number of the target true frame; FP is false positive, indicating IoU for the predicted box versus the true box is less than the threshold; FN is false negative example, which represents the number of target frames of missed detection; TN is a true negative and indicates the number of non-target boxes. Precision is used for measuring the proportion of correct samples in the prediction result; the Recall rate Recall is used for measuring the sample proportion which is correctly detected in the test set; the calculation method is as follows:
Figure BDA0003568138300000201
Figure BDA0003568138300000202
the thresholds represented IoU using AP @50, AP @75 take the average AP of Precision at 0.5, 0.75, respectively; AR @50:5:95 was used to indicate the mean AR of Recall recalls at interval IoU where the value of IoU was taken from 50% to 95% in steps of 5%.
The behavior classification module is based on the structure of a classic two-stage target detection framework, namely, fast R-CNN, and is adjusted to a certain extent, so that the behavior classification module is suitable for a video space-time positioning task. Extracting the RoI features from the output of the last feature mapping of the global feature fusion layer introduced by the present invention, expanding the two-dimensional RoI of each frame into three-dimensional RoI by copying along the time dimension, then performing spatial RoI and global average pooling on the time dimension to obtain RoI features, and then inputting the maximally pooled RoI features into a sigmoid-based classifier for multi-line label prediction.
The parameters of the spatial domain target positioning sub-network are extracted from the existing target detection model and are not jointly trained with the IM-BPSTM-Mask model provided by the invention. The method comprises the steps of using a Faster R-CNN target detection model which is trained from zero based on ImageNet and takes ResNet-50-FPN as a main stem as a reference, using Pascal VOC for pre-training, and then using a frame sequence image containing a target positioning frame label for fine tuning to obtain an airspace target positioning sub-network parameter used by the method, wherein the test result on a data set reaches 0.896AP @50 and 0.816AP @75, and the recall rate is 0.703AR @50:5: 95.
The performance comparison is carried out on the model method I-BPSTM only using the image information sensing module, the model method IM-BPSTM extracting the pixel level characteristics and directly fusing the motion information characteristic diagram using the image information sensing module and the model method IM-BPSTM-Mask established by the embodiment 1. The final result is averaged from the prediction score of softmax, and the performance index is the mean average precision index (mAP) of 7 behavior classes, measured using a frame level IoU threshold of 0.5.
Table 2 shows the comparison of the test results of IM-BPSTM-Mask and other space-time positioning model methods.
As can be seen from the table 2, the target behavior space-time perception positioning model IM-BPSTM-Mask provided by the invention obtains the best detection performance in the categories of flying, bait releasing, escaping, overturning and ground sliding, obtains the suboptimal detection performance in the two categories of taking off and landing, and has a smaller difference with the optimal performance. As shown in table 1, since the embodiment mainly aims at the detection of bait release and escape behavior, the data sample size of take-off and landing behavior is small and not representative. For comprehensive behavior category perception, the IM-BPSTM-Mask model obtains the best detection performance on each category of average detection precision mAP indexes, which means that the algorithm provided by the invention has more stable performance and higher average accuracy.
TABLE 2
Model (model) I-BPSTM IM-BPSTM IM-BPSTM-Mask
Flying in flight 0.991 0.996 0.997
Release bait 0.365 0.665 0.705
Escape 0.021 0.551 0.679
Somersault 0.152 0.219 0.281
Taking off 0.545 0.667 0.627
Landing 0.487 0.978 0.846
Sliding on ground 0.150 0.406 0.885
mAP 0.680 0.819 0.841
Further, fig. 7 shows the behavior space-time perception results of the bait release process for the single-target multi-behavior state according to the embodiment of the present invention and other methods; target behavior information contained in the frame sequence images in the same row sequentially comprises target flight, target turning start, target releasing start, a bait bomb releasing process and a bait bomb releasing end; wherein, (a) is the detection result of the method of the invention, (b) is the detection result of the IM-BPSTM method, and (c) is the information description chart of the detection result label. The target positioning result is labeled on the graph in a form of a coordinate frame, information at the upper left corner of the coordinate frame corresponding to the target identifies the behavior state of the current frame of the target and the probability estimation value of the behavior, and the correspondence between the labeling information and the behavior category is shown in (c) of fig. 7. According to the comparison results of the corresponding time points of (a) and (b) in fig. 7, the IM-BPSTM-Mask provided by the invention can sense the target behavior change at the target turning starting time, the target bait release starting time and the target bait release ending time earlier. The method can more accurately position the moment of the target behavior of the video sequence and has more sensitive perception capability on the change of the target behavior.
In order to illustrate that the target behavior space-time perception positioning method provided by the invention is also applicable to the situation of multiple targets and multiple behaviors, the embodiment performs target space-time behavior perception on bait release behaviors under the situation of multiple targets and multiple behaviors. The positioning result of the IM-BPSTM-Mask model for spatio-temporal behavior perception is shown in fig. 8, where different line images are respectively from samples of different video sequences. It can be seen that the determination of the behavior of an object in a video sequence is independent, and another object may be in flight while the object is in the release state. And can still timely sense the action of starting releasing the bait to the target under the condition of multiple targets. Different from most behavior perception models which can only carry out uniform behavior perception on objects in a section of video sequence, namely, carry out behavior classification by using sequence images, the method provided by the invention can accurately position the positions of different targets and timely perceive the behaviors of current frames of the different targets.
The sensing difficulty of the target behavior is increased by times under the condition that the target is shielded, and as shown in fig. 9, the method for space-time sensing and positioning of the target behavior provided by the invention can timely and accurately judge and position the behavior of the target under the condition that the target is small and is shielded by the bait bombing tail smoke.
FIG. 10 is a diagram showing the result of the objective escape behavior in the spatiotemporal perceptual positioning method of objective behavior according to the present invention; wherein (a), (b), (c), and (d) are from samples of different video sequences, respectively. FIG. 11 is a diagram showing the space-time sensing positioning result of the method of the present invention for the target takeoff behavior. FIG. 12 is a diagram showing the time-space sensing positioning result of the method of the present invention for the target landing behavior. The method provided by the invention can obtain a better space-time perception positioning effect of the target behavior under various environments, various target scales and various target states, and can ensure the accuracy of behavior frame positioning and behavior classification while accurately positioning the position of the target.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for establishing a target behavior perception space-time positioning model is characterized by comprising the following steps: establishing a deep neural network, and training the deep neural network by using a data set to obtain a target behavior perception space-time positioning model;
the deep neural network comprises a space-time behavior perception sub-network and a space-time target positioning sub-network, and the space-time behavior perception sub-network comprises:
a mask prediction module for image support set
Figure FDA0003568138290000011
And query set images
Figure FDA0003568138290000012
Carrying out target area feature perception, and carrying out target mask prediction on the query set image by using a perception result to obtain a target area mask feature map
Figure FDA0003568138290000013
An image information perception module for image-wise collecting the query set
Figure FDA0003568138290000014
Performing feature extraction to obtain an image information feature map
Figure FDA0003568138290000015
An image-level feature fusion layer for masking the feature map of the target region
Figure FDA0003568138290000016
And the image information feature map
Figure FDA0003568138290000017
Performing channel-by-channel feature fusion, and combining the fused features with the image information feature map
Figure FDA0003568138290000018
The target information enhanced feature map is obtained by superposition
Figure FDA0003568138290000019
A motion information perception module for perceiving the dense frame sequential images
Figure FDA00035681382900000110
Sensing the motion information to obtain a motion information characteristic diagram
Figure FDA00035681382900000111
A global feature fusion layer for feature mapping the motion information
Figure FDA00035681382900000112
Feature map for feature compression and enhancement with the target information
Figure FDA00035681382900000113
Performing characteristic cascade to obtain a space-time behavior perception characteristic diagram
Figure FDA00035681382900000114
And a behavior classification module for classifying the spatiotemporal linesFor sensing characteristic diagram
Figure FDA00035681382900000115
Extracting the region of interest and predicting the behavior class probability to obtain a behavior class prediction result of the target;
the spatial object location subnetwork is used for mapping the query set image
Figure FDA00035681382900000116
Carrying out target identification and detection positioning to obtain a target positioning frame prediction result;
wherein the support set image
Figure FDA00035681382900000117
Randomly sampling T frame images obtained from a video sequence marked with a mask label; the query set image
Figure FDA00035681382900000118
Randomly sampling T frame images which are obtained from an input video sequence and are marked with target positioning frame labels; the support set image and the query set image are different from each other, and T is a positive integer; the dense frame sequential images
Figure FDA0003568138290000021
Is alpha T frame image obtained by random sampling in the input video sequence, and alpha is more than 1.
2. The method of building a target behavior-aware spatiotemporal localization model according to claim 1, wherein the mask prediction module comprises:
a first feature extraction module for respectively extracting the support set images
Figure FDA0003568138290000022
And the query set image
Figure FDA0003568138290000023
Carrying out feature extraction to obtain a support set image feature map
Figure FDA0003568138290000024
And query set image feature maps
Figure FDA0003568138290000025
A first background suppression module for utilizing the support set image
Figure FDA0003568138290000026
Mask label of
Figure FDA0003568138290000027
Suppressing the background in the image features of the support set to obtain a first target enhanced feature map
Figure FDA0003568138290000028
A second background suppression module for utilizing the query set image
Figure FDA0003568138290000029
Target positioning frame label
Figure FDA00035681382900000210
Suppressing the background in the image features of the query set to obtain a second target enhanced feature map
Figure FDA00035681382900000211
A correlation measurement module for measuring the similarity of the first target enhanced feature map and the second target enhanced feature map to obtain a super correlation feature matrix
Figure FDA00035681382900000212
Feature compression moduleFor the super correlation feature matrix
Figure FDA00035681382900000213
Performing depth dimension compression to obtain a feature compression matrix
Figure FDA00035681382900000214
A feature fusion module for compressing the feature matrix using residual idea
Figure FDA00035681382900000215
Performing feature fusion on the features of each layer to obtain a feature fusion matrix
Figure FDA00035681382900000216
And a down-sampling layer for fusing the features into a matrix
Figure FDA00035681382900000217
Converting the two-dimensional characteristic graph into a target area mask characteristic graph
Figure FDA00035681382900000218
3. The method for building a model of object behavior aware spatiotemporal localization as claimed in claim 2, wherein in the model training process, the mask prediction module further comprises:
context decoder for mapping target region mask feature map
Figure FDA00035681382900000219
Restoring to be consistent with the original input image in size to obtain a query set target mask prediction graph
Figure FDA0003568138290000031
Training the deep neural network by using a data set, wherein the training comprises a first-stage training and a second-stage training;
the first stage training includes:
constructing a support set image by using the video sequence labeled with the mask label, constructing a query set image by using the video sequence labeled with the mask label and the target positioning frame label, and constructing a first data set by using the constructed support set image and the constructed query set image;
performing supervision training on the mask prediction module by using the first data set to obtain a pre-trained mask prediction module;
the second stage training comprises:
loading the pre-trained mask prediction module parameters in the deep neural network, constructing a support set image by using a video sequence labeled with a mask label, constructing a query set image and a dense frame sequence image by using a video sequence labeled with a target behavior label and a target positioning frame label, constructing a second data set by using the constructed support set image, the query set image and the dense frame sequence image, and training the deep neural network model by using the second data set to obtain a target behavior perception space-time positioning model.
4. The method of claim 3, wherein the training loss function is:
Figure FDA0003568138290000032
wherein,
Figure FDA0003568138290000033
represents the total loss;
Figure FDA0003568138290000034
representing behavior classification loss calculated based on the spatio-temporal feature information;
Figure FDA0003568138290000035
representing the behavior frame positioning loss calculated based on the space-time characteristic information;
Figure FDA0003568138290000036
representing a prediction penalty of the mask prediction; eta12Epsilon is a preset weight parameter; in the first stage of training, eta1=η20, epsilon is greater than zero; in the second stage of training, eta12And epsilon are all greater than zero.
5. The method for building the model of objective behavior-aware spatiotemporal localization as claimed in claim 4, wherein for the ith frame of input image, the mask prediction module outputs an objective mask prediction graph
Figure FDA0003568138290000041
The corresponding mask prediction penalty is:
Figure FDA0003568138290000042
wherein,
Figure FDA0003568138290000043
representing the target mask predicted image
Figure FDA0003568138290000044
Wherein x represents the target mask predicted image
Figure FDA0003568138290000045
The pixel point in (1);
Figure FDA0003568138290000046
a label value corresponding to a pixel x representing the input image of the ith frame, wherein the positive sample value is 1, and the negative sample valueA value of 0;
Figure FDA0003568138290000047
a probability value representing the probability value that pixel x of the input image of the ith frame is predicted as a positive sample; in the first stage of training, the mask prediction loss is obtained by measuring the mask label error of the prediction result and the query data set; in the second stage of training, the mask prediction loss is calculated by weighing the prediction result and the target positioning box label error of the query data set.
6. The method of claim 5, wherein η is the first training phase1=η2=0,ε=1;
In the second training phase, η1=η2=0.5,ε=0.1。
7. The method for establishing the model for spatial-temporal perception and localization of target behaviors as claimed in any one of claims 2 to 6, wherein the image-level feature fusion layer masks the feature map of the target region
Figure FDA0003568138290000048
And the image information feature map
Figure FDA0003568138290000049
Performing channel-by-channel feature fusion, and combining the fused features with the image information feature map
Figure FDA00035681382900000410
The target information enhanced feature map is obtained by superposition
Figure FDA00035681382900000411
The method comprises the following steps:
masking the target region with a feature map by bilinear interpolation
Figure FDA00035681382900000412
Is converted into the image information characteristic diagram
Figure FDA00035681382900000413
After the same size of the channel features, the channel features are matched with the image information feature map
Figure FDA00035681382900000414
Performing Hadamard product operation on the corresponding channel characteristics, and comparing the Hadamard product with the image information characteristic diagram
Figure FDA00035681382900000415
Adding to obtain the enhanced characteristic diagram of the target information
Figure FDA00035681382900000416
8. The method for establishing the model for space-time perception and localization of the behavior of the target according to any one of claims 1 to 6, wherein the global feature fusion layer is to perform the feature mapping on the motion information
Figure FDA00035681382900000417
Feature map for feature compression and enhancement with the target information
Figure FDA00035681382900000418
Performing characteristic cascade to obtain a space-time behavior perception characteristic diagram
Figure FDA00035681382900000419
The method comprises the following steps:
the motion information feature map is processed by convolution operation
Figure FDA0003568138290000051
Compressing to a feature map enhanced with the target information
Figure FDA0003568138290000052
After the same dimensionality, the characteristic graph enhanced with the target information
Figure FDA0003568138290000053
Performing characteristic cascade splicing to obtain a space-time behavior perception characteristic diagram
Figure FDA0003568138290000054
9. A space-time perception positioning method for target behaviors is characterized by comprising the following steps:
randomly sampling T frame images from a video sequence marked with a mask label to serve as support set images;
randomly sampling T frame images from a video sequence to be predicted to serve as sparse frame sequence images, and randomly sampling alpha T frame images from the video sequence to be predicted to serve as dense frame sequence images; alpha is more than 1;
inputting the sparse frame sequence images into a spatial target positioning sub-network in a target behavior space-time perception positioning model established by the method for establishing the target behavior space-time perception positioning model according to any one of claims 1 to 8 so as to predict a target positioning frame of a frame image target, and labeling the sparse frame sequence images by using a target positioning frame prediction result to obtain an inquiry set image;
and inputting the support set image, the query set image and the dense frame sequence image into the target behavior perception space-time positioning model, predicting the behavior category and the target positioning frame of a frame image target according to the output of the target behavior perception space-time positioning model, and completing target behavior space-time positioning.
10. A computer-readable storage medium comprising a stored computer program; when being executed by a processor, the computer program controls a device on which the computer-readable storage medium is located to execute the method for establishing the target behavior space-time perception positioning model according to any one of claims 1-8, and/or the method for establishing the target behavior space-time perception positioning model according to claim 9.
CN202210313781.4A 2022-03-28 2022-03-28 Method for establishing target behavior perception space-time positioning model and application Active CN114782859B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210313781.4A CN114782859B (en) 2022-03-28 2022-03-28 Method for establishing target behavior perception space-time positioning model and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210313781.4A CN114782859B (en) 2022-03-28 2022-03-28 Method for establishing target behavior perception space-time positioning model and application

Publications (2)

Publication Number Publication Date
CN114782859A true CN114782859A (en) 2022-07-22
CN114782859B CN114782859B (en) 2024-07-19

Family

ID=82425713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210313781.4A Active CN114782859B (en) 2022-03-28 2022-03-28 Method for establishing target behavior perception space-time positioning model and application

Country Status (1)

Country Link
CN (1) CN114782859B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880647A (en) * 2023-02-22 2023-03-31 山东山大鸥玛软件股份有限公司 Method, system, equipment and storage medium for analyzing abnormal behaviors of examinee examination room
CN117274788A (en) * 2023-10-07 2023-12-22 南开大学 Sonar image target positioning method, system, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019238976A1 (en) * 2018-06-15 2019-12-19 Université de Liège Image classification using neural networks
JP6830707B1 (en) * 2020-01-23 2021-02-17 同▲済▼大学 Person re-identification method that combines random batch mask and multi-scale expression learning
WO2022000426A1 (en) * 2020-06-30 2022-01-06 中国科学院自动化研究所 Method and system for segmenting moving target on basis of twin deep neural network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019238976A1 (en) * 2018-06-15 2019-12-19 Université de Liège Image classification using neural networks
JP6830707B1 (en) * 2020-01-23 2021-02-17 同▲済▼大学 Person re-identification method that combines random batch mask and multi-scale expression learning
WO2022000426A1 (en) * 2020-06-30 2022-01-06 中国科学院自动化研究所 Method and system for segmenting moving target on basis of twin deep neural network

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115880647A (en) * 2023-02-22 2023-03-31 山东山大鸥玛软件股份有限公司 Method, system, equipment and storage medium for analyzing abnormal behaviors of examinee examination room
CN117274788A (en) * 2023-10-07 2023-12-22 南开大学 Sonar image target positioning method, system, electronic equipment and storage medium
CN117274788B (en) * 2023-10-07 2024-04-30 南开大学 Sonar image target positioning method, system, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114782859B (en) 2024-07-19

Similar Documents

Publication Publication Date Title
CN113065558B (en) Lightweight small target detection method combined with attention mechanism
CN110020606B (en) Crowd density estimation method based on multi-scale convolutional neural network
CN109584248B (en) Infrared target instance segmentation method based on feature fusion and dense connection network
CN109684922B (en) Multi-model finished dish identification method based on convolutional neural network
CN112446270A (en) Training method of pedestrian re-identification network, and pedestrian re-identification method and device
CN111797683A (en) Video expression recognition method based on depth residual error attention network
Rahmon et al. Motion U-Net: Multi-cue encoder-decoder network for motion segmentation
CN109241982A (en) Object detection method based on depth layer convolutional neural networks
CN114782859A (en) Method for establishing space-time perception positioning model of target behaviors and application
CN111126278A (en) Target detection model optimization and acceleration method for few-category scene
CN109919223B (en) Target detection method and device based on deep neural network
CN109948457B (en) Real-time target recognition method based on convolutional neural network and CUDA acceleration
CN110222718A (en) The method and device of image procossing
CN107609571A (en) A kind of adaptive target tracking method based on LARK features
CN111860587A (en) Method for detecting small target of picture
CN112329784A (en) Correlation filtering tracking method based on space-time perception and multimodal response
CN111582091A (en) Pedestrian identification method based on multi-branch convolutional neural network
CN116363535A (en) Ship detection method in unmanned aerial vehicle aerial image based on convolutional neural network
CN111539404A (en) Full-reference image quality evaluation method based on structural clues
CN111160100A (en) Lightweight depth model aerial photography vehicle detection method based on sample generation
CN112132207A (en) Target detection neural network construction method based on multi-branch feature mapping
CN115861861B (en) Lightweight acceptance method based on unmanned aerial vehicle distribution line inspection
CN114743045B (en) Small sample target detection method based on double-branch area suggestion network
CN115512207A (en) Single-stage target detection method based on multipath feature fusion and high-order loss sensing sampling
Yin et al. M2F2-RCNN: Multi-functional faster RCNN based on multi-scale feature fusion for region search in remote sensing images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant