CN114782859A

CN114782859A - Method for establishing space-time perception positioning model of target behaviors and application

Info

Publication number: CN114782859A
Application number: CN202210313781.4A
Authority: CN
Inventors: 左峥嵘; 沈凡姝; 王岳环
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-07-22
Anticipated expiration: 2042-03-28
Also published as: CN114782859B

Abstract

The invention discloses a method for establishing a target behavior perception space-time positioning model and application thereof, belonging to the technical field of image processing and comprising the steps of establishing a target behavior perception space-time positioning model based on a depth network, wherein the target behavior perception space-time positioning model comprises a space-time behavior perception sub-network and a space-space target positioning sub-network; the spatiotemporal behavior awareness subnetwork comprises: mask prediction module based on support set graphImage (A)

And query set images

Obtaining a target area mask feature map

Obtaining query image information characteristic diagram based on image information perception module

Will be provided with

And

obtaining a feature map with enhanced target information by inputting an image-level feature fusion layer

Obtaining dense frame sequential images based on motion information perception module

Characteristic map of motion information

Will be provided with

And

inputting the global feature fusion layer to obtain a space-time behavior perception feature map

Will be provided with

The input behavior classification module obtains a classification result; and obtaining a target positioning result through a spatial target positioning sub-network. The invention effectively focuses on and utilizes the target area information, and has higher positioning accuracy.

Description

Method for establishing space-time perception positioning model of target behaviors and application

Technical Field

The invention belongs to the technical field of image processing, and particularly relates to a method for establishing a space-time perception positioning model of target behaviors and application of the space-time perception positioning model.

Background

Video sequence space-time perception belongs to a more popular problem in the field of computer vision in recent years, and aims to enable a model to perform positioning detection on a target in a video and perform identification marking from a starting frame to an ending frame of a behavior through learning of the video. Because training is usually performed based on a large amount of data, and the number of frames of a video segment is large, more computer memory resources are consumed in each iterative training process, and the time required for training is often long. In addition, due to the fact that the target scale changes are large in difference, the problems of missed detection and false detection are more, and under the condition that the target outline shape changes little, the behavior classification difficulty is also high.

For the existing video sequence space-time perception method, a deep network-based method is mostly adopted for learning, more features can be considered compared with the traditional method based on artificial features, and the method has stronger adaptability and feature expression capability. The more mainstream method based on the deep network comprises the following steps: according to the method based on the double-flow feature extraction, the spatial information is acquired by relying on the color channel and the time information is acquired by calculating the optical flow, so that more color feature information is contained in the calculation, the optical flow needs to be additionally calculated, the feature fusion cannot be carried out end to end with the color channel, the calculation amount is large, and more resources are consumed; a time dimension is added based on the traditional 2D convolution, a three-dimensional space-time perception method is carried out, the method generally has the same input frame frequency and feature map for target detection and behavior positioning, the extracted feature map often has no pertinence and enough representation capability when facing multitask, and a large amount of training input is needed to obtain a good convergence effect.

In addition, most of the existing space-time behavior perception models are trained on the basis of behavior data sets taking human bodies as main bodies, complex deformation of relative structures of the objects, such as self deformation generated by sitting, lifting hands and the like, does not exist for rigid bodies, observation deformation caused by differences of visual angle distances and the like exists only, the environment where a person is located has strong influence on behavior classification of the person (such as falling or lying down), and for an aircraft, the space-time behavior perception models need to pay more attention to the change of postures of the objects rather than the relative change of the structures of the objects. In addition, the state switching such as target turning or releasing may be in the same sky background, and classification of partial behaviors is weakly dependent on the background, so that more attitude changes of the concerned targets are needed to guide classification of the behaviors.

In summary, because the existing methods perform coarse-precision perception and feature extraction on the whole image information, pixel information of the region where the target is located is not fully utilized, and the existing methods are not suitable for the situation of behavior perception of the target in the air, and the methods usually rely on a large number of labeled data sets, the calculation complexity is high, the memory consumed during calculation is large, and the existing video sequence space-time perception positioning method needs to be further improved.

Disclosure of Invention

Aiming at the defects and improvement requirements of the prior art, the invention provides a method for establishing a target behavior space-time perception positioning model and application thereof, aiming at solving the technical problems that the existing behavior perception space-time positioning method is not suitable for performing behavior perception space-time positioning on rigid targets such as aircrafts and the like because pixel information of the region where the target is located is not fully utilized.

To achieve the above object, according to one aspect of the present invention, there is provided a method for establishing a target behavior-aware spatiotemporal localization model, including: establishing a deep neural network, and training the deep neural network by using a data set to obtain a target behavior perception space-time positioning model;

the deep neural network comprises a space-time behavior perception sub-network and a space-time target positioning sub-network, wherein the space-time behavior perception sub-network comprises:

a mask prediction module for predicting the support set image

And query set images

Carrying out target area feature perception, and carrying out target mask prediction on the query set image by using a perception result to obtain a target area mask feature map

An image information perception module for image-matching the query set

Performing feature extraction to obtain an image information feature map

An image-level feature fusion layer for masking the feature map of the target region

And image information feature map

Performing channel-by-channel feature fusion, and combining the fused features with an image information feature map

Obtaining a target information enhanced feature map by superposition

Motion information perception module for image of dense frame sequence

Sensing the motion information to obtain a motion information characteristic diagram

Global feature fusion layer for sportsInformation characteristic diagram

Feature map for feature compression and enhancement with target information

Performing characteristic cascade to obtain a space-time behavior perception characteristic diagram

And the behavior classification module is used for sensing the characteristic diagram of the time-space behavior

Extracting the region of interest and predicting the behavior class probability to obtain a behavior class prediction result of the target;

spatial object localization subnetwork for imaging query set

Carrying out target identification and detection positioning to obtain a target positioning frame prediction result;

wherein the support set image

Randomly sampling T frame images obtained from a video sequence marked with a mask label; query set images

Randomly sampling T frame images which are obtained from an input video sequence and are marked with target positioning frame labels; the support set image and the query set image are different from each other, and T is a positive integer; dense frame sequential images

Is an alpha T frame image randomly sampled from an input video sequence, and alpha is more than 1.

The established target behavior perception space-time positioning model carries out mask prediction while carrying out feature extraction on an input video sequence frame to obtain a target region mask feature map, carries out channel-by-channel fusion on the target region mask feature map and an image information feature map to obtain a target information enhanced feature map, fuses the target information enhanced feature map and an extracted motion information feature map into a space-time behavior perception feature map, and finally carries out region-of-interest extraction on the space-time behavior perception feature map to obtain a behavior classification result of a target; by predicting the mask feature map of the target area and performing channel-by-channel fusion with the image information feature map, the attention degree of the model to the pixel information of the target area can be enhanced, and the influence of background interference information is reduced, so that the pixel information of the area where the target is located is fully utilized, the accuracy of the model for extracting the features of the target area is improved, the detection accuracy and classification accuracy of behavior sensing are finally improved, and the model can effectively and accurately realize the behavior sensing space-time positioning of rigid targets such as aircrafts and the like.

The established target behavior perception space-time positioning model only needs to extract image characteristics, predict a target area mask characteristic diagram, perceive motion information, perform target detection and other operations when performing behavior perception space-time positioning, does not depend on color information, enables the model to pay more attention to the outline structure information of a target in a video sequence to be detected, is applicable to video sequences with gray scale display such as an infrared video sequence and has stronger generalization capability; meanwhile, the abandonment of the color channels reduces the memory condition of the model which depends on the training operation, and greatly reduces the memory occupation amount in the operation.

Further, the mask prediction module includes:

a first feature extraction module for respectively extracting support set images

And query set images

Carrying out feature extraction to obtain a support set image feature map

And query set image feature graph

A first background suppression module for utilizing the support set image

Mask label of

Suppressing the background in the image features of the support set to obtain a first target enhanced feature map

A second background suppression module for utilizing the query set images

Target positioning frame label

Suppressing the background in the image features of the query set to obtain a second target enhanced feature map

A correlation measurement module for measuring the similarity of the first and second target enhanced feature maps to obtain a super-correlation feature matrix

A feature compression module for compressing the hyper-correlation feature matrix

Performing depth dimension compression to obtain featuresCompression matrix

A feature fusion module for compressing the matrix with the residual error idea

Performing feature fusion on each layer of features to obtain a feature fusion matrix

And a down-sampling layer for fusing the features into a matrix

Converting into two-dimensional characteristic graph to obtain target area mask characteristic graph

In the invention, in the mask prediction module, the background suppression module utilizes a mask label or a target positioning frame label to suppress the background, thereby effectively reducing the dependence on the background and realizing the target enhancement; the relevancy measuring module measures the similarity of the target enhanced query set image and the support set image, so that the prediction of the target area mask feature map can be accurately finished by means of a small number of labeled support set images; the feature compression module keeps the feature dimension of the image of the query set unchanged, compresses the feature dimension of the image of the support set and compresses the super-correlation feature matrix under the condition of avoiding feature information loss

Dimension reduction is carried out, so that the memory consumption caused by overlarge matrix dimension in subsequent calculation can be greatly reduced; the characteristic fusion module carries out characteristic fusion operation on each layer of characteristics in the compressed super correlation matrix by utilizing a residual error thought, fuses the characteristics of different receptive fields, enables the image characteristics of the query set and the image characteristics of the support set to be better fused,therefore, the extracted features gradually tend to semantic information along with the deepening of the network layer number, so that the semantic information of the high layer is reserved, and the detailed information of the bottom layer is also reserved. In general, the target behavior perception space-time positioning model established by the invention can accurately predict the mask characteristic diagram of the target area, reduce the dependence on the background and occupy less memory.

Further, in the model training process, the mask prediction module further includes:

context decoder for mapping target region mask feature map

Restoring to be consistent with the original input image in size to obtain a query set target mask prediction graph

Training the deep neural network by using a data set, wherein the training comprises a first stage training and a second stage training;

the first stage of training comprises:

constructing a support set image by using the video sequence labeled with the mask label, constructing a query set image by using the video sequence labeled with the mask label and the target positioning frame label, and constructing a first data set by using the constructed support set image and the query set image;

performing supervised training on the mask prediction module by using a first data set to obtain a pre-trained mask prediction module;

the second stage of training comprises:

loading pre-trained mask prediction module parameters in a deep neural network, constructing a support set image by using a video sequence labeled with a mask label, constructing a query set image and a dense frame sequence image by using a video sequence labeled with a target behavior label and a target positioning frame label, constructing a second data set by using the constructed support set image, the query set image and the dense frame sequence image, and training a deep neural network model by using the second data set to obtain a target behavior perception space-time positioning model.

The invention adopts a two-stage training mode to train the established model: in the first stage of training, a mask prediction module is supervised and trained by using a support set image constructed by a video sequence with a marked mask label and a query set image constructed by a video sequence with a marked mask label and a target positioning frame label, so that the mask prediction module can obtain good target area pixel classification capability and accurately predict a target area mask feature map; in the second stage of training, the mask prediction module trained in the first stage is used for predicting a mask feature map of a target area, and the whole model is trained end to end, so that the parameters of the whole model are adjusted; by combining the two stages of training, the model training process does not depend on a large number of data sets labeled with mask labels and target positioning box labels at the same time.

Further, the training loss function is:

wherein,

represents the total loss;

representing behavior classification loss calculated based on the spatio-temporal feature information;

representing the behavior frame positioning loss calculated based on the space-time characteristic information;

representing a prediction penalty of the mask prediction; eta₁,η₂Epsilon is a preset weight parameter; in the first stage of training, eta₁＝η₂0,. epsilon.is greater than zero; in the second stage of training, eta₁,η₂All epsilonGreater than zero.

In the model training process of the first stage, the training loss function only considers the prediction loss of the mask prediction, and can effectively ensure that the mask prediction module can accurately predict the mask feature map of the target area after the training of the first stage; in the second stage of training, the training loss function considers the behavior classification loss, the behavior frame positioning loss and the prediction loss of the mask prediction at the same time, so that the model can accurately predict the behavior classification result and the target behavior frame positioning result, and the mask prediction module is subjected to fine adjustment, thereby further ensuring the accuracy of behavior perception space-time positioning by using the model after the model training is finished.

Further, for the ith frame of input image, the mask prediction module outputs a target mask prediction map

The corresponding mask prediction penalty is:

wherein,

representing a target prediction mask image

The total number of pixels in (1), x represents the target prediction mask image

The pixel point in (2);

representing a label value corresponding to a pixel x of an ith frame of input image, wherein a positive sample value is 1, and a negative sample value is 0;

image representing input image of ith framePredicting the pixel x as the probability value of the positive sample; in the first stage of training, the mask prediction loss is obtained by measuring the mask label error of the prediction result and the query data set; in the second stage of training, the mask prediction loss is calculated by measuring the prediction result and the target positioning box label error of the query data set.

In the first stage of training, the mask label of the query set image is directly used as an optimization target to calculate the prediction loss of the mask, so that supervision training of a mask prediction module is realized, and the mask prediction module has better mask prediction capability; in the second stage of training, the target positioning frame label of the query set image is used as an optimization target to calculate the prediction loss of the mask, so that the weak supervision training of the mask prediction module is realized, the dependence on a large number of data sets labeled with the mask label and the target positioning frame label at the same time is avoided, and the prediction effect of the mask prediction module is further improved.

Further, in the first training phase, η₁＝η₂＝0，ε＝1；

In the second training phase, eta₁＝η₂＝0.5，ε＝0.1。

Further, the image-level feature fusion layer masks the feature map of the target area

And image information feature map

Obtaining a target information enhanced feature map by superposition

The method comprises the following steps:

target area mask feature map through bilinear interpolation operation

Conversion to image information feature map

After the same size of the medium and same channel features, the medium and same channel features are compared with the image information feature map

Performing Hadamard product operation on the corresponding channel characteristics, and comparing the Hadamard product with the image information characteristic diagram

Adding to obtain target information enhanced feature map

In the invention, after the image-level feature fusion layer unifies dimensions of feature images to be fused, pixel-level feature fusion is realized through Hadamard product operation, and the feature information of an image domain target can be enhanced while the feature difference among original channels of the feature images is better kept.

Further, the global feature fusion layer is used for carrying out feature mapping on the motion information

Feature map for feature compression and enhancement with target information

Carrying out characteristic cascade to obtain a space-time behavior perception characteristic diagram

The method comprises the following steps:

characterizing a motion information by a convolution operation

Feature map compressed to enhancement with target information

Feature map enhanced with target information after same dimensionality

Performing characteristic cascade splicing to obtain a space-time behavior perception characteristic diagram

In the invention, the global feature fusion layer performs convolution feature extraction on the motion information feature map to realize further compression of features, and realizes feature fusion of the compressed motion information feature map and the feature map enhanced by the target information in a feature cascade splicing mode, so that the feature map obtained by fusion retains image information and motion information.

According to another aspect of the present invention, there is provided a method for space-time perception and localization of target behaviors, comprising:

randomly sampling T frame images from a video sequence marked with a mask label to serve as support set images;

randomly sampling T frame images from a video sequence to be predicted to serve as sparse frame sequence images, and randomly sampling alpha T frame images from the video sequence to be predicted to serve as dense frame sequence images; alpha is more than 1;

inputting the sparse frame sequence images into a spatial target positioning sub-network in the target behavior sensing space-time positioning model established by the method for establishing the target behavior space-time sensing positioning model so as to predict a target positioning frame of a frame image target, and labeling the sparse frame sequence images by utilizing a target positioning frame prediction result to obtain a query set image;

and inputting the support set image, the query set image and the dense frame sequence image into a target behavior perception space-time positioning model, predicting the behavior category and the target positioning frame of a frame image target according to the output of the target behavior perception space-time positioning model, and completing the target behavior space-time positioning.

According to yet another aspect of the present invention, there is provided a computer readable storage medium comprising a stored computer program; when the computer program is executed by the processor, the device of the computer readable storage medium is controlled to execute the method for establishing the target behavior space-time perception positioning model provided by the invention and/or the method for establishing the target behavior space-time perception positioning model provided by the invention.

Generally, by the above technical solution conceived by the present invention, the following beneficial effects can be obtained:

(1) according to the invention, a mask prediction module for predicting the mask characteristic diagram of the target area is introduced into the behavior sensing space-time positioning model, and the mask characteristic diagram of the target area obtained by prediction of the mask characteristic diagram prediction module and the image information characteristic diagram are subjected to channel-by-channel fusion, so that the attention of the model to the pixel information of the target area can be enhanced, and the influence of background interference information is reduced, thereby fully utilizing the pixel information of the area where the target is located, improving the accuracy of the model for extracting the characteristics of the target area, and finally improving the detection accuracy and classification accuracy of behavior sensing, so that the model can effectively and accurately realize the behavior sensing space-time positioning of rigid targets such as aircrafts and the like.

(2) The established target behavior perception space-time positioning model only needs to extract image characteristics, predict a target area mask characteristic diagram, perceive motion information, perform target detection and other operations when performing behavior perception space-time positioning, does not depend on color information, enables the model to pay more attention to the outline structure information of a target in a video sequence to be detected, is applicable to video sequences with gray scale display such as an infrared video sequence and has stronger generalization capability; meanwhile, the abandonment of the color channels reduces the memory condition of the model which depends on the training operation, and greatly reduces the memory occupation amount in the operation.

(3) According to the method, the input video sequence is subjected to weak supervision learning, in the mask prediction module, multi-scale global feature information of an image is obtained based on the first feature extractor module, the relevance measurement module can be used for rapidly converging under the condition that the sample size is insufficient, the number of parameters needing training is reduced, the extracted global feature information is compressed by the feature compression module, the memory consumed by model calculation in operation is reduced, the obtained feature information is coded and fused by the down-sampling layer, a final mask prediction feature map is obtained, a weak supervision mask prediction function is realized, the workload of manufacturing a large number of data tags is reduced, and the dependence on the strong supervision information of true-value tags is relieved.

Drawings

FIG. 1 is a schematic structural diagram of a spatiotemporal perceptual positioning model of a target behavior according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a mask prediction module according to an embodiment of the present invention;

FIG. 3 is a block diagram of a context decoder according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an image information sensing module according to an embodiment of the present invention;

fig. 5 is a tag diagram of an input target positioning box and a mask prediction diagram of a mask prediction module according to an embodiment of the present invention; wherein, each line respectively corresponds to the image original image, the mask prediction image and the target positioning block diagram; the mask prediction results of two randomly selected frames of images in different sequences are respectively shown in (a), (b), (c) and (d);

FIG. 6 is a gradient class activation thermodynamic diagram for different behaviors in accordance with the method of the present invention; the same line is from samples of different frames of the same video sequence, original images of the video frames are sampled and displayed on the left side of a dotted line, the effect after gradient activation thermodynamic diagrams are superposed is on the right side of the dotted line, and the higher the brightness is, the stronger the perception to the area is; (a) activating a thermodynamic diagram for a gradient class corresponding to the decoy release behavior sequence, (b) activating a thermodynamic diagram for a gradient class corresponding to the target overturn behavior sequence;

FIG. 7 is a graphical representation of spatiotemporal behavior perception results of bait release for a single target multi-behavior state according to embodiments of the present invention and other methods; target behavior information contained in the frame sequence images of the same row sequentially comprises target flight, target turning, target release of bait bombs, bait bomb release process and bait bomb release end; wherein, (a) is the detection result of the method of the invention, (b) is the detection result of IM-BPSTM method, (c) is the information description diagram of the label of the detection result;

FIG. 8 is a graph showing the spatiotemporal perception results of bait release behavior for a multi-objective multi-state situation in accordance with an embodiment of the present invention; wherein the different line images are respectively from samples of different video sequences;

FIG. 9 is a graph showing the temporal-spatial perception results of the bait cartridge release behavior when the target is occluded according to the present invention;

FIG. 10 is a graph showing spatiotemporal perception results for target escape behavior according to an embodiment of the present invention; wherein (a), (b), (c), (d) are from samples of different video sequences, respectively;

FIG. 11 is a diagram showing the time-space sensing results of the embodiment of the present invention for the target takeoff behavior;

FIG. 12 is a graph showing the spatiotemporal perception results for target landing behavior according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and do not limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

In the present application, the terms "first," "second," and the like (if any) in the description and the drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

In order to solve the technical problems that the existing behavior sensing space-time positioning method is not suitable for performing behavior sensing space-time positioning on rigid body targets such as aircrafts and the like because the pixel information of the region where the target is located is not fully utilized, the invention provides a method for establishing a target behavior space-time sensing positioning model and application thereof, and the overall thought is as follows: a mask prediction module for predicting a mask feature map of a target area is introduced into a model, the mask feature map of the target area obtained by prediction of the mask feature map prediction module and an image information feature map are subjected to channel-by-channel fusion, and then subsequent behavior classification is carried out, so that the attention of the model to pixel information of the target area can be enhanced, and the influence of background interference information is reduced, so that the pixel information of the area where the target is located is fully utilized, and meanwhile, color channels are abandoned, so that the memory condition of the model depending on training operation is reduced, and the memory occupation amount during operation is greatly reduced.

The following are examples.

Example 1:

a method for establishing a target behavior perception space-time positioning model comprises the following steps: and establishing a deep neural network, and training the deep neural network by using a data set to obtain a target behavior perception space-time positioning model.

As shown in fig. 1, in this embodiment, the deep neural network includes a spatio-temporal behavior awareness subnetwork and a spatial target positioning subnetwork, and the spatio-temporal behavior awareness subnetwork includes:

a mask prediction module for image support set

And query set images

In this embodiment, the mask prediction module is symbolized as

An image information perception module for recognizing the query set image

Carrying out feature extraction to obtain an image information feature map

In this embodiment, the image information sensing module is symbolized as

An image level feature fusion layer for masking the feature map of the target region

And image information feature map

Carrying out feature fusion and combining the fused features with the image information feature map

Obtaining a target information enhanced feature map by superposition

In this embodiment, the image-level feature fusion layer is symbolized as

Motion information perception module for image of dense frame sequence

In this embodiment, the motion information sensing module is symbolized as

A global feature fusion layer for feature mapping of motion information

Performing characteristic pressureReduced and enhanced feature maps with target information

In this embodiment, the global feature fusion layer is symbolized as

Extracting the region of interest and predicting the behavior category probability to obtain a behavior category prediction result of the target;

spatial object localization subnetwork for imaging query set

wherein the support set image

Is alpha T frame image obtained by random sampling in input video sequence, and alpha is more than 1; since the image information focuses more on the features of the two-dimensional plane, it movesThe information is more focused on the time dimension, so in this embodiment, the set image is supported

Query set images

As input of a mask prediction module, a query set image is used as input of an image information perception module, and a dense frame sequence image is used

As input to a motion information perception module; optionally, in this embodiment, the support set image and the query set image are sampled every 8 frames, and the dense frame sequence image is sampled every 2 frames, that is, α ═ 4.

Referring to fig. 2, in the embodiment, the mask prediction module includes:

a first feature extraction module for respectively extracting the support set images

And query set images

Carrying out feature extraction to obtain a support set image feature map

And query set image feature graph

In this embodiment, the first feature extraction module is symbolized as

A first background suppression module for utilizing the support set image

Mask label of

A second background suppression module for utilizing the query set image

Target positioning frame label

In this embodiment, the correlation metric module is symbolized as

Performing depth dimension compression to obtain a feature compression matrix

In this embodiment, the feature compression module is symbolized as

A feature fusion module for compressing the matrix by using the residual error idea

In this embodiment, the feature fusion module is symbolized as

And a down-sampling layer for fusing the features into a matrix

In the present embodiment, the downsampled layer is symbolized as

As an alternative implementation, in this embodiment, ResNet-50 is used as the first feature extractor module used in the present invention

It should be noted that other common backbone network feature extraction modules composed of multiple layers of network sub-modules may also be applied to the present invention.

In FIG. 2, the first background suppression module utilizes support set images

Mask label of

Against background in support set image featuresSuppressing to obtain a first target enhancement characteristic diagram

The method specifically comprises the following steps:

support set image by bilinear interpolation

Mask label of

Conversion and support of a set of image feature maps

After the sizes of the corresponding features of different levels are consistent, the feature map is associated with the feature map of the support set image

Performing Hadamard product operation to obtain a first target enhanced feature map

The corresponding computational expression is:

therein, Ψ_l() Indicating that mask tags will be supported

Transition to and l-th layer support feature

A uniform-sized bilinear interpolation function, <' > indicating a Hadamard product, L indicating the number of total levels of the supported set image feature map, for all layer feature sets of the supported set image feature map for the i-th frame

Is shown as

Frame i supports all layer feature set of image target enhancement feature map

Is shown as

Similarly, a second background suppression module utilizes the query set image

Target positioning frame label

The method specifically comprises the following steps:

query set image by bilinear interpolation

Target positioning frame label

Conversion and query set image feature maps

After the sizes of the corresponding features of different levels are consistent, the feature images of the image feature maps of the query set are compared with the feature images of the corresponding features of the query set

Performing Hadamard product operation to obtain a second target enhanced feature map

The corresponding computational expression is:

therein, Ψ_l() Indicating the target positioning frame label

Transition to and I-layer support feature

A size-consistent bilinear interpolation function, _ indicating Hadamard product, L indicating the total number of layers of the query set image feature map, and aggregating all layer features of the i-th frame query image feature map

Is shown as

All-layer feature set of ith frame query image target enhancement feature map

Is shown as

And the background suppression module is used for suppressing the background by using the mask label or the target positioning frame label, effectively reducing the dependence on the background and realizing the target enhancement.

In fig. 2, the correlation measurement module measures the similarity of the first target enhanced feature map and the second target enhanced feature map to obtain a super correlation feature matrix

The corresponding computational expression is:

wherein, all layer characteristics in the super correlation characteristic matrix corresponding to the ith frame of image are collected

Is shown as

ξ () represents an activation function for suppressing noise information of the similarity measure result; in the present embodiment, ξ () represents a ReLU activation function.

In this embodiment, a similarity measurement mode is used to calculate a feature correlation matrix between the support set image feature map pixel-fused with the mask label and the query set image feature map pixel-fused with the target frame label, and a similarity measurement mode is used to calculate a super-correlation feature matrix between the first target enhanced feature map and the second target enhanced feature map.

As shown in FIG. 2, in the present embodiment, the feature compression module

The method specifically comprises 3 compression submodules, wherein each compression submodule comprises 2 convolution layers with cores of 5 x 5, 1 convolution layer with cores of 3 x 3, and an activation layer and a normalization layer which are connected behind each convolution layer;

for correlation of super-correlation feature matrix

Compressing in depth dimension to reduce memory occupation during calculation, and retaining more effective characteristic information to obtain characteristic compression matrix

Namely:

wherein, the super-correlation characteristics corresponding to the ith frame imageCompressing all layers in the matrix to obtain a feature set

Is shown as

Representing a combination of operations of 4D convolution, group normalization and ReLU activation, k _ sqz_lIndicating that the l-th layer corresponds to the convolution kernel size.

In the embodiment, the feature compression module is used for compressing the cascaded support features and query features to keep the feature dimension of the query image unchanged, and compressing the feature dimension of the support image, so that the super-correlation feature matrix is compressed under the condition of avoiding feature information loss

Dimension reduction is carried out, so that the memory consumption caused by overlarge matrix dimension in subsequent calculation can be greatly reduced; it should be noted that, in this embodiment, the number of the compression sub-modules in the feature compression module may be adjusted to other numbers according to the actual calculation amount and the compression effect requirement, and generally, should be greater than 2.

In this embodiment, the feature fusion module

And feature compression module

The structure is similar, and the fusion sub-module comprises 3 fusion sub-modules, wherein each fusion sub-module comprises 3 convolution kernels of 3 x 3, and an activation layer and a normalization layer which are connected behind each convolution layer;

for compressing matrices for features using residual concepts

Each layer of feature in (1) is characterizedThe fusion operation, namely:

wherein, k _ mix_lExpressing the size of the convolution kernel corresponding to the l-th layer, realizing the fusion of multi-level structural features, and obtaining the fusion of the last layer of features

Is shown as

Thereby obtaining a feature fusion matrix

In this embodiment, the feature fusion module adopts a step-by-step nesting manner, utilizes a multi-layer residual structure, and fuses features of different receptive fields, so that the query image features and the support image features can be better fused, the extracted features gradually tend to semantic information along with the deepening of the number of network layers, and better multi-layer feature information, such as deep abstract information and bottom layer detail information, can be obtained.

In this embodiment, the down-sampling layer

Comprises a mean pooling layer for fusing the features into a matrix

Conversion into 2-dimensional feature maps, i.e. target area mask feature maps

Is used for mixing with

Together as an image-level feature fusion module

And (4) obtaining the feature map with strengthened target position information.

In order to effectively train the mask prediction module and evaluate the prediction error loss of the mask prediction module, in this embodiment, the behavior-aware spatio-temporal positioning model further includes, in a training phase:

context decoder for mapping target region mask feature map

In this embodiment, the context decoder is symbolized as

The structure is shown in fig. 3, and comprises a plurality of 3 × 3 2D convolution layers, a ReLU active layer and an up-sampling layer connected in series, and finally the output of the softmax layer is used as the final prediction result; restoring the target mask prediction image with the same size as the original input image according to the prediction target area characteristic image

By means of a context decoder

In the embodiment, the mask label of the query set image is used as an optimization target to calculate the prediction loss of the mask, so that the supervision training of the mask prediction module is realized, and the mask prediction module has better mask prediction capability.

As shown in fig. 1, in this embodiment, the image information sensing module

The structure of (2) is shown in FIG. 4, and comprises a convolution layer, a pooling layer, and 4 convolutions connected in sequenceA block and mean pooling layer; image information perception module

The calculation process of (c) can be expressed as:

wherein,

for images from a query set

The overall representation of the T frame image of (1);

represents a conventional n x n convolution kernel with time dimension t; ρ represents the mean pooling operation.

As shown in fig. 1, to mask a feature map on a target area

And image information feature map

In the embodiment, the image-level feature fusion layer masks the feature map of the target region

And image information feature map

Obtaining a target information enhanced feature map by superposition

The process can be represented as:

by bilinear interpolation operation will

Is converted into

After the same channel feature has the same dimension, the same as

Performing Hadamard product, and

are added to obtain

Namely:

wherein [ ] indicates the Hadamard product, Ψ_c() Representation of conversion of target area mask feature map into image informationC, a bilinear interpolation function with the same channel size of the information characteristic graph, wherein C is the number of characteristic channels;

the characteristic representation of the image information characteristic diagram of the c channel;

masking a feature representation of the feature map for the target area of the c-th channel;

as shown in fig. 1, in this embodiment, the motion information sensing module

Structure of and image information sensing module

Similarly, motion information perception module

The calculation process of (a) can be expressed as:

wherein,

for images from a dense sequence of frames

The whole representation of the T frame image of (1);

as shown in FIG. 1, in this embodiment, a global feature fusion layer is usedFor the motion information feature map

Feature map for feature compression and enhancement with target information

The method comprises the following steps:

characterizing a motion information by a convolution operation

Feature maps compressed to enhancement with target information

Feature map enhanced with target information after same dimensionality

Wherein the convolution operation employs a convolution kernel of size 1 x 1 with a time dimension of 5;

in this embodiment, after the global feature fusion layer unifies dimensions of the feature maps to be fused, feature fusion is realized in a tensor splicing manner, so that the feature maps obtained by fusion retain image information and motion information at the same time.

In the embodiment, an image information sensing module for acquiring image information by using a sparse frame image and a motion information sensing module for acquiring motion information by using a dense frame image are constructed by nesting the combination of the rolling blocks and the pooling layers for multiple times, image-level feature fusion is performed on features acquired by the image information sensing module and a target area mask feature map, and finally spatiotemporal information is fused by a global feature fusion layer to obtain a spatiotemporal information feature map enhanced by target information.

As shown in FIG. 1, in this embodiment, the behavior classification module includes a feature map for input spatiotemporal behavior perception by the region of interest extractor

Extracting the interested region, and outputting a maximum likelihood estimation behavior classification result by using a multi-classifier.

Referring to fig. 1, in the present embodiment, the spatial domain target positioning sub-network includes:

a second feature extraction module for extracting the image of the query set

Carrying out feature extraction;

the suggestion frame generation module is used for carrying out target detection according to the features extracted by the second feature extraction module to generate suggestion frames; the suggestion box is used for identifying the possible existing area of the target;

and the positioning frame regression module is used for identifying a suggestion frame with a target from the suggestion frames generated by the suggestion frame generation module to serve as a target positioning frame.

As an optional implementation manner, in this embodiment, the spatial domain target positioning sub-network selects a Faster R-CNN network; the second feature extraction module adopts a ResNet-50 backbone network; it should be noted that other common target detection positioning networks and feature extraction backbone networks may also be applied to the present invention.

In the embodiment, the deep neural network is trained by using a data set, wherein the training comprises a first stage training and a second stage training;

the first stage of training comprises:

constructing a support set image by using the video sequence labeled with the mask label, constructing a query set image by using the video sequence labeled with the mask label and the target positioning frame label, and constructing a first data set by using the constructed support set image and the constructed query set image;

the second stage of training comprises:

loading pre-trained mask prediction module parameters in a deep neural network, constructing a support set image by using a video sequence labeled with a mask label, constructing a query set image and a dense frame sequence image by using a video sequence labeled with a target behavior label and a target positioning frame label, constructing a second data set by using the constructed support set image, the query set image and the dense frame sequence image, and training a deep neural network model by using the second data set to obtain a target behavior perception space-time positioning model;

the training loss function is:

wherein,

represents the total loss;

representing behavior classification loss calculated based on the space-time characteristic information, and specifically adopting multi-classification cross entropy calculation;

representing behavior frame positioning loss calculated based on the space-time characteristic information, including classification loss and boundary regression frame loss;

represents the prediction loss of the mask prediction; eta₁,η₂Epsilon is a preset weight parameter; in this embodiment, η in the first stage of training₁＝η ₂0, 1; in the second stage of training, eta₁＝η₂＝0.5，ε＝0.1；

For the ith frame of input image, the mask prediction module outputs a target mask prediction image

The corresponding mask prediction penalty is:

wherein,

representing a target prediction mask image

Wherein x represents the target prediction mask image

The pixel point in (1);

the label value corresponding to a pixel x of the ith frame input image is represented, wherein a positive sample value is 1, and a negative sample value is 0;

a probability value representing that a pixel x of an ith frame input image is predicted as a positive sample; in the first stage of training, the mask prediction loss is obtained by measuring the mask label error of the prediction result and the query data set; in the second stage of training, the mask prediction loss is calculated by measuring the prediction result and the target positioning box label error of the query data set.

In this embodiment, before model training, preprocessing operations such as deformation and image normalization are performed on images obtained by random sampling in an input video sequence, and training sample data expansion is performed; optionally, in this example, the data enhancement method includes random cropping, random size modification, and random flipping;

in the first stage, a pair mask prediction module

Performing supervision training to obtain good pixel classification capability of the target area; optionally, in this embodiment, a public data set Pascal VOC data set is selected to perform supervision training on the model, where the Pascal VOC data set includes 20 types of images of an aircraft and the like in different scenes, and a target frame label and a mask label corresponding to the images;

in the second stage, the mask prediction module

Performing weak supervision training, and acquiring a prediction mask of an image target area by using a target frame label as a pseudo label of an image of a query set; image information sensing module based on query set image and target positioning frame label

Motion information perception module

Carrying out supervision training, and adjusting parameters of the integral model; in the embodiment, gray level video sequences of different states of the aircraft in a real scene and corresponding target positioning frame labels are input; because the first data set and the second data set have the domain difference of the data sets, the mask prediction module is still finely adjusted in the second training stage, a support set image is constructed by utilizing the video sequence marked with the mask label, and the model parameters are adjusted by using the target frame label as the pseudo label of the query set image, so that the prediction loss caused by the domain difference is reduced;

performing weak supervised learning on a mask prediction module by inputting a target frame label as a pseudo label of a query set image, wherein the weak supervised learning comprises the following steps:

according to

Setting the input image I^qiTarget positioning frame label

Through the mask prediction module

To obtain

Wherein (x)ⁱ,yⁱ) Inputting images for the ith frame

And labeling block diagrams

The coordinates of any one of the pixel points are,

inputting images for the ith frame

The coordinates of the top left corner vertex of the labeling box, wⁱ、hⁱIndicating the length and width of the label box,

representing an input image

The middle pixel point belongs to the set of pixel points of the target area,

representing an input image

The middle pixel point belongs to the pixel point set of the background area.

As shown in the figure5, a first behavior of the original image of the input video sequence, a second behavior mask prediction module and a context decoder

Output mask prediction graph

The third step is to obtain a target positioning frame label graph according to the target coordinate label conversion; the mask prediction effect of two randomly selected frames of images in different sequences is respectively shown in (a), (b), (c) and (d); optionally, in this embodiment, in the first training stage, the training round is 2000 rounds, the initial value of the learning rate is 0.001, and the adaptive learning rate method Adam optimizer is selected; in the second training stage, the training round is 100 rounds, the initial value of the learning rate is 0.01, the learning rate is adjusted by adopting a random gradient descending mode with momentum (momentum) items, and the weight attenuation is set to 10 in order to prevent overfitting^-5(ii) a The input video is uniformly sampled along the time axis at 10 frames per second, the size of the image sequence is (1920,1080), the shorter sides of the image sequence are randomly scaled to 256 pixels by the data enhancement module and randomly cropped to obtain a uniform input (256 ).

In summary, the target behavior awareness spatio-temporal positioning model established in this embodiment performs mask prediction while performing feature extraction on an input video sequence frame to obtain a target region mask feature map, performs channel-by-channel fusion on the target region mask feature map and an image information feature map to obtain a target information enhanced feature map, fuses the target information enhanced feature map and an extracted motion information feature map into a spatio-temporal behavior awareness feature map, and finally performs region-of-interest extraction on the spatio-temporal behavior awareness feature map to obtain a behavior classification result of a target; by predicting the mask feature map of the target area and performing channel-by-channel fusion with the image information feature map, the attention of the model to the pixel information of the target area can be enhanced, and the influence of background interference information is reduced, so that the pixel information of the area where the target is located is fully utilized, the accuracy of the model for extracting the feature of the target area is improved, the detection accuracy and classification accuracy of space-time behavior perception are finally improved, and the model can effectively and accurately realize the behavior perception space-time positioning of rigid targets such as aircrafts and the like. When the behavior sensing space-time positioning model is used for performing behavior sensing space-time positioning, the model only needs to extract image characteristics, predict a target area mask characteristic diagram and sense motion information, perform target detection and other operations, does not depend on color information, and can focus more on the outline structure information of a target in a video sequence to be detected, so that the model can be suitable for video sequences displayed in gray scales such as an infrared video sequence and the like and has stronger generalization capability; meanwhile, the abandonment of the color channels reduces the memory condition of the model which depends on the training operation, and greatly reduces the memory occupation amount in the operation.

Example 2:

a space-time perception positioning method for target behaviors comprises the following steps:

inputting the sparse frame sequence images into a spatial target positioning sub-network in the target behavior sensing space-time positioning model established by the method for establishing the target behavior space-time sensing positioning model provided by the embodiment 1 to predict a target positioning frame of a frame image target, and labeling the sparse frame sequence images by using a target positioning frame prediction result to obtain a query set image;

the support set image, the query set image and the dense frame sequence image are input into the target behavior space-time perception positioning model established by the method for establishing the target behavior space-time perception positioning model provided by the embodiment 1, the behavior category and the target positioning frame of the frame image target are predicted according to the output of the target behavior space-time perception positioning model, and the target behavior space-time positioning is completed.

Example 3:

a computer readable storage medium comprising a stored computer program; when the computer program is executed by the processor, the apparatus on which the computer readable storage medium is located is controlled to execute the method for establishing the target behavior spatio-temporal perceptual positioning model provided in the above embodiment 1, and/or the method for establishing the target behavior spatio-temporal perceptual positioning model provided in the above embodiment 2.

The performance of the model established by the invention is further explained by combining the specific target behavior space-time perception positioning result.

For convenience of description, in the following description, the model established by the above embodiment 1 will be denoted as IM-BPSTM-Mask model.

And activating a thermodynamic diagram calculation method Grad-CAM based on the gradient classes, constructing an importance weight matrix according to the forward result and the backward propagation gradient value of the input image feature map, and displaying the importance weight matrix in a thermodynamic diagram mode. Fig. 6 is a gradient class activation thermodynamic diagram obtained by the IM-BPSTM-Mask model method established in embodiment 1, where the same line is from samples of different frames of the same video sequence, the left side of the dotted line shows samples of original images of video frames, and the right side of the dotted line shows an effect after the gradient class activation thermodynamic diagrams are superimposed, and the higher the brightness is, the stronger the perception on the region is; (a) activating a thermodynamic diagram for a gradient class corresponding to the decoy release behavior sequence, (b) activating a thermodynamic diagram for a gradient class corresponding to the target overturn behavior sequence; according to the results shown in FIG. 6, the IM-BPSTM-Mask model method can reduce the activation of the background interference region, mainly focuses on the region where the target is located and the region related to behavior perception; according to (a) in fig. 6, it can be seen that the activation region of the model for the bait release behavior sequence is mainly concentrated on the target itself and the position of the bait cartridge, and the activation region is concentrated on the region of the target before the release is not started and after the release is finished, so that the interference of the tail smoke of the bait cartridge in the background is effectively ignored; it can be seen from fig. 6 (b) that the activation region for the target roll-over behavior sequence is mainly focused on the head region where the pitch angle change of the target is most easily judged. The IM-BPSTM-Mask model can effectively focus on the typical characteristics of target behavior change and inhibit the perception of irrelevant background information.

The following verification and explanation of the beneficial effects obtained by the target behavior space-time perception positioning method provided by the present invention are performed by combining the comparison results of the embodiment 2 and the existing behavior perception space-time positioning method. Using a grey-scale aircraft video sequence as an input, 107 video sequences are contained in the data set, and 11186 frames of images are total, wherein 65 video sequences are used as training samples, and 42 videos are used as testing samples. The same video sequence contains multiple behavior classes of the target, and multiple behavior classes may exist in the same target at the same time, and multiple targets may be contained in the video sequence. The label information of each behavior category is shown in fig. 7. Statistics of the number of behavior categories as shown in table 1, since the task mainly aims at the target behaviors such as bait release and escape, the number of samples for take-off and landing is small.

TABLE 1

Flying	Bait release	Escape from the body	Somersault	Taking off	Landing	Sliding on ground
							13483	5266	951	730	197	109	230

The intersection-to-parallel ratio IoU is used as a comparison basis for the detection result, and reflects the intersection area-to-parallel area ratio of the predicted rectangular box and the actual target box. And when the calculated intersection ratio is larger than a set threshold value, the detection result is considered to be correct.

TP is a true case and represents the correct detection number of the target true frame; FP is false positive, indicating IoU for the predicted box versus the true box is less than the threshold; FN is false negative example, which represents the number of target frames of missed detection; TN is a true negative and indicates the number of non-target boxes. Precision is used for measuring the proportion of correct samples in the prediction result; the Recall rate Recall is used for measuring the sample proportion which is correctly detected in the test set; the calculation method is as follows:

the thresholds represented IoU using AP @50, AP @75 take the average AP of Precision at 0.5, 0.75, respectively; AR @50:5:95 was used to indicate the mean AR of Recall recalls at interval IoU where the value of IoU was taken from 50% to 95% in steps of 5%.

The behavior classification module is based on the structure of a classic two-stage target detection framework, namely, fast R-CNN, and is adjusted to a certain extent, so that the behavior classification module is suitable for a video space-time positioning task. Extracting the RoI features from the output of the last feature mapping of the global feature fusion layer introduced by the present invention, expanding the two-dimensional RoI of each frame into three-dimensional RoI by copying along the time dimension, then performing spatial RoI and global average pooling on the time dimension to obtain RoI features, and then inputting the maximally pooled RoI features into a sigmoid-based classifier for multi-line label prediction.

The parameters of the spatial domain target positioning sub-network are extracted from the existing target detection model and are not jointly trained with the IM-BPSTM-Mask model provided by the invention. The method comprises the steps of using a Faster R-CNN target detection model which is trained from zero based on ImageNet and takes ResNet-50-FPN as a main stem as a reference, using Pascal VOC for pre-training, and then using a frame sequence image containing a target positioning frame label for fine tuning to obtain an airspace target positioning sub-network parameter used by the method, wherein the test result on a data set reaches 0.896AP @50 and 0.816AP @75, and the recall rate is 0.703AR @50:5: 95.

The performance comparison is carried out on the model method I-BPSTM only using the image information sensing module, the model method IM-BPSTM extracting the pixel level characteristics and directly fusing the motion information characteristic diagram using the image information sensing module and the model method IM-BPSTM-Mask established by the embodiment 1. The final result is averaged from the prediction score of softmax, and the performance index is the mean average precision index (mAP) of 7 behavior classes, measured using a frame level IoU threshold of 0.5.

Table 2 shows the comparison of the test results of IM-BPSTM-Mask and other space-time positioning model methods.

As can be seen from the table 2, the target behavior space-time perception positioning model IM-BPSTM-Mask provided by the invention obtains the best detection performance in the categories of flying, bait releasing, escaping, overturning and ground sliding, obtains the suboptimal detection performance in the two categories of taking off and landing, and has a smaller difference with the optimal performance. As shown in table 1, since the embodiment mainly aims at the detection of bait release and escape behavior, the data sample size of take-off and landing behavior is small and not representative. For comprehensive behavior category perception, the IM-BPSTM-Mask model obtains the best detection performance on each category of average detection precision mAP indexes, which means that the algorithm provided by the invention has more stable performance and higher average accuracy.

TABLE 2

Model (model)	I-BPSTM	IM-BPSTM	IM-BPSTM-Mask
				Flying in flight	0.991	0.996	0.997
Release bait	0.365	0.665	0.705
				Escape	0.021	0.551	0.679
Somersault	0.152	0.219	0.281
				Taking off	0.545	0.667	0.627
Landing	0.487	0.978	0.846
				Sliding on ground	0.150	0.406	0.885
mAP	0.680	0.819	0.841

Further, fig. 7 shows the behavior space-time perception results of the bait release process for the single-target multi-behavior state according to the embodiment of the present invention and other methods; target behavior information contained in the frame sequence images in the same row sequentially comprises target flight, target turning start, target releasing start, a bait bomb releasing process and a bait bomb releasing end; wherein, (a) is the detection result of the method of the invention, (b) is the detection result of the IM-BPSTM method, and (c) is the information description chart of the detection result label. The target positioning result is labeled on the graph in a form of a coordinate frame, information at the upper left corner of the coordinate frame corresponding to the target identifies the behavior state of the current frame of the target and the probability estimation value of the behavior, and the correspondence between the labeling information and the behavior category is shown in (c) of fig. 7. According to the comparison results of the corresponding time points of (a) and (b) in fig. 7, the IM-BPSTM-Mask provided by the invention can sense the target behavior change at the target turning starting time, the target bait release starting time and the target bait release ending time earlier. The method can more accurately position the moment of the target behavior of the video sequence and has more sensitive perception capability on the change of the target behavior.

In order to illustrate that the target behavior space-time perception positioning method provided by the invention is also applicable to the situation of multiple targets and multiple behaviors, the embodiment performs target space-time behavior perception on bait release behaviors under the situation of multiple targets and multiple behaviors. The positioning result of the IM-BPSTM-Mask model for spatio-temporal behavior perception is shown in fig. 8, where different line images are respectively from samples of different video sequences. It can be seen that the determination of the behavior of an object in a video sequence is independent, and another object may be in flight while the object is in the release state. And can still timely sense the action of starting releasing the bait to the target under the condition of multiple targets. Different from most behavior perception models which can only carry out uniform behavior perception on objects in a section of video sequence, namely, carry out behavior classification by using sequence images, the method provided by the invention can accurately position the positions of different targets and timely perceive the behaviors of current frames of the different targets.

The sensing difficulty of the target behavior is increased by times under the condition that the target is shielded, and as shown in fig. 9, the method for space-time sensing and positioning of the target behavior provided by the invention can timely and accurately judge and position the behavior of the target under the condition that the target is small and is shielded by the bait bombing tail smoke.

FIG. 10 is a diagram showing the result of the objective escape behavior in the spatiotemporal perceptual positioning method of objective behavior according to the present invention; wherein (a), (b), (c), and (d) are from samples of different video sequences, respectively. FIG. 11 is a diagram showing the space-time sensing positioning result of the method of the present invention for the target takeoff behavior. FIG. 12 is a diagram showing the time-space sensing positioning result of the method of the present invention for the target landing behavior. The method provided by the invention can obtain a better space-time perception positioning effect of the target behavior under various environments, various target scales and various target states, and can ensure the accuracy of behavior frame positioning and behavior classification while accurately positioning the position of the target.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for establishing a target behavior perception space-time positioning model is characterized by comprising the following steps: establishing a deep neural network, and training the deep neural network by using a data set to obtain a target behavior perception space-time positioning model;

the deep neural network comprises a space-time behavior perception sub-network and a space-time target positioning sub-network, and the space-time behavior perception sub-network comprises:

a mask prediction module for image support set

And query set images

An image information perception module for image-wise collecting the query set

Performing feature extraction to obtain an image information feature map

And the image information feature map

Performing channel-by-channel feature fusion, and combining the fused features with the image information feature map

The target information enhanced feature map is obtained by superposition

A motion information perception module for perceiving the dense frame sequential images

A global feature fusion layer for feature mapping the motion information

Feature map for feature compression and enhancement with the target information

And a behavior classification module for classifying the spatiotemporal linesFor sensing characteristic diagram

the spatial object location subnetwork is used for mapping the query set image

wherein the support set image

Randomly sampling T frame images obtained from a video sequence marked with a mask label; the query set image

Randomly sampling T frame images which are obtained from an input video sequence and are marked with target positioning frame labels; the support set image and the query set image are different from each other, and T is a positive integer; the dense frame sequential images

Is alpha T frame image obtained by random sampling in the input video sequence, and alpha is more than 1.

2. The method of building a target behavior-aware spatiotemporal localization model according to claim 1, wherein the mask prediction module comprises:

And the query set image

Carrying out feature extraction to obtain a support set image feature map

And query set image feature maps

A first background suppression module for utilizing the support set image

Mask label of

A second background suppression module for utilizing the query set image

Target positioning frame label

A correlation measurement module for measuring the similarity of the first target enhanced feature map and the second target enhanced feature map to obtain a super correlation feature matrix

Feature compression moduleFor the super correlation feature matrix

Performing depth dimension compression to obtain a feature compression matrix

A feature fusion module for compressing the feature matrix using residual idea

Performing feature fusion on the features of each layer to obtain a feature fusion matrix

And a down-sampling layer for fusing the features into a matrix

Converting the two-dimensional characteristic graph into a target area mask characteristic graph

3. The method for building a model of object behavior aware spatiotemporal localization as claimed in claim 2, wherein in the model training process, the mask prediction module further comprises:

context decoder for mapping target region mask feature map

Training the deep neural network by using a data set, wherein the training comprises a first-stage training and a second-stage training;

the first stage training includes:

performing supervision training on the mask prediction module by using the first data set to obtain a pre-trained mask prediction module;

the second stage training comprises:

loading the pre-trained mask prediction module parameters in the deep neural network, constructing a support set image by using a video sequence labeled with a mask label, constructing a query set image and a dense frame sequence image by using a video sequence labeled with a target behavior label and a target positioning frame label, constructing a second data set by using the constructed support set image, the query set image and the dense frame sequence image, and training the deep neural network model by using the second data set to obtain a target behavior perception space-time positioning model.

4. The method of claim 3, wherein the training loss function is:

wherein,

represents the total loss;

representing a prediction penalty of the mask prediction; eta₁,η₂Epsilon is a preset weight parameter; in the first stage of training, eta₁＝η₂0, epsilon is greater than zero; in the second stage of training, eta₁,η₂And epsilon are all greater than zero.

5. The method for building the model of objective behavior-aware spatiotemporal localization as claimed in claim 4, wherein for the ith frame of input image, the mask prediction module outputs an objective mask prediction graph

The corresponding mask prediction penalty is:

wherein,

representing the target mask predicted image

Wherein x represents the target mask predicted image

The pixel point in (1);

a label value corresponding to a pixel x representing the input image of the ith frame, wherein the positive sample value is 1, and the negative sample valueA value of 0;

a probability value representing the probability value that pixel x of the input image of the ith frame is predicted as a positive sample; in the first stage of training, the mask prediction loss is obtained by measuring the mask label error of the prediction result and the query data set; in the second stage of training, the mask prediction loss is calculated by weighing the prediction result and the target positioning box label error of the query data set.

6. The method of claim 5, wherein η is the first training phase₁＝η₂＝0，ε＝1；

In the second training phase, η₁＝η₂＝0.5，ε＝0.1。

7. The method for establishing the model for spatial-temporal perception and localization of target behaviors as claimed in any one of claims 2 to 6, wherein the image-level feature fusion layer masks the feature map of the target region

And the image information feature map

The target information enhanced feature map is obtained by superposition

The method comprises the following steps:

masking the target region with a feature map by bilinear interpolation

Is converted into the image information characteristic diagram

After the same size of the channel features, the channel features are matched with the image information feature map

Adding to obtain the enhanced characteristic diagram of the target information

8. The method for establishing the model for space-time perception and localization of the behavior of the target according to any one of claims 1 to 6, wherein the global feature fusion layer is to perform the feature mapping on the motion information

Feature map for feature compression and enhancement with the target information

The method comprises the following steps:

the motion information feature map is processed by convolution operation

Compressing to a feature map enhanced with the target information

After the same dimensionality, the characteristic graph enhanced with the target information

9. A space-time perception positioning method for target behaviors is characterized by comprising the following steps:

inputting the sparse frame sequence images into a spatial target positioning sub-network in a target behavior space-time perception positioning model established by the method for establishing the target behavior space-time perception positioning model according to any one of claims 1 to 8 so as to predict a target positioning frame of a frame image target, and labeling the sparse frame sequence images by using a target positioning frame prediction result to obtain an inquiry set image;

and inputting the support set image, the query set image and the dense frame sequence image into the target behavior perception space-time positioning model, predicting the behavior category and the target positioning frame of a frame image target according to the output of the target behavior perception space-time positioning model, and completing target behavior space-time positioning.

10. A computer-readable storage medium comprising a stored computer program; when being executed by a processor, the computer program controls a device on which the computer-readable storage medium is located to execute the method for establishing the target behavior space-time perception positioning model according to any one of claims 1-8, and/or the method for establishing the target behavior space-time perception positioning model according to claim 9.