CN117115706A

CN117115706A - Video scene graph generation method based on multi-scale space-time attention network

Info

Publication number: CN117115706A
Application number: CN202311048203.3A
Authority: CN
Inventors: 余宙; 王朱佳; 俞俊
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-08-21
Filing date: 2023-08-21
Publication date: 2023-11-24

Abstract

The invention discloses a video dynamic scene graph generation method based on a multi-scale space-time attention network. The method comprises the following steps: 1. dividing a data set, namely extracting features from video frames by using a pre-trained target detection network, classifying targets, namely, constructing language features of the targets, namely, constructing comprehensive features of character relation pairs, and storing the comprehensive features as a sparse matrix, namely, constructing a multi-scale space-time attention network, namely, constructing a pre-trained model enhanced classification network, namely, 7, a loss function, namely, 8, a training model, namely, 9, and calculating a network predicted value. The invention provides a multi-scale space-time attention network, which innovatively introduces a multi-scale modeling idea on the basis of a classical transducer architecture so as to realize accurate modeling of dynamic fine granularity semantics of video.

Description

Video scene graph generation method based on multi-scale space-time attention network

Technical Field

The invention provides a video dynamic scene graph generation method (Dynamic Scene Graph Generation) based on a Multi-scale Spatial-temporal-Temporal Transformer (MSTT).

Background

The dynamic scene graph (dynamic scene graph generation, DSGG) task aims to simultaneously detect objects appearing in a video and predict relationships between objects, thereby generating a series of triples shaped as < subject, predicate, object >. The dynamic scene graph is based on the static scene graph, and a time axis is additionally added to represent the change of the relationship of objects with time. Generating dynamic scene graphs is a challenging task because of the need to consider both spatial and temporal factors. While conventional static image scene graph generation models focus only on the current frame, typically only capture the static relationships between objects in the image, dynamic scene graph generation models require more fine granularity in modeling semantics in dynamic video, hopefully enabling more specific relationships to be generated for the current frame from other frames in the video. This fine-grained relational modeling helps to more accurately understand interaction dynamics in video. Therefore, as the requirement for fine granularity increases, research into dynamic video scene graph generation has important significance.

In recent years, research into video scene graph generation methods can be divided into two branches, coarse-granularity-based video scene graph generation and fine-granularity-based video scene graph generation. The coarse granularity-based video scene graph generation method aims at generating a scene graph of scenes appearing in video clips. The method has good research results in understanding the relation between the video clips and the predicted objects, but cannot capture the characteristic of dynamic evolution of the object relation between video frames along with time. As the demand for understanding video increases, more and more research is devoted to fine-grained video scene graph generation. But these methods focus on only the correlation of objects in the global space in the spatial dimension and on only the short-term temporal correlation of objects in the video in the temporal dimension. Such single-scale attention network models are not capable of capturing local spatial correlations of relative positions between objects and long-term temporal correlations of the same pair of objects throughout the video, and therefore have limitations in their understanding capabilities. For this reason, designing different modeling scales in the spatial and temporal dimensions to capture more useful information is undoubtedly helpful to deepen understanding of the video scene graph and thus enhance the expressive power of the final video dynamic scene graph generation.

In practical application, the video scene graph has a wide range of application scenes. First, the video scene graph can provide semantic relationships and interaction patterns between objects in the video, thereby helping the machine understand the video content. This is very useful for video content analysis, content recommendation, video retrieval, and video content understanding tasks. Second, the video scene graph may provide object and motion information in the video, helping the user to edit and clip the video. Video editing and composition can be more conveniently performed by automatically identifying and analyzing scenes and dynamic elements in the video. In addition, the video scene graph may also help the computer make deeper visual inferences. By understanding the relationships and dynamic changes between objects, the computer can better infer future dynamic behavior, thereby providing more intelligent and accurate visual reasoning results. Finally, the video scene graph may be used for generation and enhancement of video content. By modifying and adjusting the video scene graph, new video content may be generated or the visual effects of existing video may be enhanced.

In summary, the video dynamic scene graph generation is a subject worthy of intensive research, and the patent aims to cut in and expand from a plurality of key points in the task, solve the difficulties and the key points existing in the current method, and form a complete video scene graph generation system.

Previous video scene graph generation methods have focused on only the correlation of objects in global space in the spatial dimension and on only the short-term temporal correlation of objects in the video in the temporal dimension. Such methods, while simple to implement and easy to understand, still suffer from two problems:

1. local spatial relationships highly correlated to person pairs cannot be captured with pertinence. This will result in models that cannot focus on the spatial proximity between objects, and thus the accuracy of relational modeling is reduced when cross occlusion or the like occurs in a dynamic scene. Therefore, the method adds a local spatial scale to perform local spatial relationship modeling so as to more accurately capture the relationship between the person and the object and reduce the interference of other irrelevant objects on relationship modeling. The module plays an important role in accurately modeling the object relation related to dense interaction in the dynamic video scene.

2. Long-term evolution trends and persistent behavior of character relationships cannot be accurately predicted and captured. This will result in models that fail to take into account persistent behaviour and dynamic evolution between objects, which may lead to erroneous modeling when there are conditions where the objects are occluded for a long period of time. Therefore, the method adds a long-term time scale for long-term time relation modeling. Attention to the persistent behavior and dynamic evolution of objects can capture the long-term relationship of continuous interaction and chase between objects. The module plays an important role in modeling of long-term dynamic relations in a dynamic video scene.

Disclosure of Invention

The invention provides a video scene graph generation method based on a multi-scale space-time attention network aiming at the two problems. The core method is to propose a multi-scale space-time attention network (MSTT) to realize accurate modeling of video dynamic fine granularity semantics.

The invention mainly comprises two points:

1. the concept of multi-scale modeling is innovatively introduced on the basis of a classical transducer architecture, and modeling is performed on the space dimension and the time dimension respectively. In the spatial dimension, in addition to preserving the correlation of the object of interest in the traditional method in the global space, the local spatial correlation of the relative positions between modeled objects is also added. In the time dimension, in addition to preserving the feature that the traditional method focuses on the short-term time correlation of objects in the video, the long-term time correlation of the same pair of objects in the whole video is also focused on.

2. The interactive understanding capability of the model visual language is enhanced by adopting a large-scale visual language pre-training model CLIP. By converting the relationship categories into textual descriptions and generating text embeddings using the CLIP model, better semantic representation and understanding capabilities can be provided.

The technical scheme adopted by the invention for solving the technical problems comprises the following steps:

step 1: partitioning of data sets

The data set is partitioned.

Step 2: extracting features from video frames by using a pre-trained target detection network, and classifying targets

For video frames in the dataset, a pre-trained object detector is used for object recognition and feature extraction. And predicts its class for each object detected as a priori condition of the model.

Step 3: language features of build targets

And converting the classification predicted by the target into a word vector containing semantic information according to a pre-trained word vector model.

Step 4: building comprehensive features of character relation pairs and storing the comprehensive features as sparse matrixes

Performing paired splicing among subjects and objects on the features obtained in the step 2 and the step 3, wherein the paired splicing comprises visual features of the subjects, visual features of the objects, linguistic features of the subjects, linguistic features of the objects and joint features between the subjects and the objects;

for all frames in a video, storing paired features between subjects and objects appearing in the video into a sparse matrix, wherein the number of rows of the matrix represents the number of frames of the video, the number of columns of the matrix represents the number of categories of objects, and the meaning of each column is a relation pair between the same object and a person;

step 5: construction of a multiscale spatiotemporal attention network

The integrated features of step 4 are input to a multi-scale spatio-temporal attention network having two modules, a spatial encoder and a temporal decoder, respectively. The spatial encoder comprises a local spatial encoding and a global spatial encoding, and the time decoder comprises a long-term time decoding and a short-term time decoding.

Step 6: building a pre-trained model enhanced classification network

And (5) inputting the output result of the step (5) into a classification network enhanced by the pre-training model to carry out final relation classification so as to enhance the interactive understanding of visual language. Finally, the attention relationship type prediction vector, the position relationship type prediction vector, and the contact relationship type prediction vector are output.

Step 7: loss function

For the object classification of step 2, the prediction vector and the target vector are input into the loss function, and the loss value is calculated. And (3) for the relation classification in the step (6), respectively inputting the 3 types of the output predicted vectors and the corresponding target vectors into corresponding loss functions, and respectively outputting 3 loss values.

Step 8: training model

And (3) carrying out gradient feedback on the model parameters of the neural network in the step (6) by utilizing a back propagation algorithm according to the loss value generated by the loss function in the step (7) until the whole network model is converged, namely, the loss value of the model approaches to 0 and does not have a descending trend.

Step 9: network predictor calculation

And (3) sorting according to the prediction vectors output in the step (6), and selecting a final classification prediction result according to different judgment standards. The judgment policy is as follows: 1) With constraint policies, only one predicate is allowed to exist between each person pair. This strategy constrains the number of relationships between pairs of people in the generated scene graph. 2) An unconstrained strategy allows multiple predicates to exist between each person pair. There is no limit to the number of relationships between pairs of characters in the scene graph generated under such a policy.

The step 1 is specifically realized as follows:

70% of the data in the dataset was used for training, the remaining 30% for validation and testing.

The step 2 of extracting features from the video frame by using the pre-trained target detection network and predicting category distribution is specifically as follows:

for incoming videoV＝[I ₁ ,I ₂ ,…,I _T ](where T represents the number of frames of the video), each of which is I _t Can be obtained from the detector _t Each bounding boxAnd their category distribution->And the visual feature corresponding to each bounding box +.>

The language features of the building target described in the step 3 are specifically as follows:

the object class labels are mapped into 200-dimensional semantic embedded vectors by pre-trained GloVe-200 d. The semantic vector between two objects a and b in the t-th frame is expressed as

And (3) constructing the comprehensive characteristics of the character relation pairs and storing the comprehensive characteristics as a sparse matrix, wherein the comprehensive characteristics are as follows:

the token vector between two objects a and b in the t-th frame can be expressed as:

wherein the method comprises the steps of<,>Representing a stitching operation in the channel dimension,representing a flattening operation +.>Representing addition by element. W (W) _s ,W _o W is provided _u Is a linear matrix used to compress the visual features to 512 dimensions. />Feature map representing joint box by Roialign calculation, f _box Is a deformation function for transforming the boundary box corresponding to the subject and object into a transformation functionFeatures of the same shape.

The features represented by the publication 1 are stored in a sparse matrix, in particular, for C object classes and video V with T frames, the input matrix can be simply represented asWhere D represents the dimension of the input representation. The rows of the sparse matrix represent video frames and the columns represent pairwise combinations of people and objects. Since part of the objects in the object labels in the video are never present, the columns of objects that are never present are deleted in the sparse matrix for redundancy reduction, i.e. the final input matrix is denoted +.>Where C' represents the number of object categories that actually appear in the current video V.

The construction of the multi-scale space-time attention network in the step 5 is specifically as follows:

5-1, constructing a multi-scale space encoder:

this step requires the construction of a global spatial encoder and a local spatial encoder.

5-1-1. Building a global space encoder:

input matrix for video VTaking the example of pulling the input of the encoder in which the t-th line (representing all occurrences of the character relation representation sequence in the t-th frame) is taken as input sequence +.>In a spatial encoder at the global scale, a single-headed dot product self-attention mechanism is employed. In this operation, Q, K, V shares the same input, and the resulting output after passing through the n-layer encoder is expressed as:

the encoder consists of n stacked MultiHeadAtts _global (. Cndot.) the input of the nth layer is the output of the (n-1) th layer.

5-1-2. Build a local spatial encoder:

first, a center point of an object is calculated from a bounding box in which the object appears in each frame. Then, according to the distance between the center points of the objects, the nearest object of each object is obtained, and a mask matrix M epsilon R is constructed ^C'×C' Wherein the value is 0 or 1. When the value in the matrix M is 1, it indicates that the object is the nearest visual object to the current object; when the value is 0, this indicates that the object is an invisible object of the current object. The method comprises obtaining affinity matrix after matrix multiplication of Q and KThe dot product is calculated by the mask matrix M, and then matrix multiplication is performed by V. The output after passing through the n-layer encoder is expressed as:

splicing the spatial contexts acquired by the two scales in the channel dimension to be used as the output of a final multi-scale spatial encoder

5-2, constructing a multi-scale time decoder:

this step requires the construction of a long-term time decoder and a short-term time decoder.

5-2-1. Build long-time decoder:

output matrix for multi-scale spatial encoderTo select as input to the decoder the c-th column (representing the sequence of relation characterizations between the person and the object with category label c), i.e. the input sequence isQ, K, V share the same input and the output after passing through an n-layer decoder is expressed as:

5-2-2. Construct short-term time decoder:

similarly to 5-1-2, a mask matrix M εR with values of 0 or 1 is also provided in the decoder ^T×T For limiting the range of frames that can be of interest for each instant t. Where 1 represents a frame visible at the current time and 0 represents a frame not visible at the current time. The input to this module is still the output matrix of the multi-scale space encoder, while the mask matrix is additionally input. Similar to a multi-scale encoder, the output representation after passing through an n-layer decoder is shown below, which contains relational evolution information on a short-term time scale:

the output of the multi-scale decoder includes the captured long-term and short-term time-dependent information. This information is stitched together in the channel dimension to form the final decoder output.

The step 6 of constructing a pre-training model enhanced classification network is specifically as follows:

6-1. Design Prompt (Prompt):

a Prompt (Prompt) structure is designed in the form of 'a photo of a person [ relationship ] a/an [ object ]', wherein predicate tags are filled in the position of [ relationship ] and object tags are filled in the position of [ object ]. A (R C) sentence text description is generated for the R predicate tags and the C object tags.

6-2, generating text embedding of the text description statement:

text embeddings of text description statements are generated offline using a pre-trained CLIP text encoder and used as weights to initialize the learnable classifier. In a task for a specific data set, a different kind of relationship needs to be predicted for each pair of person relationships, and typically includes three relationships, namely, attention relationship, position relationship, and contact relationship. To achieve this goal, it is necessary to divide the text embedding into 3 categories and use them to initialize the weights of the 3 classifiers.

6-3, fine tuning a classifier:

to adapt to a particular dataset, the classifier is fine-tuned and the learning rate is set to 2e ^-5 Training is performed. The fine tuning process optimizes the classifier weights by performing supervised learning on specific task datasets so that they can better adapt to specific relationship classification tasks.

The loss function described in step 7 is specifically as follows:

7-1 calculating object class prediction distribution O _i With real labelsThe difference between them, here cross entropy (softmax cross entropy) is used, the specific formula is as follows:

7-2. Calculate attention relation class prediction distribution r _ai With real labelsThe difference between them, here cross entropy (softmax cross entropy) is used, the specific formula is as follows:

7-3, calculating the position relation type prediction distribution r _si With real labelsThe difference between them, here binary cross entropy (sigmoid binary cross entropy) is used, the specific formula is as follows:

7-4. Calculating a contact relationship class prediction distribution r _ci With real labelsThe difference between them, here binary cross entropy (sigmoid binary cross entropy) is used, the specific formula is as follows:

7-5, model total loss, the specific formula is as follows:

Loss＝Loss _obj +Loss _{rel_a} +Loss _{rel_s} +Loss _{rel_c} (equation 12)

The training model described in the step 8 is specifically as follows:

and (3) carrying out gradient feedback on the model parameters of the neural network in the step (5) and the step (6) by using a back propagation algorithm according to the loss value generated by the loss function in the step (7), and continuously optimizing until the whole network model is converged.

The model predictive value calculation in the step 9 is specifically as follows:

and (3) carrying out generation prediction of the video scene graph on the converged model, respectively inputting the output results obtained in the step (5) into the initialized class 3 classifier in the step (6), and deciding the final classification prediction result according to different judgment standards. The judgment policy is as follows: 1) With constraint policies, only one predicate is allowed to exist between each person pair. This strategy constrains the number of relationships between pairs of people in the generated scene graph. 2) An unconstrained strategy allows multiple predicates to exist between each person pair. There is no limit to the number of relationships between pairs of characters in the scene graph generated under such a policy.

The invention is characterized in that: in the space dimension, the invention further focuses on the local spatial relationship between two objects closest to the center point of the boundary frame based on target detection on the basis of the traditional method. Modeling relationships by selecting pairs of people whose spatial locations are closest highlights the spatial proximity between them. Such accurate modeling facilitates a better understanding of interaction dynamics between people and objects than conventional methods, providing more accurate semantic analysis results. In the time dimension, the present invention also focuses on the long-term time relationship of the same pair of objects between all frames on the basis of the conventional method. Long-term time-series modeling can model long-term relationships among objects more comprehensively by focusing on dynamic evolution of the same pair of objects among all frames, and is helpful for generating more accurate and consistent scene graphs. In the case of occlusion and coincidence, objects may be gradually exposed over multiple frames, or long-term temporal modeling may capture patterns of these persistent behaviors by interactively unbinding the occlusion. In conclusion, the multi-scale modeling concept introduced in the space and time dimensions has important significance in the video dynamic fine-granularity semantic modeling task.

Drawings

Fig. 1: the overall architecture of a multi-scale spatiotemporal attention network.

Fig. 2: attention design under multi-scale space encoders and temporal decoders.

Fig. 3: the CLIP model enhances classification networks.

Detailed Description

The detailed parameters of the present invention are described in further detail below.

As shown in fig. 1,2 and 3, the invention provides a video scene graph generating method based on a multi-scale space-time attention network.

The step 1 is specifically realized as follows:

for an input video v= [ I ] ₁ ,I ₂ ,…,I _T ](where T represents the number of frames of the video, and the actual data of each video in the set of data is the same) each frame I _t Can be obtained from the detector _t Each bounding boxAnd their category distribution->And visual features corresponding to each bounding box

The semantic features of the building target described in the step 3 are specifically as follows:

taking the calculation of the token vector between two objects a and b in the t-th frame as an example. First, a linear matrix W is used _s ,W _o The visual features of objects a and b are compressed to 512 dimensions. Second, feature mapping of the combined frames of the objects a and b is obtained through RoIAlign calculationAnd converting the bounding box corresponding to the subject and object into a text corresponding to +.>The same shape features, transformation function f _box The joint frame feature map is added to the feature of the bounding box transformation and passed through a linear matrix W _u Compressed to 512 dimensions. Finally, the semantic feature obtained in the step 3 is +.>And splicing the features to obtain characterization vectors of the objects a and b in the t-th frame, wherein the characterization vectors are 512+512+512+200+200=1936 dimensions.

The features are then stored in a sparse matrix. In particular, for C object categories and video V with T frames, the input matrix can be simply represented asWhere D represents the dimension of the input representation (i.e., 1936). The rows of the sparse matrix represent video frames and the columns represent pairwise combinations of people and objects. Since part of the objects in the object labels in the video are never present, the columns of objects that are never present are deleted in the sparse matrix for redundancy reduction, i.e. the final input matrix is denoted +.>Where C' represents the number of object categories that actually appear in the current video V.

5-1, constructing a multi-scale space encoder:

5-1-1. Building a global space encoder:

input matrix for video VTaking the example of pulling the input of the encoder in which the t-th line (representing all occurrences of the character relation representation sequence in the t-th frame) is taken as input sequence +.>Where d=1936. In a spatial encoder at the global scale, a single-headed dot product self-attention mechanism is employed. In this operation Q, K, V shares the same input and the resulting output of the global space encoder after passing through the n-layer encoder. The encoder consists of n stacked MultiHeadAtts _global (. Cndot.) the input of the nth layer is the output of the (n-1) th layer. N set by the method is 1, namely, only 1 layer of global space encoder is stacked.

5-1-2. Build a local spatial encoder:

first, a center point of an object is calculated from a bounding box in which the object appears in each frame. Then, according to the distance between the center points of the objects, the nearest object of each object is obtained, and a mask matrix M epsilon R is constructed ^C'×C' Wherein the value is 0 or 1. When the value in the matrix M is 1, it indicates that the object is the nearest visual object to the current object; when the value is 0, this indicates that the object is an invisible object of the current object. The method comprises obtaining affinity matrix A εR after matrix multiplication of Q and K ^C'×C' The dot product is calculated by the mask matrix M, and then matrix multiplication is performed by V. Finally, the output of the local space encoder is obtained after the n layers of encoders. The method sets n to 1, i.e. stacks only 1 layer of local spatial encoders.

The two are combinedThe spatial context obtained by the seed scale is spliced in the channel dimension to be used as the output of a final multi-scale spatial encoder

5-2, constructing a multi-scale time decoder:

5-2-1. Build long-time decoder:

output matrix for multi-scale spatial encoderTo select as input to the decoder the c-th column (representing the sequence of relation characterizations between the person and the object with category label c), i.e. the input sequence isWhere d=1936. Q, K, V share the same input and the output of the long-time decoder is obtained after passing through the n-layer decoder. The method sets n to 1, i.e. stacks only 1 layer long time decoder.

5-2-2. Construct short-term time decoder:

similarly to 5-1-2, a mask matrix M εR with values of 0 or 1 is also provided in the decoder ^T×T For limiting the range of frames that can be of interest for each instant t _t-p ,I _t+q ]Where p represents the previous p-frame of the current frame and q represents the next q-frame of the current frame. The invention sets p to 1 and q to 0. In the mask matrix, 1 indicates a frame visible at the current time, and 0 indicates a frame invisible at the current time. The input to this module is still the output matrix of the multi-scale space encoder, while the mask matrix is additionally input. The specific implementation is similar to a multi-scale encoder, and the output of the short-term time decoder is obtained after passing through an n-layer decoder. The method sets n to 1, i.e. stacks only 1 layer short-term time decoders.

6-1. Design Prompt (Prompt):

6-2, generating text embedding of the text description statement:

text embeddings of text description statements are generated off-line using a pre-trained CLIP text encoder (512-d), and these embeddings are used as weights to initialize the learnable classifier. In a task for a specific data set, a different kind of relationship needs to be predicted for each pair of person relationships, and typically includes three relationships, namely, attention relationship, position relationship, and contact relationship. To achieve this goal, it is necessary to divide the text embedding into 3 categories and use them to initialize the weights of the 3 classifiers.

6-3, fine tuning a classifier:

The loss function described in step 7 is specifically as follows:

7-1 calculating object class prediction distribution O _i With real labelsThe difference between them, here cross entropy (softmax cross entropy) is used.

7-2. Calculate attention relation class prediction distribution r _ai With real labelsThe difference between them, here cross entropy (softmax cross entropy) is used.

7-3, calculating the position relation type prediction distribution r _si With real labelsThe gap between them, here binary cross entropy (sigmoid binary cross entropy) is used.

7-4. Calculating a contact relationship class prediction distribution r _ci With real labelsThe gap between them, here binary cross entropy (sigmoid binary cross entropy) is used.

7-5. Model total loss is the sum of the losses.

The training model described in the step 8 is specifically as follows:

and (3) carrying out gradient feedback on the model parameters of the neural network in the step (5) and the step (6) by utilizing a back propagation algorithm according to the loss value generated by the loss function in the step (7) until the whole network model is converged, namely the loss value of the model approaches to 0 and does not have a descending trend.

Claims

1. The method for generating the video scene graph based on the multi-scale space-time attention network is characterized by comprising the following steps of:

step 1: dividing a data set;

step 2: extracting features from the video frames by using a pre-trained target detection network, and classifying targets; predicting the category of each detected object as a priori condition of the model;

step 3: language features of build targets

Converting the classification result in the step 2 into word vectors containing semantic information according to a pre-trained word vector model;

Performing paired splicing among subjects and objects on the features obtained in the step 2 and the step 3, wherein the paired splicing comprises visual features of the subjects, visual features of the objects, language features of the subjects, language features of the objects and joint features between the subjects and the objects;

step 5: construction of a multiscale spatiotemporal attention network

Inputting the integrated features of step 4 into a multi-scale spatiotemporal attention network comprising: a spatial encoder and a temporal decoder; the space encoder comprises local space coding and global space coding, and the time decoder comprises long-term time decoding and short-term time decoding;

step 6: building a pre-trained model enhanced classification network

Inputting the output result of the step 5 into a classification network enhanced by the pre-training model to carry out final relationship classification so as to enhance the interactive understanding of visual language; finally outputting attention relation type predictive vectors, position relation type predictive vectors and contact relation type predictive vectors;

step 7: loss function

For the object classification in the step 2, inputting the prediction vector and the target vector into a loss function, and calculating a loss value; for the relation classification of the step 6, respectively inputting the 3 types of predictive vectors output by the relation classification and the corresponding target vectors into corresponding loss functions, and respectively outputting 3 loss values;

step 8: training model

Carrying out gradient feedback on the model parameters of the neural network in the step 6 by using a back propagation algorithm according to the loss value generated by the loss function in the step 7, and continuously optimizing until the whole network model is converged, namely the training loss of the model is reduced to a certain range and is not reduced any more;

step 9: network predictor calculation

And (3) sorting according to the prediction vectors output in the step (6), and selecting a final classification prediction result according to different judgment standards.

2. The method for generating a video scene graph based on a multi-scale spatio-temporal attention network according to claim 1, wherein the extracting features of the video frames with the pre-trained object detection network in step 2 is specifically as follows:

for an input video v= [ I ] ₁ ,I ₂ ,…,I _T ]Where T represents the number of frames of the video,

each of which is I _t Can be obtained from the detector _t Each bounding boxAnd their category distribution->And the visual feature corresponding to each bounding box +.>Where b represents an object bounding box, d represents an object class distribution, and v represents an object visual feature.

3. The method for generating a video scene graph based on a multi-scale spatiotemporal attention network according to claim 2, wherein the language features of the build object in step 3 are as follows:

mapping object class labels into 200-dimensional semantic embedded vectors through pre-trained GloVe-200 d; the semantic vector between two objects a and b in the t-th frame is expressed as

4. The method for generating a video scene graph based on a multi-scale spatiotemporal attention network according to claim 3, wherein the comprehensive features of the person relationship pairs constructed in step 4 are stored as sparse matrices, specifically as follows:

wherein the method comprises the steps of<,>Representing a stitching operation in the channel dimension,representing a flattening operation +.>Representing addition by element; w (W) _s ,W _o W is provided _u Is a linear matrix for compressing visual features to 512 dimensions; />Feature map representing joint box by Roialign calculation, f _box Is a deformation function for transforming the bounding box corresponding to subject and object into +.>Features of the same shape;

storing features represented by the equation 1 into a sparse matrix, the input matrix representing, for C object classes and video V with T frames, as

Wherein D represents the dimension of the input representation; the rows of the sparse matrix represent video frames and the columns represent pairwise combinations of people and objects; the input matrix is expressed asWhere C' represents the number of object categories that actually appear in the current video V.

5. The method for generating a video scene graph based on a multi-scale spatiotemporal attention network of claim 4, characterized by constructing a multi-scale spatiotemporal attention network of step 5 comprising the steps of:

5-1, constructing a multi-scale space encoder:

5-1-1. Building a global space encoder:

input matrix for video VThe input sequence is +.>Wherein t represents the t frame of the video; in a spatial encoder at a global scale, a single-head dot product self-attention mechanism is adopted; in this operation, Q, K, V shares the same input, and the resulting output after passing through the n-layer encoder is expressed as:

the encoder consists of n stacked MultiHeadAtts _global (. Cndot.) the input of the nth layer is the output of the (n-1) th layer;

5-1-2. Build a local spatial encoder:

calculating the center point of the object according to the boundary frame of the object in each frame;

obtaining the nearest object of each object according to the distance between the center points of the objects, and constructing a mask matrix M epsilon R ^C ^'×C' Wherein the value is 0 or 1; when the value in the matrix M is 1, it indicates that the object is the nearest visual object to the current object; when the value is 0, it indicates that the object is an invisible object of the current object;

when Q and K are subjected to matrix multiplication, an affinity matrix A epsilon R is obtained ^C'×C' Calculating dot product with mask matrix M, and multiplying with V; the output after passing through the n-layer encoder is expressed as:

5-2, constructing a multi-scale time decoder:

5-2-1. Build long-time decoder:

output matrix for multi-scale spatial encoderTo select the c-th column as the input of the decoder, the input sequence is +.>Q, K, V share the same input and the output after passing through an n-layer decoder is expressed as:

5-2-2. Construct short-term time decoder:

a mask matrix M epsilon R with a value of 0 or 1 is also set in the decoder ^T×T For limiting the range of frames that can be of interest at each instant t; wherein 1 represents a frame visible at the current time, and 0 represents a frame invisible at the current time; the input of the module is the output matrix of the multi-scale space coder, and a mask matrix is additionally input; the output after passing through the n-layer decoder is represented as follows, which contains the relational evolution information on the short-term time scale:

the output of the multi-scale decoder includes captured long-term and short-term time dependent information; this information is stitched together in the channel dimension to form the final decoder output.

6. The method for generating a video scene graph based on a multi-scale spatiotemporal attention network of claim 5, wherein said constructing a pre-trained model enhanced classification network of step 6 comprises the steps of:

6-1, design prompting:

a text prompt structure with [ relation ] and [ object ] is designed, wherein predicate labels are filled in the position of [ relation ] and object labels are filled in the position of [ object ]; generating (R x C) sentence text descriptions for the R predicate tags and the C object tags;

6-2, generating text embedding of the text description statement:

generating text inserts of text description sentences off-line with a pre-trained CLIP text encoder and using these inserts as weights for initializing a learnable classifier;

6-3, fine tuning a classifier:

fine-tuning the classifier and setting the learning rate to 2e ^-5 Training is carried out; the process of fine tuning is to optimize the weights of the classifier by performing supervised learning on specific task datasets.

7. A method of generating a video scene graph based on a multi-scale spatiotemporal attention network as recited in claim 6, wherein said loss function of step 7 comprises the steps of:

7-1 calculating object class prediction distribution O _i With real labelsThe difference between them, using cross entropy softmax cross entropy, is given by:

7-2. Calculate attention relation class prediction distribution r _ai With real labelsThe difference between them, using cross entropy softmax cross entropy, is given by:

7-3, calculating the position relation type prediction distribution r _si With real labelsThe gap between them, using binary cross entropy sigmoid binary cross entropy, is given by:

7-4. Calculating a contact relationship class prediction distribution r _ci With real labelsThe gap between them, using binary cross entropy sigmoid binary cross entropy, is given by:

7-5, model total loss, the formula is as follows:

Loss＝Loss _obj +Loss _{rel_a} +Loss _{rel_s} +Loss _{rel_c} (equation 12).

8. The method for generating a video scene graph based on a multi-scale spatiotemporal attention network of claim 1, wherein the criteria comprises:

with constraint strategy, only one predicate is allowed to exist between each person pair; this strategy constrains the number of relationships between pairs of characters in the generated scene graph;

an unconstrained strategy allowing multiple predicates to exist between each person pair; there is no limit to the number of relationships between pairs of characters in the scene graph generated under such a policy.