CN117372706A

CN117372706A - Multi-scale deformable character interaction relation detection method

Info

Publication number: CN117372706A
Application number: CN202310846089.2A
Authority: CN
Inventors: 贾海涛; 余梦鹏; 张宏博; 张钰琪
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2023-07-11
Filing date: 2023-07-11
Publication date: 2024-01-09

Abstract

The invention relates to the field of human interaction relation detection in the field of image understanding, in particular to a multi-scale deformable human interaction relation detection method. At present, due to the lack of multi-scale features, the conventional algorithm based on the Transformer is difficult to accurately identify small targets from the high-resolution feature map so as to influence the prediction of the human interaction relationship, and the addition of the multi-scale features can provide new features for the human interaction relationship detection algorithm, but the complexity is also increased sharply due to the addition of the features. In order to solve the problems, the invention provides a character interaction relation detection algorithm based on the improvement of a QPIC algorithm. Secondly, the recognition accuracy is improved by introducing the multi-scale features. Thirdly, the feature vector is initially encoded through a multi-scale deformable attention module, and feature points with the most remarkable features are sampled, so that the algorithm is light in weight, and the calculation complexity is reduced.

Description

Multi-scale deformable character interaction relation detection method

Technical Field

The invention relates to the field of human interaction relation detection in the field of image understanding, in particular to a multi-scale deformable human interaction relation detection method.

Background

The research background of human interaction relation detection can be traced back to an early stage in the field of computer vision, in order for a computer to better understand an image, it is necessary to not only identify objects in the image, but also understand the relation between these objects, and the role they play in the image. The detection of the interaction relation of the person aims to solve the problem of deep semantic understanding between the person and the object in the image.

The DETR structure minimally modifies the transducer structure, largely preserving the characteristics of the transducer, and is a milestone-like ride-through, as opposed to the prior art structures based on fast R-CNN. Thanks to the high scalability of the DETR architecture, as it enlarges the heterochromatic in the field of target detection, various network structures based on DETR come out endlessly, and have achieved significant achievements in the respective fields. In the task of detecting the human interaction relationship, the QPIC changes the query vector for the entity target into the query vector for the human interaction pair on the basis of DETR, and meanwhile, increases the human interaction relationship detection head, so that the context information is effectively aggregated, and the structure of the transducer is applied to human interaction relationship detection for the first time and achieves good effects, but the following problems still exist:

the traditional convolution-based algorithm is very mature in the field of target detection, and can extract multi-scale features through FPN to optimize the detection effect, but the conventional transform-based algorithm is difficult to accurately identify small targets from a high-resolution feature map due to lack of multi-scale features, so that prediction of human interaction relations is affected, huge calculation cost is generated when the transform is directly used for noticing to process the multi-scale feature map, and the conventional transform-based human interaction relation detection algorithm is limited to use of a single-scale feature map. Because of this limitation, previous transducer-based approaches have exhibited undesirable performance, particularly in situations where background information exists on humans, objects, and interactions between them at different scales.

While the addition of multi-scale features may provide new features for human interaction detection algorithms, the addition of features may also lead to dramatic increases in complexity. Moreover, the QPIC algorithm itself is computationally complex, and adding multi-scale features directly only increases the complexity of the algorithm to an unacceptable level. Therefore, how to reduce the algorithm complexity is also one of the issues of intensive research.

The single-phase approach of the last two years typically builds a feature extractor consisting of a hierarchical backbone of CNNs (e.g., hourglass-104, DLA-34, resNet-50, and ResNet 101) and a transfomer encoder. However, these approaches ignore two drawbacks of using the CNN backbone. First, CNNs are poor in capturing non-local semantic features, and cannot establish a relationship between pixels far apart, i.e., cannot obtain a global receptive field, such as a relationship between a person and an object, and in addition, even if the number of network layers is deepened, the fewer feasible information transfer paths between pixels far apart are, the average global receptive field cannot be achieved. Second, the use of large receptive low resolution feature maps ignores spatial information over a small range, and is still affected even though the attention-oriented transform encoder can supplement semantic information from the image.

Disclosure of Invention

The present character interaction relation detection algorithm based on the Transformer is limited to use of a single-scale feature map, the complexity of the algorithm is also greatly increased by directly introducing multi-scale features, and the invention aims to overcome the defects in the discussion and provide a multi-scale deformable character interaction relation detection method.

The method for detecting the multi-scale deformable character interaction relationship comprises the following specific processes:

step 1: giving an original image, inputting the original image into a Swin transform network, extracting the feature images of the last three layers, and performing dimension reduction through 1x1 convolution to obtain an image feature vector.

Step 2: and the feature vector is initially encoded through a multi-scale deformable attention module, feature points with the most remarkable features are sampled, and the computational complexity is reduced.

Step 3: the query vector passes through the self-attention module in the decoder, and then is sent to the double-flow character entity attention mechanism together with the feature vector obtained by the encoder to carry out the operation of cross attention, wherein the cross attention is divided into a double-flow network, and the features of the human and the object are extracted in a finer way.

Step 4: and respectively obtaining four predictions of an object boundary frame, a human body boundary frame, an object category and a person interaction category through the FFN full-connection layer.

Compared with the prior art, the invention has the beneficial effects that:

(1) Compared with the traditional CNN backbone network, the invention uses the Swin Transformer to enhance the feature extraction capability;

(2) For character entities with smaller targets, the method introduces multi-scale features to improve the recognition accuracy;

(3) In algorithm complexity, the invention uses a multi-scale deformable attention mechanism to reduce the number of sampling points, so that the algorithm is light.

Drawings

Fig. 1 is: multi-scale deformable single-stage character interaction detection algorithm schematic diagram

Fig. 2 is: multi-scale deformable attention schematic diagram

Fig. 3 is: reference point and sample point schematic

Fig. 4 is: schematic diagram of attention mechanism of character entity under multi-scale condition

Fig. 5 is: schematic diagram of attention mechanism of double-flow character entity under multi-scale condition

Detailed Description

The following describes the present embodiment with reference to the accompanying drawings, and a specific process of a multi-scale deformable character interaction relation detection method is as follows:

step 1: FIG. 1 is a scale-upDeformed single-stage character interaction detection algorithm schematic diagram, specifically, inputting one sheetUsing a signature of the last three stages of the Swin transformA 1x1 convolution is used to convolve the feature map x ₁ 、x ₂ And x ₃ From dimension C _s Projected to dimension C _d . Then, multi-scale feature map x ₁ 、x ₂ X ₃ Straightened into a sequence and spliced to finally obtain C _d The feature vector of the dimension is merged into the position coding information at the same time and is used as the input of the encoder. Because of the introduction of multi-scale features, position coding requires the introduction of multi-scale feature level information for identifying from which level the feature comes, in addition to identifying the relative position in the image.

Step 2: given a multi-scale feature mapWherein->L represents the number of feature maps, and L is used to specifically identify the index of the feature map. P is p _q ∈[0,1] ² Then a two-dimensional mapping reference point in the multi-scale image feature vector represents the query element. Unlike single-scale deformable attention mechanisms, because of the incorporation of multi-scale features, two-dimensional mapping reference points need to find specific coordinates in feature maps of different scales at the same time, so that the reference point coordinates here need to be normalized instead of using absolute positions of the coordinates, and alignment of the positions of the reference points of multiple feature maps is facilitated. The deformable attention mechanism incorporating the multi-scale features is shown by equation (1):

like a single-scale deformable module, M representsIs the total number of attention headers in the multi-header attention mechanism, m represents the index of the index header, K represents the total number of the filtered sampling points, K represents the index of the subscript of the specific sampling point, and Deltap _mlqk Represents the offset of the sampling point relative to the reference point, ΔA _mlqk Attention weights representing the attention of multiple heads.This is to map the coordinates of the two-dimensional mapping reference points to the feature maps of different scales, so that the positions of the reference points in different scales are aligned. The overall process is more intuitively illustrated by the multi-scale deformable attention schematic shown in fig. 2. In the attention calculation process, each query element in the query vector is mapped into the feature vector of the image, which corresponds to z _q . The feature vectors are respectively mapped into feature graphs with different scales, and feature space initial reference points W 'in different scales are respectively obtained' _m x ^l Offset of reference point Δp _mlqk Attention weight ΔA _mlqk . Finally according to the formulaAnd adding the results of the feature graphs with different scales to obtain the output of the current attention layer. The structure of the multi-scale attention mechanism is substantially equivalent to the single-scale deformable attention mechanism, essentially equivalent to performing the single-scale attention mechanism multiple times at different scales, when l=1, k=1 are both completely equivalent.

Step 3: specifically, given an HOI query feature z _q Human entity attention determines sampling locations of human and object features, respectivelyAnd->As shown in FIG. 4, in order to make the feature points clearer, the original picture is changed into a gray level picture, black arrows in the picture represent full-connection layers, and red arrows represent human sampling in the feature pictures with different scalesThe dots, yellow arrows represent feature points of objects in the different scale feature map. First, z is _q Input into the full connection layer to get +.>And->The kth sample position and the mth attention head of the human and the object at the ith feature layer are shown in formulas (2), (3), respectively:

wherein h is _q Δh is the reference point and the sampling offset of the human, respectively. o (o) _q Δo is the reference point and the sampling offset of the object, respectively, which are defined byAnd->Is obtained through the full connection layer. Then, based on the sampled position, calculate human +.>And object->Is characterized by the following formula (4), (5):

step 4: due to the present invention cross-attention in the decoderInstead of the physical attention of the person for double flow, the double flow structure outputs the results of the cross attention respectivelyAnd->There is therefore also a need for corresponding improvements in the original character interaction relationship prediction head. The specific calculation is shown in the formulas (6), (7), (8) and (9):

as shown in FIG. 5, the human deformable attention output in the above formulaThe interaction detection head can thus obtain a human body frame. Output of the object deformable attention mechanism>The interaction detection head can obtain the object frame and the object category. Whereas prediction of interaction categories requires simultaneous reference +.>And->

Claims

1. A multi-scale deformable character interaction relation detection method is characterized in that: the method comprises the following specific processes:

step 1: giving an original image, inputting the original image into a Swin transform network, extracting the feature images of the last three layers, and performing dimension reduction through 1x1 convolution to obtain an image feature vector;

step 2: the feature vector is initially encoded through a multi-scale deformable attention module, feature points with the most remarkable features are sampled, and the calculation complexity is reduced;

step 3: the query vector in the decoder passes through the self-attention module and then is sent to the double-flow character entity attention mechanism together with the feature vector obtained by the encoder to carry out the operation of cross attention, wherein the cross attention is divided into a double-flow network, and the features of the human and the object are extracted in a finer way;

2. The method of claim 1, wherein: in the step 1, a Swin transform network is used for extracting the feature images of the last three stages, dimension reduction is carried out through 1x1 convolution, a multi-scale feature image is obtained, the multi-scale feature image is straightened into a sequence and spliced, a feature vector is finally obtained, and meanwhile, position coding information is integrated and multi-scale feature level information is introduced for identification.

3. The method of claim 1, wherein: and 2, introducing a multi-scale deformable attention module, mapping the feature vectors into feature graphs of different scales respectively, and adding the results of the feature graphs of different scales to obtain the output of the current attention layer.

4. The method of claim 1, wherein: the dual-stream character entity attention mechanism in the step 3 performs a cross-attention operation, where the cross-attention is divided into dual-stream networks.

5. The method of claim 1, wherein: the human interaction relation prediction head in the step 4 can obtain four predictions of a human body frame, an object category and a human interaction category.