CN117372706A - Multi-scale deformable character interaction relation detection method - Google Patents
Multi-scale deformable character interaction relation detection method Download PDFInfo
- Publication number
- CN117372706A CN117372706A CN202310846089.2A CN202310846089A CN117372706A CN 117372706 A CN117372706 A CN 117372706A CN 202310846089 A CN202310846089 A CN 202310846089A CN 117372706 A CN117372706 A CN 117372706A
- Authority
- CN
- China
- Prior art keywords
- scale
- feature
- attention
- features
- deformable
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000003993 interaction Effects 0.000 title claims abstract description 34
- 238000001514 detection method Methods 0.000 title claims abstract description 25
- 239000013598 vector Substances 0.000 claims abstract description 18
- 238000004364 calculation method Methods 0.000 claims abstract description 5
- 230000007246 mechanism Effects 0.000 claims description 13
- 238000000034 method Methods 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 6
- 238000013507 mapping Methods 0.000 claims description 4
- 230000009467 reduction Effects 0.000 claims description 3
- 238000005070 sampling Methods 0.000 description 8
- 238000013527 convolutional neural network Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 238000013459 approach Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000010348 incorporation Methods 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
- G06N3/0455—Auto-encoder networks; Encoder-decoder networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/46—Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
- G06V10/462—Salient features, e.g. scale invariant feature transforms [SIFT]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/7715—Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Social Psychology (AREA)
- Human Computer Interaction (AREA)
- Psychiatry (AREA)
- Image Analysis (AREA)
Abstract
The invention relates to the field of human interaction relation detection in the field of image understanding, in particular to a multi-scale deformable human interaction relation detection method. At present, due to the lack of multi-scale features, the conventional algorithm based on the Transformer is difficult to accurately identify small targets from the high-resolution feature map so as to influence the prediction of the human interaction relationship, and the addition of the multi-scale features can provide new features for the human interaction relationship detection algorithm, but the complexity is also increased sharply due to the addition of the features. In order to solve the problems, the invention provides a character interaction relation detection algorithm based on the improvement of a QPIC algorithm. Secondly, the recognition accuracy is improved by introducing the multi-scale features. Thirdly, the feature vector is initially encoded through a multi-scale deformable attention module, and feature points with the most remarkable features are sampled, so that the algorithm is light in weight, and the calculation complexity is reduced.
Description
Technical Field
The invention relates to the field of human interaction relation detection in the field of image understanding, in particular to a multi-scale deformable human interaction relation detection method.
Background
The research background of human interaction relation detection can be traced back to an early stage in the field of computer vision, in order for a computer to better understand an image, it is necessary to not only identify objects in the image, but also understand the relation between these objects, and the role they play in the image. The detection of the interaction relation of the person aims to solve the problem of deep semantic understanding between the person and the object in the image.
The DETR structure minimally modifies the transducer structure, largely preserving the characteristics of the transducer, and is a milestone-like ride-through, as opposed to the prior art structures based on fast R-CNN. Thanks to the high scalability of the DETR architecture, as it enlarges the heterochromatic in the field of target detection, various network structures based on DETR come out endlessly, and have achieved significant achievements in the respective fields. In the task of detecting the human interaction relationship, the QPIC changes the query vector for the entity target into the query vector for the human interaction pair on the basis of DETR, and meanwhile, increases the human interaction relationship detection head, so that the context information is effectively aggregated, and the structure of the transducer is applied to human interaction relationship detection for the first time and achieves good effects, but the following problems still exist:
the traditional convolution-based algorithm is very mature in the field of target detection, and can extract multi-scale features through FPN to optimize the detection effect, but the conventional transform-based algorithm is difficult to accurately identify small targets from a high-resolution feature map due to lack of multi-scale features, so that prediction of human interaction relations is affected, huge calculation cost is generated when the transform is directly used for noticing to process the multi-scale feature map, and the conventional transform-based human interaction relation detection algorithm is limited to use of a single-scale feature map. Because of this limitation, previous transducer-based approaches have exhibited undesirable performance, particularly in situations where background information exists on humans, objects, and interactions between them at different scales.
While the addition of multi-scale features may provide new features for human interaction detection algorithms, the addition of features may also lead to dramatic increases in complexity. Moreover, the QPIC algorithm itself is computationally complex, and adding multi-scale features directly only increases the complexity of the algorithm to an unacceptable level. Therefore, how to reduce the algorithm complexity is also one of the issues of intensive research.
The single-phase approach of the last two years typically builds a feature extractor consisting of a hierarchical backbone of CNNs (e.g., hourglass-104, DLA-34, resNet-50, and ResNet 101) and a transfomer encoder. However, these approaches ignore two drawbacks of using the CNN backbone. First, CNNs are poor in capturing non-local semantic features, and cannot establish a relationship between pixels far apart, i.e., cannot obtain a global receptive field, such as a relationship between a person and an object, and in addition, even if the number of network layers is deepened, the fewer feasible information transfer paths between pixels far apart are, the average global receptive field cannot be achieved. Second, the use of large receptive low resolution feature maps ignores spatial information over a small range, and is still affected even though the attention-oriented transform encoder can supplement semantic information from the image.
Disclosure of Invention
The present character interaction relation detection algorithm based on the Transformer is limited to use of a single-scale feature map, the complexity of the algorithm is also greatly increased by directly introducing multi-scale features, and the invention aims to overcome the defects in the discussion and provide a multi-scale deformable character interaction relation detection method.
The method for detecting the multi-scale deformable character interaction relationship comprises the following specific processes:
step 1: giving an original image, inputting the original image into a Swin transform network, extracting the feature images of the last three layers, and performing dimension reduction through 1x1 convolution to obtain an image feature vector.
Step 2: and the feature vector is initially encoded through a multi-scale deformable attention module, feature points with the most remarkable features are sampled, and the computational complexity is reduced.
Step 3: the query vector passes through the self-attention module in the decoder, and then is sent to the double-flow character entity attention mechanism together with the feature vector obtained by the encoder to carry out the operation of cross attention, wherein the cross attention is divided into a double-flow network, and the features of the human and the object are extracted in a finer way.
Step 4: and respectively obtaining four predictions of an object boundary frame, a human body boundary frame, an object category and a person interaction category through the FFN full-connection layer.
Compared with the prior art, the invention has the beneficial effects that:
(1) Compared with the traditional CNN backbone network, the invention uses the Swin Transformer to enhance the feature extraction capability;
(2) For character entities with smaller targets, the method introduces multi-scale features to improve the recognition accuracy;
(3) In algorithm complexity, the invention uses a multi-scale deformable attention mechanism to reduce the number of sampling points, so that the algorithm is light.
Drawings
Fig. 1 is: multi-scale deformable single-stage character interaction detection algorithm schematic diagram
Fig. 2 is: multi-scale deformable attention schematic diagram
Fig. 3 is: reference point and sample point schematic
Fig. 4 is: schematic diagram of attention mechanism of character entity under multi-scale condition
Fig. 5 is: schematic diagram of attention mechanism of double-flow character entity under multi-scale condition
Detailed Description
The following describes the present embodiment with reference to the accompanying drawings, and a specific process of a multi-scale deformable character interaction relation detection method is as follows:
step 1: FIG. 1 is a scale-upDeformed single-stage character interaction detection algorithm schematic diagram, specifically, inputting one sheetUsing a signature of the last three stages of the Swin transformA 1x1 convolution is used to convolve the feature map x 1 、x 2 And x 3 From dimension C s Projected to dimension C d . Then, multi-scale feature map x 1 、x 2 X 3 Straightened into a sequence and spliced to finally obtain C d The feature vector of the dimension is merged into the position coding information at the same time and is used as the input of the encoder. Because of the introduction of multi-scale features, position coding requires the introduction of multi-scale feature level information for identifying from which level the feature comes, in addition to identifying the relative position in the image.
Step 2: given a multi-scale feature mapWherein->L represents the number of feature maps, and L is used to specifically identify the index of the feature map. P is p q ∈[0,1] 2 Then a two-dimensional mapping reference point in the multi-scale image feature vector represents the query element. Unlike single-scale deformable attention mechanisms, because of the incorporation of multi-scale features, two-dimensional mapping reference points need to find specific coordinates in feature maps of different scales at the same time, so that the reference point coordinates here need to be normalized instead of using absolute positions of the coordinates, and alignment of the positions of the reference points of multiple feature maps is facilitated. The deformable attention mechanism incorporating the multi-scale features is shown by equation (1):
like a single-scale deformable module, M representsIs the total number of attention headers in the multi-header attention mechanism, m represents the index of the index header, K represents the total number of the filtered sampling points, K represents the index of the subscript of the specific sampling point, and Deltap mlqk Represents the offset of the sampling point relative to the reference point, ΔA mlqk Attention weights representing the attention of multiple heads.This is to map the coordinates of the two-dimensional mapping reference points to the feature maps of different scales, so that the positions of the reference points in different scales are aligned. The overall process is more intuitively illustrated by the multi-scale deformable attention schematic shown in fig. 2. In the attention calculation process, each query element in the query vector is mapped into the feature vector of the image, which corresponds to z q . The feature vectors are respectively mapped into feature graphs with different scales, and feature space initial reference points W 'in different scales are respectively obtained' m x l Offset of reference point Δp mlqk Attention weight ΔA mlqk . Finally according to the formulaAnd adding the results of the feature graphs with different scales to obtain the output of the current attention layer. The structure of the multi-scale attention mechanism is substantially equivalent to the single-scale deformable attention mechanism, essentially equivalent to performing the single-scale attention mechanism multiple times at different scales, when l=1, k=1 are both completely equivalent.
Step 3: specifically, given an HOI query feature z q Human entity attention determines sampling locations of human and object features, respectivelyAnd->As shown in FIG. 4, in order to make the feature points clearer, the original picture is changed into a gray level picture, black arrows in the picture represent full-connection layers, and red arrows represent human sampling in the feature pictures with different scalesThe dots, yellow arrows represent feature points of objects in the different scale feature map. First, z is q Input into the full connection layer to get +.>And->The kth sample position and the mth attention head of the human and the object at the ith feature layer are shown in formulas (2), (3), respectively:
wherein h is q Δh is the reference point and the sampling offset of the human, respectively. o (o) q Δo is the reference point and the sampling offset of the object, respectively, which are defined byAnd->Is obtained through the full connection layer. Then, based on the sampled position, calculate human +.>And object->Is characterized by the following formula (4), (5):
step 4: due to the present invention cross-attention in the decoderInstead of the physical attention of the person for double flow, the double flow structure outputs the results of the cross attention respectivelyAnd->There is therefore also a need for corresponding improvements in the original character interaction relationship prediction head. The specific calculation is shown in the formulas (6), (7), (8) and (9):
as shown in FIG. 5, the human deformable attention output in the above formulaThe interaction detection head can thus obtain a human body frame. Output of the object deformable attention mechanism>The interaction detection head can obtain the object frame and the object category. Whereas prediction of interaction categories requires simultaneous reference +.>And->
Claims (5)
1. A multi-scale deformable character interaction relation detection method is characterized in that: the method comprises the following specific processes:
step 1: giving an original image, inputting the original image into a Swin transform network, extracting the feature images of the last three layers, and performing dimension reduction through 1x1 convolution to obtain an image feature vector;
step 2: the feature vector is initially encoded through a multi-scale deformable attention module, feature points with the most remarkable features are sampled, and the calculation complexity is reduced;
step 3: the query vector in the decoder passes through the self-attention module and then is sent to the double-flow character entity attention mechanism together with the feature vector obtained by the encoder to carry out the operation of cross attention, wherein the cross attention is divided into a double-flow network, and the features of the human and the object are extracted in a finer way;
step 4: and respectively obtaining four predictions of an object boundary frame, a human body boundary frame, an object category and a person interaction category through the FFN full-connection layer.
2. The method of claim 1, wherein: in the step 1, a Swin transform network is used for extracting the feature images of the last three stages, dimension reduction is carried out through 1x1 convolution, a multi-scale feature image is obtained, the multi-scale feature image is straightened into a sequence and spliced, a feature vector is finally obtained, and meanwhile, position coding information is integrated and multi-scale feature level information is introduced for identification.
3. The method of claim 1, wherein: and 2, introducing a multi-scale deformable attention module, mapping the feature vectors into feature graphs of different scales respectively, and adding the results of the feature graphs of different scales to obtain the output of the current attention layer.
4. The method of claim 1, wherein: the dual-stream character entity attention mechanism in the step 3 performs a cross-attention operation, where the cross-attention is divided into dual-stream networks.
5. The method of claim 1, wherein: the human interaction relation prediction head in the step 4 can obtain four predictions of a human body frame, an object category and a human interaction category.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310846089.2A CN117372706A (en) | 2023-07-11 | 2023-07-11 | Multi-scale deformable character interaction relation detection method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310846089.2A CN117372706A (en) | 2023-07-11 | 2023-07-11 | Multi-scale deformable character interaction relation detection method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN117372706A true CN117372706A (en) | 2024-01-09 |
Family
ID=89388086
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310846089.2A Pending CN117372706A (en) | 2023-07-11 | 2023-07-11 | Multi-scale deformable character interaction relation detection method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117372706A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117830874A (en) * | 2024-03-05 | 2024-04-05 | 成都理工大学 | Remote sensing target detection method under multi-scale fuzzy boundary condition |
CN117953590A (en) * | 2024-03-27 | 2024-04-30 | 武汉工程大学 | Ternary interaction detection method, system, equipment and medium |
CN117953589A (en) * | 2024-03-27 | 2024-04-30 | 武汉工程大学 | Interactive action detection method, system, equipment and medium |
-
2023
- 2023-07-11 CN CN202310846089.2A patent/CN117372706A/en active Pending
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117830874A (en) * | 2024-03-05 | 2024-04-05 | 成都理工大学 | Remote sensing target detection method under multi-scale fuzzy boundary condition |
CN117830874B (en) * | 2024-03-05 | 2024-05-07 | 成都理工大学 | Remote sensing target detection method under multi-scale fuzzy boundary condition |
CN117953590A (en) * | 2024-03-27 | 2024-04-30 | 武汉工程大学 | Ternary interaction detection method, system, equipment and medium |
CN117953589A (en) * | 2024-03-27 | 2024-04-30 | 武汉工程大学 | Interactive action detection method, system, equipment and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN117372706A (en) | Multi-scale deformable character interaction relation detection method | |
WO2020108362A1 (en) | Body posture detection method, apparatus and device, and storage medium | |
WO2021129569A1 (en) | Human action recognition method | |
Zheng et al. | Learning Cross-scale Correspondence and Patch-based Synthesis for Reference-based Super-Resolution. | |
CN112052831B (en) | Method, device and computer storage medium for face detection | |
CN111915484A (en) | Reference image guiding super-resolution method based on dense matching and self-adaptive fusion | |
CN111709980A (en) | Multi-scale image registration method and device based on deep learning | |
CN113592927B (en) | Cross-domain image geometric registration method guided by structural information | |
CN111160295A (en) | Video pedestrian re-identification method based on region guidance and space-time attention | |
WO2023159898A1 (en) | Action recognition system, method, and apparatus, model training method and apparatus, computer device, and computer readable storage medium | |
CN111738211A (en) | PTZ camera moving target detection and identification method based on dynamic background compensation and deep learning | |
CN117392496A (en) | Target detection method and system based on infrared and visible light image fusion | |
Hua et al. | Dynamic scene deblurring with continuous cross-layer attention transmission | |
CN114358150A (en) | SAR-visible light remote sensing image matching method | |
CN112329662B (en) | Multi-view saliency estimation method based on unsupervised learning | |
CN112801141B (en) | Heterogeneous image matching method based on template matching and twin neural network optimization | |
CN109934283A (en) | A kind of adaptive motion object detection method merging CNN and SIFT light stream | |
CN117133041A (en) | Three-dimensional reconstruction network face recognition method, system, equipment and medium based on deep learning | |
CN109740405B (en) | Method for detecting front window difference information of non-aligned similar vehicles | |
CN116597174A (en) | Visual SLAM loop detection system and method based on deep learning | |
CN116091793A (en) | Light field significance detection method based on optical flow fusion | |
CN110021036A (en) | Infrared target detection method, apparatus, computer equipment and storage medium | |
CN113628261B (en) | Infrared and visible light image registration method in electric power inspection scene | |
CN115620049A (en) | Method for detecting disguised target based on polarized image clues and application thereof | |
Uchigasaki et al. | Deep image compression using scene text quality assessment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |