CN117372706A - Multi-scale deformable character interaction relation detection method - Google Patents

Multi-scale deformable character interaction relation detection method Download PDF

Info

Publication number
CN117372706A
CN117372706A CN202310846089.2A CN202310846089A CN117372706A CN 117372706 A CN117372706 A CN 117372706A CN 202310846089 A CN202310846089 A CN 202310846089A CN 117372706 A CN117372706 A CN 117372706A
Authority
CN
China
Prior art keywords
scale
feature
attention
features
deformable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310846089.2A
Other languages
Chinese (zh)
Inventor
贾海涛
余梦鹏
张宏博
张钰琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202310846089.2A priority Critical patent/CN117372706A/en
Publication of CN117372706A publication Critical patent/CN117372706A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • G06V10/462Salient features, e.g. scale invariant feature transforms [SIFT]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/7715Feature extraction, e.g. by transforming the feature space, e.g. multi-dimensional scaling [MDS]; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to the field of human interaction relation detection in the field of image understanding, in particular to a multi-scale deformable human interaction relation detection method. At present, due to the lack of multi-scale features, the conventional algorithm based on the Transformer is difficult to accurately identify small targets from the high-resolution feature map so as to influence the prediction of the human interaction relationship, and the addition of the multi-scale features can provide new features for the human interaction relationship detection algorithm, but the complexity is also increased sharply due to the addition of the features. In order to solve the problems, the invention provides a character interaction relation detection algorithm based on the improvement of a QPIC algorithm. Secondly, the recognition accuracy is improved by introducing the multi-scale features. Thirdly, the feature vector is initially encoded through a multi-scale deformable attention module, and feature points with the most remarkable features are sampled, so that the algorithm is light in weight, and the calculation complexity is reduced.

Description

Multi-scale deformable character interaction relation detection method
Technical Field
The invention relates to the field of human interaction relation detection in the field of image understanding, in particular to a multi-scale deformable human interaction relation detection method.
Background
The research background of human interaction relation detection can be traced back to an early stage in the field of computer vision, in order for a computer to better understand an image, it is necessary to not only identify objects in the image, but also understand the relation between these objects, and the role they play in the image. The detection of the interaction relation of the person aims to solve the problem of deep semantic understanding between the person and the object in the image.
The DETR structure minimally modifies the transducer structure, largely preserving the characteristics of the transducer, and is a milestone-like ride-through, as opposed to the prior art structures based on fast R-CNN. Thanks to the high scalability of the DETR architecture, as it enlarges the heterochromatic in the field of target detection, various network structures based on DETR come out endlessly, and have achieved significant achievements in the respective fields. In the task of detecting the human interaction relationship, the QPIC changes the query vector for the entity target into the query vector for the human interaction pair on the basis of DETR, and meanwhile, increases the human interaction relationship detection head, so that the context information is effectively aggregated, and the structure of the transducer is applied to human interaction relationship detection for the first time and achieves good effects, but the following problems still exist:
the traditional convolution-based algorithm is very mature in the field of target detection, and can extract multi-scale features through FPN to optimize the detection effect, but the conventional transform-based algorithm is difficult to accurately identify small targets from a high-resolution feature map due to lack of multi-scale features, so that prediction of human interaction relations is affected, huge calculation cost is generated when the transform is directly used for noticing to process the multi-scale feature map, and the conventional transform-based human interaction relation detection algorithm is limited to use of a single-scale feature map. Because of this limitation, previous transducer-based approaches have exhibited undesirable performance, particularly in situations where background information exists on humans, objects, and interactions between them at different scales.
While the addition of multi-scale features may provide new features for human interaction detection algorithms, the addition of features may also lead to dramatic increases in complexity. Moreover, the QPIC algorithm itself is computationally complex, and adding multi-scale features directly only increases the complexity of the algorithm to an unacceptable level. Therefore, how to reduce the algorithm complexity is also one of the issues of intensive research.
The single-phase approach of the last two years typically builds a feature extractor consisting of a hierarchical backbone of CNNs (e.g., hourglass-104, DLA-34, resNet-50, and ResNet 101) and a transfomer encoder. However, these approaches ignore two drawbacks of using the CNN backbone. First, CNNs are poor in capturing non-local semantic features, and cannot establish a relationship between pixels far apart, i.e., cannot obtain a global receptive field, such as a relationship between a person and an object, and in addition, even if the number of network layers is deepened, the fewer feasible information transfer paths between pixels far apart are, the average global receptive field cannot be achieved. Second, the use of large receptive low resolution feature maps ignores spatial information over a small range, and is still affected even though the attention-oriented transform encoder can supplement semantic information from the image.
Disclosure of Invention
The present character interaction relation detection algorithm based on the Transformer is limited to use of a single-scale feature map, the complexity of the algorithm is also greatly increased by directly introducing multi-scale features, and the invention aims to overcome the defects in the discussion and provide a multi-scale deformable character interaction relation detection method.
The method for detecting the multi-scale deformable character interaction relationship comprises the following specific processes:
step 1: giving an original image, inputting the original image into a Swin transform network, extracting the feature images of the last three layers, and performing dimension reduction through 1x1 convolution to obtain an image feature vector.
Step 2: and the feature vector is initially encoded through a multi-scale deformable attention module, feature points with the most remarkable features are sampled, and the computational complexity is reduced.
Step 3: the query vector passes through the self-attention module in the decoder, and then is sent to the double-flow character entity attention mechanism together with the feature vector obtained by the encoder to carry out the operation of cross attention, wherein the cross attention is divided into a double-flow network, and the features of the human and the object are extracted in a finer way.
Step 4: and respectively obtaining four predictions of an object boundary frame, a human body boundary frame, an object category and a person interaction category through the FFN full-connection layer.
Compared with the prior art, the invention has the beneficial effects that:
(1) Compared with the traditional CNN backbone network, the invention uses the Swin Transformer to enhance the feature extraction capability;
(2) For character entities with smaller targets, the method introduces multi-scale features to improve the recognition accuracy;
(3) In algorithm complexity, the invention uses a multi-scale deformable attention mechanism to reduce the number of sampling points, so that the algorithm is light.
Drawings
Fig. 1 is: multi-scale deformable single-stage character interaction detection algorithm schematic diagram
Fig. 2 is: multi-scale deformable attention schematic diagram
Fig. 3 is: reference point and sample point schematic
Fig. 4 is: schematic diagram of attention mechanism of character entity under multi-scale condition
Fig. 5 is: schematic diagram of attention mechanism of double-flow character entity under multi-scale condition
Detailed Description
The following describes the present embodiment with reference to the accompanying drawings, and a specific process of a multi-scale deformable character interaction relation detection method is as follows:
step 1: FIG. 1 is a scale-upDeformed single-stage character interaction detection algorithm schematic diagram, specifically, inputting one sheetUsing a signature of the last three stages of the Swin transformA 1x1 convolution is used to convolve the feature map x 1 、x 2 And x 3 From dimension C s Projected to dimension C d . Then, multi-scale feature map x 1 、x 2 X 3 Straightened into a sequence and spliced to finally obtain C d The feature vector of the dimension is merged into the position coding information at the same time and is used as the input of the encoder. Because of the introduction of multi-scale features, position coding requires the introduction of multi-scale feature level information for identifying from which level the feature comes, in addition to identifying the relative position in the image.
Step 2: given a multi-scale feature mapWherein->L represents the number of feature maps, and L is used to specifically identify the index of the feature map. P is p q ∈[0,1] 2 Then a two-dimensional mapping reference point in the multi-scale image feature vector represents the query element. Unlike single-scale deformable attention mechanisms, because of the incorporation of multi-scale features, two-dimensional mapping reference points need to find specific coordinates in feature maps of different scales at the same time, so that the reference point coordinates here need to be normalized instead of using absolute positions of the coordinates, and alignment of the positions of the reference points of multiple feature maps is facilitated. The deformable attention mechanism incorporating the multi-scale features is shown by equation (1):
like a single-scale deformable module, M representsIs the total number of attention headers in the multi-header attention mechanism, m represents the index of the index header, K represents the total number of the filtered sampling points, K represents the index of the subscript of the specific sampling point, and Deltap mlqk Represents the offset of the sampling point relative to the reference point, ΔA mlqk Attention weights representing the attention of multiple heads.This is to map the coordinates of the two-dimensional mapping reference points to the feature maps of different scales, so that the positions of the reference points in different scales are aligned. The overall process is more intuitively illustrated by the multi-scale deformable attention schematic shown in fig. 2. In the attention calculation process, each query element in the query vector is mapped into the feature vector of the image, which corresponds to z q . The feature vectors are respectively mapped into feature graphs with different scales, and feature space initial reference points W 'in different scales are respectively obtained' m x l Offset of reference point Δp mlqk Attention weight ΔA mlqk . Finally according to the formulaAnd adding the results of the feature graphs with different scales to obtain the output of the current attention layer. The structure of the multi-scale attention mechanism is substantially equivalent to the single-scale deformable attention mechanism, essentially equivalent to performing the single-scale attention mechanism multiple times at different scales, when l=1, k=1 are both completely equivalent.
Step 3: specifically, given an HOI query feature z q Human entity attention determines sampling locations of human and object features, respectivelyAnd->As shown in FIG. 4, in order to make the feature points clearer, the original picture is changed into a gray level picture, black arrows in the picture represent full-connection layers, and red arrows represent human sampling in the feature pictures with different scalesThe dots, yellow arrows represent feature points of objects in the different scale feature map. First, z is q Input into the full connection layer to get +.>And->The kth sample position and the mth attention head of the human and the object at the ith feature layer are shown in formulas (2), (3), respectively:
wherein h is q Δh is the reference point and the sampling offset of the human, respectively. o (o) q Δo is the reference point and the sampling offset of the object, respectively, which are defined byAnd->Is obtained through the full connection layer. Then, based on the sampled position, calculate human +.>And object->Is characterized by the following formula (4), (5):
step 4: due to the present invention cross-attention in the decoderInstead of the physical attention of the person for double flow, the double flow structure outputs the results of the cross attention respectivelyAnd->There is therefore also a need for corresponding improvements in the original character interaction relationship prediction head. The specific calculation is shown in the formulas (6), (7), (8) and (9):
as shown in FIG. 5, the human deformable attention output in the above formulaThe interaction detection head can thus obtain a human body frame. Output of the object deformable attention mechanism>The interaction detection head can obtain the object frame and the object category. Whereas prediction of interaction categories requires simultaneous reference +.>And->

Claims (5)

1. A multi-scale deformable character interaction relation detection method is characterized in that: the method comprises the following specific processes:
step 1: giving an original image, inputting the original image into a Swin transform network, extracting the feature images of the last three layers, and performing dimension reduction through 1x1 convolution to obtain an image feature vector;
step 2: the feature vector is initially encoded through a multi-scale deformable attention module, feature points with the most remarkable features are sampled, and the calculation complexity is reduced;
step 3: the query vector in the decoder passes through the self-attention module and then is sent to the double-flow character entity attention mechanism together with the feature vector obtained by the encoder to carry out the operation of cross attention, wherein the cross attention is divided into a double-flow network, and the features of the human and the object are extracted in a finer way;
step 4: and respectively obtaining four predictions of an object boundary frame, a human body boundary frame, an object category and a person interaction category through the FFN full-connection layer.
2. The method of claim 1, wherein: in the step 1, a Swin transform network is used for extracting the feature images of the last three stages, dimension reduction is carried out through 1x1 convolution, a multi-scale feature image is obtained, the multi-scale feature image is straightened into a sequence and spliced, a feature vector is finally obtained, and meanwhile, position coding information is integrated and multi-scale feature level information is introduced for identification.
3. The method of claim 1, wherein: and 2, introducing a multi-scale deformable attention module, mapping the feature vectors into feature graphs of different scales respectively, and adding the results of the feature graphs of different scales to obtain the output of the current attention layer.
4. The method of claim 1, wherein: the dual-stream character entity attention mechanism in the step 3 performs a cross-attention operation, where the cross-attention is divided into dual-stream networks.
5. The method of claim 1, wherein: the human interaction relation prediction head in the step 4 can obtain four predictions of a human body frame, an object category and a human interaction category.
CN202310846089.2A 2023-07-11 2023-07-11 Multi-scale deformable character interaction relation detection method Pending CN117372706A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310846089.2A CN117372706A (en) 2023-07-11 2023-07-11 Multi-scale deformable character interaction relation detection method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310846089.2A CN117372706A (en) 2023-07-11 2023-07-11 Multi-scale deformable character interaction relation detection method

Publications (1)

Publication Number Publication Date
CN117372706A true CN117372706A (en) 2024-01-09

Family

ID=89388086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310846089.2A Pending CN117372706A (en) 2023-07-11 2023-07-11 Multi-scale deformable character interaction relation detection method

Country Status (1)

Country Link
CN (1) CN117372706A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117830874A (en) * 2024-03-05 2024-04-05 成都理工大学 Remote sensing target detection method under multi-scale fuzzy boundary condition
CN117953590A (en) * 2024-03-27 2024-04-30 武汉工程大学 Ternary interaction detection method, system, equipment and medium
CN117953589A (en) * 2024-03-27 2024-04-30 武汉工程大学 Interactive action detection method, system, equipment and medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117830874A (en) * 2024-03-05 2024-04-05 成都理工大学 Remote sensing target detection method under multi-scale fuzzy boundary condition
CN117830874B (en) * 2024-03-05 2024-05-07 成都理工大学 Remote sensing target detection method under multi-scale fuzzy boundary condition
CN117953590A (en) * 2024-03-27 2024-04-30 武汉工程大学 Ternary interaction detection method, system, equipment and medium
CN117953589A (en) * 2024-03-27 2024-04-30 武汉工程大学 Interactive action detection method, system, equipment and medium

Similar Documents

Publication Publication Date Title
CN117372706A (en) Multi-scale deformable character interaction relation detection method
WO2020108362A1 (en) Body posture detection method, apparatus and device, and storage medium
WO2021129569A1 (en) Human action recognition method
Zheng et al. Learning Cross-scale Correspondence and Patch-based Synthesis for Reference-based Super-Resolution.
CN112052831B (en) Method, device and computer storage medium for face detection
CN111915484A (en) Reference image guiding super-resolution method based on dense matching and self-adaptive fusion
CN111709980A (en) Multi-scale image registration method and device based on deep learning
CN113592927B (en) Cross-domain image geometric registration method guided by structural information
CN111160295A (en) Video pedestrian re-identification method based on region guidance and space-time attention
WO2023159898A1 (en) Action recognition system, method, and apparatus, model training method and apparatus, computer device, and computer readable storage medium
CN111738211A (en) PTZ camera moving target detection and identification method based on dynamic background compensation and deep learning
CN117392496A (en) Target detection method and system based on infrared and visible light image fusion
Hua et al. Dynamic scene deblurring with continuous cross-layer attention transmission
CN114358150A (en) SAR-visible light remote sensing image matching method
CN112329662B (en) Multi-view saliency estimation method based on unsupervised learning
CN112801141B (en) Heterogeneous image matching method based on template matching and twin neural network optimization
CN109934283A (en) A kind of adaptive motion object detection method merging CNN and SIFT light stream
CN117133041A (en) Three-dimensional reconstruction network face recognition method, system, equipment and medium based on deep learning
CN109740405B (en) Method for detecting front window difference information of non-aligned similar vehicles
CN116597174A (en) Visual SLAM loop detection system and method based on deep learning
CN116091793A (en) Light field significance detection method based on optical flow fusion
CN110021036A (en) Infrared target detection method, apparatus, computer equipment and storage medium
CN113628261B (en) Infrared and visible light image registration method in electric power inspection scene
CN115620049A (en) Method for detecting disguised target based on polarized image clues and application thereof
Uchigasaki et al. Deep image compression using scene text quality assessment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination