CN114863353A - Method and device for detecting relation between person and object and storage medium - Google Patents

Method and device for detecting relation between person and object and storage medium Download PDF

Info

Publication number
CN114863353A
CN114863353A CN202210410947.4A CN202210410947A CN114863353A CN 114863353 A CN114863353 A CN 114863353A CN 202210410947 A CN202210410947 A CN 202210410947A CN 114863353 A CN114863353 A CN 114863353A
Authority
CN
China
Prior art keywords
network
decoder
relationship
teacher
student
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210410947.4A
Other languages
Chinese (zh)
Other versions
CN114863353B (en
Inventor
丁长兴
屈贤
钟旭彬
王健
丁二锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
South China University of Technology SCUT
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT, Beijing Baidu Netcom Science and Technology Co Ltd filed Critical South China University of Technology SCUT
Priority to CN202210410947.4A priority Critical patent/CN114863353B/en
Priority claimed from CN202210410947.4A external-priority patent/CN114863353B/en
Publication of CN114863353A publication Critical patent/CN114863353A/en
Application granted granted Critical
Publication of CN114863353B publication Critical patent/CN114863353B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method, a device and a storage medium for detecting the relation between a person and an object, wherein the method comprises the following steps: acquiring a training set of a human-object relation detection data set, and performing enhancement processing on the training set; constructing a Student network and initializing the Student network; constructing a Teacher network and initializing the Teacher network; using a preset loss function to supervise the output of the Student network and the output of the Teacher network in the training process; using a preset distillation loss function to draw the prediction of a student transform decoder and a Teacher transform decoder in training; in the test, a trained Student network is adopted to obtain the detection result of the relationship between the person and the object. The invention designs the Teacher network with definite semantic information by using the idea of knowledge distillation, guides the original Student network to learn a better attention matrix, thereby obtaining more discriminative context information, greatly improving the detection performance of the relationship between people and objects, and being widely applied to the technical field of image processing and recognition.

Description

Method and device for detecting relation between person and object and storage medium
Technical Field
The invention relates to the technical field of image processing and recognition, in particular to a method and a device for detecting a relationship between a person and an object and a storage medium.
Background
The human-object relationship detection can detect the positions of the human and the object, the types of the object and the types of the interaction relationship, which are interacted with each other in one picture. Human-object relationship detection is of great importance, such as: in an automatic driving system, a detection system judges the condition of a surrounding road by detecting the relationship between people and surrounding objects on the road, so as to make safe driving behaviors; in a hospital monitoring system, a detection system can judge whether a ward is in an emergency or not through the relationship between the ward and surrounding objects, thereby ensuring the life health and safety of the ward.
At present, the problem of detecting the relationship between people and objects is mainly how to extract the global context features with discriminant power. Because of the powerful ability of transformers to extract contextual features, there have been some approaches to utilize transformers in human-to-object relationship detection. However, in the current Transformer-based method, the query matrix and decoder initialization features of the mutual attention module in the Transformer decoder have the problem of semantic ambiguity, which greatly limits the ability of the Transformer to learn better context features and predict relationship classes more accurately.
Interpretation of terms:
CNN: convolutional Neural Networks (Convolutional Neural Networks) are a class of feed-forward Neural Networks that contain convolution computations and have a deep structure.
Disclosure of Invention
To solve at least one of the technical problems in the prior art to some extent, an object of the present invention is to provide a method, an apparatus and a storage medium for detecting a relationship between a person and an object.
The technical scheme adopted by the invention is as follows:
a method for detecting a relationship between a person and an object, comprising the steps of:
acquiring a training set of a human-object relation detection data set, and performing enhancement processing on the training set;
constructing a Student network and initializing the Student network;
constructing a Teacher network and initializing the Teacher network;
using a preset loss function to supervise the output of the Student network and the output of the Teacher network in the training process;
using a preset distillation loss function to draw the prediction of a Student transform decoder and a Teacher transform decoder in training;
in the test, a trained Student network is adopted to obtain the detection result of the relationship between the person and the object.
Further, the enhancing the training set includes:
and randomly carrying out horizontal turning, color dithering, size scaling and cutting on the picture, and finally carrying out normalization on the picture.
Further, the building and initializing the Student network comprises:
constructing and initializing a CNN-based deep neural network;
constructing and initializing a Transformer encoder and a Transformer decoder;
and constructing a human-object relation detection network, predicting the human-object relation in the picture to be detected according to the output of the Transformer decoder, and initializing the human-object relation detection network.
Further, the construction method of the deep neural network comprises the following steps:
the profile F is obtained using the classical residual network ResNet-50 or ResNet-101 followed by a convolution of 1x1 to reduce the number of channels.
Further, the transform position coding mode is as follows:
Figure BDA0003604095010000021
Figure BDA0003604095010000022
pos represents a position of a two-dimensional picture, D is a constant, j represents a dimension, and for the position with an odd channel, a cos function is used for position coding; for the positions with even channels, performing position coding by using a sin function; and finally, outputting the PE as a three-dimensional position coding matrix, wherein the dimension size is consistent with that of the F.
Further, the method for constructing the Transformer encoder comprises the following steps:
the method comprises the following steps that 1 cascaded encoder layer forms an encoder, and each encoder layer consists of a self-attention module, a residual error connecting network, a layer normalization processing module, a forward feedback network, a residual error connecting network and a layer normalization processing module which are cascaded; the query matrix, the key matrix and the value matrix of the self-attention module in the encoder are respectively F + PE, F + PE and F, and the output of the transform encoder is E;
the construction method of the Transformer decoder comprises the following steps:
the method comprises the following steps that a decoder is formed by I cascaded decoder layers, and each decoder layer is formed by a self-attention module, a residual error connecting network, a layer normalization processing module, a mutual attention module, a residual error connecting network, a layer normalization processing module, a forward feedback network, a residual error connecting network and a layer normalization processing module which are cascaded; the query matrix, the key matrix and the value matrix of the mutual attention module in the decoder are Q, E + PE and E respectively, and the output of the transform decoder is D;
wherein, F represents the output characteristic of the input image passing through the CNN-based deep neural network, PE represents the position coding of the input image, and Q represents a set of learnable vectors.
Further, the structure of the human-object relationship detection network is as follows:
the system comprises 4 forward feedback networks which are used for respectively predicting the position, the object type and the relationship type of a person; the 4 feedforward networks are respectively composed of 3 full connection layers (middle ReLU activation function), 1 full connection layer and 1 full connection layer.
Further, predicting the relationship between the human and the object in the picture to be detected according to the output of the transform decoder to obtain the jth prediction result, wherein the jth prediction result comprises the following steps:
Figure BDA0003604095010000031
wherein
Figure BDA0003604095010000032
To normalize the position of the frames of people and objects,
Figure BDA0003604095010000033
wherein N is obj And N act Respectively the number of objects and the number of relations represented in the data set.
Further, the matching mode of the prediction result and the annotation relation pair is as follows:
the Hungarian algorithm, where the loss matrix is calculated as follows:
Figure BDA0003604095010000034
Figure BDA0003604095010000035
Figure BDA0003604095010000036
Figure BDA0003604095010000037
Figure BDA0003604095010000038
wherein
Figure BDA0003604095010000039
Representing the ith labeled relation pair in the picture, and phi representing the collection of the subscripts of the empty set of the labeled relation pairs in the picture; GIOU is short for generalized IoU; calculating to obtain subscript position of prediction relation pair corresponding to each labeled relation pair
Figure BDA00036040950100000310
Further, the Teacher network comprises: the CNN-based deep neural network, the Transformer encoder, the Transformer decoder, and the human-object relationship detection network are provided as in the Student network, and parameters of the networks are shared.
Further, the difference between the Student Transformer decoder and the Teacher Transformer decoder includes:
in a Student transform decoder, a query matrix of a mutual attention module is a set Q of learnable vectors; in the Teacher transform decoder, the query matrix of the mutual attention module is a set Q of location features of a set of labeled relationship pairs t
In the Student transform decoder, the feature D is initialized o Is a zero vector; in the Teacher transform decoder, feature D is initialized to Is a set of word vector features that label objects in a relationship pair.
Further, the query matrix Q of the mutual attention module in the Teacher transform decoder t The construction method comprises the following steps:
Q t =tanh(F q (H t ))
Figure BDA0003604095010000041
Figure BDA0003604095010000042
in the formula,
Figure BDA0003604095010000043
number of pairs of relationships noted in the picture, H t A set of location feature codes; in that
Figure BDA0003604095010000044
In the middle, the first 8 elements represent the coordinates, width and height of the center points of the frames of the person and the object, and the last 4 elements represent the relative positions and the areas of the two frames; f q Is composed of 2 full connection layers, and the activation function between the full connection layers is ReLU, F q The initial parameters are initialized randomly;
initializing features in the Teacher transform decoder
Figure BDA0003604095010000045
The construction method comprises the following steps:
Figure BDA0003604095010000046
Figure BDA0003604095010000047
in the formula,
Figure BDA0003604095010000048
word vectors representing objects in the ith labeled relationship pair, F w Is composed of 2 full connection layers, and the activation function between the full connection layers is ReLU, F w Is initialized randomly.
Further, the preset loss function includes: l1loss, generalized IoU loss, cross-entry loss, and focal loss;
wherein, L1loss and generalized IoU loss are used for supervising the frame regression of people and objects; cross-entropy loss is used to supervise object class classification; focal loss is used for supervising relationship category classification; the specific relationship detects the total loss function as:
Figure BDA0003604095010000049
Figure BDA00036040950100000410
Figure BDA00036040950100000411
Figure BDA00036040950100000412
Figure BDA0003604095010000051
Figure BDA0003604095010000052
Figure BDA0003604095010000053
Figure BDA0003604095010000054
Figure BDA0003604095010000055
Figure BDA0003604095010000056
wherein,
Figure BDA0003604095010000057
representing the marked relationship in the picture versus the collection of the non-empty set subscripts,
Figure BDA0003604095010000058
represents the number thereof; phi represents a set of relation pairs of empty set subscripts marked in the picture;
Figure BDA0003604095010000059
labeling corresponding pictures of the Student network after Hungarian algorithm matching
Figure BDA00036040950100000510
The predicted result of (2);
Figure BDA00036040950100000511
the prediction result of the Teacher network is obtained; n is a radical of q For the number of predicted potential relationship pairs, l f Is a per-element focal length; l is t And L s Detecting a total loss function for the relationship between the Teacher and the Student network respectively;
Figure BDA00036040950100000512
l1loss, generalized IoU loss, cross-entry loss and focal loss of the Teacher and Student networks, respectively; lambda [ alpha ] b 、λ u 、λ c 、λ a Are weights.
Further, the preset distillation loss function comprises: cosine embedding loss and KL-divergence loss;
wherein cosine embedding loss is used to approximate the characteristics of the two decoder outputs, and KL-divergence loss is used to approximate the attention matrix of the two decoders.
The cosine embedding loss is used for approximating the characteristics output by the two decoders, and the specific formula is as follows:
Figure BDA00036040950100000513
wherein
Figure BDA00036040950100000514
And
Figure BDA00036040950100000515
respectively for the Teach transform decoder output D t And the Student transform decoder outputs the ith eigenvector in D.
The KL-subvigence loss is used for zooming in the attention matrixes of two decoders, and the specific formula is as follows:
Figure BDA00036040950100000516
wherein
Figure BDA0003604095010000061
And
Figure BDA0003604095010000062
the average attention matrixes output by the j layer of the decoder corresponding to the ith eigenvector output by the Teacher decoder and the Student Transformer decoder respectively.
The overall distillation loss function is specifically formulated as:
L dis =α 1 L cos2 L KL
wherein alpha is 1 And alpha 2 Are weights.
Furthermore, in the testing process, only the output of the Student network is utilized, and the Teacher network is not relied on; no extra calculation is added in the test process.
The other technical scheme adopted by the invention is as follows:
a person-to-object relationship detection apparatus comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the method described above.
The other technical scheme adopted by the invention is as follows:
a computer readable storage medium in which a processor executable program is stored, which when executed by a processor is for performing the method as described above.
The invention has the beneficial effects that: the invention designs the Teacher network with definite semantic information by using the idea of knowledge distillation, guides the original Student network to learn a better attention matrix, thereby obtaining context information with discrimination, greatly improving the detection performance of the relationship between the human and the object, solving the problem of ambiguous semantics in the current method based on a Transformer, improving the accuracy of the detection of the relationship between the human and the object and accelerating the convergence speed of the model.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description is made on the drawings of the embodiments of the present invention or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart illustrating steps of a method for detecting a relationship between a person and an object according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a basic network structure of a transform encoder according to an embodiment of the present invention;
FIG. 3 is a diagram illustrating a basic network structure of a transform decoder according to an embodiment of the present invention;
FIG. 4 is an exemplary diagram of the inputs to the Teacher and Student Transformer decoders of an embodiment of the present invention;
fig. 5 is a schematic diagram of a basic network structure of a human-object relationship detection module in an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.
In the description of the present invention, it should be understood that the orientation or positional relationship referred to in the description of the orientation, such as the upper, lower, front, rear, left, right, etc., is based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.
In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and larger, smaller, larger, etc. are understood as excluding the number, and larger, smaller, inner, etc. are understood as including the number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.
In the description of the present invention, unless otherwise explicitly limited, terms such as arrangement, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.
As shown in fig. 1, this embodiment provides a method for detecting relationship between a person and an object based on Transformer and knowledge distillation, which can further improve the accuracy of detecting relationship between a person and an object by using a Teacher Transformer decoder with clear semantic information to guide a Student Transformer decoder to learn context information with more discriminative power, and the method specifically includes the following steps:
and S1, acquiring a training set of the human-object relation detection data set, and performing enhancement processing on the training set.
Performing data enhancement on the input training picture; in this embodiment, the data enhancement is performed on the training data of the data set HICO-DET, specifically: horizontally flipping an input picture with a probability of 50%; dithering brightness, contrast and saturation in the interval of [0.6,1.4 ]; scaling the glass substrate with a probability of 50%, wherein the shortest side is selected with a medium probability [480,512,544,576,608,640,672,704,736,768,800] during scaling, and the longest side is guaranteed not to exceed 1333; finally, the pictures are normalized to mean and variance [0.485,0.456,0.406] and [0.229,0.224,0.225], respectively.
And S2, constructing the Student network and initializing the Student network.
The step S2 specifically includes steps S21-S23:
s21, constructing and initializing a CNN-based deep neural network;
s22, constructing and initializing a Transformer encoder and a Transformer decoder;
s23, constructing a human-object relation detection network, predicting the human-object relation in the picture to be detected according to the output of the Transformer decoder, and initializing the human-object relation detection network.
(1) For a training picture after data enhancement, a feature map F is obtained through a CNN-based deep neural network.
The CNN-based deep neural network constructed in this embodiment is a ResNet-50 network, and is followed by a convolution of 1 × 1 to reduce the number of channels, and the initialization mode is to use the parameters of a transform-based object detection model trained on MS-COCO as initialization parameters.
(2) After F is obtained, the features are input into a transform encoder.
In this embodiment, three-dimensional position coding needs to be performed on each pixel of the three-dimensional feature map F, and the position coding method is as follows:
Figure BDA0003604095010000081
Figure BDA0003604095010000082
wherein pos represents a position of the two-dimensional picture, D is a constant, and D is 128 in this embodiment; j represents the dimension, and position coding is carried out on the positions with odd channels by a cos function; for the positions with even channels, performing position coding by using a sin function; and finally, outputting the PE as a three-dimensional position coding matrix, wherein the dimension size is consistent with that of the F.
In this embodiment, the transform encoder is composed of l concatenated encoder layers, with l set to 6. The encoder structure of this embodiment is shown in fig. 2, and each encoder layer is composed of a self-attention module, a residual error connection network, a layer normalization processing module, a forward feedback network, a residual error connection network, and a layer normalization processing module, which are cascaded. As can be seen from fig. 2, the query matrix, the key matrix and the value matrix of the transform encoder are respectively:
Q e =F+PE
K e =F+PE
V e =F
the computational process of the Transformer encoder is expressed as:
E=f enc (F,PE)
where E is the output of the transform encoder for the features, f enc Representing the concatenated encoder layers.
The initialization mode of the Transformer encoder is to use the parameters of the encoder in the Transformer-based object detection model trained on the MS-COCO as initialization parameters.
(3) Next, the transform encoder signature E and the position encoding PE are input into the transform decoder.
In this embodiment, the transform decoder is composed of l concatenated decoder layers, where l is set to 6 in the same way, and corresponds to the number of layers of the encoder. The decoder structure of this embodiment is shown in fig. 3, and each decoder layer is composed of a self-attention module, a residual error connection network, a layer normalization processing module, a mutual-attention module, a residual error connection network, a layer normalization processing module, a forward feedback network, a residual error connection network, and a layer normalization processing module which are cascaded; the query matrix, the key matrix and the value matrix of the mutual attention module of the decoder are respectively:
Figure BDA0003604095010000091
K d =E+PE
V d =E
where Q is a set of learnable vectors.
The computational process of the transform decoder is expressed as:
D=f dec (Q,D 0 ,E,PE)
where D is the output of the transform decoder, D 0 To initialize the feature vectors, a zero vector matrix is used in this embodiment.
The initialization mode of the Transformer decoder is to use the parameters of the decoder in the Transformer-based object detection model trained on the MS-COCO as initialization parameters.
(4) D, obtaining a final prediction result through a human-object relation detection network, wherein the jth prediction result triple comprises
Figure BDA0003604095010000092
Wherein
Figure BDA0003604095010000093
To normalize the position of the frames of people and objects,
Figure BDA0003604095010000094
Figure BDA0003604095010000095
wherein N is obj And N act Respectively the number of objects and the number of relations represented in the data set. The human-object relationship detection network comprises 4 forward feedback networks, and specifically comprises the following steps:
Figure BDA0003604095010000096
Figure BDA0003604095010000097
Figure BDA0003604095010000098
Figure BDA0003604095010000099
wherein F h And F o The method comprises the following steps that (1) the method is composed of 3 full connection layers, and an activation function between the full connection layers is ReLU; f c And F a Is a fully connected layer. F h 、F o And F c The initialization method of (1) is to use the parameters of the object detection network in the transform-based object detection model trained on MS-COCO as the initialization parameters. F a And (4) random initialization.
After the prediction result of the Student network is obtained, the matching mode of the prediction result and the annotation relation pair is used in Hungarian algorithm, wherein the loss matrix is calculated as follows:
Figure BDA00036040950100000910
Figure BDA0003604095010000101
Figure BDA0003604095010000102
Figure BDA0003604095010000103
Figure BDA0003604095010000104
wherein
Figure BDA0003604095010000105
Representing the ith labeled relation pair in the picture, and phi representing the collection of the subscripts of the empty set of the labeled relation pairs in the picture; (ii) a GIOU is short for generated IoU; calculating to obtain subscript position of prediction relation pair corresponding to each labeled relation pair
Figure BDA0003604095010000106
S3, constructing the Teacher network and initializing the Teacher network.
The Teacher network is also composed of a CNN-based deep neural network, a Transformer encoder, a Transformer decoder, and a human-object relationship detection network, and shares parameters of the above networks with students.
The difference between the Teacher network and the Student network lies in a Transformer decoder, which is specifically expressed as follows:
(1) query matrix of mutual attention module in decoder: in the Student transform decoder, the query matrix of the mutual attention module is a set Q of learnable vectors, and the query matrix of the mutual attention module in the Teacher transform decoder is a set Q of position features of a set of labeled relationship pairs t ;Q t The construction method comprises the following steps:
Q t =tanh(F q (H t ))
Figure BDA0003604095010000107
Figure BDA0003604095010000108
wherein
Figure BDA0003604095010000109
Number of pairs of relationships, H, marked in a picture t A set of location feature codes; in that
Figure BDA00036040950100001010
In the middle, the first 8 elements represent the coordinates, width and height of the center points of the frames of the person and the object in the ith labeled relationship pair, and the last 4 elements represent the relative position and the areas of the two frames; f q Is composed of 2 full connection layers, and the activation function between the full connection layers is ReLU, F q The initial parameters of (2) are randomly initialized.
(2) Initialization feature of the decoder: in the Student transform decoder, the feature D is initialized o Initialization feature for a zero vector, Teacher transform decoder
Figure BDA00036040950100001011
A set of word vector characteristics of the objects in a set of labeled relationship pairs;
Figure BDA00036040950100001012
the construction method comprises the following steps:
Figure BDA00036040950100001013
Figure BDA0003604095010000111
wherein
Figure BDA0003604095010000112
The word vector representing the object in the ith labeled relationship pair. In this embodiment, the word vectors are obtained from a trained CLIP model, and the dimension of each word vector is 512. F w Is composed of 2 full connection layers, and the activation function between the full connection layers is ReLU, F w Is initialized randomly.
For the Teacher transform decoder, the query matrix is Q t The key matrix and the value matrix are respectively
K d =E+PE
V d =E
The key matrix and value matrix are the same as those of the Student transform decoder.
The computational process of the Teacher transform decoder is expressed as:
Figure BDA0003604095010000113
wherein D t Is the output of the Teacher transform decoder.
And S4, monitoring the output of the Student network and the Teacher network by using a preset loss function in training.
After prediction outputs of the Teacher and Student networks are obtained, a plurality of loss functions are used for supervision, and specific relation detection total loss functions are as follows:
Figure BDA0003604095010000114
Figure BDA0003604095010000115
Figure BDA0003604095010000116
Figure BDA0003604095010000117
Figure BDA0003604095010000118
Figure BDA0003604095010000119
Figure BDA00036040950100001110
Figure BDA0003604095010000121
Figure BDA00036040950100001211
Figure BDA0003604095010000122
wherein L is t And L s Detecting a total loss function for the relationship between the Teacher and the Student network respectively;
Figure BDA0003604095010000123
Figure BDA0003604095010000124
l1loss, generalized IoU loss, cross-entry loss and focal loss of the Teacher and Student networks, respectively; wherein L1loss, generated IoU loss is used to supervise the frame regression of human and object; cross-entry loss supervised object class classification; classifying the local supervision relation types; lambda [ alpha ] b 、λ u 、λ c 、λ a For weighting, in this embodimentλ b 、λ u 、λ c 、λ a 2.5, 1 and 1, respectively.
S5, using a preset distillation loss function to draw the prediction of a Student transform decoder and a Teacher transform decoder in training;
the distillation loss function used in this example includes a cosine embedding loss and a KL-divergence loss, where the cosine embedding loss is used to approximate the characteristics of the two decoder outputs, and the specific formula is:
Figure BDA0003604095010000125
wherein
Figure BDA0003604095010000126
And
Figure BDA0003604095010000127
respectively for the Teach transform decoder output D t And the Student transform decoder outputs the ith eigenvector in D.
The KL-subvigence loss is used for zooming in the attention matrixes of two decoders, and the specific formula is as follows:
Figure BDA0003604095010000128
wherein
Figure BDA0003604095010000129
And
Figure BDA00036040950100001210
respectively for the Teach transform decoder output D t And the average attention matrix of the decoder j layer output corresponding to the i characteristic vector of the Student transform decoder output D.
The overall distillation loss function is specifically formulated as:
L dis =α 1 L cos2 L KL
wherein alpha is 1 And alpha 2 In this embodiment, α is used as a weight 1 And alpha 2 Respectively 1 and 10.
And S6, acquiring the detection result of the relationship between the person and the object by adopting the trained Student network in the test.
In this embodiment, the test process is characterized in that only the output of the Student network is utilized, and the Teacher network is not relied on; no extra calculation is added in the test process.
To verify the effectiveness of the present invention, we performed experiments on the HICO-DET dataset as shown in table 1 below, using the mean Average Precision (mapp) metric, and the calculation method is: and calculating the prediction precision for each action type contained in the data set in all the test images, wherein the average prediction precision of all the actions is mAP.
TABLE 1 comparative data Table of the invention and other methods on HICO-DET
Methods full rare non-rare
QPIC 29.07 21.85 31.23
Ours 30.41 25.10 32.00
In summary, compared with the prior art, the present embodiment has the following advantages and beneficial effects:
(1) the method solves the problem that the query matrix of the mutual attention module of the decoder and the initialization characteristic of the decoder have semantic ambiguity in the existing method based on the Transformer. The method utilizes position information and object word vector information of a labeling relation pair to construct a query matrix and an initialization characteristic with definite semantic information; the method designs the Teacher network with definite semantic information by using the idea of knowledge distillation, guides the original Student network to learn a better attention matrix so as to obtain more discriminative context information, and greatly improves the detection performance of the relationship between people and objects.
(2) The output of the two networks is drawn by the network parameters shared by the Teacher and the Student and the distillation loss function, the convergence speed of the networks is greatly improved, the training time is reduced, and the training efficiency is improved; sharing network parameters at the same time reduces system complexity.
(3) In the testing process, only the output of the Student network is reserved, and extra calculation amount is not increased; the test time of the model is not increased while the performance is improved.
(4) Meanwhile, the method is suitable for a plurality of detection networks based on transformers and has wide application value.
This embodiment still provides a people and object relation detection device, includes:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the method of fig. 1.
The device for detecting the relationship between a person and an object according to the embodiment of the present invention can perform the method for detecting the relationship between a person and an object according to the embodiment of the method of the present invention, and can perform any combination of the steps of the embodiments of the method.
The embodiment of the application also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.
The embodiment also provides a storage medium, which stores instructions or programs capable of executing the method for detecting the relationship between the person and the object provided by the embodiment of the method of the invention, and when the instructions or the programs are run, the steps can be implemented by any combination of the embodiment of the method, and the method has corresponding functions and beneficial effects.
In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.
Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is to be determined from the appended claims along with their full scope of equivalents.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.
It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
In the foregoing description of the specification, reference to the description of "one embodiment/example," "another embodiment/example," or "certain embodiments/examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.
While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A method for detecting a relationship between a person and an object, comprising the steps of:
acquiring a training set of a human-object relation detection data set, and performing enhancement processing on the training set;
constructing a Student network and initializing the Student network;
constructing a Teacher network and initializing the Teacher network;
using a preset loss function to supervise the output of the Student network and the output of the Teacher network in the training process;
using a preset distillation loss function to draw the prediction of a Student transform decoder and a Teacher transform decoder in training;
in the test, a trained Student network is adopted to obtain the detection result of the relationship between the person and the object.
2. The method for detecting relationship between a person and an object according to claim 1, wherein the constructing and initializing a Student network comprises:
constructing and initializing a CNN-based deep neural network;
constructing and initializing a Transformer encoder and a Transformer decoder;
and constructing a human-object relation detection network, predicting the human-object relation in the picture to be detected according to the output of the Transformer decoder, and initializing the human-object relation detection network.
3. The method for detecting the relationship between a person and an object according to claim 2, wherein the Transformer encoder is constructed by:
the method comprises the following steps that 1 cascaded encoder layer forms an encoder, and each encoder layer consists of a self-attention module, a residual error connecting network, a layer normalization processing module, a forward feedback network, a residual error connecting network and a layer normalization processing module which are cascaded; the query matrix, the key matrix and the value matrix of the self-attention module in the encoder are respectively F + PE, F + PE and F, and the output of the transform encoder is E;
the construction method of the Transformer decoder comprises the following steps:
the method comprises the following steps that a decoder is formed by I cascaded decoder layers, and each decoder layer is formed by a self-attention module, a residual error connecting network, a layer normalization processing module, a mutual attention module, a residual error connecting network, a layer normalization processing module, a forward feedback network, a residual error connecting network and a layer normalization processing module which are cascaded; the query matrix, the key matrix and the value matrix of the mutual attention module in the decoder are Q, E + PE and E respectively, and the output of the transform decoder is D;
wherein, F represents the output characteristic of the input image passing through the CNN-based deep neural network, PE represents the position coding of the input image, and Q represents a set of learnable vectors.
4. The method according to claim 2, wherein the human-object relationship detection network comprises 4 feedforward networks, and the 4 feedforward networks are respectively composed of 3 full-connected layers, 1 full-connected layer and 1 full-connected layer;
the 4 feedforward networks are used to predict the position of the person, the position of the object, the class of the object, and the class of the relationship, respectively.
5. The method of claim 1, wherein the difference between the Student Transformer decoder and the Teacher Transformer decoder comprises:
in a Student transform decoder, a query matrix of a mutual attention module is a set Q of learnable vectors;
in the Teacher transform decoder, the query matrix of the mutual attention module is a set Q of location features of a set of labeled relationship pairs t
In the Student transform decoder, the feature D is initialized o Is a zero vector; initializing features in a Teacher transform decoder
Figure FDA0003604093000000028
Is a set of word vector features that label objects in a relationship pair.
6. The method of claim 5, wherein the query matrix Q of the mutual attention module in the Teacher transform decoder is a matrix of the human-object relationship detection t The construction method comprises the following steps:
Q t =tanh(F q (H t ))
Figure FDA0003604093000000021
Figure FDA0003604093000000022
in the formula,
Figure FDA0003604093000000023
number of pairs of relationships noted in the picture, H t A set of location feature codes; in that
Figure FDA0003604093000000024
In the middle, the first 8 elements represent the coordinates, width and height of the center point of the person and object frame, and the last 4 elements represent the coordinates, width and height of the center point of the person and object frameThe relative position and the area of the two frames are obtained; f q Is composed of 2 full connection layers, and the activation function between the full connection layers is ReLU, F q The initial parameters are initialized randomly;
initializing features in the Teacher transform decoder
Figure FDA0003604093000000029
The construction method comprises the following steps:
Figure FDA0003604093000000025
Figure FDA0003604093000000026
in the formula,
Figure FDA0003604093000000027
word vectors representing objects in the ith labeled relationship pair, F w Is composed of 2 full connection layers, and the activation function between the full connection layers is ReLU, F w Is initialized randomly.
7. The method according to claim 1, wherein the predetermined loss function comprises: l1loss, generalized IoU loss, cross-entry loss, and focal loss;
wherein, L1loss and generalized IoU loss are used for supervising the frame regression of people and objects; cross-entropy loss is used to supervise object class classification; focal distance is used to supervise relationship class classification.
8. A method as claimed in claim 1, wherein the predetermined distillation loss function comprises: cosine embedding loss and KL-divergence loss;
wherein cosine embedding loss is used to approximate the characteristics of the two decoder outputs, and KL-divergence loss is used to approximate the attention matrix of the two decoders.
9. A person-object relationship detecting apparatus, comprising:
at least one processor;
at least one memory for storing at least one program;
when executed by the at least one processor, cause the at least one processor to implement the method of any one of claims 1-8.
10. A computer-readable storage medium, in which a program executable by a processor is stored, wherein the program executable by the processor is adapted to perform the method according to any one of claims 1 to 8 when executed by the processor.
CN202210410947.4A 2022-04-19 Method and device for detecting relationship between person and object and storage medium Active CN114863353B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210410947.4A CN114863353B (en) 2022-04-19 Method and device for detecting relationship between person and object and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210410947.4A CN114863353B (en) 2022-04-19 Method and device for detecting relationship between person and object and storage medium

Publications (2)

Publication Number Publication Date
CN114863353A true CN114863353A (en) 2022-08-05
CN114863353B CN114863353B (en) 2024-08-02

Family

ID=

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115407874A (en) * 2022-08-18 2022-11-29 中国兵器工业标准化研究所 Neural network-based VR maintenance training operation proficiency prediction method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021023202A1 (en) * 2019-08-07 2021-02-11 交叉信息核心技术研究院(西安)有限公司 Self-distillation training method and device for convolutional neural network, and scalable dynamic prediction method
CN112613303A (en) * 2021-01-07 2021-04-06 福州大学 Knowledge distillation-based cross-modal image aesthetic quality evaluation method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021023202A1 (en) * 2019-08-07 2021-02-11 交叉信息核心技术研究院(西安)有限公司 Self-distillation training method and device for convolutional neural network, and scalable dynamic prediction method
CN112613303A (en) * 2021-01-07 2021-04-06 福州大学 Knowledge distillation-based cross-modal image aesthetic quality evaluation method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115407874A (en) * 2022-08-18 2022-11-29 中国兵器工业标准化研究所 Neural network-based VR maintenance training operation proficiency prediction method

Similar Documents

Publication Publication Date Title
CN110414432B (en) Training method of object recognition model, object recognition method and corresponding device
CN109471895B (en) Electronic medical record phenotype extraction and phenotype name normalization method and system
EP3819790A2 (en) Method and apparatus for visual question answering, computer device and medium
CN112559784B (en) Image classification method and system based on incremental learning
EP3859560A2 (en) Method and apparatus for visual question answering, computer device and medium
CN109711463B (en) Attention-based important object detection method
CN113792113A (en) Visual language model obtaining and task processing method, device, equipment and medium
CN112183577A (en) Training method of semi-supervised learning model, image processing method and equipment
CN112084331A (en) Text processing method, text processing device, model training method, model training device, computer equipment and storage medium
US20230153943A1 (en) Multi-scale distillation for low-resolution detection
CN113344206A (en) Knowledge distillation method, device and equipment integrating channel and relation feature learning
Ye et al. Steering angle prediction YOLOv5-based end-to-end adaptive neural network control for autonomous vehicles
CN115482418B (en) Semi-supervised model training method, system and application based on pseudo-negative labels
US20230281826A1 (en) Panoptic segmentation with multi-database training using mixed embedding
CN111325237A (en) Image identification method based on attention interaction mechanism
CN111611796A (en) Hypernym determination method and device for hyponym, electronic device and storage medium
Park et al. Bayesian weight decay on bounded approximation for deep convolutional neural networks
CN116521899B (en) Improved graph neural network-based document level relation extraction method and system
CN114863353B (en) Method and device for detecting relationship between person and object and storage medium
CN114863353A (en) Method and device for detecting relation between person and object and storage medium
CN116503761A (en) High-voltage line foreign matter detection method, model training method and device
CN114741487B (en) Image-text retrieval method and system based on image-text semantic embedding
CN115424275A (en) Fishing boat brand identification method and system based on deep learning technology
CN113779244B (en) Document emotion classification method and device, storage medium and electronic equipment
Guo et al. Deep Learning-Based Image Retrieval With Unsupervised Double Bit Hashing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant