CN114863353A

CN114863353A - Method and device for detecting relation between person and object and storage medium

Info

Publication number: CN114863353A
Application number: CN202210410947.4A
Authority: CN
Inventors: 丁长兴; 屈贤; 钟旭彬; 王健; 丁二锐
Original assignee: South China University of Technology SCUT; Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: South China University of Technology SCUT; Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-04-19
Filing date: 2022-04-19
Publication date: 2022-08-05
Anticipated expiration: 2042-04-19

Abstract

The invention discloses a method, a device and a storage medium for detecting the relation between a person and an object, wherein the method comprises the following steps: acquiring a training set of a human-object relation detection data set, and performing enhancement processing on the training set; constructing a Student network and initializing the Student network; constructing a Teacher network and initializing the Teacher network; using a preset loss function to supervise the output of the Student network and the output of the Teacher network in the training process; using a preset distillation loss function to draw the prediction of a student transform decoder and a Teacher transform decoder in training; in the test, a trained Student network is adopted to obtain the detection result of the relationship between the person and the object. The invention designs the Teacher network with definite semantic information by using the idea of knowledge distillation, guides the original Student network to learn a better attention matrix, thereby obtaining more discriminative context information, greatly improving the detection performance of the relationship between people and objects, and being widely applied to the technical field of image processing and recognition.

Description

Method and device for detecting relation between person and object and storage medium

Technical Field

The invention relates to the technical field of image processing and recognition, in particular to a method and a device for detecting a relationship between a person and an object and a storage medium.

Background

The human-object relationship detection can detect the positions of the human and the object, the types of the object and the types of the interaction relationship, which are interacted with each other in one picture. Human-object relationship detection is of great importance, such as: in an automatic driving system, a detection system judges the condition of a surrounding road by detecting the relationship between people and surrounding objects on the road, so as to make safe driving behaviors; in a hospital monitoring system, a detection system can judge whether a ward is in an emergency or not through the relationship between the ward and surrounding objects, thereby ensuring the life health and safety of the ward.

At present, the problem of detecting the relationship between people and objects is mainly how to extract the global context features with discriminant power. Because of the powerful ability of transformers to extract contextual features, there have been some approaches to utilize transformers in human-to-object relationship detection. However, in the current Transformer-based method, the query matrix and decoder initialization features of the mutual attention module in the Transformer decoder have the problem of semantic ambiguity, which greatly limits the ability of the Transformer to learn better context features and predict relationship classes more accurately.

Interpretation of terms:

CNN: convolutional Neural Networks (Convolutional Neural Networks) are a class of feed-forward Neural Networks that contain convolution computations and have a deep structure.

Disclosure of Invention

To solve at least one of the technical problems in the prior art to some extent, an object of the present invention is to provide a method, an apparatus and a storage medium for detecting a relationship between a person and an object.

The technical scheme adopted by the invention is as follows:

a method for detecting a relationship between a person and an object, comprising the steps of:

acquiring a training set of a human-object relation detection data set, and performing enhancement processing on the training set;

constructing a Student network and initializing the Student network;

constructing a Teacher network and initializing the Teacher network;

using a preset loss function to supervise the output of the Student network and the output of the Teacher network in the training process;

using a preset distillation loss function to draw the prediction of a Student transform decoder and a Teacher transform decoder in training;

in the test, a trained Student network is adopted to obtain the detection result of the relationship between the person and the object.

Further, the enhancing the training set includes:

and randomly carrying out horizontal turning, color dithering, size scaling and cutting on the picture, and finally carrying out normalization on the picture.

Further, the building and initializing the Student network comprises:

constructing and initializing a CNN-based deep neural network;

constructing and initializing a Transformer encoder and a Transformer decoder;

and constructing a human-object relation detection network, predicting the human-object relation in the picture to be detected according to the output of the Transformer decoder, and initializing the human-object relation detection network.

Further, the construction method of the deep neural network comprises the following steps:

the profile F is obtained using the classical residual network ResNet-50 or ResNet-101 followed by a convolution of 1x1 to reduce the number of channels.

Further, the transform position coding mode is as follows:

pos represents a position of a two-dimensional picture, D is a constant, j represents a dimension, and for the position with an odd channel, a cos function is used for position coding; for the positions with even channels, performing position coding by using a sin function; and finally, outputting the PE as a three-dimensional position coding matrix, wherein the dimension size is consistent with that of the F.

Further, the method for constructing the Transformer encoder comprises the following steps:

the method comprises the following steps that 1 cascaded encoder layer forms an encoder, and each encoder layer consists of a self-attention module, a residual error connecting network, a layer normalization processing module, a forward feedback network, a residual error connecting network and a layer normalization processing module which are cascaded; the query matrix, the key matrix and the value matrix of the self-attention module in the encoder are respectively F + PE, F + PE and F, and the output of the transform encoder is E;

the construction method of the Transformer decoder comprises the following steps:

the method comprises the following steps that a decoder is formed by I cascaded decoder layers, and each decoder layer is formed by a self-attention module, a residual error connecting network, a layer normalization processing module, a mutual attention module, a residual error connecting network, a layer normalization processing module, a forward feedback network, a residual error connecting network and a layer normalization processing module which are cascaded; the query matrix, the key matrix and the value matrix of the mutual attention module in the decoder are Q, E + PE and E respectively, and the output of the transform decoder is D;

wherein, F represents the output characteristic of the input image passing through the CNN-based deep neural network, PE represents the position coding of the input image, and Q represents a set of learnable vectors.

Further, the structure of the human-object relationship detection network is as follows:

the system comprises 4 forward feedback networks which are used for respectively predicting the position, the object type and the relationship type of a person; the 4 feedforward networks are respectively composed of 3 full connection layers (middle ReLU activation function), 1 full connection layer and 1 full connection layer.

Further, predicting the relationship between the human and the object in the picture to be detected according to the output of the transform decoder to obtain the jth prediction result, wherein the jth prediction result comprises the following steps:

wherein

To normalize the position of the frames of people and objects,

wherein N is _obj And N _act Respectively the number of objects and the number of relations represented in the data set.

Further, the matching mode of the prediction result and the annotation relation pair is as follows:

the Hungarian algorithm, where the loss matrix is calculated as follows:

wherein

Representing the ith labeled relation pair in the picture, and phi representing the collection of the subscripts of the empty set of the labeled relation pairs in the picture; GIOU is short for generalized IoU; calculating to obtain subscript position of prediction relation pair corresponding to each labeled relation pair

Further, the Teacher network comprises: the CNN-based deep neural network, the Transformer encoder, the Transformer decoder, and the human-object relationship detection network are provided as in the Student network, and parameters of the networks are shared.

Further, the difference between the Student Transformer decoder and the Teacher Transformer decoder includes:

in a Student transform decoder, a query matrix of a mutual attention module is a set Q of learnable vectors; in the Teacher transform decoder, the query matrix of the mutual attention module is a set Q of location features of a set of labeled relationship pairs _t ；

In the Student transform decoder, the feature D is initialized _o Is a zero vector; in the Teacher transform decoder, feature D is initialized _to Is a set of word vector features that label objects in a relationship pair.

Further, the query matrix Q of the mutual attention module in the Teacher transform decoder _t The construction method comprises the following steps:

Q _t ＝tanh(F _q (H _t ))

in the formula,

number of pairs of relationships noted in the picture, H _t A set of location feature codes; in that

In the middle, the first 8 elements represent the coordinates, width and height of the center points of the frames of the person and the object, and the last 4 elements represent the relative positions and the areas of the two frames; f _q Is composed of 2 full connection layers, and the activation function between the full connection layers is ReLU, F _q The initial parameters are initialized randomly;

initializing features in the Teacher transform decoder

The construction method comprises the following steps:

in the formula,

word vectors representing objects in the ith labeled relationship pair, F _w Is composed of 2 full connection layers, and the activation function between the full connection layers is ReLU, F _w Is initialized randomly.

Further, the preset loss function includes: l1loss, generalized IoU loss, cross-entry loss, and focal loss;

wherein, L1loss and generalized IoU loss are used for supervising the frame regression of people and objects; cross-entropy loss is used to supervise object class classification; focal loss is used for supervising relationship category classification; the specific relationship detects the total loss function as:

wherein,

representing the marked relationship in the picture versus the collection of the non-empty set subscripts,

represents the number thereof; phi represents a set of relation pairs of empty set subscripts marked in the picture;

labeling corresponding pictures of the Student network after Hungarian algorithm matching

The predicted result of (2);

the prediction result of the Teacher network is obtained; n is a radical of _q For the number of predicted potential relationship pairs, l _f Is a per-element focal length; l is _t And L _s Detecting a total loss function for the relationship between the Teacher and the Student network respectively;

l1loss, generalized IoU loss, cross-entry loss and focal loss of the Teacher and Student networks, respectively; lambda [ alpha ] _b 、λ _u 、λ _c 、λ _a Are weights.

Further, the preset distillation loss function comprises: cosine embedding loss and KL-divergence loss;

wherein cosine embedding loss is used to approximate the characteristics of the two decoder outputs, and KL-divergence loss is used to approximate the attention matrix of the two decoders.

The cosine embedding loss is used for approximating the characteristics output by the two decoders, and the specific formula is as follows:

wherein

And

respectively for the Teach transform decoder output D _t And the Student transform decoder outputs the ith eigenvector in D.

The KL-subvigence loss is used for zooming in the attention matrixes of two decoders, and the specific formula is as follows:

wherein

And

the average attention matrixes output by the j layer of the decoder corresponding to the ith eigenvector output by the Teacher decoder and the Student Transformer decoder respectively.

The overall distillation loss function is specifically formulated as:

L _dis ＝α ₁ L _cos +α ₂ L _KL

wherein alpha is ₁ And alpha ₂ Are weights.

Furthermore, in the testing process, only the output of the Student network is utilized, and the Teacher network is not relied on; no extra calculation is added in the test process.

The other technical scheme adopted by the invention is as follows:

a person-to-object relationship detection apparatus comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method described above.

The other technical scheme adopted by the invention is as follows:

a computer readable storage medium in which a processor executable program is stored, which when executed by a processor is for performing the method as described above.

The invention has the beneficial effects that: the invention designs the Teacher network with definite semantic information by using the idea of knowledge distillation, guides the original Student network to learn a better attention matrix, thereby obtaining context information with discrimination, greatly improving the detection performance of the relationship between the human and the object, solving the problem of ambiguous semantics in the current method based on a Transformer, improving the accuracy of the detection of the relationship between the human and the object and accelerating the convergence speed of the model.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the following description is made on the drawings of the embodiments of the present invention or the related technical solutions in the prior art, and it should be understood that the drawings in the following description are only for convenience and clarity of describing some embodiments in the technical solutions of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart illustrating steps of a method for detecting a relationship between a person and an object according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a basic network structure of a transform encoder according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating a basic network structure of a transform decoder according to an embodiment of the present invention;

FIG. 4 is an exemplary diagram of the inputs to the Teacher and Student Transformer decoders of an embodiment of the present invention;

fig. 5 is a schematic diagram of a basic network structure of a human-object relationship detection module in an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention. The step numbers in the following embodiments are provided only for convenience of illustration, the order between the steps is not limited at all, and the execution order of each step in the embodiments can be adapted according to the understanding of those skilled in the art.

In the description of the present invention, it should be understood that the orientation or positional relationship referred to in the description of the orientation, such as the upper, lower, front, rear, left, right, etc., is based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplification of description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.

In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and larger, smaller, larger, etc. are understood as excluding the number, and larger, smaller, inner, etc. are understood as including the number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present invention, unless otherwise explicitly limited, terms such as arrangement, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.

As shown in fig. 1, this embodiment provides a method for detecting relationship between a person and an object based on Transformer and knowledge distillation, which can further improve the accuracy of detecting relationship between a person and an object by using a Teacher Transformer decoder with clear semantic information to guide a Student Transformer decoder to learn context information with more discriminative power, and the method specifically includes the following steps:

and S1, acquiring a training set of the human-object relation detection data set, and performing enhancement processing on the training set.

Performing data enhancement on the input training picture; in this embodiment, the data enhancement is performed on the training data of the data set HICO-DET, specifically: horizontally flipping an input picture with a probability of 50%; dithering brightness, contrast and saturation in the interval of [0.6,1.4 ]; scaling the glass substrate with a probability of 50%, wherein the shortest side is selected with a medium probability [480,512,544,576,608,640,672,704,736,768,800] during scaling, and the longest side is guaranteed not to exceed 1333; finally, the pictures are normalized to mean and variance [0.485,0.456,0.406] and [0.229,0.224,0.225], respectively.

And S2, constructing the Student network and initializing the Student network.

The step S2 specifically includes steps S21-S23:

s21, constructing and initializing a CNN-based deep neural network;

s22, constructing and initializing a Transformer encoder and a Transformer decoder;

s23, constructing a human-object relation detection network, predicting the human-object relation in the picture to be detected according to the output of the Transformer decoder, and initializing the human-object relation detection network.

(1) For a training picture after data enhancement, a feature map F is obtained through a CNN-based deep neural network.

The CNN-based deep neural network constructed in this embodiment is a ResNet-50 network, and is followed by a convolution of 1 × 1 to reduce the number of channels, and the initialization mode is to use the parameters of a transform-based object detection model trained on MS-COCO as initialization parameters.

(2) After F is obtained, the features are input into a transform encoder.

In this embodiment, three-dimensional position coding needs to be performed on each pixel of the three-dimensional feature map F, and the position coding method is as follows:

wherein pos represents a position of the two-dimensional picture, D is a constant, and D is 128 in this embodiment; j represents the dimension, and position coding is carried out on the positions with odd channels by a cos function; for the positions with even channels, performing position coding by using a sin function; and finally, outputting the PE as a three-dimensional position coding matrix, wherein the dimension size is consistent with that of the F.

In this embodiment, the transform encoder is composed of l concatenated encoder layers, with l set to 6. The encoder structure of this embodiment is shown in fig. 2, and each encoder layer is composed of a self-attention module, a residual error connection network, a layer normalization processing module, a forward feedback network, a residual error connection network, and a layer normalization processing module, which are cascaded. As can be seen from fig. 2, the query matrix, the key matrix and the value matrix of the transform encoder are respectively:

Q _e ＝F+PE

K _e ＝F+PE

V _e ＝F

the computational process of the Transformer encoder is expressed as:

E＝f _enc (F,PE)

where E is the output of the transform encoder for the features, f _enc Representing the concatenated encoder layers.

The initialization mode of the Transformer encoder is to use the parameters of the encoder in the Transformer-based object detection model trained on the MS-COCO as initialization parameters.

(3) Next, the transform encoder signature E and the position encoding PE are input into the transform decoder.

In this embodiment, the transform decoder is composed of l concatenated decoder layers, where l is set to 6 in the same way, and corresponds to the number of layers of the encoder. The decoder structure of this embodiment is shown in fig. 3, and each decoder layer is composed of a self-attention module, a residual error connection network, a layer normalization processing module, a mutual-attention module, a residual error connection network, a layer normalization processing module, a forward feedback network, a residual error connection network, and a layer normalization processing module which are cascaded; the query matrix, the key matrix and the value matrix of the mutual attention module of the decoder are respectively:

K _d ＝E+PE

V _d ＝E

where Q is a set of learnable vectors.

The computational process of the transform decoder is expressed as:

D＝f _dec (Q,D ₀ ,E,PE)

where D is the output of the transform decoder, D ₀ To initialize the feature vectors, a zero vector matrix is used in this embodiment.

The initialization mode of the Transformer decoder is to use the parameters of the decoder in the Transformer-based object detection model trained on the MS-COCO as initialization parameters.

(4) D, obtaining a final prediction result through a human-object relation detection network, wherein the jth prediction result triple comprises

Wherein

To normalize the position of the frames of people and objects,

wherein N is _obj And N _act Respectively the number of objects and the number of relations represented in the data set. The human-object relationship detection network comprises 4 forward feedback networks, and specifically comprises the following steps:

wherein F _h And F _o The method comprises the following steps that (1) the method is composed of 3 full connection layers, and an activation function between the full connection layers is ReLU; f _c And F _a Is a fully connected layer. F _h 、F _o And F _c The initialization method of (1) is to use the parameters of the object detection network in the transform-based object detection model trained on MS-COCO as the initialization parameters. F _a And (4) random initialization.

After the prediction result of the Student network is obtained, the matching mode of the prediction result and the annotation relation pair is used in Hungarian algorithm, wherein the loss matrix is calculated as follows:

wherein

Representing the ith labeled relation pair in the picture, and phi representing the collection of the subscripts of the empty set of the labeled relation pairs in the picture; (ii) a GIOU is short for generated IoU; calculating to obtain subscript position of prediction relation pair corresponding to each labeled relation pair

S3, constructing the Teacher network and initializing the Teacher network.

The Teacher network is also composed of a CNN-based deep neural network, a Transformer encoder, a Transformer decoder, and a human-object relationship detection network, and shares parameters of the above networks with students.

The difference between the Teacher network and the Student network lies in a Transformer decoder, which is specifically expressed as follows:

(1) query matrix of mutual attention module in decoder: in the Student transform decoder, the query matrix of the mutual attention module is a set Q of learnable vectors, and the query matrix of the mutual attention module in the Teacher transform decoder is a set Q of position features of a set of labeled relationship pairs _t ；Q _t The construction method comprises the following steps:

Q _t ＝tanh(F _q (H _t ))

wherein

Number of pairs of relationships, H, marked in a picture _t A set of location feature codes; in that

In the middle, the first 8 elements represent the coordinates, width and height of the center points of the frames of the person and the object in the ith labeled relationship pair, and the last 4 elements represent the relative position and the areas of the two frames; f _q Is composed of 2 full connection layers, and the activation function between the full connection layers is ReLU, F _q The initial parameters of (2) are randomly initialized.

(2) Initialization feature of the decoder: in the Student transform decoder, the feature D is initialized _o Initialization feature for a zero vector, Teacher transform decoder

A set of word vector characteristics of the objects in a set of labeled relationship pairs;

the construction method comprises the following steps:

wherein

The word vector representing the object in the ith labeled relationship pair. In this embodiment, the word vectors are obtained from a trained CLIP model, and the dimension of each word vector is 512. F _w Is composed of 2 full connection layers, and the activation function between the full connection layers is ReLU, F _w Is initialized randomly.

For the Teacher transform decoder, the query matrix is Q _t The key matrix and the value matrix are respectively

K _d ＝E+PE

V _d ＝E

The key matrix and value matrix are the same as those of the Student transform decoder.

The computational process of the Teacher transform decoder is expressed as:

wherein D _t Is the output of the Teacher transform decoder.

And S4, monitoring the output of the Student network and the Teacher network by using a preset loss function in training.

After prediction outputs of the Teacher and Student networks are obtained, a plurality of loss functions are used for supervision, and specific relation detection total loss functions are as follows:

wherein L is _t And L _s Detecting a total loss function for the relationship between the Teacher and the Student network respectively;

l1loss, generalized IoU loss, cross-entry loss and focal loss of the Teacher and Student networks, respectively; wherein L1loss, generated IoU loss is used to supervise the frame regression of human and object; cross-entry loss supervised object class classification; classifying the local supervision relation types; lambda [ alpha ] _b 、λ _u 、λ _c 、λ _a For weighting, in this embodimentλ _b 、λ _u 、λ _c 、λ _a 2.5, 1 and 1, respectively.

S5, using a preset distillation loss function to draw the prediction of a Student transform decoder and a Teacher transform decoder in training;

the distillation loss function used in this example includes a cosine embedding loss and a KL-divergence loss, where the cosine embedding loss is used to approximate the characteristics of the two decoder outputs, and the specific formula is:

wherein

And

wherein

And

respectively for the Teach transform decoder output D _t And the average attention matrix of the decoder j layer output corresponding to the i characteristic vector of the Student transform decoder output D.

The overall distillation loss function is specifically formulated as:

L _dis ＝α ₁ L _cos +α ₂ L _KL

wherein alpha is ₁ And alpha ₂ In this embodiment, α is used as a weight ₁ And alpha ₂ Respectively 1 and 10.

And S6, acquiring the detection result of the relationship between the person and the object by adopting the trained Student network in the test.

In this embodiment, the test process is characterized in that only the output of the Student network is utilized, and the Teacher network is not relied on; no extra calculation is added in the test process.

To verify the effectiveness of the present invention, we performed experiments on the HICO-DET dataset as shown in table 1 below, using the mean Average Precision (mapp) metric, and the calculation method is: and calculating the prediction precision for each action type contained in the data set in all the test images, wherein the average prediction precision of all the actions is mAP.

TABLE 1 comparative data Table of the invention and other methods on HICO-DET

Methods	full	rare	non-rare
				QPIC	29.07	21.85	31.23
Ours	30.41	25.10	32.00

In summary, compared with the prior art, the present embodiment has the following advantages and beneficial effects:

(1) the method solves the problem that the query matrix of the mutual attention module of the decoder and the initialization characteristic of the decoder have semantic ambiguity in the existing method based on the Transformer. The method utilizes position information and object word vector information of a labeling relation pair to construct a query matrix and an initialization characteristic with definite semantic information; the method designs the Teacher network with definite semantic information by using the idea of knowledge distillation, guides the original Student network to learn a better attention matrix so as to obtain more discriminative context information, and greatly improves the detection performance of the relationship between people and objects.

(2) The output of the two networks is drawn by the network parameters shared by the Teacher and the Student and the distillation loss function, the convergence speed of the networks is greatly improved, the training time is reduced, and the training efficiency is improved; sharing network parameters at the same time reduces system complexity.

(3) In the testing process, only the output of the Student network is reserved, and extra calculation amount is not increased; the test time of the model is not increased while the performance is improved.

(4) Meanwhile, the method is suitable for a plurality of detection networks based on transformers and has wide application value.

This embodiment still provides a people and object relation detection device, includes:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method of fig. 1.

The device for detecting the relationship between a person and an object according to the embodiment of the present invention can perform the method for detecting the relationship between a person and an object according to the embodiment of the method of the present invention, and can perform any combination of the steps of the embodiments of the method.

The embodiment of the application also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.

The embodiment also provides a storage medium, which stores instructions or programs capable of executing the method for detecting the relationship between the person and the object provided by the embodiment of the method of the invention, and when the instructions or the programs are run, the steps can be implemented by any combination of the embodiment of the method, and the method has corresponding functions and beneficial effects.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is to be determined from the appended claims along with their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the foregoing description of the specification, reference to the description of "one embodiment/example," "another embodiment/example," or "certain embodiments/examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the present invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for detecting a relationship between a person and an object, comprising the steps of:

constructing a Student network and initializing the Student network;

constructing a Teacher network and initializing the Teacher network;

2. The method for detecting relationship between a person and an object according to claim 1, wherein the constructing and initializing a Student network comprises:

constructing and initializing a CNN-based deep neural network;

constructing and initializing a Transformer encoder and a Transformer decoder;

3. The method for detecting the relationship between a person and an object according to claim 2, wherein the Transformer encoder is constructed by:

4. The method according to claim 2, wherein the human-object relationship detection network comprises 4 feedforward networks, and the 4 feedforward networks are respectively composed of 3 full-connected layers, 1 full-connected layer and 1 full-connected layer;

the 4 feedforward networks are used to predict the position of the person, the position of the object, the class of the object, and the class of the relationship, respectively.

5. The method of claim 1, wherein the difference between the Student Transformer decoder and the Teacher Transformer decoder comprises:

in a Student transform decoder, a query matrix of a mutual attention module is a set Q of learnable vectors;

in the Teacher transform decoder, the query matrix of the mutual attention module is a set Q of location features of a set of labeled relationship pairs _t ；

In the Student transform decoder, the feature D is initialized _o Is a zero vector; initializing features in a Teacher transform decoder

Is a set of word vector features that label objects in a relationship pair.

6. The method of claim 5, wherein the query matrix Q of the mutual attention module in the Teacher transform decoder is a matrix of the human-object relationship detection _t The construction method comprises the following steps:

Q _t ＝tanh(F _q (H _t ))

in the formula,

In the middle, the first 8 elements represent the coordinates, width and height of the center point of the person and object frame, and the last 4 elements represent the coordinates, width and height of the center point of the person and object frameThe relative position and the area of the two frames are obtained; f _q Is composed of 2 full connection layers, and the activation function between the full connection layers is ReLU, F _q The initial parameters are initialized randomly;

initializing features in the Teacher transform decoder

The construction method comprises the following steps:

in the formula,

7. The method according to claim 1, wherein the predetermined loss function comprises: l1loss, generalized IoU loss, cross-entry loss, and focal loss;

wherein, L1loss and generalized IoU loss are used for supervising the frame regression of people and objects; cross-entropy loss is used to supervise object class classification; focal distance is used to supervise relationship class classification.

8. A method as claimed in claim 1, wherein the predetermined distillation loss function comprises: cosine embedding loss and KL-divergence loss;

9. A person-object relationship detecting apparatus, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method of any one of claims 1-8.

10. A computer-readable storage medium, in which a program executable by a processor is stored, wherein the program executable by the processor is adapted to perform the method according to any one of claims 1 to 8 when executed by the processor.