CN116071809B

CN116071809B - Face space-time representation generation method based on multi-class representation space-time interaction

Info

Publication number: CN116071809B
Application number: CN202310285315.4A
Authority: CN
Inventors: 蒋冬梅; 李岩; 王耀威; 蓝湘源; 吕科
Original assignee: Peng Cheng Laboratory
Current assignee: Peng Cheng Laboratory
Priority date: 2023-03-22
Filing date: 2023-03-22
Publication date: 2023-07-14
Anticipated expiration: 2043-03-22
Also published as: CN116071809A

Abstract

The invention discloses a face space-time representation generating method based on multi-type representation space-time interaction. The method solves the problem that the effectiveness of the generated face space-time representation is low because the high-level relation information between different areas of the face is ignored in the existing face space-time representation learning method for extracting the local representation of the face by adopting a convolutional neural network.

Description

Face space-time representation generation method based on multi-class representation space-time interaction

Technical Field

The invention relates to the field of face space-time representation learning, in particular to a face space-time representation generating method based on multi-class representation space-time interaction.

Background

At present, a method for extracting local face characterization through a convolutional neural network is commonly adopted in face space-time characterization learning, and modeling of high-level relation information among different areas of a face is omitted, so that the effectiveness of the learned face space-time characterization is limited to a certain extent. There are also a few methods to model face images directly using transformers, but structural information of the face is lost, and meanwhile, local characterization of the face is not fully learned.

In addition, the present face representation learning mainly starts from a single frame image in the video, ignores time sequence dynamic information of the face video, and obtains the space-time coupling representation of the face from the image sequence, which is more beneficial to tasks such as living body identity recognition and verification, state evaluation and the like. In order to fuse different characterization description sequences, at present, a method of fusing two characterizations at each moment and then early fusing a model for time sequence dynamic modeling is mostly adopted; or a method of carrying out later decision fusion for each represented time sequence dynamic mode. These methods ignore the dynamic interaction relationship between two characterizations and cannot achieve simultaneous interactions within the same characterization at different times and between different characterizations.

Accordingly, there is a need for improvement and development in the art.

Disclosure of Invention

The invention aims to solve the technical problems that aiming at the defects in the prior art, a face space-time representation generating method based on multi-type representation space-time interaction is provided, and aims to solve the problems that the effectiveness of the generated face space-time representation is lower because the existing face space-time representation learning adopts a convolutional neural network to extract the local representation of the face, and high-level relation information among different areas of the face is ignored.

The technical scheme adopted by the invention for solving the problems is as follows:

in a first aspect, an embodiment of the present invention provides a face space-time representation generating method based on multi-class representation space-time interaction, where the method includes:

acquiring video data to be processed, and determining a face image sequence according to the video data, wherein the face image sequence comprises face images respectively corresponding to a plurality of moments;

acquiring a face local representation and a face relation representation which are respectively corresponding to the face images, wherein the face local representation is a representation extracted based on local information of the face images, and the face relation representation is a representation extracted based on association relations between different areas of the face images;

and carrying out space-time interaction on each face local representation and each face relation representation to obtain the face space-time representation corresponding to the video data, wherein the space-time interaction is interaction in time and space between similar representations at different moments.

In one embodiment, the method for acquiring the local face representation corresponding to each face image includes:

inputting the face image into a pre-training convolutional neural network;

and acquiring the face local representation output by the pre-training convolutional neural network based on the face image.

In one embodiment, the method for obtaining the face relation representation corresponding to each face image includes:

inputting the face image into a space diagram attention network, wherein a diagram attention mechanism of the space diagram attention network operates based on a face region relation diagram, and the face region relation diagram is used for reflecting the association relation between regions in each face image;

and acquiring the face relation representation output by the space diagram attention network based on the face image.

In one embodiment, the method for generating the face region relation graph includes:

dividing one face image into a plurality of subareas on average;

and determining the face region relation graph according to the subareas, wherein the face region relation graph comprises a plurality of nodes and edges, each node in the face region relation graph corresponds to each subarea one by one, and the edges in the face region relation graph are used for reflecting the association relation among the nodes.

In one embodiment, if two nodes in the face region relationship graph are adjacent and/or bilateral symmetry, an edge exists between the two nodes.

In one embodiment, the calculation process of the spatial map attention network for each face image includes:

acquiring relation coefficients between every two subareas in the face image through a graph attention mechanism of the spatial graph attention network, and carrying out normalization processing on the relation coefficients;

aggregating neighbor information according to the relation coefficients after normalization processing to obtain corresponding relation characterization of each subarea;

and carrying out average processing according to each relation representation to obtain the face relation representation corresponding to the face image.

In one embodiment, the performing space-time interaction on each of the local face representations and each of the face relation representations to obtain a face space-time representation corresponding to the video data includes:

inputting each face local representation and each face relation representation into a double-flow space-time diagram attention network, wherein a diagram attention mechanism of the double-flow space-time diagram attention network operates based on double-flow face representation space-time diagrams, and the double-flow face representation space-time diagrams are used for reflecting interaction relations between each face local representation and each face relation representation;

and acquiring the two-flow space-time diagram attention network, wherein the two-flow space-time diagram attention network is based on the face local characterization and the face relation characterization.

In one embodiment, the dual-flow face representation space-time diagram comprises a plurality of local nodes, relationship nodes and edges, wherein each local node corresponds to each face local representation one by one, each relationship node corresponds to each face relationship representation one by one, and the edges in the dual-flow face representation space-time diagram are used for reflecting interaction relations between the same type of nodes and between two types of nodes.

In one embodiment, if the time interval between the moments corresponding to the two nodes in the dual-flow face representation space-time diagram is smaller than a preset threshold, an edge exists between the two nodes.

In one embodiment, the calculation process of the dual-flow space-time diagram attention network is as follows:

generating a high-dimensional embedded representation based on each face local representation and each face relation representation through a graph attention mechanism of the double-flow space-time graph attention network, wherein the high-dimensional embedded representation comprises embedded representations respectively corresponding to each moment, and the embedded representation at each moment is obtained by splicing the face local representation and the face relation representation at the moment;

and carrying out average processing according to the high-dimensional embedded representation to obtain the face space-time representation.

In one embodiment, the method further comprises:

acquiring a preset database, wherein the database comprises a plurality of candidate face space-time characterizations, and each candidate face space-time characterization corresponds to different identity information respectively;

acquiring cosine similarity between the face space-time representation and each candidate face space-time representation in the database;

and determining an identification result corresponding to the video data according to the cosine similarity.

In a second aspect, an embodiment of the present invention further provides a face space-time representation generating device based on multi-class representation space-time interaction, where the device includes:

the data acquisition module is used for acquiring video data to be processed and determining a face image sequence according to the video data, wherein the face image sequence comprises face images corresponding to a plurality of moments respectively;

the face image processing module is used for processing the face image to obtain a face local representation and a face relation representation corresponding to the face image respectively, wherein the face local representation is a representation extracted based on local information of the face image, and the face relation representation is a representation extracted based on association relation between different areas of the face image;

the representation interaction module is used for carrying out space-time interaction on each face local representation and each face relation representation to obtain the face space-time representation corresponding to the video data, wherein the space-time interaction is interaction in time and space between the same type of representation at different moments.

In a third aspect, an embodiment of the present invention further provides a terminal, where the terminal includes a memory and one or more processors; the memory stores more than one program; the program comprising instructions for performing a face space-time representation generation method based on multi-class representation space-time interactions as described in any of the above; the processor is configured to execute the program.

In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a plurality of instructions are stored, where the instructions are adapted to be loaded and executed by a processor, to implement the steps of any one of the above-mentioned face space-time representation generating methods based on multi-class representation space-time interactions.

The invention has the beneficial effects that: according to the embodiment of the invention, the facial relation characterization between the facial local characterization and different areas of the facial is learned at the same time, and the space-time dynamic interaction modeling of the two characterizations is realized, so that the space-time interaction between the same type of characterization at different times and between different types of characterizations is realized, and finally, the more reliable facial space-time characterization is obtained. The method solves the problem that the effectiveness of the generated face space-time representation is low because the high-level relation information between different areas of the face is ignored in the existing face space-time representation learning method for extracting the local representation of the face by adopting a convolutional neural network.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.

Fig. 1 is a schematic flow chart of a face space-time representation generating method based on multi-class representation space-time interaction according to an embodiment of the invention.

Fig. 2 is an overall framework diagram of a face space-time representation generating method based on multi-class representation space-time interaction provided by an embodiment of the invention.

Fig. 3 is a schematic diagram of the construction of a face region relationship diagram according to an embodiment of the present invention.

Fig. 4 is a schematic construction diagram of a double-flow face representation space-time diagram provided by the embodiment of the invention.

Fig. 5 is a schematic block diagram of a face space-time representation generating device based on multi-class representation space-time interaction according to an embodiment of the present invention.

Fig. 6 is a schematic block diagram of a terminal according to an embodiment of the present invention.

Detailed Description

The invention discloses a face space-time representation generating method based on multi-class representation space-time interaction, which is used for making the purposes, technical schemes and effects of the invention clearer and more definite, and is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.

It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

Aiming at the defects in the prior art, the invention provides a face space-time representation generation method based on multi-class representation space-time interaction, which comprises the following steps: acquiring video data to be processed, and determining a face image sequence according to the video data, wherein the face image sequence comprises face images respectively corresponding to a plurality of moments; acquiring a face local representation and a face relation representation which are respectively corresponding to the face images, wherein the face local representation is a representation extracted based on local information of the face images, and the face relation representation is a representation extracted based on association relations between different areas of the face images; and carrying out space-time interaction on each face local representation and each face relation representation to obtain the face space-time representation corresponding to the video data, wherein the space-time interaction is interaction in time and space between similar representations at different moments. According to the invention, the face relation characterization between the local characterization of the face and different regions of the face is learned at the same time, and the space-time dynamic interaction modeling of the two characterizations is realized, so that the space-time interaction between the same type of characterization at different times and between the different types of characterization is realized, and finally, the more reliable face space-time characterization is obtained. The method solves the problem that the effectiveness of the generated face space-time representation is low because the high-level relation information between different areas of the face is ignored in the existing face space-time representation learning method for extracting the local representation of the face by adopting a convolutional neural network.

As shown in fig. 1, the method includes:

step S100, obtaining video data to be processed, and determining a face image sequence according to the video data, wherein the face image sequence comprises face images respectively corresponding to a plurality of moments;

step 200, obtaining a face local representation and a face relation representation which are respectively corresponding to the face images, wherein the face local representation is a representation extracted based on local information of the face images, and the face relation representation is a representation extracted based on association relations between different areas of the face images;

step S300, carrying out space-time interaction on each face local representation and each face relation representation to obtain the face space-time representation corresponding to the video data, wherein the space-time interaction is interaction in time and space between the same type of representation at different moments.

Specifically, the video data to be processed may be any one of face videos. Firstly, converting the video data into a human face image sequence to obtain a plurality of human face images with different time stamps. And then, extracting two complementary face representations of the face local representation and the face relation representation of each face image, so as to obtain a representation sequence consisting of the face local representation of each face image and another representation sequence consisting of the face relation representation of each face image. Finally, the two characterization sequences are organically fused in a space-time interaction mode, so that dynamic interaction relation modeling in the similar characterization at different moments and between the different similar characterization is realized, and effective face space-time characterization is learned. The face space-time characterization can be used for tasks such as personnel identification, verification, state evaluation and the like in video data.

In one implementation manner, the method for acquiring the face local representation corresponding to each face image includes:

step S201, inputting the face image into a pre-training convolutional neural network;

step S202, obtaining the face local representation output by the pre-training convolutional neural network based on the face image.

Specifically, the pretraining convolutional neural network can be InsightFace, VGGFace, and the like, and because mass data training is adopted in advance with the training convolutional neural network, accurate local face characterization can be obtained after each frame of image is input into the pretraining convolutional neural network.

For example, for one video data, a face detection algorithm such as OpenFace, dlib is utilized to detect, track and align a face in the video data, and a face image sequence is output:

wherein,,

is indicated at +.>

Face images under a single timestamp.

For human face image sequence

Processing each frame of image by using a pre-training convolutional neural network to obtain a local representation of the face:

wherein,,

representing a selected pre-trained convolutional neural network, < ->

Indicate->

The partial representation of the face under the individual time stamps.

In one implementation manner, the method for acquiring the face relation representation corresponding to each face image includes:

step S203, inputting the face image into a space diagram attention network, wherein a diagram attention mechanism of the space diagram attention network operates based on a face region relation diagram, and the face region relation diagram is used for reflecting the association relation between the regions in each face image;

step S204, the face relation representation output by the space diagram attention network based on the face image is obtained.

Specifically, in this embodiment, a spatial map attention network is further preset to model association relationships between different areas of the face image, so as to obtain a face relationship representation of the face image. The diagram attention mechanism of the space diagram attention network mainly operates according to a preset face area relation diagram. The human face region relation graph reflects the association relation among the regions in each frame of human face image, so that the trained space graph attention network can accurately extract the human face relation representation in each frame of human face image.

For example, the attention-introducing mechanism of the drawing dynamically models the association relationship between different face regions, namely:

wherein,,

representing a spatial diagram attention network, < >>

Indicate->

And (5) representing the face relation of the face image under the time stamp.

In one implementation manner, the method for generating the face region relation graph includes:

s10, dividing a face image into a plurality of subareas on average;

step S11, determining the face area relation diagram according to the subareas, wherein the face area relation diagram comprises a plurality of nodes and edges, each node in the face area relation diagram corresponds to each subarea one by one, and the edges in the face area relation diagram are used for reflecting the association relation among the nodes.

Specifically, as shown in fig. 3, one face image is divided into an average

And the sub-areas are then used as nodes of the graph, and the association relationship among the sub-areas is used as the edge of the graph to construct the face area relationship graph.

In one implementation, if two nodes in the face region relationship graph are adjacent and/or bilateral symmetry, an edge exists between the two nodes.

Specifically, the existence of edges in the face region relation diagram defined in this embodiment needs to satisfy the following conditions: the two sub-areas are spatially adjacent or the two sub-areas are spatially right and left symmetric, thereby ensuring that no excessive context information is introduced.

In one implementation, the computation process of the spatial map attention network for each face image includes:

step S2041, obtaining relation coefficients between every two subareas in the face image through a graph attention mechanism of the spatial graph attention network, and carrying out normalization processing on the relation coefficients;

step S2042, aggregating neighbor information according to the relation coefficients after normalization processing to obtain corresponding relation characterization of the subareas;

and step S2043, carrying out average processing according to each relation representation to obtain the face relation representation corresponding to the face image.

Specifically, first, calculating the relation coefficient between every two sub-areas:

wherein,,

representing a nonlinear activation function. />

For the attention mechanism, a relationship between two sub-regions is dynamically learned. />

The feature mapping matrix is represented for obtaining sufficient expressive power. />

And->

Respectively represent +.>

Person and->

Characterization of the sub-regions.

Then, the relation coefficient is normalized to facilitate aggregation of neighbor information:

and then, the neighbor information is aggregated by using the normalized relationship coefficient, and the relationship representation of each sub-region is obtained:

finally, the moment

Relation characterization of all sub-regions below->

Is averaged to obtain the face relation representation of the whole face image>

。

In one implementation, the step S300 specifically includes:

step 301, inputting each face local representation and each face relation representation into a double-flow space-time diagram attention network, wherein a diagram attention mechanism of the double-flow space-time diagram attention network operates based on double-flow face representation space-time diagrams, and the double-flow face representation space-time diagrams are used for reflecting interaction relations between each face local representation and each face relation representation;

step S302, acquiring the human face space-time representation output by the double-flow space-time diagram attention network based on the human face local representation and the human face relation representation.

Specifically, in the embodiment, a dual-flow space-time diagram attention network is preset to model the space-time dynamic interaction of two types of characterization of the local human face and the relationship characterization of the human face, so that interaction enhancement within the same type of characterization at different times and between different types of characterization is realized, and the human face space-time characterization after interaction enhancement is output. The graph annotating force mechanism of the double-flow space-time graph attention network mainly operates according to a preset double-flow human face representation space-time graph, and the double-flow human face representation space-time graph reflects the interaction relation between each human face local representation and each human face relation representation, so that the trained double-flow space-time graph attention network can carry out interaction enhancement on the two representations and output reliable human face space-time representations for tasks such as human identity recognition, verification and state evaluation in video data.

In one implementation manner, the dual-flow face representation space-time diagram comprises a plurality of local nodes, relationship nodes and edges, wherein each local node corresponds to each face local representation one by one, each relationship node corresponds to each face relationship representation one by one, and the edges in the dual-flow face representation space-time diagram are used for reflecting interaction relations between the same type of nodes and between two types of nodes.

Specifically, as shown in fig. 4, the local face representation and the face relationship representation at each moment are respectively used as local nodes and relationship nodes of the graph, and the interaction relationship between the representations is used as the edge of the graph to construct the double-flow face representation space-time graph. Each local node can interact with other local nodes at different moments, and can also interact with relation nodes at different moments and interact with each other local node at different moments; similarly, each relationship node can interact with local nodes at different moments and interact with itself in addition to other relationship nodes at different moments. Therefore, the dual-flow time-space diagram attention network running based on the dual-flow face representation time-space diagram can realize interaction enhancement within and between different time-like representations.

In one implementation manner, if the time interval between the moments corresponding to the two nodes in the double-flow face representation space-time diagram is smaller than a preset threshold, an edge exists between the two nodes.

Specifically, the present embodiment defines that only an edge exists between any two nodes whose time interval is smaller than a preset threshold value, and no edge exists between any two nodes whose time interval is greater than or equal to a specified threshold value, so as to ensure that excessive long-term time context information is not introduced. Wherein, the two nodes can be two local nodes, or two relation nodes, or one local node and one relation node.

In one implementation, the computing process of the dual-flow space-time diagram attention network is:

step S3021, generating a high-dimensional embedded representation based on each face local representation and each face relation representation through a graph attention mechanism of the dual-flow space-time graph attention network, wherein the high-dimensional embedded representation comprises embedded representations respectively corresponding to each moment, and the embedded representation at each moment is obtained by splicing the face local representation and the face relation representation at the moment;

and step 3022, carrying out average processing according to the high-dimensional embedded representation to obtain the face space-time representation.

Specifically, the relationship between different types of characterizations at different times is first dynamically modeled using a graph attention mechanism of a dual-flow space-time graph attention network:

。

wherein,,

is a dual flow space-time diagram attention network. />

And->

Respectively +.>

Face local representation and face relation representation under a single timestamp>

And->

Respectively +.>

And under the condition of a time stamp, the human face local representation and the human face relation representation are enhanced through the interaction of the double-flow space-time diagram attention network. />

Is a dual flow space-time diagram attention network.

Then, each moment is characterized by interaction enhancement

And->

Splicing/concatenating to obtain high-dimensional embedded representation ++>

。

Finally, the high-dimensional embedded representation is averaged (equivalent to pooling the embedded representation at each moment in the time dimension) to obtain the global space-time dynamic representation of the whole video data, namely the human face space-time representation

。

In one implementation, the method further comprises:

step S400, a preset database is obtained, wherein the database comprises a plurality of candidate face space-time characterizations, and each candidate face space-time characterization corresponds to different identity information respectively;

step S401, obtaining cosine similarity between the face space-time representation and each candidate face space-time representation in the database;

step S402, determining an identification result corresponding to the video data according to the cosine similarity.

Specifically, one of the application scenarios in this embodiment is living body identification. During identification, the identification result is determined according to the identity information corresponding to the one/the first plurality of candidate face space-time characterizations with the highest cosine similarity by calculating the cosine similarity between the face space-time characterizations of the current video data and each candidate face space-time characterizations in a preset database.

In one implementation manner, the overall framework of the method can be realized by a deep neural network, and as shown in fig. 2, the overall framework mainly comprises three parts, namely a face local representation learning module, a face relation representation learning module and a face space-time representation interaction reinforcement learning module. In the deep neural network training stage, the whole framework performs end-to-end joint optimization.

The invention has the advantages that:

1. spatial association relations between different face areas can be mined.

2. The method realizes simultaneous interaction and enhancement of the face local representation and the face relation representation within the same representation at different moments and between different representations.

Based on the above embodiment, the present invention further provides a device for generating a face space-time representation based on multi-class representation space-time interaction, as shown in fig. 5, where the device includes:

the data acquisition module 01 is used for acquiring video data to be processed, and determining a face image sequence according to the video data, wherein the face image sequence comprises face images respectively corresponding to a plurality of moments;

the representation extraction module 02 is configured to obtain a local representation of a face and a representation of a face relationship, which correspond to each face image respectively, where the local representation of the face is a representation extracted based on local information of the face image, and the representation of the face relationship is a representation extracted based on association relationships between different areas of the face image;

the representation interaction module 03 is configured to perform space-time interaction on each of the face local representations and each of the face relationship representations to obtain face space-time representations corresponding to the video data, where the space-time interaction is interaction in time and space between the same type of representations at different moments.

Based on the above embodiment, the present invention also provides a terminal, and a functional block diagram thereof may be shown in fig. 6. The terminal comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. Wherein the processor of the terminal is adapted to provide computing and control capabilities. The memory of the terminal includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the terminal is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a face space-time representation generation method based on multi-class representation space-time interactions. The display screen of the terminal may be a liquid crystal display screen or an electronic ink display screen.

It will be appreciated by those skilled in the art that the functional block diagram shown in fig. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the terminal to which the present inventive arrangements may be applied, and that a particular terminal may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.

In one implementation, the memory of the terminal has stored therein one or more programs, and the execution of the one or more programs by one or more processors includes instructions for performing a face space-time representation generation method based on multi-class representation space-time interactions.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

In summary, the invention discloses a face space-time representation generation based on multi-type representation space-time interaction, which realizes space-time interaction within the same type of representation at different moments and between different types of representations by simultaneously learning face relation representations between a local representation of a face and different regions of the face and modeling the space-time dynamic interaction of the two representations, and finally obtains more reliable face space-time representation. The method solves the problem that the effectiveness of the generated face space-time representation is low because the high-level relation information between different areas of the face is ignored in the existing face space-time representation learning method for extracting the local representation of the face by adopting a convolutional neural network.

It is to be understood that the invention is not limited in its application to the examples described above, but is capable of modification and variation in light of the above teachings by those skilled in the art, and that all such modifications and variations are intended to be included within the scope of the appended claims.

Claims

1. A face space-time representation generation method based on multi-class representation space-time interaction, the method comprising:

performing space-time interaction on each face local representation and each face relation representation to obtain a face space-time representation corresponding to the video data, wherein the space-time interaction is interaction in time and space between similar representations at different moments;

the method for acquiring the face relation representation corresponding to each face image comprises the following steps:

acquiring the face relation representation output by the space diagram attention network based on the face image;

the method for generating the face region relation graph comprises the following steps:

dividing one face image into a plurality of subareas on average;

determining the face region relation graph according to the subareas, wherein the face region relation graph comprises a plurality of nodes and edges, each node in the face region relation graph corresponds to each subarea one by one, and the edges in the face region relation graph are used for reflecting the association relation among the nodes;

performing space-time interaction on each face local representation and each face relation representation to obtain a face space-time representation corresponding to the video data, wherein the method comprises the following steps:

acquiring the human face space-time representation output by the double-current space-time diagram attention network based on the human face local representation and the human face relation representation;

the double-flow face representation space-time diagram comprises a plurality of local nodes, relationship nodes and edges, wherein each local node corresponds to each face local representation one by one, each relationship node corresponds to each face relationship representation one by one, and the edges in the double-flow face representation space-time diagram are used for reflecting interaction relations between the inside of the similar nodes and between the two types of nodes;

the calculation process of the double-flow time-space diagram attention network is as follows:

2. The method for generating the face space-time representation based on the multi-class representation space-time interaction according to claim 1, wherein the method for acquiring the face local representation corresponding to each face image comprises the following steps:

inputting the face image into a pre-training convolutional neural network;

3. The method for generating the face space-time representation based on the multi-class representation space-time interaction according to claim 1, wherein if two nodes in the face region relation graph are adjacent and/or bilateral symmetry, an edge exists between the two nodes.

4. The method for generating a face space-time representation based on multi-class representation space-time interactions according to claim 1, wherein the calculation process of the spatial map attention network for each face image comprises:

5. The face space-time representation generation method based on multi-class representation space-time interaction according to claim 1, wherein if the time interval between the moments corresponding to the two nodes in the double-flow face representation space-time diagram is smaller than a preset threshold, an edge exists between the two nodes.

6. The method for generating a face space-time representation based on multi-class representation space-time interactions of claim 1, further comprising:

7. A face space-time representation generating device based on multi-class representation space-time interactions, the device comprising:

the representation interaction module is used for carrying out space-time interaction on each face local representation and each face relation representation to obtain the face space-time representation corresponding to the video data, wherein the space-time interaction is interaction in time and space between the same type of representation at different moments;

dividing one face image into a plurality of subareas on average;

8. A terminal comprising a memory and one or more processors; the memory stores more than one program; the program comprising instructions for performing a face space-time representation generation method based on multi-class representation space-time interactions as claimed in any one of claims 1-6; the processor is configured to execute the program.

9. A computer readable storage medium having stored thereon a plurality of instructions adapted to be loaded and executed by a processor to implement the steps of the method for generating a face space-time representation based on multi-class representation space-time interactions of any of the preceding claims 1-6.