CN116071809B - Face space-time representation generation method based on multi-class representation space-time interaction - Google Patents

Face space-time representation generation method based on multi-class representation space-time interaction Download PDF

Info

Publication number
CN116071809B
CN116071809B CN202310285315.4A CN202310285315A CN116071809B CN 116071809 B CN116071809 B CN 116071809B CN 202310285315 A CN202310285315 A CN 202310285315A CN 116071809 B CN116071809 B CN 116071809B
Authority
CN
China
Prior art keywords
face
representation
space
time
relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310285315.4A
Other languages
Chinese (zh)
Other versions
CN116071809A (en
Inventor
蒋冬梅
李岩
王耀威
蓝湘源
吕科
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peng Cheng Laboratory
Original Assignee
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peng Cheng Laboratory filed Critical Peng Cheng Laboratory
Priority to CN202310285315.4A priority Critical patent/CN116071809B/en
Publication of CN116071809A publication Critical patent/CN116071809A/en
Application granted granted Critical
Publication of CN116071809B publication Critical patent/CN116071809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a face space-time representation generating method based on multi-type representation space-time interaction. The method solves the problem that the effectiveness of the generated face space-time representation is low because the high-level relation information between different areas of the face is ignored in the existing face space-time representation learning method for extracting the local representation of the face by adopting a convolutional neural network.

Description

Face space-time representation generation method based on multi-class representation space-time interaction
Technical Field
The invention relates to the field of face space-time representation learning, in particular to a face space-time representation generating method based on multi-class representation space-time interaction.
Background
At present, a method for extracting local face characterization through a convolutional neural network is commonly adopted in face space-time characterization learning, and modeling of high-level relation information among different areas of a face is omitted, so that the effectiveness of the learned face space-time characterization is limited to a certain extent. There are also a few methods to model face images directly using transformers, but structural information of the face is lost, and meanwhile, local characterization of the face is not fully learned.
In addition, the present face representation learning mainly starts from a single frame image in the video, ignores time sequence dynamic information of the face video, and obtains the space-time coupling representation of the face from the image sequence, which is more beneficial to tasks such as living body identity recognition and verification, state evaluation and the like. In order to fuse different characterization description sequences, at present, a method of fusing two characterizations at each moment and then early fusing a model for time sequence dynamic modeling is mostly adopted; or a method of carrying out later decision fusion for each represented time sequence dynamic mode. These methods ignore the dynamic interaction relationship between two characterizations and cannot achieve simultaneous interactions within the same characterization at different times and between different characterizations.
Accordingly, there is a need for improvement and development in the art.
Disclosure of Invention
The invention aims to solve the technical problems that aiming at the defects in the prior art, a face space-time representation generating method based on multi-type representation space-time interaction is provided, and aims to solve the problems that the effectiveness of the generated face space-time representation is lower because the existing face space-time representation learning adopts a convolutional neural network to extract the local representation of the face, and high-level relation information among different areas of the face is ignored.
The technical scheme adopted by the invention for solving the problems is as follows:
in a first aspect, an embodiment of the present invention provides a face space-time representation generating method based on multi-class representation space-time interaction, where the method includes:
acquiring video data to be processed, and determining a face image sequence according to the video data, wherein the face image sequence comprises face images respectively corresponding to a plurality of moments;
acquiring a face local representation and a face relation representation which are respectively corresponding to the face images, wherein the face local representation is a representation extracted based on local information of the face images, and the face relation representation is a representation extracted based on association relations between different areas of the face images;
and carrying out space-time interaction on each face local representation and each face relation representation to obtain the face space-time representation corresponding to the video data, wherein the space-time interaction is interaction in time and space between similar representations at different moments.
In one embodiment, the method for acquiring the local face representation corresponding to each face image includes:
inputting the face image into a pre-training convolutional neural network;
and acquiring the face local representation output by the pre-training convolutional neural network based on the face image.
In one embodiment, the method for obtaining the face relation representation corresponding to each face image includes:
inputting the face image into a space diagram attention network, wherein a diagram attention mechanism of the space diagram attention network operates based on a face region relation diagram, and the face region relation diagram is used for reflecting the association relation between regions in each face image;
and acquiring the face relation representation output by the space diagram attention network based on the face image.
In one embodiment, the method for generating the face region relation graph includes:
dividing one face image into a plurality of subareas on average;
and determining the face region relation graph according to the subareas, wherein the face region relation graph comprises a plurality of nodes and edges, each node in the face region relation graph corresponds to each subarea one by one, and the edges in the face region relation graph are used for reflecting the association relation among the nodes.
In one embodiment, if two nodes in the face region relationship graph are adjacent and/or bilateral symmetry, an edge exists between the two nodes.
In one embodiment, the calculation process of the spatial map attention network for each face image includes:
acquiring relation coefficients between every two subareas in the face image through a graph attention mechanism of the spatial graph attention network, and carrying out normalization processing on the relation coefficients;
aggregating neighbor information according to the relation coefficients after normalization processing to obtain corresponding relation characterization of each subarea;
and carrying out average processing according to each relation representation to obtain the face relation representation corresponding to the face image.
In one embodiment, the performing space-time interaction on each of the local face representations and each of the face relation representations to obtain a face space-time representation corresponding to the video data includes:
inputting each face local representation and each face relation representation into a double-flow space-time diagram attention network, wherein a diagram attention mechanism of the double-flow space-time diagram attention network operates based on double-flow face representation space-time diagrams, and the double-flow face representation space-time diagrams are used for reflecting interaction relations between each face local representation and each face relation representation;
and acquiring the two-flow space-time diagram attention network, wherein the two-flow space-time diagram attention network is based on the face local characterization and the face relation characterization.
In one embodiment, the dual-flow face representation space-time diagram comprises a plurality of local nodes, relationship nodes and edges, wherein each local node corresponds to each face local representation one by one, each relationship node corresponds to each face relationship representation one by one, and the edges in the dual-flow face representation space-time diagram are used for reflecting interaction relations between the same type of nodes and between two types of nodes.
In one embodiment, if the time interval between the moments corresponding to the two nodes in the dual-flow face representation space-time diagram is smaller than a preset threshold, an edge exists between the two nodes.
In one embodiment, the calculation process of the dual-flow space-time diagram attention network is as follows:
generating a high-dimensional embedded representation based on each face local representation and each face relation representation through a graph attention mechanism of the double-flow space-time graph attention network, wherein the high-dimensional embedded representation comprises embedded representations respectively corresponding to each moment, and the embedded representation at each moment is obtained by splicing the face local representation and the face relation representation at the moment;
and carrying out average processing according to the high-dimensional embedded representation to obtain the face space-time representation.
In one embodiment, the method further comprises:
acquiring a preset database, wherein the database comprises a plurality of candidate face space-time characterizations, and each candidate face space-time characterization corresponds to different identity information respectively;
acquiring cosine similarity between the face space-time representation and each candidate face space-time representation in the database;
and determining an identification result corresponding to the video data according to the cosine similarity.
In a second aspect, an embodiment of the present invention further provides a face space-time representation generating device based on multi-class representation space-time interaction, where the device includes:
the data acquisition module is used for acquiring video data to be processed and determining a face image sequence according to the video data, wherein the face image sequence comprises face images corresponding to a plurality of moments respectively;
the face image processing module is used for processing the face image to obtain a face local representation and a face relation representation corresponding to the face image respectively, wherein the face local representation is a representation extracted based on local information of the face image, and the face relation representation is a representation extracted based on association relation between different areas of the face image;
the representation interaction module is used for carrying out space-time interaction on each face local representation and each face relation representation to obtain the face space-time representation corresponding to the video data, wherein the space-time interaction is interaction in time and space between the same type of representation at different moments.
In a third aspect, an embodiment of the present invention further provides a terminal, where the terminal includes a memory and one or more processors; the memory stores more than one program; the program comprising instructions for performing a face space-time representation generation method based on multi-class representation space-time interactions as described in any of the above; the processor is configured to execute the program.
In a fourth aspect, an embodiment of the present invention further provides a computer readable storage medium, where a plurality of instructions are stored, where the instructions are adapted to be loaded and executed by a processor, to implement the steps of any one of the above-mentioned face space-time representation generating methods based on multi-class representation space-time interactions.
The invention has the beneficial effects that: according to the embodiment of the invention, the facial relation characterization between the facial local characterization and different areas of the facial is learned at the same time, and the space-time dynamic interaction modeling of the two characterizations is realized, so that the space-time interaction between the same type of characterization at different times and between different types of characterizations is realized, and finally, the more reliable facial space-time characterization is obtained. The method solves the problem that the effectiveness of the generated face space-time representation is low because the high-level relation information between different areas of the face is ignored in the existing face space-time representation learning method for extracting the local representation of the face by adopting a convolutional neural network.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to the drawings without inventive effort to those skilled in the art.
Fig. 1 is a schematic flow chart of a face space-time representation generating method based on multi-class representation space-time interaction according to an embodiment of the invention.
Fig. 2 is an overall framework diagram of a face space-time representation generating method based on multi-class representation space-time interaction provided by an embodiment of the invention.
Fig. 3 is a schematic diagram of the construction of a face region relationship diagram according to an embodiment of the present invention.
Fig. 4 is a schematic construction diagram of a double-flow face representation space-time diagram provided by the embodiment of the invention.
Fig. 5 is a schematic block diagram of a face space-time representation generating device based on multi-class representation space-time interaction according to an embodiment of the present invention.
Fig. 6 is a schematic block diagram of a terminal according to an embodiment of the present invention.
Detailed Description
The invention discloses a face space-time representation generating method based on multi-class representation space-time interaction, which is used for making the purposes, technical schemes and effects of the invention clearer and more definite, and is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless expressly stated otherwise, as understood by those skilled in the art. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein includes all or any element and all combination of one or more of the associated listed items.
It will be understood by those skilled in the art that all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs unless defined otherwise. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
Aiming at the defects in the prior art, the invention provides a face space-time representation generation method based on multi-class representation space-time interaction, which comprises the following steps: acquiring video data to be processed, and determining a face image sequence according to the video data, wherein the face image sequence comprises face images respectively corresponding to a plurality of moments; acquiring a face local representation and a face relation representation which are respectively corresponding to the face images, wherein the face local representation is a representation extracted based on local information of the face images, and the face relation representation is a representation extracted based on association relations between different areas of the face images; and carrying out space-time interaction on each face local representation and each face relation representation to obtain the face space-time representation corresponding to the video data, wherein the space-time interaction is interaction in time and space between similar representations at different moments. According to the invention, the face relation characterization between the local characterization of the face and different regions of the face is learned at the same time, and the space-time dynamic interaction modeling of the two characterizations is realized, so that the space-time interaction between the same type of characterization at different times and between the different types of characterization is realized, and finally, the more reliable face space-time characterization is obtained. The method solves the problem that the effectiveness of the generated face space-time representation is low because the high-level relation information between different areas of the face is ignored in the existing face space-time representation learning method for extracting the local representation of the face by adopting a convolutional neural network.
As shown in fig. 1, the method includes:
step S100, obtaining video data to be processed, and determining a face image sequence according to the video data, wherein the face image sequence comprises face images respectively corresponding to a plurality of moments;
step 200, obtaining a face local representation and a face relation representation which are respectively corresponding to the face images, wherein the face local representation is a representation extracted based on local information of the face images, and the face relation representation is a representation extracted based on association relations between different areas of the face images;
step S300, carrying out space-time interaction on each face local representation and each face relation representation to obtain the face space-time representation corresponding to the video data, wherein the space-time interaction is interaction in time and space between the same type of representation at different moments.
Specifically, the video data to be processed may be any one of face videos. Firstly, converting the video data into a human face image sequence to obtain a plurality of human face images with different time stamps. And then, extracting two complementary face representations of the face local representation and the face relation representation of each face image, so as to obtain a representation sequence consisting of the face local representation of each face image and another representation sequence consisting of the face relation representation of each face image. Finally, the two characterization sequences are organically fused in a space-time interaction mode, so that dynamic interaction relation modeling in the similar characterization at different moments and between the different similar characterization is realized, and effective face space-time characterization is learned. The face space-time characterization can be used for tasks such as personnel identification, verification, state evaluation and the like in video data.
In one implementation manner, the method for acquiring the face local representation corresponding to each face image includes:
step S201, inputting the face image into a pre-training convolutional neural network;
step S202, obtaining the face local representation output by the pre-training convolutional neural network based on the face image.
Specifically, the pretraining convolutional neural network can be InsightFace, VGGFace, and the like, and because mass data training is adopted in advance with the training convolutional neural network, accurate local face characterization can be obtained after each frame of image is input into the pretraining convolutional neural network.
For example, for one video data, a face detection algorithm such as OpenFace, dlib is utilized to detect, track and align a face in the video data, and a face image sequence is output:
Figure SMS_1
wherein,,
Figure SMS_2
is indicated at +.>
Figure SMS_3
Face images under a single timestamp.
For human face image sequence
Figure SMS_4
Processing each frame of image by using a pre-training convolutional neural network to obtain a local representation of the face:
Figure SMS_5
wherein,,
Figure SMS_6
representing a selected pre-trained convolutional neural network, < ->
Figure SMS_7
Indicate->
Figure SMS_8
The partial representation of the face under the individual time stamps.
In one implementation manner, the method for acquiring the face relation representation corresponding to each face image includes:
step S203, inputting the face image into a space diagram attention network, wherein a diagram attention mechanism of the space diagram attention network operates based on a face region relation diagram, and the face region relation diagram is used for reflecting the association relation between the regions in each face image;
step S204, the face relation representation output by the space diagram attention network based on the face image is obtained.
Specifically, in this embodiment, a spatial map attention network is further preset to model association relationships between different areas of the face image, so as to obtain a face relationship representation of the face image. The diagram attention mechanism of the space diagram attention network mainly operates according to a preset face area relation diagram. The human face region relation graph reflects the association relation among the regions in each frame of human face image, so that the trained space graph attention network can accurately extract the human face relation representation in each frame of human face image.
For example, the attention-introducing mechanism of the drawing dynamically models the association relationship between different face regions, namely:
Figure SMS_9
wherein,,
Figure SMS_10
representing a spatial diagram attention network, < >>
Figure SMS_11
Indicate->
Figure SMS_12
And (5) representing the face relation of the face image under the time stamp.
In one implementation manner, the method for generating the face region relation graph includes:
s10, dividing a face image into a plurality of subareas on average;
step S11, determining the face area relation diagram according to the subareas, wherein the face area relation diagram comprises a plurality of nodes and edges, each node in the face area relation diagram corresponds to each subarea one by one, and the edges in the face area relation diagram are used for reflecting the association relation among the nodes.
Specifically, as shown in fig. 3, one face image is divided into an average
Figure SMS_13
And the sub-areas are then used as nodes of the graph, and the association relationship among the sub-areas is used as the edge of the graph to construct the face area relationship graph.
In one implementation, if two nodes in the face region relationship graph are adjacent and/or bilateral symmetry, an edge exists between the two nodes.
Specifically, the existence of edges in the face region relation diagram defined in this embodiment needs to satisfy the following conditions: the two sub-areas are spatially adjacent or the two sub-areas are spatially right and left symmetric, thereby ensuring that no excessive context information is introduced.
In one implementation, the computation process of the spatial map attention network for each face image includes:
step S2041, obtaining relation coefficients between every two subareas in the face image through a graph attention mechanism of the spatial graph attention network, and carrying out normalization processing on the relation coefficients;
step S2042, aggregating neighbor information according to the relation coefficients after normalization processing to obtain corresponding relation characterization of the subareas;
and step S2043, carrying out average processing according to each relation representation to obtain the face relation representation corresponding to the face image.
Specifically, first, calculating the relation coefficient between every two sub-areas:
Figure SMS_14
wherein,,
Figure SMS_15
representing a nonlinear activation function. />
Figure SMS_16
For the attention mechanism, a relationship between two sub-regions is dynamically learned. />
Figure SMS_17
The feature mapping matrix is represented for obtaining sufficient expressive power. />
Figure SMS_18
And->
Figure SMS_19
Respectively represent +.>
Figure SMS_20
Person and->
Figure SMS_21
Characterization of the sub-regions.
Then, the relation coefficient is normalized to facilitate aggregation of neighbor information:
Figure SMS_22
and then, the neighbor information is aggregated by using the normalized relationship coefficient, and the relationship representation of each sub-region is obtained:
Figure SMS_23
finally, the moment
Figure SMS_24
Relation characterization of all sub-regions below->
Figure SMS_25
Is averaged to obtain the face relation representation of the whole face image>
Figure SMS_26
In one implementation, the step S300 specifically includes:
step 301, inputting each face local representation and each face relation representation into a double-flow space-time diagram attention network, wherein a diagram attention mechanism of the double-flow space-time diagram attention network operates based on double-flow face representation space-time diagrams, and the double-flow face representation space-time diagrams are used for reflecting interaction relations between each face local representation and each face relation representation;
step S302, acquiring the human face space-time representation output by the double-flow space-time diagram attention network based on the human face local representation and the human face relation representation.
Specifically, in the embodiment, a dual-flow space-time diagram attention network is preset to model the space-time dynamic interaction of two types of characterization of the local human face and the relationship characterization of the human face, so that interaction enhancement within the same type of characterization at different times and between different types of characterization is realized, and the human face space-time characterization after interaction enhancement is output. The graph annotating force mechanism of the double-flow space-time graph attention network mainly operates according to a preset double-flow human face representation space-time graph, and the double-flow human face representation space-time graph reflects the interaction relation between each human face local representation and each human face relation representation, so that the trained double-flow space-time graph attention network can carry out interaction enhancement on the two representations and output reliable human face space-time representations for tasks such as human identity recognition, verification and state evaluation in video data.
In one implementation manner, the dual-flow face representation space-time diagram comprises a plurality of local nodes, relationship nodes and edges, wherein each local node corresponds to each face local representation one by one, each relationship node corresponds to each face relationship representation one by one, and the edges in the dual-flow face representation space-time diagram are used for reflecting interaction relations between the same type of nodes and between two types of nodes.
Specifically, as shown in fig. 4, the local face representation and the face relationship representation at each moment are respectively used as local nodes and relationship nodes of the graph, and the interaction relationship between the representations is used as the edge of the graph to construct the double-flow face representation space-time graph. Each local node can interact with other local nodes at different moments, and can also interact with relation nodes at different moments and interact with each other local node at different moments; similarly, each relationship node can interact with local nodes at different moments and interact with itself in addition to other relationship nodes at different moments. Therefore, the dual-flow time-space diagram attention network running based on the dual-flow face representation time-space diagram can realize interaction enhancement within and between different time-like representations.
In one implementation manner, if the time interval between the moments corresponding to the two nodes in the double-flow face representation space-time diagram is smaller than a preset threshold, an edge exists between the two nodes.
Specifically, the present embodiment defines that only an edge exists between any two nodes whose time interval is smaller than a preset threshold value, and no edge exists between any two nodes whose time interval is greater than or equal to a specified threshold value, so as to ensure that excessive long-term time context information is not introduced. Wherein, the two nodes can be two local nodes, or two relation nodes, or one local node and one relation node.
In one implementation, the computing process of the dual-flow space-time diagram attention network is:
step S3021, generating a high-dimensional embedded representation based on each face local representation and each face relation representation through a graph attention mechanism of the dual-flow space-time graph attention network, wherein the high-dimensional embedded representation comprises embedded representations respectively corresponding to each moment, and the embedded representation at each moment is obtained by splicing the face local representation and the face relation representation at the moment;
and step 3022, carrying out average processing according to the high-dimensional embedded representation to obtain the face space-time representation.
Specifically, the relationship between different types of characterizations at different times is first dynamically modeled using a graph attention mechanism of a dual-flow space-time graph attention network:
Figure SMS_27
wherein,,
Figure SMS_29
is a dual flow space-time diagram attention network. />
Figure SMS_32
And->
Figure SMS_34
Respectively +.>
Figure SMS_28
Face local representation and face relation representation under a single timestamp>
Figure SMS_31
And->
Figure SMS_33
Respectively +.>
Figure SMS_35
And under the condition of a time stamp, the human face local representation and the human face relation representation are enhanced through the interaction of the double-flow space-time diagram attention network. />
Figure SMS_30
Is a dual flow space-time diagram attention network.
Then, each moment is characterized by interaction enhancement
Figure SMS_36
And->
Figure SMS_37
Splicing/concatenating to obtain high-dimensional embedded representation ++>
Figure SMS_38
Finally, the high-dimensional embedded representation is averaged (equivalent to pooling the embedded representation at each moment in the time dimension) to obtain the global space-time dynamic representation of the whole video data, namely the human face space-time representation
Figure SMS_39
In one implementation, the method further comprises:
step S400, a preset database is obtained, wherein the database comprises a plurality of candidate face space-time characterizations, and each candidate face space-time characterization corresponds to different identity information respectively;
step S401, obtaining cosine similarity between the face space-time representation and each candidate face space-time representation in the database;
step S402, determining an identification result corresponding to the video data according to the cosine similarity.
Specifically, one of the application scenarios in this embodiment is living body identification. During identification, the identification result is determined according to the identity information corresponding to the one/the first plurality of candidate face space-time characterizations with the highest cosine similarity by calculating the cosine similarity between the face space-time characterizations of the current video data and each candidate face space-time characterizations in a preset database.
In one implementation manner, the overall framework of the method can be realized by a deep neural network, and as shown in fig. 2, the overall framework mainly comprises three parts, namely a face local representation learning module, a face relation representation learning module and a face space-time representation interaction reinforcement learning module. In the deep neural network training stage, the whole framework performs end-to-end joint optimization.
The invention has the advantages that:
1. spatial association relations between different face areas can be mined.
2. The method realizes simultaneous interaction and enhancement of the face local representation and the face relation representation within the same representation at different moments and between different representations.
Based on the above embodiment, the present invention further provides a device for generating a face space-time representation based on multi-class representation space-time interaction, as shown in fig. 5, where the device includes:
the data acquisition module 01 is used for acquiring video data to be processed, and determining a face image sequence according to the video data, wherein the face image sequence comprises face images respectively corresponding to a plurality of moments;
the representation extraction module 02 is configured to obtain a local representation of a face and a representation of a face relationship, which correspond to each face image respectively, where the local representation of the face is a representation extracted based on local information of the face image, and the representation of the face relationship is a representation extracted based on association relationships between different areas of the face image;
the representation interaction module 03 is configured to perform space-time interaction on each of the face local representations and each of the face relationship representations to obtain face space-time representations corresponding to the video data, where the space-time interaction is interaction in time and space between the same type of representations at different moments.
Based on the above embodiment, the present invention also provides a terminal, and a functional block diagram thereof may be shown in fig. 6. The terminal comprises a processor, a memory, a network interface and a display screen which are connected through a system bus. Wherein the processor of the terminal is adapted to provide computing and control capabilities. The memory of the terminal includes a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface of the terminal is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a face space-time representation generation method based on multi-class representation space-time interactions. The display screen of the terminal may be a liquid crystal display screen or an electronic ink display screen.
It will be appreciated by those skilled in the art that the functional block diagram shown in fig. 6 is merely a block diagram of some of the structures associated with the present inventive arrangements and is not limiting of the terminal to which the present inventive arrangements may be applied, and that a particular terminal may include more or less components than those shown, or may combine some of the components, or have a different arrangement of components.
In one implementation, the memory of the terminal has stored therein one or more programs, and the execution of the one or more programs by one or more processors includes instructions for performing a face space-time representation generation method based on multi-class representation space-time interactions.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
In summary, the invention discloses a face space-time representation generation based on multi-type representation space-time interaction, which realizes space-time interaction within the same type of representation at different moments and between different types of representations by simultaneously learning face relation representations between a local representation of a face and different regions of the face and modeling the space-time dynamic interaction of the two representations, and finally obtains more reliable face space-time representation. The method solves the problem that the effectiveness of the generated face space-time representation is low because the high-level relation information between different areas of the face is ignored in the existing face space-time representation learning method for extracting the local representation of the face by adopting a convolutional neural network.
It is to be understood that the invention is not limited in its application to the examples described above, but is capable of modification and variation in light of the above teachings by those skilled in the art, and that all such modifications and variations are intended to be included within the scope of the appended claims.

Claims (9)

1. A face space-time representation generation method based on multi-class representation space-time interaction, the method comprising:
acquiring video data to be processed, and determining a face image sequence according to the video data, wherein the face image sequence comprises face images respectively corresponding to a plurality of moments;
acquiring a face local representation and a face relation representation which are respectively corresponding to the face images, wherein the face local representation is a representation extracted based on local information of the face images, and the face relation representation is a representation extracted based on association relations between different areas of the face images;
performing space-time interaction on each face local representation and each face relation representation to obtain a face space-time representation corresponding to the video data, wherein the space-time interaction is interaction in time and space between similar representations at different moments;
the method for acquiring the face relation representation corresponding to each face image comprises the following steps:
inputting the face image into a space diagram attention network, wherein a diagram attention mechanism of the space diagram attention network operates based on a face region relation diagram, and the face region relation diagram is used for reflecting the association relation between regions in each face image;
acquiring the face relation representation output by the space diagram attention network based on the face image;
the method for generating the face region relation graph comprises the following steps:
dividing one face image into a plurality of subareas on average;
determining the face region relation graph according to the subareas, wherein the face region relation graph comprises a plurality of nodes and edges, each node in the face region relation graph corresponds to each subarea one by one, and the edges in the face region relation graph are used for reflecting the association relation among the nodes;
performing space-time interaction on each face local representation and each face relation representation to obtain a face space-time representation corresponding to the video data, wherein the method comprises the following steps:
inputting each face local representation and each face relation representation into a double-flow space-time diagram attention network, wherein a diagram attention mechanism of the double-flow space-time diagram attention network operates based on double-flow face representation space-time diagrams, and the double-flow face representation space-time diagrams are used for reflecting interaction relations between each face local representation and each face relation representation;
acquiring the human face space-time representation output by the double-current space-time diagram attention network based on the human face local representation and the human face relation representation;
the double-flow face representation space-time diagram comprises a plurality of local nodes, relationship nodes and edges, wherein each local node corresponds to each face local representation one by one, each relationship node corresponds to each face relationship representation one by one, and the edges in the double-flow face representation space-time diagram are used for reflecting interaction relations between the inside of the similar nodes and between the two types of nodes;
the calculation process of the double-flow time-space diagram attention network is as follows:
generating a high-dimensional embedded representation based on each face local representation and each face relation representation through a graph attention mechanism of the double-flow space-time graph attention network, wherein the high-dimensional embedded representation comprises embedded representations respectively corresponding to each moment, and the embedded representation at each moment is obtained by splicing the face local representation and the face relation representation at the moment;
and carrying out average processing according to the high-dimensional embedded representation to obtain the face space-time representation.
2. The method for generating the face space-time representation based on the multi-class representation space-time interaction according to claim 1, wherein the method for acquiring the face local representation corresponding to each face image comprises the following steps:
inputting the face image into a pre-training convolutional neural network;
and acquiring the face local representation output by the pre-training convolutional neural network based on the face image.
3. The method for generating the face space-time representation based on the multi-class representation space-time interaction according to claim 1, wherein if two nodes in the face region relation graph are adjacent and/or bilateral symmetry, an edge exists between the two nodes.
4. The method for generating a face space-time representation based on multi-class representation space-time interactions according to claim 1, wherein the calculation process of the spatial map attention network for each face image comprises:
acquiring relation coefficients between every two subareas in the face image through a graph attention mechanism of the spatial graph attention network, and carrying out normalization processing on the relation coefficients;
aggregating neighbor information according to the relation coefficients after normalization processing to obtain corresponding relation characterization of each subarea;
and carrying out average processing according to each relation representation to obtain the face relation representation corresponding to the face image.
5. The face space-time representation generation method based on multi-class representation space-time interaction according to claim 1, wherein if the time interval between the moments corresponding to the two nodes in the double-flow face representation space-time diagram is smaller than a preset threshold, an edge exists between the two nodes.
6. The method for generating a face space-time representation based on multi-class representation space-time interactions of claim 1, further comprising:
acquiring a preset database, wherein the database comprises a plurality of candidate face space-time characterizations, and each candidate face space-time characterization corresponds to different identity information respectively;
acquiring cosine similarity between the face space-time representation and each candidate face space-time representation in the database;
and determining an identification result corresponding to the video data according to the cosine similarity.
7. A face space-time representation generating device based on multi-class representation space-time interactions, the device comprising:
the data acquisition module is used for acquiring video data to be processed and determining a face image sequence according to the video data, wherein the face image sequence comprises face images corresponding to a plurality of moments respectively;
the face image processing module is used for processing the face image to obtain a face local representation and a face relation representation corresponding to the face image respectively, wherein the face local representation is a representation extracted based on local information of the face image, and the face relation representation is a representation extracted based on association relation between different areas of the face image;
the representation interaction module is used for carrying out space-time interaction on each face local representation and each face relation representation to obtain the face space-time representation corresponding to the video data, wherein the space-time interaction is interaction in time and space between the same type of representation at different moments;
the method for acquiring the face relation representation corresponding to each face image comprises the following steps:
inputting the face image into a space diagram attention network, wherein a diagram attention mechanism of the space diagram attention network operates based on a face region relation diagram, and the face region relation diagram is used for reflecting the association relation between regions in each face image;
acquiring the face relation representation output by the space diagram attention network based on the face image;
the method for generating the face region relation graph comprises the following steps:
dividing one face image into a plurality of subareas on average;
determining the face region relation graph according to the subareas, wherein the face region relation graph comprises a plurality of nodes and edges, each node in the face region relation graph corresponds to each subarea one by one, and the edges in the face region relation graph are used for reflecting the association relation among the nodes;
performing space-time interaction on each face local representation and each face relation representation to obtain a face space-time representation corresponding to the video data, wherein the method comprises the following steps:
inputting each face local representation and each face relation representation into a double-flow space-time diagram attention network, wherein a diagram attention mechanism of the double-flow space-time diagram attention network operates based on double-flow face representation space-time diagrams, and the double-flow face representation space-time diagrams are used for reflecting interaction relations between each face local representation and each face relation representation;
acquiring the human face space-time representation output by the double-current space-time diagram attention network based on the human face local representation and the human face relation representation;
the double-flow face representation space-time diagram comprises a plurality of local nodes, relationship nodes and edges, wherein each local node corresponds to each face local representation one by one, each relationship node corresponds to each face relationship representation one by one, and the edges in the double-flow face representation space-time diagram are used for reflecting interaction relations between the inside of the similar nodes and between the two types of nodes;
the calculation process of the double-flow time-space diagram attention network is as follows:
generating a high-dimensional embedded representation based on each face local representation and each face relation representation through a graph attention mechanism of the double-flow space-time graph attention network, wherein the high-dimensional embedded representation comprises embedded representations respectively corresponding to each moment, and the embedded representation at each moment is obtained by splicing the face local representation and the face relation representation at the moment;
and carrying out average processing according to the high-dimensional embedded representation to obtain the face space-time representation.
8. A terminal comprising a memory and one or more processors; the memory stores more than one program; the program comprising instructions for performing a face space-time representation generation method based on multi-class representation space-time interactions as claimed in any one of claims 1-6; the processor is configured to execute the program.
9. A computer readable storage medium having stored thereon a plurality of instructions adapted to be loaded and executed by a processor to implement the steps of the method for generating a face space-time representation based on multi-class representation space-time interactions of any of the preceding claims 1-6.
CN202310285315.4A 2023-03-22 2023-03-22 Face space-time representation generation method based on multi-class representation space-time interaction Active CN116071809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310285315.4A CN116071809B (en) 2023-03-22 2023-03-22 Face space-time representation generation method based on multi-class representation space-time interaction

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310285315.4A CN116071809B (en) 2023-03-22 2023-03-22 Face space-time representation generation method based on multi-class representation space-time interaction

Publications (2)

Publication Number Publication Date
CN116071809A CN116071809A (en) 2023-05-05
CN116071809B true CN116071809B (en) 2023-07-14

Family

ID=86177092

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310285315.4A Active CN116071809B (en) 2023-03-22 2023-03-22 Face space-time representation generation method based on multi-class representation space-time interaction

Country Status (1)

Country Link
CN (1) CN116071809B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709351A (en) * 2020-06-11 2020-09-25 江南大学 Three-branch network behavior identification method based on multipath space-time characteristic reinforcement fusion
CN112800894A (en) * 2021-01-18 2021-05-14 南京邮电大学 Dynamic expression recognition method and system based on attention mechanism between space and time streams
CN113435330A (en) * 2021-06-28 2021-09-24 平安科技(深圳)有限公司 Micro-expression identification method, device, equipment and storage medium based on video
CN113673303A (en) * 2021-06-28 2021-11-19 中国科学院大学 Human face action unit intensity regression method, device and medium
CN114170666A (en) * 2021-12-13 2022-03-11 重庆邮电大学 Facial expression recognition method based on multi-region convolutional neural network
CN115471885A (en) * 2022-08-24 2022-12-13 深圳市海清视讯科技有限公司 Action unit correlation learning method and device, electronic device and storage medium

Family Cites Families (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020113355A1 (en) * 2018-12-03 2020-06-11 Intel Corporation A content adaptive attention model for neural network-based image and video encoders
CN111160163B (en) * 2019-12-18 2022-04-01 浙江大学 Expression recognition method based on regional relation modeling and information fusion modeling
US11861940B2 (en) * 2020-06-16 2024-01-02 University Of Maryland, College Park Human emotion recognition in images or video
CN112070670B (en) * 2020-09-03 2022-05-10 武汉工程大学 Face super-resolution method and system of global-local separation attention mechanism
US20220101103A1 (en) * 2020-09-25 2022-03-31 Royal Bank Of Canada System and method for structure learning for graph neural networks
CN112232191B (en) * 2020-10-15 2023-04-18 南京邮电大学 Depression recognition system based on micro-expression analysis
CN112434608B (en) * 2020-11-24 2023-02-28 山东大学 Human behavior identification method and system based on double-current combined network
CN112633153A (en) * 2020-12-22 2021-04-09 天津大学 Facial expression motion unit identification method based on space-time graph convolutional network
CN113239822A (en) * 2020-12-28 2021-08-10 武汉纺织大学 Dangerous behavior detection method and system based on space-time double-current convolutional neural network
CN112800937B (en) * 2021-01-26 2023-09-05 华南理工大学 Intelligent face recognition method
CN113011357B (en) * 2021-03-26 2023-04-25 西安电子科技大学 Depth fake face video positioning method based on space-time fusion
WO2022231519A1 (en) * 2021-04-26 2022-11-03 Nanyang Technological University Trajectory predicting methods and systems
CN113609935A (en) * 2021-07-21 2021-11-05 无锡我懂了教育科技有限公司 Lightweight vague discrimination method based on deep learning face recognition
CN113705384B (en) * 2021-08-12 2024-04-05 西安交通大学 Facial expression recognition method considering local space-time characteristics and global timing clues
CN114694220B (en) * 2022-03-25 2024-06-21 上海大学 Double-flow face counterfeiting detection method based on Swin Transformer
CN114821804A (en) * 2022-05-18 2022-07-29 江苏奥斯汀光电科技股份有限公司 Attention mechanism-based action recognition method for graph convolution neural network
CN114842542B (en) * 2022-05-31 2023-06-13 中国矿业大学 Facial action unit identification method and device based on self-adaptive attention and space-time correlation
CN115640925A (en) * 2022-09-28 2023-01-24 中铁第四勘察设计院集团有限公司 Wisdom building site management system

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111709351A (en) * 2020-06-11 2020-09-25 江南大学 Three-branch network behavior identification method based on multipath space-time characteristic reinforcement fusion
CN112800894A (en) * 2021-01-18 2021-05-14 南京邮电大学 Dynamic expression recognition method and system based on attention mechanism between space and time streams
CN113435330A (en) * 2021-06-28 2021-09-24 平安科技(深圳)有限公司 Micro-expression identification method, device, equipment and storage medium based on video
CN113673303A (en) * 2021-06-28 2021-11-19 中国科学院大学 Human face action unit intensity regression method, device and medium
CN114170666A (en) * 2021-12-13 2022-03-11 重庆邮电大学 Facial expression recognition method based on multi-region convolutional neural network
CN115471885A (en) * 2022-08-24 2022-12-13 深圳市海清视讯科技有限公司 Action unit correlation learning method and device, electronic device and storage medium

Also Published As

Publication number Publication date
CN116071809A (en) 2023-05-05

Similar Documents

Publication Publication Date Title
US11775574B2 (en) Method and apparatus for visual question answering, computer device and medium
CN111985229B (en) Sequence labeling method and device and computer equipment
CN115310562B (en) Fault prediction model generation method suitable for energy storage equipment in extreme state
CN111710383A (en) Medical record quality control method and device, computer equipment and storage medium
CN115577678B (en) Method, system, medium, equipment and terminal for identifying causal relationship of document-level event
CN114332893A (en) Table structure identification method and device, computer equipment and storage medium
CN113192175A (en) Model training method and device, computer equipment and readable storage medium
CN112001399A (en) Image scene classification method and device based on local feature saliency
CN115983148A (en) CFD simulation cloud picture prediction method, system, electronic device and medium
WO2021208774A1 (en) Method and apparatus for assisting machine learning model to go online
CN116071809B (en) Face space-time representation generation method based on multi-class representation space-time interaction
CN110807462B (en) Training method insensitive to context of semantic segmentation model
CN116774973A (en) Data rendering method, device, computer equipment and storage medium
CN116484224A (en) Training method, device, medium and equipment for multi-mode pre-training model
CN115687136A (en) Script program processing method, system, computer equipment and medium
CN113743448B (en) Model training data acquisition method, model training method and device
US20230022253A1 (en) Fast and accurate prediction methods and systems based on analytical models
CN115309862A (en) Causal relationship identification method and device based on graph convolution network and contrast learning
CN116992937A (en) Neural network model restoration method and related equipment
CN113642642A (en) Control identification method and device
CN113901817A (en) Document classification method and device, computer equipment and storage medium
CN111091198A (en) Data processing method and device
CN117454987B (en) Mine event knowledge graph construction method and device based on event automatic extraction
CN113449716B (en) Field positioning and classifying method, text image recognition method, device and equipment
CN117079010A (en) Multi-label image recognition method, device and equipment for structured semantic priori

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant