CN112668408A

CN112668408A - Face key point generation method and device, storage medium and electronic equipment

Info

Publication number: CN112668408A
Application number: CN202011463289.2A
Authority: CN
Inventors: 赵明瑶; 闫嵩
Original assignee: Beijing Dami Technology Co Ltd
Current assignee: Beijing Dami Technology Co Ltd
Priority date: 2020-12-11
Filing date: 2020-12-11
Publication date: 2021-04-16

Abstract

The application discloses a face key point generation method and device, a storage medium and electronic equipment, and belongs to the technical field of computers. The face key point generation method comprises the following steps: the method comprises the steps of extracting features of audio data to obtain sound features, extracting features of a template face to obtain face features, processing a face sequence to obtain sequence features, superposing the sound features, the face features and the sequence features to generate input features, and generating a face key point sequence according to the input features. Therefore, the computer in the application can improve the reality degree and the fluency of the talking action of the virtual image by directly generating the human face key point related information of the naturally changed virtual image based on the audio data.

Description

Face key point generation method and device, storage medium and electronic equipment

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for generating key points of a human face, a storage medium and electronic equipment.

Background

Sound and vision play an important role in human transmission/reception of information, and both ways contain overlapping information to some extent. For example, when a familiar person is heard, what the person in the picture says can be inferred, so that the two kinds of information can be mutually converted, and the method has great commercial application value in some practical commercial scenes. With the development of deep learning and the increase of computer computing power, many audio processing and image generation methods based on deep learning have good effects, and currently, in the method for generating face images directly through audio, most of actions of generated speaking faces cannot be changed naturally or can be changed only according to a fixed mode. In view of this phenomenon, a method is desired that can directly process input audio data and generate a high-quality avatar corresponding to natural changes in mouth movements and facial expressions.

Disclosure of Invention

The embodiment of the application provides a face key point generation method, a face key point generation device, a storage medium and electronic equipment, which can directly generate a naturally-changing virtual image based on audio data and improve the reality degree. The technical scheme is as follows:

in a first aspect, an embodiment of the present application provides a method for generating a face key point, including:

carrying out feature extraction on the audio data to obtain sound features;

extracting the features of the template face to obtain face features;

processing the face sequence to obtain sequence characteristics; the face sequence comprises an angle sequence constraint feature and a boundary key point constraint feature;

superposing the sound feature, the face feature and the sequence feature to generate an input feature;

and generating a face key point sequence according to the input features.

In a second aspect, an embodiment of the present application provides a device for generating a face keypoint, where the device includes:

the first extraction module is used for extracting the characteristics of the audio data to obtain sound characteristics;

the second extraction module is used for extracting the characteristics of the template face to obtain face characteristics;

the processing module is used for processing the face sequence to obtain sequence characteristics; the face sequence comprises an angle sequence constraint feature and a boundary key point constraint feature;

the superposition module is used for superposing the sound feature, the face feature and the sequence feature to generate an input feature;

and the generating module is used for generating a face key point sequence according to the input features.

In a third aspect, embodiments of the present application provide a computer storage medium storing a plurality of instructions adapted to be loaded by a processor and to perform the above-mentioned method steps.

In a fourth aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the above-mentioned method steps.

The beneficial effects brought by the technical scheme provided by some embodiments of the application at least comprise:

when the face key point generation method, the face key point generation device, the storage medium and the electronic equipment work, the audio data are subjected to feature extraction to obtain the voice features, the template face is subjected to feature extraction to obtain the face features, the face sequence is processed to obtain the sequence features, the voice features, the face features and the sequence features are overlapped to generate the input features, and the face key point sequence is generated according to the input features. According to the embodiment of the application, the human face key point related information of the naturally-changed virtual image can be directly generated based on the audio data, and the reality degree and the fluency of the speaking action of the virtual image are improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic diagram of a communication system architecture provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of a method for generating key points of a human face according to an embodiment of the present application;

fig. 3 is another schematic flow chart of a method for generating face key points according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a face keypoint generating device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The following description refers to the accompanying drawings in which like numerals refer to the same or similar elements throughout the different views, unless otherwise specified. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.

In the description of the present application, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. The specific meaning of the above terms in the present application can be understood in a specific case by those of ordinary skill in the art. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

In order to solve the above mentioned problem that the motion of a generated speaking face mostly cannot be changed naturally or can only be changed according to a fixed mode in the current method for generating a face image directly through audio, a method for generating a face key point is provided. The computer system can be a computer system of a smart phone, a notebook computer, a tablet computer and the like.

Fig. 1 is a schematic diagram of a communication system architecture provided in the present application.

Referring to fig. 1, a communication system 01 includes a terminal apparatus 101, a network apparatus 102, and a server 103; when the communication system 01 includes a core network, the network device 102 may also be connected to the core network. The network device 102 may also communicate with an Internet Protocol (IP) network, such as the Internet (Internet), a private IP network, or other data network. The network device 102 provides services for the terminal device 101 and the server 103 within the coverage area. A user may use the terminal device 101 to interact with the server 103 through the network device 102 to receive or send a message, etc., the terminal device 101 may be installed with various communication client applications, such as a voice interaction application, an animation application, etc., the server 103 may be a server storing the face key point generation method provided in the embodiment of the present application and providing various services, and is configured to detect, store, and process files such as audio data and template faces uploaded by the terminal device 101, and send a processing result to the terminal device 101.

In the following method embodiments, for convenience of description, only the execution subject of each step is described as a computer.

The method for generating the face key points according to the embodiment of the present application will be described in detail below with reference to fig. 2 to 3.

Please refer to fig. 2, which is a flowchart illustrating a method for generating face key points according to an embodiment of the present application. The method may comprise the steps of:

s201, extracting the characteristics of the audio data to obtain sound characteristics.

In general, a feature may be a characteristic or a characteristic of one class of objects that is distinguishable from other classes of objects, or a collection of characteristics or characteristics, and may be extracted by measuring or processing data, and for audio data, a sound feature may be an intrinsic feature of audio that is distinguishable from other classes of audio, for example: Mel-Frequency Cepstral coefficients (MFCCs) characteristics, Mel-Filter Bank (MFB) characteristics, Spectral sub-band Centroid (SSC) characteristics, and the like. The computer calculates the central position of the audio data in the time interval based on a preset frame rate, traverses the time interval, extracts the MFCC sound characteristics of the Mel cepstrum coefficient in the sub-time intervals with preset lengths before and after the central position, and processes the MFCC sound characteristics to obtain the sound characteristics.

S202, extracting the characteristics of the template face to obtain the face characteristics.

Generally, the face features refer to the coordinate information features of key points of a face, such as: 81 individual face keypoint coordinates or 68 individual face keypoint coordinates. The method comprises the steps that a computer identifies a template face in a data set to obtain face key point coordinate information, all face key point coordinate information in the data set is counted to obtain average face key point coordinate information, target face key point coordinate information is determined, initial input features are obtained based on the target face key point coordinate information and the average face key point coordinate information, and the initial input features are processed to obtain face features.

And S203, processing the face sequence to obtain sequence characteristics.

Generally, the face sequence includes an angle sequence constraint feature and a boundary key point constraint feature, a user may manually set or select a template, the angle sequence constraint feature includes parameters in x and y directions, and the boundary key point constraint feature includes 3 boundary point parameters. The computer obtains an angle sequence constraint characteristic and a boundary key point constraint characteristic, processes the angle sequence constraint characteristic to obtain an angle sequence constraint sequence, processes the boundary key point constraint characteristic to obtain a boundary key point constraint sequence, and superposes the angle sequence constraint sequence and the boundary key point constraint sequence to obtain a sequence characteristic.

And S204, overlapping the sound feature, the face feature and the sequence feature to generate an input feature.

In general, stacking refers to merging multiple vectors or arrays into one vector or array, and includes Cat stacking and Stack stacking. And the computer performs Cat superposition on the voice feature, the face feature and the sequence feature to obtain a first superposition feature, and performs Stack superposition on the first superposition feature to obtain an input feature. Cat stacking can be understood as splicing without increasing dimension, and Stack stacking can be understood as stacking with adding a new dimension (the added dimension depends on the dimension of the input).

And S205, generating a face key point sequence according to the input features.

Generally, the face key point sequence includes a sequence size and audio data length association parameter, a face key point number and corresponding coordinates. The computer processes the input features to obtain face key point related features, and processes the face key point related features through a multilayer full-connection network to obtain a face key point sequence, wherein the processing of the input features to obtain the face key point related features is to process the input features to obtain the face key point related features by using a Long Short Term Memory (LSTM) neural network.

According to the content, the audio data are subjected to feature extraction to obtain the sound features, the template face is subjected to feature extraction to obtain the face features, the face sequence is processed to obtain the sequence features, the sound features, the face features and the sequence features are overlapped to generate the input features, and the face key point sequence is generated according to the input features. According to the embodiment of the application, the human face key point related information of the naturally-changed virtual image can be directly generated based on the audio data, and the reality degree and the fluency of the speaking action of the virtual image are improved.

Referring to fig. 3, another flow chart of a method for generating face key points is provided in the present application. The face key point generating method can comprise the following steps:

s301, calculating the center position of the audio data in the time interval based on a preset frame rate.

Generally, frame rate refers to the frequency/rate at which bitmap images appear on the display in units of frames in succession. The computer generates a face key point sequence based on the audio data, where the face key point sequence corresponds to face key point coordinates of consecutive frames, and therefore, it is necessary to determine a frame rate of the generated avatar video, for example: determining that the frame rate is 25 frames per second, the audio data is 3 minutes, generating 4500 frames of video, the number of MFCC features of the audio data per second is 100, 100 MFCC features in 1 second are related to the audio time length, 25 corresponding generated faces in 1 second, dividing 100 MFCC features into 25 parts, and determining the MFCC features of the center position of each part in the 25 parts as the features of the center position of the generated faces.

S302, traversing the time interval, extracting the MFCC sound features of the Mel cepstrum coefficients in sub time intervals with preset lengths before and after the central position, and processing the MFCC sound features to obtain the sound features.

Generally, the sub-time interval with the preset length refers to the duration of the generated audio data corresponding to each frame of face image, for example: the sub-time interval of the preset length is 150 ms. After the computer determines the central position on the time interval of the audio data, traversing the time interval to extract the Mel cepstrum coefficient (MFCC) sound features in sub-time intervals with preset lengths before and after the central position, and processing the MFCC sound features to obtain sound features, for example: extracting 13-dimensional MFCC features, first-order derivative features (12-dimensional) and second-order derivative features (11-dimensional) of MFCC from audio data in a sub-time interval, combining Cat to form first sound features of (1, 36) -dimensional dimensions of 36 MFCC relevant dimensions, traversing 1 second audio data to obtain second sound features of (30,36) -dimensional dimensions corresponding to 300ms of time before and after a central position, extracting the second sound features by using a convolution network and a full-connection network as a sound feature encoder to obtain the sound features, wherein the dimension of an encoded sound feature vector is (1,256), the first sound features and the second sound features are expressed by arrays, and the sound features are expressed by vectors.

S303, identifying a template face in the data set to obtain face key point coordinate information, and counting all the face key point coordinate information in the data set to obtain average face key point coordinate information.

Generally, a data set is a collection of multiple template faces pre-provided by a user in a computer. Before the computer identifies the template face in the data set to acquire the coordinate information of the key points of the face, the method further comprises the following steps: the method comprises the steps of detecting a face image in a data set based on a face detection algorithm, obtaining a detection result file, analyzing the detection result file to generate a template face by the face detection algorithm including a key point extraction algorithm in a face recognition dlib face image, a face key point positioning model algorithm preset by a user or an artificial intelligent open platform calling algorithm (such as Baidu, open platform), wherein information in the detection result file comprises a plurality of face key point coordinates of cheek coordinates, eyebrow coordinates, eye coordinates, mouth coordinates and nose coordinates. After obtaining the voice features, the computer identifies a template face in the data set to obtain face key point coordinate information, and counts all the face key point coordinate information in the data set to obtain average face key point coordinate information, for example: the method comprises the steps of identifying 68 face key point coordinate information in a first template face by using a neural network such as Dlib feature identification or a depth network, wherein the 68 face key point coordinate information in a second template face is ((73,25), (85,30), (90,34), (9.)), and the 68 face key point coordinate information in the first template face is ((65,20), (87,32), (92, 30.)), and the average face key point coordinate information is ((69,22.5), (86,31), (91, 32.)).

S304, determining coordinate information of target face key points, obtaining initial input features based on the coordinate information of the target face key points and the coordinate information of the average face key points, and processing the initial input features to obtain face features.

Generally, after obtaining the average face key point coordinate information, a computer determines the target face key point coordinate information, obtains an initial input feature based on the target face key point coordinate information and the average face key point coordinate information, and processes the initial input feature to obtain a face feature, for example: the computer determines 68 pieces of face key point coordinate information ((73,25), (85,30), (90,34),..) in the template face I as target face key point coordinate information, and subtracts the average face key point coordinate information ((69,22.5), (86,31), (91,32),. -) from the target face key point coordinate information to obtain initial input features which can be expressed as ((4,2.5), (-1, -1), (-1,2),. -). And processing the initial input features through a face key point feature extraction module (composed of a plurality of layers of fully-connected networks) to obtain face features, wherein the initial input features are expressed by arrays, and the face features are expressed by vectors.

S305, obtaining an angle sequence constraint characteristic and a boundary key point constraint characteristic, and processing the angle sequence constraint characteristic to obtain an angle sequence constraint sequence.

Generally, the face sequence includes an angle sequence constraint feature and a boundary key point constraint feature, a user may manually set or select a template, and the angle sequence constraint feature includes parameters in both x and y directions, for example: the angle sequence constraint characteristics are (30, 60), wherein 30 represents that the x coordinate axis is rotated by 30 degrees, and 60 represents that the y coordinate axis is rotated by 60 degrees. And the computer processes the angle sequence constraint characteristics through a sequence characteristic extraction module (consisting of a single-layer full-connection network) to obtain an angle sequence constraint sequence, wherein the angle sequence constraint characteristics and the boundary key point constraint characteristics are expressed by arrays, and the angle sequence constraint sequence is expressed by vectors. For example: and obtaining the dimensionality of the angle sequence constraint characteristic of (N, 2) for N continuous face computers, wherein 2 represents the parameters of the angle x and y directions, and the dimensionality of the angle sequence constraint sequence after passing through a sequence characteristic extraction module (consisting of a single-layer full-connection network) is (N, 12).

S306, processing the boundary key point constraint characteristics to obtain a boundary key point constraint sequence, and overlapping the angle sequence constraint sequence and the boundary key point constraint sequence to obtain sequence characteristics.

Generally, after a computer obtains an angle sequence constraint sequence, a sequence feature extraction module (composed of a single-layer fully-connected network) processes the boundary key point constraint features to obtain a boundary key point constraint sequence. The constraint feature of the boundary key points includes 6 parameters in the x and y directions of 3 boundary points, for example: the boundary key point constraint features are ((35,70), (55,120), (75,70)), and represent coordinates of three points, i.e., a left boundary point, a lower boundary point, and a right boundary point of the generated face, for a total of 6 parameters. Wherein, the boundary key point constraint sequence is represented by a vector, and the angle sequence constraint sequence and the boundary key point constraint sequence are superimposed to obtain sequence characteristics, for example: and (2) obtaining the boundary key point constraint sequence (N,6) for N continuous face computers, wherein the dimension of the boundary key point constraint sequence after passing through a sequence feature extraction module (consisting of a single-layer fully-connected network) is (N, 36). The sequence feature obtained by superimposing the boundary key point constraint sequence in this embodiment with the angle sequence constraint sequence in S305 is (N,48), and the sequence feature is represented by a vector.

S307, performing Cat superposition on the voice feature, the face feature and the sequence feature to obtain a first superposition feature.

Generally, the sound feature corresponds to a frame of generated sound feature corresponding to a human face, the human face feature represents a human face feature corresponding to a frame of generated human face, and the sequence feature represents a sequence feature corresponding to a frame of generated human face, for example: the dimension of the sound feature corresponding to a frame generation face is (1,256), the dimension of the face feature is (1,512), and the dimension of the sequence feature is (1,48), and a computer performs Cat superposition on one of the sound feature, the face feature, and the sequence feature to obtain a first superposition feature, for example: and N is an integer larger than 1, the sound features of the N frames are superposed to obtain a first sound superposition feature with a dimension of (N,256), the face features of the N frames are superposed to obtain a first face superposition feature with a dimension of (N,512), and the sequence features of the N frames are superposed to obtain a first sequence superposition feature with a dimension of (N, 48).

And S308, stacking the first stacking features to obtain input features.

Generally, after obtaining the first overlay feature, the computer stacks the first overlay feature to obtain an input feature, for example: the first sound superimposition feature dimension is (N,256), the first face superimposition feature dimension is (N,512), the first sequence superimposition feature dimension is (N,48), and the computer performs Stack superimposition to obtain the input feature dimension (N, 816).

And S309, processing the input features to obtain the relevant features of the key points of the human face.

Generally, after a computer obtains input features, a long-short term memory (LSTM) neural network is used for processing the input features to obtain face key point related features, wherein the LSTM neural network is provided with 256 hidden nodes and 3 layers, and the face key point related features are expressed by vectors.

And S310, processing the relevant features of the face key points through a multilayer full-connection network to obtain a face key point sequence.

Generally, after obtaining the relevant features of the face key points by the computer, processing the relevant features of the face key points through a multilayer fully-connected network to obtain a face key point sequence, where the relevant features of the face key points include a sequence size and audio data length correlation parameter S, a number P of face key points, and corresponding coordinates N, N being always equal to 2, and the face key point sequence is represented by an array (S, P, N), for example, the computer obtains the face key point sequence as (1,50,2), which means that a frame of face is generated, and the generated face has 50 face key points and x-y-axis coordinates ((125,75), (130,80), (140,83). Processing the relevant features of the face key points through a multilayer full-connection network to obtain a face key point sequence, and representing the face features by using (S,512) dimensional vectors after coding, for example: the 50 face key point features are encoded into (1,512) representations through a multi-layer full-connection network to generate a frame of face.

When the scheme of the embodiment of the application is executed, the central position on the time interval of the audio data is calculated based on a preset frame rate, the Mel cepstrum coefficient MFCC sound characteristics in a sub-time interval with a preset length before and after the central position are extracted through the time interval, the MFCC sound characteristics are processed to obtain the sound characteristics, a template human face in a data set is identified to obtain the coordinate information of key points of the human face, all the coordinate information of the key points of the human face in the data set is counted to obtain the coordinate information of key points of the average human face, the coordinate information of key points of the target human face is determined, the initial input characteristics are obtained based on the coordinate information of the key points of the target human face and the coordinate information of the key points of the average human face, the initial input characteristics are processed to obtain the human face characteristics, the angle sequence constraint characteristics and the boundary key point constraint characteristics are obtained, and the angle sequence, processing the boundary key point constraint features to obtain a boundary key point constraint sequence, superposing the angle sequence constraint sequence and the boundary key point constraint sequence to obtain sequence features, carrying out Cat superposition on the voice features, the face features and the sequence features to obtain first superposition features, carrying out Stack superposition on the first superposition features to obtain input features, processing the input features to obtain face key point related features, and processing the face key point related features through a multilayer full-connection network to obtain a face key point sequence. According to the embodiment of the application, the human face key point related information of the naturally-changed virtual image can be directly generated based on the audio data, and the reality degree and the fluency of the speaking action of the virtual image are improved.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method of the present application.

Please refer to fig. 4, which shows a schematic structural diagram of a face keypoint generating apparatus according to an exemplary embodiment of the present application, and is hereinafter referred to as the generating apparatus 4 for short. The generating means 4 may be implemented by software, hardware or a combination of both as all or part of a terminal. The method comprises the following steps:

a first extraction module 401, configured to perform feature extraction on audio data to obtain a sound feature;

a second extraction module 402, configured to perform feature extraction on the template face to obtain face features;

a processing module 403, configured to process the face sequence to obtain a sequence feature; the face sequence comprises an angle sequence constraint feature and a boundary key point constraint feature;

a superposition module 404, configured to superpose the sound feature, the face feature, and the sequence feature to generate an input feature;

and a generating module 405, configured to generate a face key point sequence according to the input features.

Optionally, the first extraction module 401 further includes:

a calculation unit configured to calculate a center position on a time interval of the audio data based on a preset frame rate; traversing the time interval to extract the Mel cepstrum coefficient (MFCC) sound characteristics in sub time intervals with preset lengths before and after the central position; and processing the MFCC sound characteristics to obtain the sound characteristics.

Optionally, the second extracting module 402 further includes:

the identification unit is used for identifying a template face in the data set to acquire the coordinate information of key points of the face; counting all the face key point coordinate information in the data set to obtain average face key point coordinate information; determining coordinate information of target face key points, and obtaining initial input features based on the coordinate information of the target face key points and the coordinate information of the average face key points; and processing the initial input features to obtain the human face features.

The analysis unit is used for detecting the face images in the data set based on a face detection algorithm to obtain a detection result file; analyzing the detection result file to generate a template face; and the information in the detection result file comprises a plurality of face key point coordinates of the cheek coordinates, the eyebrow coordinates, the eye coordinates, the mouth coordinates and the nose coordinates.

Optionally, the processing module 403 further includes:

the acquiring unit is used for acquiring the angle sequence constraint characteristic and the boundary key point constraint characteristic; processing the angle sequence constraint characteristics to obtain an angle sequence constraint sequence; processing the boundary key point constraint characteristics to obtain a boundary key point constraint sequence; and overlapping the angle sequence constraint sequence and the boundary key point constraint sequence to obtain sequence characteristics.

Optionally, the stacking module 404 further includes:

the merging unit is used for performing Cat superposition on the sound feature, the face feature and the sequence feature to obtain a first superposition feature; and performing Stack superposition on the first superposition characteristics to obtain input characteristics.

Optionally, the generating module 405 further includes:

the processing unit is used for processing the input features to obtain relevant features of key points of the human face; processing the relevant features of the face key points through a multilayer full-connection network to obtain a face key point sequence; the face key point sequence comprises sequence size and audio data length association parameters, face key point number and corresponding coordinates.

The embodiment of the present application and the method embodiments of fig. 2 to 3 are based on the same concept, and the technical effects brought by the embodiment are also the same, and the specific process may refer to the description of the method embodiments of fig. 2 to 3, and will not be described again here.

The device 4 may be a field-programmable gate array (FPGA), an application-specific integrated chip, a system on chip (SoC), a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), an embedded Neural Network Processing Unit (NPU), a Tensor Processing Unit (TPU), or similar image processors of a service end and a mobile end, a Neural Network acceleration processor, a Network Processor (NP), a digital signal Processing circuit, a Microcontroller (MCU), or a programmable logic controller (PLD) or other integrated chips.

An embodiment of the present application further provides a computer storage medium, where the computer storage medium may store a plurality of instructions, where the instructions are suitable for being loaded by a processor and executing the above method steps, and a specific execution process may refer to specific descriptions of the embodiment shown in fig. 2 or fig. 3, which is not described herein again.

The present application further provides a computer program product, which stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the template control method according to the above embodiments.

Please refer to fig. 5, which is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. As shown in fig. 5, the electronic device 5 may include: at least one processor 501, at least one network interface 504, a user interface 503, memory 505, at least one communication bus 502.

Wherein a communication bus 502 is used to enable connective communication between these components.

The user interface 503 may include a Display (Display) and a Microphone (Microphone), and the optional user interface 503 may also include a standard wired interface and a wireless interface.

The network interface 504 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), among others.

Processor 501 may include one or more processing cores, among other things. The processor 501 connects various parts throughout the terminal 500 using various interfaces and lines, and performs various functions of the terminal 500 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 505, and calling data stored in the memory 505. Optionally, the processor 501 may be implemented in at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 501 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for performing all tensor operations in the deep learning network and is responsible for rendering and drawing contents to be displayed on the display screen; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 501, but may be implemented by a single chip.

The Memory 505 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). Optionally, the memory 505 includes a non-transitory computer-readable medium. The memory 505 may be used to store instructions, programs, code sets, or instruction sets. The memory 505 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like; the storage data area may store data and the like referred to in the above respective method embodiments. The memory 505 may alternatively be at least one memory device located remotely from the processor 501. As shown in fig. 5, the memory 505, which is a type of computer storage medium, may include an operating system, a network communication module, a user interface module, and a face keypoint generation application.

In the electronic device 500 shown in fig. 5, the user interface 503 is mainly used as an interface for providing input for a user, and acquiring data input by the user; the processor 501 may be configured to call the face keypoint generation application stored in the memory 505, and specifically perform the following operations:

carrying out feature extraction on the audio data to obtain sound features;

extracting the features of the template face to obtain face features;

and generating a face key point sequence according to the input features.

In one embodiment, the processor 501 performs the feature extraction on the audio data to obtain the sound features, including:

calculating a center position on a time interval of the audio data based on a preset frame rate;

traversing the time interval to extract the Mel cepstrum coefficient (MFCC) sound characteristics in sub time intervals with preset lengths before and after the central position;

and processing the MFCC sound characteristics to obtain the sound characteristics.

In one embodiment, the processor 501 performs the feature extraction on the template face to obtain the face feature, including:

identifying a template face in a data set to acquire coordinate information of key points of the face;

counting all the face key point coordinate information in the data set to obtain average face key point coordinate information;

determining coordinate information of target face key points, and obtaining initial input features based on the coordinate information of the target face key points and the coordinate information of the average face key points;

and processing the initial input features to obtain the human face features.

In one embodiment, before the processor 501 executes the template face in the recognition data set to acquire the coordinate information of the face key point, the method further includes:

detecting a face image in the data set based on a face detection algorithm to obtain a detection result file;

analyzing the detection result file to generate a template face;

and the information in the detection result file comprises a plurality of face key point coordinates of the cheek coordinates, the eyebrow coordinates, the eye coordinates, the mouth coordinates and the nose coordinates.

In one embodiment, the processor 501 performs the processing on the face sequence to obtain sequence features, including:

acquiring an angle sequence constraint characteristic and a boundary key point constraint characteristic;

processing the angle sequence constraint characteristics to obtain an angle sequence constraint sequence;

processing the boundary key point constraint characteristics to obtain a boundary key point constraint sequence;

and overlapping the angle sequence constraint sequence and the boundary key point constraint sequence to obtain sequence characteristics.

In one embodiment, processor 501 performs the superimposing of the voice feature, the face feature, and the sequence feature to generate an input feature, including:

performing Cat superposition on one of the voice feature, the face feature and the sequence feature to obtain a first superposition feature;

and performing Stack superposition on the first superposition characteristics to obtain input characteristics.

In one embodiment, the processor 501 performs the generating of the face key point sequence according to the input features, including:

processing the input features to obtain face key point related features;

processing the relevant features of the face key points through a multilayer full-connection network to obtain a face key point sequence; the face key point sequence comprises sequence size and audio data length association parameters, face key point number and corresponding coordinates.

The technical concept of the embodiment of the present application is the same as that of fig. 2 or fig. 3, and the specific process may refer to the method embodiment of fig. 2 or fig. 3, which is not described herein again.

In the embodiment of the application, a central position on a time interval of the audio data is calculated based on a preset frame rate, a Mel cepstrum coefficient MFCC sound feature in a sub-time interval with a preset length before and after the central position is extracted through the time interval, the MFCC sound feature is processed to obtain a sound feature, a template face in a data set is identified to obtain face key point coordinate information, all the face key point coordinate information in the data set is counted to obtain average face key point coordinate information, target face key point coordinate information is determined, an initial input feature is obtained based on the target face key point coordinate information and the average face key point coordinate information, the initial input feature is processed to obtain a face feature, an angle sequence constraint feature and a boundary key point constraint feature are obtained, and the angle sequence constraint feature is processed to obtain an angle sequence constraint sequence, processing the boundary key point constraint features to obtain a boundary key point constraint sequence, superposing the angle sequence constraint sequence and the boundary key point constraint sequence to obtain sequence features, carrying out Cat superposition on the voice features, the face features and the sequence features to obtain first superposition features, carrying out Stack superposition on the first superposition features to obtain input features, processing the input features to obtain face key point related features, and processing the face key point related features through a multilayer full-connection network to obtain a face key point sequence. According to the embodiment of the application, the human face key point related information of the naturally-changed virtual image can be directly generated based on the audio data, and the reality degree and the fluency of the speaking action of the virtual image are improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a read-only memory or a random access memory.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims

1. A method for generating key points of a human face is characterized by comprising the following steps:

carrying out feature extraction on the audio data to obtain sound features;

extracting the features of the template face to obtain face features;

and generating a face key point sequence according to the input features.

2. The method of claim 1, wherein the performing feature extraction on the audio data to obtain sound features comprises:

3. The method of claim 1, wherein the extracting the features of the template face to obtain the face features comprises:

and processing the initial input features to obtain the human face features.

4. The method of claim 3, wherein before the recognizing the template face in the data set to obtain the face key point coordinate information, the method further comprises:

analyzing the detection result file to generate a template face;

5. The method of claim 1, wherein the processing the face sequence to obtain the sequence feature comprises:

6. The method of claim 1, wherein the superimposing the voice feature, the face feature, and the sequence feature to generate an input feature comprises:

performing Cat superposition on the voice feature, the face feature and the sequence feature to obtain a first superposition feature;

7. The method of claim 1, wherein the generating a face keypoint sequence from the input features comprises:

processing the input features to obtain face key point related features;

8. The method of claim 7, wherein the processing the input features to obtain face keypoint related features is processing the input features to obtain face keypoint related features using a long-short term memory (LSTM) neural network.

9. A face keypoint generation apparatus, comprising:

10. A computer storage medium, characterized in that it stores a plurality of instructions adapted to be loaded by a processor and to perform the method steps according to any of claims 1 to 6.

11. An electronic device, comprising: a memory and a processor; wherein the memory stores a computer program adapted to be loaded by the processor and to perform the method steps of any of claims 1 to 6.