CN113851145B - Virtual human action sequence synthesis method combining voice and semantic key actions - Google Patents

Virtual human action sequence synthesis method combining voice and semantic key actions Download PDF

Info

Publication number
CN113851145B
CN113851145B CN202111111485.8A CN202111111485A CN113851145B CN 113851145 B CN113851145 B CN 113851145B CN 202111111485 A CN202111111485 A CN 202111111485A CN 113851145 B CN113851145 B CN 113851145B
Authority
CN
China
Prior art keywords
sequence
voice
key
action
key action
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111111485.8A
Other languages
Chinese (zh)
Other versions
CN113851145A (en
Inventor
曾鸣
刘鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen University
Original Assignee
Xiamen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen University filed Critical Xiamen University
Priority to CN202111111485.8A priority Critical patent/CN113851145B/en
Publication of CN113851145A publication Critical patent/CN113851145A/en
Application granted granted Critical
Publication of CN113851145B publication Critical patent/CN113851145B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/18Details of the transformation process
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Signal Processing (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Processing Or Creating Images (AREA)

Abstract

A virtual human action sequence synthesis method combining voice and semantic key actions relates to action synthesis. Manually selecting and recording a key action target video to construct a key action pool; extracting a voice characteristic sequence from an input voice stream; inputting a voice characteristic sequence to a voice recognition module, and outputting a corresponding text sequence; inputting a voice characteristic sequence to the mouth shape reasoning module and outputting a mouth shape characteristic point sequence; and outputting a human face texture image sequence to the human face texture matching module input port type characteristic point change track sequence. Inputting a text sequence and a voice audio stream to a key action selection module, and outputting a key gesture sequence; inputting a voice audio stream, a text sequence and a key action sequence to a background frame selection module, and outputting a background frame sequence; and inputting a human face texture image sequence and a background frame sequence to the foreground and background mixing module, and outputting a virtual person speaking video with consistent virtual person actions and voice semantics. And by utilizing the semantic constraint action, the consistency of the virtual person action and the voice semantic is improved.

Description

Virtual human action sequence synthesis method combining voice and semantic key actions
Technical Field
The invention relates to the technical field of action synthesis, in particular to a virtual human action sequence synthesis method combining voice and semantic key actions.
Background
The traditional virtual human body state synthesis method generally utilizes a neural network method to directly generate human body gestures from voice or text, and mainly has two problems: on the one hand, the generated actions have poor editability and limited variation; on the other hand, the generation process tends to be a mapping between patterns without explicit constraints on consistency of actions and content semantics.
In paper Ginosar S,Bar A,Kohavi G,et al.Learning individual styles of conversational gesture[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:3497-3506., by inputting a section of voice audio, a human body gesture sequence is generated by using an countermeasure generation network, and a human body texture sequence corresponding to the gesture is synthesized correspondingly, but the generated motion has no editability, and texture defects and discontinuities are obvious.
Disclosure of Invention
The invention aims at solving the problem that the consistency of virtual human actions synthesized by the traditional virtual human synthesis technology and voice semantics lacks and is associated, and provides a virtual human action sequence synthesis method combining voice and semantic key actions.
The invention comprises the following steps:
1) Manually selecting and recording a key action target video to construct a key action pool;
2) Extracting a voice characteristic sequence from an input voice stream;
3) Inputting a voice characteristic sequence to a voice recognition module, and outputting a corresponding text sequence;
4) Inputting a voice characteristic sequence to the mouth shape reasoning module and outputting a mouth shape characteristic point sequence;
5) Outputting a human face texture image sequence to an input type characteristic point change track sequence of the human face texture matching module;
6) Inputting a text sequence and a voice audio stream to a key action selection module, and outputting a key action sequence;
7) Inputting a voice audio stream, a text sequence and a key action sequence to a background frame selection module, and outputting a background frame sequence;
8) And inputting a human face texture image sequence and a background frame sequence to the foreground and background mixing module, and outputting a virtual person speaking video with consistent virtual person actions and voice semantics.
In the step 1), the key action instruction hand and other actions with strong correlation with semantics are selected in advance, and a video is recorded for each key action to be used as a background in a foreground and background mixing module; the set of the key action target videos is a key action pool; in particular, a video which keeps the human body posture at a natural position but does not act must also contain the number 0, so that the key action selection module can select when selecting the key action; other key action videos require that the human body postures at the beginning and the end are not greatly different from the human body postures kept by the former; the video numbers of other key actions are not required and can be distinguished.
In step 2), the speech features are human ear hearing related features, such as mel frequency cepstral coefficients.
In step 3), the voice recognition module is a voice recognition network.
In step 4), the mouth shape reasoning module is an LSTM network.
In step 5), the face texture matching module is configured to match a face texture image to each mouth shape in the mouth shape feature point sequence, where the mouth shape of the image is consistent with the corresponding mouth shape feature point; the specific steps of one mouth shape in the mouth shape characteristic point matching sequence are as follows:
(1) Face detection and face feature point alignment are carried out on all videos of the key action pool, and a face texture image set F= { F i } and a corresponding mouth shape feature point P= { P i } are extracted; wherein p i is the position of the mouth shape feature point in the face texture image f i;
(2) Selecting M mouth shape characteristic points which are the most similar to the mouth shape characteristic points P in the set P, and selecting corresponding face texture images to obtain M candidate face texture images F '= { F' i |i epsilon [0, M-1] };
(3) Calculating the median of the face texture image set F' to be used as a face texture image matched with the mouth shape feature points rho;
In step 6), the key action selection module is used for extracting semantic information from the voice information and the text information, and further appointing corresponding key actions for each sentence of the text sequence; the method specifically comprises the following steps:
(1) Performing substring segmentation on the input text sequence L according to punctuation marks to obtain L= (L 0,L1,...,LN);
(2) Identifying the end point of each sentence according to loudness of the input voice audio stream, and then separating an audio stream fragment A= (A 0,A1,...,Ai,...,AN) corresponding to each text substring L i;
(3) Converting each audio stream fragment A i into a frequency spectrum feature map, and inputting the frequency spectrum feature map into a full convolution neural network in a key action selection module to extract audio features to obtain hidden space vectors HA= (HA 0,HA1,...,HAi,...,HAN) corresponding to the audio;
(4) Converting each text substring L i into a word vector by using a word encoder, and then inputting the word vector into a cyclic neural network in a key action selection module to extract text features to obtain a hidden space vector HL= (HL 0,HL1,...,HLi,...,HLN) corresponding to the text;
(5) The HA i and the HL i are input into an attention sub-network in a key action selection module to obtain a weighted feature vector H i;
(6) H i is input into a softmax function, discretization is carried out according to the maximum value, and key action category one-hot encoding KL i is obtained; in particular, if there is no specific action, the one-hot code is an all-zero vector;
(7) Comparing each one-hot code KL i obtained in the step (6) with the key action pool to obtain a corresponding key action sequence K= (K 0,K1,...,Ki,...,KN), wherein K i is a key action type number corresponding to KL i, and outputting and using the key action sequence K= (K 0,K1,...,Ki,...,KN) as a follow-up use;
after the step 6), the key action sequence K= (K 0,K1,...,Ki,...,KN) can be manually corrected or edited, so that a certain key action can be manually specified to be performed on a certain section of voice, the interactivity and the editability are stronger, and the virtual person action sequence synthesized after manual selection also accords with the semantic scene of the voice.
In step 7), the background frame selection module is used for selecting a background frame sequence which is consistent with the key action sequence obtained in step 6) and the head gesture is consistent with the voice pause rhythm; preferably, the background frame selection module uses a dynamic programming algorithm to solve the frame sequence, the dynamic normalization algorithm comprehensively considers the smoothness converted by the key action and the matching degree of the head motion speed and the voice pause, and the optimal background frame sequence is obtained through an iterative method.
In step 8), the foreground and background mixing module mixes the face texture image sequence obtained in step 5) with the background frame sequence obtained in step 7) by adopting LAPLACIAN PYRAMID blending image mixing algorithm to obtain a virtual human action sequence consistent with voice semantics.
The invention constructs the key action pool by collecting data, further selects key actions from the key action pool by utilizing semantic information of voice and text, and further guides the generation of follow-up action videos. On the basis of automatically selecting key actions, the invention also supports manual editing of action sequences, thereby improving the interactivity and diversity of the generated actions; in summary, the invention combines text and voice semantic information to select key actions, explicitly utilizes semantic constraint actions, and improves consistency of virtual human actions and voice semantics.
Drawings
Fig. 1 is an overall flow chart of the present invention.
FIG. 2 is a flow chart of a key action selection module according to the present invention.
Fig. 3 is a flowchart of a background frame selection module according to the present invention.
Detailed Description
The invention will be further illustrated by the following examples in conjunction with the accompanying drawings.
Referring to fig. 1 to 3, the embodiment of the present invention shows the virtual human motion sequence synthesizing method combining the voice and the semantic key motion according to the present invention by taking the synthesis of the virtual human gesture as an example, but the scope of the motion that can be generated according to the present invention is not limited thereto, and any virtual human motion sequence can be synthesized by using the method according to the present invention. The method comprises the following specific steps:
1) And manually selecting and recording a key gesture target video to construct a key gesture pool.
2) A speech feature sequence is extracted for an input speech stream.
3) And inputting a voice characteristic sequence to the voice recognition module, and outputting a corresponding text sequence.
4) And inputting a voice characteristic sequence into the mouth shape reasoning module and outputting a mouth shape characteristic point sequence.
5) And inputting a mouth-shaped characteristic point change track sequence to the face texture matching module, and outputting a face texture image sequence.
6) And inputting a text sequence and a voice audio stream to the key action selection module, and outputting a key gesture sequence.
7) And inputting a voice audio stream, a text sequence and a key gesture sequence to the background frame selection module, and outputting a background frame sequence.
8) And inputting a human face texture image sequence and a background frame sequence to the foreground and background mixing module, and outputting a virtual human speaking video with consistent virtual human posture and voice semantics.
The virtual human action sequence synthesis method combining voice and semantic key actions comprises the following steps:
the traditional virtual human body state synthesis method generally utilizes a neural network method to directly generate human body gestures from voice or text, and mainly has two problems: on the one hand, the generated actions have poor editability and limited variation; on the other hand, the generation process tends to be a mapping between patterns without explicit constraints on consistency of actions and content semantics.
Based on the method, the invention provides a virtual human action sequence synthesis method combining voice and semantic key actions. And constructing a key action pool by collecting data, and further selecting key actions from the key action pool by utilizing semantic information of voice and text so as to guide the generation of subsequent action videos. On the basis of automatically selecting key actions, the invention also supports manual editing of action sequences, thereby improving the interactivity and diversity of the generated actions; in summary, the invention combines the semantic information of voice and text to select the key actions, and explicitly utilizes the semantic constraint actions to improve the consistency of the actions and the semantics.
Step 6) and step 7) are key modules of the invention, and control the generation of key action sequences consistent with voice semantics. Specific embodiments of the key action selection module and the background frame selection module are described in detail below. The other steps are performed as described in the summary of the invention to complete the specific embodiment.
The key action selecting module:
(1) For an input text L, performing substring segmentation according to punctuation marks, taking L= "good, i are virtual anchor and very good to know as an example, obtaining L 0 =" good, L 1 = "i are virtual anchor" and L 2 = "very good to know as good to the person";
(2) Inputting each text substring L i into the voice synthesis network, and outputting a corresponding audio stream fragment A= (A 0,A1,A2);
(3) Converting each audio stream fragment A i into a frequency spectrum feature map, and inputting the frequency spectrum feature map into a full convolution neural network in a key action selection module to extract audio features to obtain hidden space vectors HA= (HA 0,HA1,HA2) corresponding to the audio;
(4) Converting each text substring L i into a word vector by using a word encoder, and then inputting the word vector into a cyclic neural network in a key action selection module to extract text features to obtain a hidden space vector HL= (HL 0,HL1,HL2) corresponding to the text;
(5) The HA i and the HL i are input into an attention sub-network in a key action selection module to obtain a weighted feature vector H i;
(6) H i is input into the softmax function, and discretization is carried out according to the maximum value, so that the key action category one-hot code KL i is obtained. Wherein, the length of KL i is the total number of key actions in the key action pool. KL i in the embodiment corresponds to three action types of waving hands, covering chest and spreading hands respectively;
(7) And (3) comparing each one-hot code KL i obtained in the step (6) with the key action pools to obtain a corresponding key action fragment set K= (K 0,K1,K2), and outputting the corresponding key action fragment set K= (K 0,K1,K2) for subsequent use.
A background frame selection module:
In this embodiment, taking the sentence "good, i am virtual anchor, very happy to know good" as an example, how the background frame selection module selects the background frame sequence is specifically shown. Referring to fig. 3, the specific steps are as follows:
(1) And performing substring segmentation on the input text sequence L according to punctuation marks to obtain L= (L 0,L1,...,LN).
(2) The end point of each sentence is identified according to loudness for the input voice audio stream, and then the audio stream segment a= (a 0,A1,...,Ai,...,AN) corresponding to each text substring L i is separated.
(3) And converting each audio stream fragment A i into a frequency spectrum characteristic diagram, and inputting the frequency spectrum characteristic diagram into a full convolution neural network in a key action selection module to extract audio characteristics to obtain a hidden space vector HA= (HA 0,HA1,...,HAi,...,HAN) corresponding to the audio.
(4) Converting each text substring L i into a word vector by using a word encoder, and then inputting the word vector into a cyclic neural network in a key action selection module to extract text characteristics to obtain a hidden space vector HL= (HL 0,HL1,...,HLi,...,HLN) corresponding to the text
(5) The HA i and the HL i are input into an attention sub-network in the key action selection module to obtain a weighted feature vector H i.
(6) H i is input into the softmax function, and discretization is carried out according to the maximum value, so that the key action category one-hot code KL i is obtained. In particular, if there is no specific action, one-hot is encoded as an all-zero vector.
(7) And (3) comparing each one-hot code KL i obtained in the step (6) with the key action pool to obtain a corresponding key action sequence K= (K 0,K1,...,Ki,...,KN), wherein K i is a key action type number corresponding to KL i, and outputting and using the key action sequence K= (K 0,K1,...,Ki,...,KN) as a follow-up use.
Preferably, the key action sequence k= (K 0,K1,...,Ki,...,KN) can be manually corrected or edited after the step 6), so that a certain key action can be manually specified to be performed on a certain section of voice, the interactivity is stronger, and the virtual person action sequence synthesized after manual selection also accords with the semantic scene of the voice.
And 7) the background frame selection module is used for selecting a background frame sequence which is consistent with the key action sequence obtained in the step 6) and the head gesture is consistent with the voice pause rhythm. Without loss of generality, the output video frame rate of the algorithm is assumed to be the same as that of a virtual person key action video recorded by manual implementation, and the background frame selection module specifically comprises the following steps:
(1) Calculating an ending time point corresponding to an audio stream segment A i of the audio stream segment corresponding to each text substring L i, multiplying the ending time point by the output video frame rate to obtain an ending frame number E= (E 0,e1,...,ei,...,eN) corresponding to each text substring L i in the output video;
(2) The output video background frame sequence is recorded as B= (B 0,b1,...,bi,...,bM), and the frame sequence of the ith key action is recorded as Where M is the total frame number of the video output by the algorithm, and M i is the total frame length of the frame sequence of the ith key action. Cycling each text substring L i, if the corresponding key action K i is not equal to 0, setting b j=Ki, wherein j is E [ e i-mi,ei ];
(3) For the output background frame sequence between any two key actions, namely the key action K i and the key action K j meet the following relationship K i≠0;Kj≠0;Kl = 0,l epsilon (i, j); and (3) solving the frame sequence by adopting a dynamic programming algorithm in the following steps (4) to (7). The dynamic normalization algorithm comprehensively considers the smoothness converted by key actions and the matching degree of the head movement speed and the voice pause, and obtains an optimal background frame sequence by an iterative method;
(4) The output background frame sequence between any two key actions is filled with video frame number 0 in the key action pool. Let the background frame to be padded be s= (S 0,s1,...,sN), where n=e j-mj-ei. Let video frame No. 0 be t= (T 0,t1,...,tM), where the total number of video frames No. 0 is M. The target video refers to a No. 0 key video;
(5) And calculating the face head movement speed V= (V 0,v1,...,vM) of the target video at each frame by using a face feature point alignment algorithm. Calculating whether the input voice audio stream is in a speaking state at each frame of the background frame to be padded by using the audio stream loudness statistic, wherein the sign is a= (a 0,a1,...,aN).ai =1 indicates that the voice is not speaking, and otherwise indicates that the voice is speaking;
(6) And (3) taking the nth frame cost of the target video mth frame as the nth frame cost of the background frame to be padded as F (n, m, 0) in the padding process. And repeating the mth frame of the target video as the n+1th frame cost of the background frame to be padded in the padding process to obtain F (n+1, m, 1). The recursive formula for the cost function F is:
F(n,m,0)=min(F(n-1,m-1,0),F(n-1,m-1,1))-anvm
F(n,m,1)=F(n-1,m,0)+αvm
F(0,m,0)=-anvm+v(kimi,k0m)
F(n,0,0)=∞if n>0
F(n,m,0)=∞if n=0 or m=0
(7) The optimal background frame sequence is defined by min m { min (F (N-1, m, 0), F (N-1, m, 1)) } and can be solved in O (MN) time by iteratively solving the recursion;
The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art, who is within the scope of the present invention, should make equivalent substitutions or modifications according to the technical scheme of the present invention and the inventive concept thereof, and should be covered by the scope of the present invention.

Claims (9)

1. A virtual human action sequence synthesis method combining voice and semantic key actions is characterized by comprising the following steps:
1) Manually selecting and recording a key action target video to construct a key action pool;
2) Extracting a voice characteristic sequence from an input voice stream;
3) Inputting a voice characteristic sequence to a voice recognition module, and outputting a corresponding text sequence;
4) Inputting a voice characteristic sequence to the mouth shape reasoning module and outputting a mouth shape characteristic point sequence;
5) Outputting a human face texture image sequence to an input type characteristic point change track sequence of the human face texture matching module;
6) Inputting a text sequence and a voice audio stream to a key action selection module, and outputting a key action sequence;
the key action selection module is used for extracting semantic information from the voice information and the text information, and further appointing corresponding key actions for each sentence of the text sequence; the method specifically comprises the following steps:
(1) Performing substring segmentation on the input text sequence L according to punctuation marks to obtain L= (L 0,L1,…Li,…,LN);
(2) Identifying the end point of each sentence according to loudness of the input voice audio stream, and then separating an audio stream fragment A= (A 0,A1,…,Ai,…,AN) corresponding to each text substring L i;
(3) Converting each audio stream fragment A i into a frequency spectrum feature map, and inputting the frequency spectrum feature map into a full convolution neural network in a key action selection module to extract audio features to obtain hidden space vectors HA= (HA 0,HA1,…,HAi,…,HAN) corresponding to the audio;
(4) Converting each text substring L i into a word vector by using a word encoder, and then inputting the word vector into a cyclic neural network in a key action selection module to extract text features to obtain a hidden space vector HL= (HL 0,HL1,…,HLi,…,HLN) corresponding to the text;
(5) The HA i and the HL i are input into an attention sub-network in a key action selection module to obtain a weighted feature vector H i;
(6) H i is input into a softmax function, discretization is carried out according to the maximum value, and key action category one-hot encoding KL i is obtained; if no specific action exists, the one-hot code is an all-zero vector;
(7) Comparing each one-hot code KL i obtained in the step (6) with the key action pool to obtain a corresponding key action sequence K= (K 0,K1,…,Ki,…,KN), wherein K i is a key action type number corresponding to KL i, and outputting and using the key action sequence K= (K 0,K1,…,Ki,…,KN) as a follow-up use;
7) Inputting a voice audio stream, a text sequence and a key action sequence to a background frame selection module, and outputting a background frame sequence;
8) And inputting a human face texture image sequence and a background frame sequence to the foreground and background mixing module, and outputting a virtual person speaking video with consistent virtual person actions and voice semantics.
2. The method for synthesizing the virtual human action sequence combining the voice and the semantic key actions according to claim 1 is characterized in that in the step 1), the key actions instruct actions with strong correlation between hands and semantics, key action types which can be synthesized by the virtual human are selected in advance, and a video is recorded for each key action for being used as a background in a foreground and background mixing module; the set of the key action target videos is a key action pool; wherein, a video which keeps the human body posture at a natural position but does not act must also contain the number 0 for the key action selection module to select when the key action is selected; other key action videos require that the human body postures at the beginning and the end are not greatly different from the human body postures kept by the former; other key action video numbers are not required.
3. The method of claim 1, wherein in step 2), the speech features are auditory related features of the human ear, including mel-frequency cepstral coefficients.
4. The method of claim 1, wherein in step 3), the speech recognition module is a speech recognition network.
5. The method of claim 1, wherein in step 4), the mouth-shape inference module is an LSTM network.
6. The method of claim 1, wherein in step 5), the face texture matching module is configured to match each mouth shape in the sequence of mouth shape feature points with a face texture image, where the mouth shape of the image is consistent with the corresponding mouth shape feature points; the specific steps of one mouth shape in the mouth shape characteristic point matching sequence are as follows:
(1) Face detection and face feature point alignment are carried out on all videos of the key action pool, and a face texture image set F= { F i } and a corresponding mouth shape feature point P= { P i } are extracted; wherein p i is the position of the mouth shape feature point in the face texture image f i;
(2) Selecting M mouth shape characteristic points which are the most similar to the mouth shape characteristic points in the set P, and selecting corresponding face texture images to obtain M candidate face texture image sets F '= { F' i |i epsilon [0, M-1] };
(3) And calculating the median of the face texture image set F' to be used as the face texture image matched with the mouth shape characteristic points rho.
7. The method for synthesizing a sequence of virtual human actions combining speech and semantic key actions according to claim 1, wherein in step 7), the background frame selection module is configured to select a sequence of background frames having a head gesture consistent with a speech pause cadence, the sequence of key actions being consistent with the sequence of key actions obtained in step 6).
8. The method for synthesizing the virtual human action sequence combining the voice and the semantic key actions according to claim 7, wherein the background frame selection module solves the frame sequence by using a dynamic programming algorithm, and the dynamic normalization algorithm comprehensively considers the smoothness converted by the key actions and the matching degree of the head movement speed and the voice pause, and solves the optimal background frame sequence by an iterative method.
9. The method for synthesizing the virtual human action sequence combining the voice and the semantic key actions according to claim 1 is characterized in that in the step 8), a LAPLACIAN PYRAMID blending image blending algorithm is adopted by the foreground and background blending module, and the human face texture image sequence obtained in the step 5) and the background frame sequence obtained in the step 7) are mixed together to obtain the virtual human action sequence consistent with the voice semantic.
CN202111111485.8A 2021-09-23 2021-09-23 Virtual human action sequence synthesis method combining voice and semantic key actions Active CN113851145B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111111485.8A CN113851145B (en) 2021-09-23 2021-09-23 Virtual human action sequence synthesis method combining voice and semantic key actions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111111485.8A CN113851145B (en) 2021-09-23 2021-09-23 Virtual human action sequence synthesis method combining voice and semantic key actions

Publications (2)

Publication Number Publication Date
CN113851145A CN113851145A (en) 2021-12-28
CN113851145B true CN113851145B (en) 2024-06-07

Family

ID=78979053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111111485.8A Active CN113851145B (en) 2021-09-23 2021-09-23 Virtual human action sequence synthesis method combining voice and semantic key actions

Country Status (1)

Country Link
CN (1) CN113851145B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778040B (en) * 2023-08-17 2024-04-09 北京百度网讯科技有限公司 Face image generation method based on mouth shape, training method and device of model
CN117528197B (en) * 2024-01-08 2024-04-02 北京天工异彩影视科技有限公司 High-frame-rate playback type quick virtual film making system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20010038772A (en) * 1999-10-27 2001-05-15 최창석 Automatic and adaptive synchronization method of image frame using speech duration time in the system integrated with speech and face animation
WO2001091482A1 (en) * 2000-05-23 2001-11-29 Media Farm, Inc. Remote displays in mobile communication networks
CN103218842A (en) * 2013-03-12 2013-07-24 西南交通大学 Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
CN110688911A (en) * 2019-09-05 2020-01-14 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium
WO2020232867A1 (en) * 2019-05-21 2020-11-26 平安科技(深圳)有限公司 Lip-reading recognition method and apparatus, computer device, and storage medium
CN112562722A (en) * 2020-12-01 2021-03-26 新华智云科技有限公司 Audio-driven digital human generation method and system based on semantics
CN112887698A (en) * 2021-02-04 2021-06-01 中国科学技术大学 High-quality face voice driving method based on nerve radiation field

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20010038772A (en) * 1999-10-27 2001-05-15 최창석 Automatic and adaptive synchronization method of image frame using speech duration time in the system integrated with speech and face animation
WO2001091482A1 (en) * 2000-05-23 2001-11-29 Media Farm, Inc. Remote displays in mobile communication networks
CN103218842A (en) * 2013-03-12 2013-07-24 西南交通大学 Voice synchronous-drive three-dimensional face mouth shape and face posture animation method
WO2020232867A1 (en) * 2019-05-21 2020-11-26 平安科技(深圳)有限公司 Lip-reading recognition method and apparatus, computer device, and storage medium
CN110688911A (en) * 2019-09-05 2020-01-14 深圳追一科技有限公司 Video processing method, device, system, terminal equipment and storage medium
CN112562722A (en) * 2020-12-01 2021-03-26 新华智云科技有限公司 Audio-driven digital human generation method and system based on semantics
CN112887698A (en) * 2021-02-04 2021-06-01 中国科学技术大学 High-quality face voice driving method based on nerve radiation field

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
分布式Kahn处理网络的一种集群调度算法;钱正平;齐德昱;曾鸣;;计算机应用研究;20091215(第12期);全文 *
基于序列图模型的多标签序列标注;王少敬;刘鹏飞;邱锡鹏;;中文信息学报;20200615(第06期);全文 *

Also Published As

Publication number Publication date
CN113851145A (en) 2021-12-28

Similar Documents

Publication Publication Date Title
CN109376582B (en) Interactive face cartoon method based on generation of confrontation network
CN113851145B (en) Virtual human action sequence synthesis method combining voice and semantic key actions
Wang et al. Seeing what you said: Talking face generation guided by a lip reading expert
Chuang et al. Mood swings: expressive speech animation
CN112151030B (en) Multi-mode-based complex scene voice recognition method and device
CN110880315A (en) Personalized voice and video generation system based on phoneme posterior probability
CN113592985B (en) Method and device for outputting mixed deformation value, storage medium and electronic device
CN110853670A (en) Music-driven dance generating method
Ma et al. Unpaired image-to-speech synthesis with multimodal information bottleneck
CN114173188B (en) Video generation method, electronic device, storage medium and digital person server
CN116051692B (en) Three-dimensional digital human face animation generation method based on voice driving
CN115511994A (en) Method for quickly cloning real person into two-dimensional virtual digital person
WO2023035969A1 (en) Speech and image synchronization measurement method and apparatus, and model training method and apparatus
CN116597857A (en) Method, system, device and storage medium for driving image by voice
Liu et al. Moda: Mapping-once audio-driven portrait animation with dual attentions
Websdale et al. Speaker-independent speech animation using perceptual loss functions and synthetic data
CN116828129B (en) Ultra-clear 2D digital person generation method and system
Tan et al. Style2talker: High-resolution talking head generation with emotion style and art style
CN117078816A (en) Virtual image generation method, device, terminal equipment and storage medium
Yin et al. Asymmetrically boosted hmm for speech reading
Wang et al. Ca-wav2lip: Coordinate attention-based speech to lip synthesis in the wild
JP4617500B2 (en) Lip sync animation creation device, computer program, and face model creation device
CN110648666B (en) Method and system for improving conference transcription performance based on conference outline
CN114155321A (en) Face animation generation method based on self-supervision and mixed density network
Zhao et al. Generating diverse gestures from speech using memory networks as dynamic dictionaries

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant