CN116825083A

CN116825083A - Speech synthesis system based on face grid

Info

Publication number: CN116825083A
Application number: CN202310176960.2A
Authority: CN
Inventors: 金宸极; 林菲; 张聪
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2023-02-28
Filing date: 2023-02-28
Publication date: 2023-09-29

Abstract

The application belongs to the technical field of computer vision, and particularly relates to a voice synthesis system based on a face grid. The method comprises the following steps: s1, constructing a lip movement model, and extracting lip movement level features from video data through an encoder; s2, video voice recognition, namely selecting a video as a model input, and predicting the content taught by a speaker through a lip motion video to form a text; s3, generating text to voice, synthesizing a Mel frequency spectrum corresponding to the text in an autoregressive mode, and synthesizing an audio waveform through an audio decoder. Compared with the prior art, the voice synthesis system based on the face grid has the advantages that: the accuracy can be improved, and lip reading can be performed directly by advanced features (lip movements).

Description

Speech synthesis system based on face grid

Technical Field

The application belongs to the technical field of computer vision, and particularly relates to a voice synthesis system based on a face grid.

Background

The relationship of lip motion to pronunciation when a human speaks has been demonstrated in a number of previous studies. Trained professionals can understand what others say by looking at their lips, i.e. read lips, a means that is also used to assist the dyshearing person in communicating with others. However, this is too inefficient in relying on manual completion, and although there is also artificial intelligence available to translate, the accuracy is not ideal. Based on this, a facility capable of replacing a manual read lip and improving accuracy is demanded.

Disclosure of Invention

The present application has been made in view of the above problems, and an object of the present application is to provide a face mesh-based speech synthesis system capable of improving accuracy.

In order to achieve the above purpose, the present application adopts the following technical scheme: the voice synthesis system based on the face grid comprises the following steps:

s1, constructing a lip movement model, and extracting lip movement level features from video data through an encoder;

s2, video voice recognition, namely selecting a video as a model input, and predicting the content taught by a speaker through a lip motion video to form a text;

s3, generating text to voice, synthesizing a Mel frequency spectrum corresponding to the text in an autoregressive mode, and synthesizing an audio waveform through an audio decoder.

In the face mesh based speech synthesis system described above, the encoder uses a 2D convolutional network to extract the lip movement high-level features from Landmarks.

In the voice synthesis system based on the face grid, a Lip2Wav data set is constructed by selecting videos of a plurality of different speakers on a video website, wherein the total length of the videos is 120 hours;

the content recited by a single speaker has the same context and synthesizes audio content that is similar to the speaker's tone and style.

In the above-described face mesh-based speech synthesis system, the sequence l= (L) is derived from the ordered lip motion sequence ₁ ，l ₂ ，…，l _t ) Mel spectrum s= (S) of the corresponding band of the medium prediction ₁ ，s ₂ ，…，s _t′ )；

Time of day speech time step s _k Modeling should be performed according to the following equation:

s _k′ ＝f(l _k∈(k±δ) ，s＜k′)。

in the above-mentioned speech synthesis system based on face meshes, a multi-layer convolutional network is created in the encoder, each layer of convolutional network expands the number of channels characterizing features, and the network adopts residual error linking and batch normalization.

In the voice synthesis system based on the face grid, the space-time lip motion Landmarks are encoded by the convolution network, and the lip motion information with the dimension of F is accepted by the model network as the input;

the different convolution channels characterizing Features Landmark are processed;

multi-layer convolution networks are established in the encoder, the number of channels for characterizing the characteristics is expanded by each layer of convolution network, and residual error links and batch normalization are used between the networks;

the three-dimensional coordinates are sampled into one-dimensional characteristics in the final layer winding machine, and the time dimension is reserved in a straight way; the final output of the convolution component is the tensor of dimension F, where F is the number of features modeled per time step.

In the above-mentioned speech synthesis system based on face mesh, when the decoder synthesizes the mel spectrum output of time step, the decoder receives the output of time step, calculates attention with the lip movement information level feature extracted by the encoder, extracts the previous short-time information by two unidirectional LSTM networks, and makes Linear Projection Layer synthesize the mel spectrum output of time step.

In the above-mentioned face mesh-based speech synthesis system, the actual mel spectrum of the encoder is additive with random noise, where l ₁ Is the loss of the previous training round;

the Mel spectrogram added with noise has the same effect as the predicted Mel spectrogram ₁ Distance: l (L) _l (Target+Noise，Target)＝l _l (Predict，Target)。

In the above-mentioned face mesh-based speech synthesis system, correction and normalization processing are performed in the preprocessing stage.

In the above-mentioned face mesh-based speech synthesis system, the facial points are corrected by a general correction or spatial rotation algorithm.

Compared with the prior art, the voice synthesis system based on the face grid has the advantages that: 1. the accuracy can be improved, and lip reading can be performed directly by advanced features (lip movements). 2. The audio of the content targeted under silence monitoring may be synthesized. 3. The noise reduction task is referenced by synthesizing the target's speech in a noisy environment. 4. The sound is synthesized for disabled people who lose language in the acquired days.

Detailed Description

The present application will be described in further detail with reference to the following embodiments.

The application provides a novel Face Mesh Landmarks characteristic extraction encoder, and to our knowledge, face Mesh Landmarks is rarely used as an input mode in the previous research. On the basis, a (model name) model is built, and the lip reading model for synthesizing natural language frequency from lip reading actions is realized. The model has the potential to synthesize natural frequencies directly through lip motion capture.

The present approach forces the innovative Scheduled Sampling Method model alignment when the encoder-6 constructed model information weights are unbalanced.

A new feature extraction encoder for Face Mesh Landmarks is presented herein, and to our knowledge, there is little research in the current study to use Face Mesh Landmarks as an input modality. On the basis, a (model name) model is built, and the lip reading model for synthesizing natural language audio from lip reading actions is realized for the first time. The model has the potential to synthesize natural audio directly through lip motion capture.

An innovative Scheduled Sampling Method is used herein for forcing model alignment when encoder-decoder structure model information weights are unbalanced.

Most studies related to lip recognition [ 4,5,6 ] will usually intuitively select video as the input of the model and predict what the speaker speaks, i.e. video language prediction, by lip motion video. Some work [ 7,8 ] selects a CNN network to extract lip-sync advanced features from video data, based on which what the speaker speaks is predicted. With the prominence of the seq2seq model in other fields, some studies [ 9, 10 ] combine the structure of LSTM with transducer, taking into account more of the influence of time sequence on lip reading results. There are also some works [ 11 ] that use speech recognition techniques to quickly label lip-reading content because additional text labels are required to predict the data set used by the lip-reading content based on the video. Research into generating natural speech through text has been a very long history, and researchers have proposed a number of different technical approaches to solve this problem [ 12, 13 ]. However, the audio synthesized by these methods generally appears relatively low compared to the natural pronunciation of humans, and is prone to terrorist effects in use scenarios such as intelligent voice assistants, resulting in a poor experience. The end-to-end voice synthesis system constructed based on the deep neural network can synthesize natural audio which is more similar to human pronunciation. The Tacotron2 [ 14 ] model is an end-to-end speech synthesis system proposed by Google, synthesizes a Mel frequency spectrum corresponding to a text in an autoregressive manner, and synthesizes an audio waveform through an audio decoder such as a WaveRNN [ 15 ] or a Waveglow [ 16 ]. To be able to synthesize more realistic, natural audio, our model selects a structure similar to the Tacotron2 model in the decoder part and also selects the Mel spectrum as the mediating feature. Experimental results show that the synthesized audio has higher audio quality, and some indexes capable of measuring the audio quality are selected to evaluate the synthesized result. Prajwal et al [ 17 ] inspired by Tacotron2, and by building a model of Lip2Wav by adding a faceEncoder on the basis of Tacotron2, the work of synthesizing natural speech through Lip-reading video is realized, and the work has great inspired us, and the Lip2Wav data set constructed by them also provides great assistance for the work herein. Our work tends to choose to extract information directly from such high-level features as lip-reading actions, as opposed to selecting such low-level features as the input to the model. In terms of application scenes, our method can be applied to some scenes in which video pictures are inconvenient to capture by using a camera, for example, in a scene in which a speaker is sounding after the acquired, the capture of lip actions can be realized by using equipment similar to a cover again. For the scene where the video picture can be captured, the lip movement features can be extracted through the mature facial motion prediction model.

The video data sets prepared for the lip reading recognition task are numerous and can be broadly divided into two categories: (1) Word-centric data sets are typically constructed by recording several daily words read by different laboratory personnel in a controlled environment. Such as AVLetters 18, GRID 19, ouluVS 20, etc., (2) a multi-speaker dataset built with conversations as the center, derived from the broadcasting content of a large number of television broadcasting programs, and intercepting shots in which speakers can be captured, such as LRW 21, LRS2 22, LRS3 23, etc. The construction of the data set is based on the assumption that different people have commonalities for lip reading actions of the same pronunciation, and is suitable for the task of identifying lip reading content. While the task of synthesizing audio through lip-reading is focused more on the lip movement characteristics of a particular person speaking in a certain context, e.g., in the context of assisting a aphasia person, we would want the synthesized audio to have the tone characteristics of a particular person, and these previous datasets did not meet such tasks well. Prajwal [ 17 ] and the like construct a Lip2Wav dataset by selecting videos of five different speakers on a Youtube website for 120 hours in total, and the contents recited by a single speaker have the same context, so that a model can conveniently learn the language style of the single speaker, and synthesize audio contents similar to the tone and style of the speaker. And, researchers have used this dataset in related research with video synthesized audio. And synthesizing an LipLandmark2Way model of the audio through lip motion. The encoder section uses a 2D convolutional network to extract the lip motion advanced features from Landmarks. The decoder network refers to the implementation of the decoder section in [ 14 ], processing the results of the encoder at the Attention module. The decoder predicts the mel spectrum using an autoregressive manner and synthesizes the audio from the mel spectrum.

Unlike most studies in existence, we work on the objective of the sequence from ordered labial sequences l= (L) ₁ ，l ₂ ，…，l _t ) Mel spectrum s= (S) of the corresponding audio segment of the medium prediction ₁ ，s ₂ ，…，s _t′ ). Studies using a sequence of continuous actions as model input are not common to our knowledge, and we make some adjustments to the model in order to be able to efficiently extract features from such input modalities and synthesize audio. According to daily experience, we can consider that the sound made by the speaker at the moment is related to lip-type movements at the same time. However, since different phones share the same view, the audio content at the moment is determined, and the context information should be comprehensively considered. Specifically, a speech time step s of time _k Modeling should be performed according to the following equation:

s _k′ ＝f(l _k ∈(k±δ)，s＜k′)

since it is not clear that a wide range of lip movements will affect the current time step results during the actual session, we do not impose additional restrictions on the lip movement information of interest to the decoder part during the training phase, and the Attention is calculated by the model through the Attention mechanism. The Alignment of the encoder and decoder time steps obtained in the experimental results shows that the model's attention to lip movement information over different time steps meets our expectations. Our model structure is shown in the figure. Our model takes FaceMeshLandmarks as inputs, each andmark contains three-dimensional spatial information. We sample faceresh handles from the video dataset at a frequency of 30 frames per second and select 80 handles in which the middle lip acts as the model's inputs. The final model accepts a matrix dimension of F, where T is the number of time steps (frames) of the Landmark sequence, and F is the number of landmarks selected per frame (80 for lip content only) with reference to the Tacotron2 model, our model will not directly produce an audio waveform, but rather will produce melspectrums on the condition of lip motion. We sample the original audio at 16kHz and window it at a length of 50ms, with the audio window moving 12.5ms per frame. I.e. with a win-size of 800 and a hop-size of 200. This arrangement results in about 80 mel frames per second of audio on the same order of 30 frames per second of video samples. The mel dimension is set to 80.

The model accepts as input the feature coordinates of FaceMeshLandmarks. The Encoder section needs to learn how to extract fine-grained feature sequences of lip motion from these spatiotemporal coordinate sequences. Meanwhile, we consider that the synthesized audio time series and lip motion series have a high correlation in the time dimension, so we do not want the Encoder to over-process the time dimension. Considerable research has been done [ 24, 25 ] to demonstrate the effectiveness of convolutional neural networks in tasks involving the time dimension, and the work of houetal [ 26 ] has proposed an end-to-end encoder-decoder style 3D convolutional neural network to codec video signals for video object segmentation tasks. Thus, convolutional network coding space-time lip motion lagmarks was also used in our work. Our model network accepts as input lip movement information whose dimension is F. Unlike the approach to processing the image, we will characterize Features Landmark with different convolution channels so that the convolution kernel can respond to all Features Landmark simultaneously. A multi-layer convolution network is created in the encoder, each layer of convolution network expands the number of channels characterizing the characteristic, and residual error links and batch normalization are used between the networks. The 3-dimensional coordinates will be sampled to 1-dimensional features in the last layer of the coiler network, and the time dimension will be preserved all the time. The final output of the convolutional network part is the tensor of dimension F, where

F is the number of features modeling each time step. In addition to convolutional networks, we also use LSTM networks to extract short-term features so that each time step input to the Decoder stage can contain a range of short-term context information. For the synthesis of more procedural, natural speech results, our model references the similar approach used in Tacotron2 by Jonathan, et al [ 10 ]. Their research uses this decoder to synthesize the mel spectrum of natural human voice, consistent with our objective, and our work was therefore improved based on this outcome.

The decoder is an autoregressive cyclic neural network, and at synthesis t _k The decoder will output the mel spectrum of the time step. In the task of TTS, tacotron2 also needs to synthesize a stop tag using Linear Projection to avoid unrestricted looping of the decoder network, which is preserved by subsequent studies using Tacotron 2-like architecture, such as the work of Abdelaziz et al [ 27 ]. However, in the task of lip motion synthesis of audio, the length of the lip motion signal has a significant linear relationship with the length of the corresponding synthesized audio, so that no additional prediction of the stop label is required.

Since the model uses an autoregressive decoder, previous studies [ 28, 29 ] mostly use the means of the Teacher Force in order to enable the model to converge quickly, i.e. use the real mel spectrogram of the previous time step as the input to the next time step decoder. However, in our study, since it is difficult for the encoder to extract accurate lip movement features in the early stage of training, the decoder will quickly overfit the input without considering the lip movement features, and the encoder cannot be sufficiently trained. For this purpose we used Scheduled Sampling Method, which was different from previous studies. In order to enable the model to learn the relation between the lip motion and the synthesized audio better, we use a teacher forced method to train in the beginning of training, namely using the real mel spectrogram of the previous time step as the input of the decoder of the next time step. With the increase of training rounds, we gradually weaken the enforcement of teachers, and replace the real mel spectrogram with the mel spectrogram predicted by the decoder. Such a method works well in previous studies [ 17, 30 ].

However, we have found that even if teacher forces are attenuated in subsequent training, the model is still prone to over-reliance on the last time step input, and lack sufficient attention to the lip motion information encoded by the encoder. To solve this problem, we add some random noise to the true mel spectrum of the input encoder: noise-Uniform (-2, 2). L ₁

Wherein l ₁ Is l of the previous training round ₁ Loss, thus, the Mel spectrum added with noise can have l consistent with the predicted Mel spectrum ₁ Distance: l (L) ₁ (Target+Noise，Target)＝l ₁ (Predict，Target)。

Subject t _k-1 Outputting time step, calculating attention with high-level feature of lip movement information extracted by encoder, extracting the short-time information by two unidirectional LSTM networks, and synthesizing Mel spectrum of the time step by Linear Projection Layer to avoid early stage of training due to l ₁ The inability to converge due to excessive loss, we propose to first use the real mel spectrum without noise for teacher forced training and add noise to the mel spectrum when training to 20K rounds. Experiments prove that the addition of noise can very effectively help the model pay attention to lip movements, and related experimental results are shown later.

Our model accepts as input the three-dimensional coordinates of facial points, and to our knowledge, the currently commonly used lip-reading recognition dataset does not support such input modalities. To compare with other studies, we constructed a set of Pipeline that can extract our desired three-dimensional coordinates of facial points from other video datasets for lip-reading recognition, and perform corrective and normalization preprocessing.

We select Lip2Wav constructed by Prajwal [ 17 ] et al as the Lip read identification video dataset and use MediaPipe Face Mesh, which is open-source by Google, to identify and point-to-point label the faces that appear in the video. The model is realized based on the work of Grishchenko et al [ 31 ], 478 different points can be marked on the whole face, and 80 points represent lip details. Our work also compares the effect of using only lip points versus full face points on experimental results.

For the identified Face handles, correction and normalization processing can be carried out in a preprocessing stage, so that the influence of the motion of the Face in the three-dimensional space on the model is reduced.

It should be noted that our model is not limited to the method of capturing Face Mesh Landmarks, but rather we can also envisage capturing Face Mesh Landmarks using Lidar or other custom devices, in addition to the method of predicting Landmarks from video used in our work. It is envisioned that higher accuracy facial motion characterization may provide more accurate audio synthesis results.

Depending on the number of facial feature points selected for use, we can adjust the hidden dimension set for the encoder. In our more typical experiment, a lip feature with 80 different points is used, and the hidden dimension in the encoder LSTM set based on this is 384, and in order to be able to encode such dimensions, the convolution layers of the convolution network are set to 120, 160, 240 and 320 in order.

We used Nvidia RTX3090 graphics card for training and set the batch size to 32, and after 40000 rounds of training the learning rate increased linearly to 2 x 10 ^-3 And gradually decays in subsequent rounds, the optimizer chooses to use Adam [ 32 ]. And trained about 400K rounds. More training parameters and details are presented in the code.

We used unified preprocessing means in experiments to process multiple sets of audiovisual data and used our proposed model for training. To be able to compare with previous studies we chose to examine the effect of the model in unconstrained and constrained environments, respectively. To measure the quality of the synthesized audio, we use short-time objective intelligibility STOI [ 33 ], extended short-time objective intelligibility ESTOI [ 34 ], and speech quality perception assessment PESQ

【35】 As an evaluation criterion. These indicators are often considered in previous work as natural important indicators for evaluating synthesized audio. Of course, whether the quality of the synthesized speech is naturally a subjective judgment is obviously not sufficient using only objective criteria. So we will also organize the different participants to make subjective judgments as to the quality of the synthesized audio and score it. Unlike previous studies where participants were only asked to listen to the audio for scoring, we were asked to rate the participants with the video frames.

We performed training experiments on the model, in this section we developed some preliminary results from the experiments. During training, we will record the Attention value of the Attention module in the model Decoder. The spectrum chart decoder pays attention to each time step of the encoder for a certain time step. In theory, the frequency at a certain time is related to the lip reading degree at the present time, so it should be noted that the alignment state of the line should be presented. If Teacher Force was used as Scheduled Sampling Method with the previous study, the model would be at 10K

The experimental results obtained by the current method are not ideal, and the structure of the model has room for continuous improvement.

In the present experiment, if training is performed directly using the Teacher mapping method, the decoder is very easy to fit directly to the input of the decoder, thus ignoring the content of the encoder. The encoder will not be sufficiently trained and it is difficult for the attention module of the decoder to establish the ideal alignment. If the Teacher forming method is not used in the initial stage of training, the attention module can establish ideal alignment after a certain round, but the situation of over fitting of the encoder is easy to occur, and the decoder ignores the information of the previous and the subsequent time steps to synthesize audio, so that the over fitting result is easy to occur.

When speaking, the head of a person often moves and rotates irregularly along with the content of a speech, which is very likely to cause the increase of the data space range and even the loss of lip data. In the previous study [ 17 ], normalization of the person-conveying data was achieved mainly by clipping of data frames based on facial recognition software. The three-dimensional characteristics of the facial feature points can be acquired in the data set, so that the facial feature points can be corrected through algorithms such as spatial rotation and the like. Two different correction algorithms were designed in our work, and we compared the impact of using different correction algorithms on model training. General correction: only the face is translated to the center position of the screen and scaled to the appropriate size. Rotation correction: and calculating the space normal vector of the face, and rotating until the space normal vector is consistent with the normal vector of the camera. And then general correction is carried out.

MediaPipe Face Mesh a total of 478 different Landmarks can be marked on the full face. Of which 80 data points characterize the lip range. Prajwal et al [ 17 ] demonstrate in their work that the focus of the lip reading model will be mainly on the lip area, then what is differentiated from the result if full-face Landmarks are used and lip-only Landmarks are used?

We designed a control experiment to study the effect of using different types of Landmarks on model training. To be able to accommodate different numbers of Landmarks, we make some adjustments to the convolutional layer in the encoder as well. To compare differences, we also tried to use the work of Bulat et al [ 36 ] as an extractor for facial feature points. The experimental parameters were selected as shown in table 1.

Table 1. The number of Landmarks and embedding dimensions used by the Encoder in the various experiments. MP in the first column predicts Landmarks using MediaPipe; FA uses Face Alignment for prediction. OL means only lip Landmark is used, WF means full face Landmark is used.

In this work, we studied a method of synthesizing corresponding audio based on lip movements. We have used Face Mesh Landmarks as the input to the model and built the corresponding encoder based on the characteristics of the input data, and to our knowledge, no related studies have been done before using Landmarks as the input and for the lip reading recognition task. In our experiments, using Landmarks as a high-level feature representing lip motion can accelerate the speed of model fitting, but too fast a fitting speed also indirectly results in some parts of the model of the encoder-decoder structure being over-fitted, while other parts are not adequately trained. For this problem, we have devised corresponding ablation experiments, with the hope of achieving more desirable results.

Our work opens up new directions for related studies of lip reading recognition, one of which is the ultimate solution that can help solve the cocktail problem in the video-available environment. In addition, the voice frequency can also be used for assisting disabled persons losing the language in the acquired period to synthesize the voice frequency, so that the disabled persons can communicate with surrounding persons conveniently.

Although our study used Landmarks extracted from the video, we also set aside scalable space for the model. It is anticipated that researchers may attempt to obtain Landmarks' data in a more accurate manner in the near future. For example, more accurate depth information can be obtained by using a primary camera on an iPhone, and face modeling can be performed. It is believed that accurate lip motion data can help us explore some of the problems encountered in the experiments more deeply and may bring more useful insight into the field of lip reading identification.

Reference to

【1】Cappelletta,Luca,and Naomi Harte."Phoneme-to-viseme mapping for visual speech recognition."ICPRAM(2).2012.

【2】Shaikh,Ayaz A.,et al."Lip reading using optical flow and support vector machines."2010 3Rd international congress on image and signal processing.Vol.1.IEEE,2010.

【3】Chung,Joon Son,and A.P.Zisserman."Lip reading in profile."(2017).【4】Ma,Pingchuan,Stavros Petridis,and Maja Pantic."Visual Speech Recognition for Multiple Languages in the Wild."arXiv preprint arXiv:2202.13084(2022).

【5】Fung,Ivan,and Brian Mak."End-to-end low-resource lip-reading with maxout CNN and LSTM."2018 IEEEInternational Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2018.

【6】NadeemHashmi,Saquib,et al."A lip reading model using CNN with batch normalization."2018 eleventh international conference on contemporary computing(IC3)

【7】Huang,Hongyang,et al."A Novel Machine Lip Reading Model."Procedia Computer Science 199(2022):1432-1437.

【8】Wang,Huijuan,Gangqiang Pu,and Tingyu Chen."A Lip Reading Method Based on 3D Convolutional Vision Transformer."IEEE Access 10(2022):77205-77212.

【9】Afouras,Triantafyllos,Joon Son Chung,and Andrew Zisserman."Asr is all you need:Cross-modal distillation for lip reading."ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2020.

【10】Shen,Jonathan,et al."Natural tts synthesis by conditioning wavenet on mel spectrogram predictions."2018IEEE international conference on acoustics,speech and signal processing (ICASSP).IEEE,2018.

【11】Kalchbrenner,Nal,et al."Efficient neural audio synthesis."International Conference on Machine Learning.PMLR,2018.

【12】Prenger,Ryan,Rafael Valle,and Bryan Catanzaro."Waveglow:A flow-based generative network for speech synthesis."ICASSP 2019-2019 IEEE International Conference on Acoustics,Speech and Signal Processing(ICASSP).IEEE,2019.

【13】Prajwal,K.R.,et al."Learning individual speaking styles for accurate lip to speech synthesis."Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition.2020.

【14】Matthews,Iain,et al."Extraction of visual features for lipreading."IEEE Transactions on Pattern Analysis and Machine Intelligence 24.2(2002):198-213.

【15】Cooke,Martin,et al."An audio-visual corpus for speech perceptionand automatic speech recognition."TheJournal of the Acoustical Society of America 120.5(2006):2421-2424.

【16】Zhao,Guoying,Mark Barnard,and Matti Pietikainen."Lipreadingwith local spatiotemporal descriptors."IEEE Transactions on Multimedia11.7(2009):1254-1265.

【17】Chung,Joon Son,and Andrew Zisserman."Lip reading in the wild."Asianconference on computer vision.Springer,Cham,2016.

【18】Son Chung,Joon,et al."Lip reading sentences in the wild."Proceedings of the IEEE conference on computer vision and patternrecognition.2017.

【19】Afouras,Triantafyllos,et al."Deep audio-visual speech recognition."IEEE transactions on pattern analysis and machine intelligence(2018).【20】Bai,Shaojie,J.Zico Kolter,and Vladlen Koltun."An empiricalevaluation of generic convolutional and recurrentnetworks for sequence modeling."arXiv preprint arXiv:1803.01271(2018).【21】Xu,Kai,et al."Spat iotemporal CNN for video object segmentat ion."Proceedings of the IEEE/CVF Conference on Computer Vision and PatternRecognition.2019.

【22】Hou,Rui,et al."An efficient 3d CNN for action/object segmentationin video."arXiv preprint arXiv:1907.08895(2019).

【23】Abdelaziz,Ahmed Hussen,et al."Audiovisual Speech Synthesis usingTacotron2."arXiv preprint arXiv:2008.00620(2020).

【24】Toomarian,Nikzad,and Jacob Bahren."Fast temporal neural learningusing teacher forcing."(1995).

【25】Mihaylova,Tsvetomila,and AndréFT Martins."Scheduled samplingfor transformers."arXiv preprint arXiv:1906.07651(2019).

【26】Afouras,Triantafyllos,Joon Son Chung,and Andrew Zisserman."Deeplip reading:a comparison of models and an online application."arXivpreprint arXiv:1806.06053(2018).

【27】Grishchenko,Ivan,et al."Attention mesh:High-fidelity face meshprediction in real-time."arXiv preprint arXiv:2006.10962(2020).

【28】Kingma,Diederik P.,and Jimmy Ba."Adam:A method for stochasticoptimization."arXiv preprint arXiv:1412.6980(2014).

【29】Taal,Cees H.,et al."A short-time objective intelligibilitymeasure for time-frequency weighted noisy

speech."2010 IEEE international conference on acoustics,speech and signalprocessing.IEEE,2010.

【30】Jensen,Jesper,and Cees H.Taal."An algorithm for predicting theintelligibility of speech masked by modulated

noise maskers."IEEE/ACM Transactions on Audio,Speech,and LanguageProcessing 24.11(2016):2009-2022.

【31】Rix,Antony W.,et al."Perceptual evaluation of speech quality(PESQ)-a new method for speech quality assessment oftelephone networks and codecs."2001 IEEE international conference onacoustics,speech,and signal processing.Proceedings(Cat.No.01CH37221).Vol.2.IEEE,2001.

【32】Bulat,Adrian,and Georgios Tzimiropoulos."How far are we fromsolving the 2d&3d face alignment problem？(and

a dataset of 230,000 3d facial landmarks)."Proceedings of the IEEEInternational Conference on Computer Vision.2017.

【33】Fan,Yuchen,et al."TTS synthesis with bidirectional LSTM based recurrent neural networks."Fifteenth annualconference of the international speech communication association.2014.【34】Agiomyrgiannakis,Yannis."Vocaine the vocoder and applications in speech synthesis."2015IEEE internationalconference on acoustics,speech and signal processing(ICASSP).IEEE,2015.【35】Thangthai,Kwanchiva,Helen L.Bear,and Richard Harvey."Comparing phonemes and visemes with DNN-based lipreading."arXiv preprint arXiv:1805.02924(2018).

【36】Lan,Yuxuan,et al."Improving visual features for lip-reading."Auditory-Visual Speech Processing 2010.2010.

The specific embodiments described herein are offered by way of example only to illustrate the spirit of the application. Those skilled in the art may make various modifications or additions to the described embodiments or substitutions thereof without departing from the spirit of the application or exceeding the scope of the application as defined in the accompanying claims.

Claims

1. A face mesh-based speech synthesis system, comprising the steps of:

2. The face mesh based speech synthesis system of claim 1, wherein in step S1, the encoder uses a 2D convolutional network to extract lip movement advanced features from Landmarks.

3. The face mesh based speech synthesis system of claim 2, wherein in step S1, a Lip2Wav dataset is constructed by selecting a video on a video website for a total of 120 hours for several different speakers;

in step S3, the content recited by a single speaker has the same context and synthesizes audio content similar to the speaker' S tone and style.

4. A face mesh based speech synthesis system according to claim 3, wherein in step S3, the speech synthesis system is performed from an ordered labial sequence l= (L ₁ ，l ₂ ，…，l _t ) Mel spectrum s= (S) of the corresponding band of the medium prediction ₁ ，s ₂ ，…，s _t′ )；

s _k′ ＝f(l _k∈(k±δ) ，s _＜k′ )。

5. the face mesh based speech synthesis system of claim 1, wherein a multi-layer convolutional network is created in the encoder, each layer of convolutional network extending the number of channels characterizing the feature, the network employing residual chaining and batch normalization.

6. The face mesh-based speech synthesis system of claim 1, wherein the convolutional encoding space-time lip motion labmarks, the model complex accepts as input lip motion information with dimension F;

7. The face mesh-based speech synthesis system of claim 1, wherein in step S3, the decoder receives the output of the time step when synthesizing the mel spectrum output of the time step, calculates attention with the extracted lip movement information level features of the encoder, extracts the previous short-time information through two unidirectional LSTM networks, and causes Linear Projection Layer to synthesize the mel spectrum output of the time step.

8. The face mesh based speech synthesis system of claim 1, wherein the random noise is determined by adding the true mel spectrum of the encoderSound, where l ₁ Is the loss of the previous training round;

9. The face mesh-based speech synthesis system according to claim 1, wherein in step S2, correction and normalization processing is performed in a preprocessing stage.

10. The face mesh-based speech synthesis system according to claim 1, wherein in step S2, the point locations are corrected by a general correction or a spatial rotation algorithm.