CN114793300A

CN114793300A - Virtual video customer service robot synthesis method and system based on generation countermeasure network

Info

Publication number: CN114793300A
Application number: CN202110097183.3A
Authority: CN
Inventors: 张轩宇; 王逸超; 刘昱麟; 朱鹏飞
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-01-25
Filing date: 2021-01-25
Publication date: 2022-07-26

Abstract

The invention relates to the technical field of human face video synthesis, and discloses a virtual video customer service robot synthesis method and system based on a generation countermeasure network. The method and the system for synthesizing the virtual video customer service robot based on the generated countermeasure network have the innovativeness that two schemes for synthesizing the virtual video customer service robot are provided, so that a user can select the virtual video customer service robot independently according to the requirements; the synthesis scheme can enable a user to realize the synthesis of various languages, the arbitrary selection of customer service images and the application of various scenes, and the emotion of a speaker is integrated into the video synthesis process, so that the voice synthesis method has good reality; a set of system based on a Web end is integrated, and a user is supported to directly log in a website, upload audio and video materials, synthesize the audio and video materials on line and produce the audio and video materials in batches quickly.

Description

Virtual video customer service robot synthesis method and system based on generation countermeasure network

Technical Field

The invention relates to the technical field of face video synthesis, in particular to a method and a system for synthesizing a virtual video customer service robot based on a generated countermeasure network.

Background

Human face video synthesis is an emerging and challenging problem in computer vision, and virtual video robots based on this technology are gaining more and more attention. The virtual video customer service robot comprises modules such as lip shape generation, expression generation and voice synthesis, and is expected to truly simulate lip movement, voice and facial expression of a person during speaking.

Inspired by the successful application of deep learning in the field of computer vision, the human face video synthesis based on deep learning obtains excellent performance and good visual effect. At present, some reference data sets with important significance, such as GRID [1], TIMIT [2] and LRW [3], are proposed in the field of human face video synthesis. These data sets provide a large number of audio-video data pairs, and the development of the human face video synthesis field is greatly promoted. Based on the above data sets, a number of excellent algorithms, such as ObamaNet [4], LipGAN [5], ExprGAN [6], Wav2Lip [7], etc., emerge. Taking the LipGAN as an example, the audio and video characteristics are extracted by generating a coding and decoding structure of a generator in the countermeasure network, and the generated video is compared with the real video by using a discriminator, so that end-to-end training is realized, and better performance is obtained on both static images and dynamic videos. These algorithms play an important role in promoting the development of human face video synthesis. In recent years, companies such as hundredth, search fox, science news and the like design corresponding virtual video robots to complete simple works such as news broadcasting, customer service answering and the like based on a face video synthesis technology, and strong artificial intelligence landing and development are promoted.

The prior art has the following defects and shortcomings:

however, most of the existing virtual video customer service synthesis methods and systems cannot realize real and reliable integrated synthesis from text to video. The concrete expression is as follows: the lip shape and the voice cannot be well aligned, the language of a speaker cannot be switched according to the requirements of a user, and corresponding facial expressions and voice intonations cannot be generated according to the emotions of expressed words and sentences. Although the systems have the primary function of video customer service, the system cannot better approach the habit of real person speaking, and the trace of manual processing is obvious.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a method and a system for synthesizing a virtual video customer service robot based on a generated countermeasure network, and the innovation of the method is that two schemes for synthesizing the virtual video customer service robot are provided, so that a user can independently select the method and the system according to the requirements; the synthesis scheme can enable a user to realize the synthesis of various languages, the arbitrary selection of customer service images and the application of various scenes, and the emotion of a speaker is integrated into the video synthesis process, so that the voice synthesis method has good reality; a set of system based on a Web end is integrated, and the users are supported to directly log in a website, upload audio and video materials, synthesize the audio and video materials on line and produce the audio and video materials in batches quickly.

In order to achieve the purpose of the virtual video customer service robot synthesis method and system based on the generation of the countermeasure network, the invention provides the following technical scheme: a virtual video customer service robot synthesis method and system based on a generated confrontation network comprises a lip shape generator module, an expression generator module, a text emotion analysis module and a text voice synthesis module.

The virtual video customer service robot synthesis method based on the generation of the countermeasure network comprises the following steps:

the method comprises the following steps: 1000 segments of news simulcast video from a central television station 15 seconds in duration are collected as a corresponding chinese corpus-video data set. A Wav2Lip and First Order Motion Model is trained on the data set, so that the Model is more consistent with the characteristics of Chinese pronunciation and is used as a Lip generator.

Step two: an ExprGAN model is trained on an Oulu-CASIA NIR & VIS facial expression data set to serve as an expression generator, a bidirectional LSTM model is trained to serve as a text emotion analysis module, and a Baidu TTS interface is called to synthesize speech with emotion.

Step three: the four modules are integrated and developed based on a Web end. The method comprises the steps of building a front end by using a VUE framework, building a rear end by using flash and django package packaging interfaces of Python, and performing reverse proxy by using nginx to integrate a virtual video customer service robot synthetic network station and a virtual video customer service robot synthetic network platform with two schemes.

Step four: and selecting two corresponding synthesis schemes by the user according to the self requirement.

Step five: and logging in a website, and submitting the original material to synthesize the face video of the virtual customer service.

The scheme in the first step is migration synthesis, is more suitable for scenes with high requirements on lip alignment, and can clearly and truly display a human face video;

the scheme in the second step is text synthesis, so that the method is more suitable for large-scale commercial application scenes, can directly synthesize real lip shapes, expressions and sounds according to characters, and the synthesized video has good time sequence stability, rapid synthesis and vivid effect.

Further, if the user selects the scheme in the step one, a source video and an image picture of the video customer service, which are read with corresponding characters in advance, need to be provided for the platform server;

further, if the user selects the scheme in the step two, any video representing the virtual customer service image and the characters to be read by the customer service need to be provided for the platform server.

The Wav2Lip model is specifically used for carrying out feature extraction on videos and audios of continuous frames, introducing synthesis loss, and synthesizing Lip movement videos with good smoothness by generating an antagonistic network. The First Order Motion Model is specifically used for performing animation processing on pictures without using any label or prior information. I.e., by training on a set of videos depicting facial features, the model can be used for lip migration.

The ExprGAN model is specifically an expression editing algorithm with controllable expression intensity, and can change a facial image into a target expression with multiple styles, and the expression intensity can be continuously controlled. The bi-directional LSTM model is specifically used for analyzing text emotion by using the bi-directional LSTM model, and is used for better processing degree words and capturing bi-directional semantic dependence.

Further, if the user selects the scheme in the step one, a video with accurate lip movement and natural facial expression is generated directly through a trained First Order Motion model;

further, if the user selects the scheme in the step two, inputting the text into an emotion analysis module, and analyzing corresponding emotion; generating audio of corresponding voice intonation by calling TTS; inputting the video into a lip generator, and synthesizing the video with lip movement together with the audio generated by the TTS; and inputting the video into an expression generator, and adjusting facial expressions according to the analyzed emotion to obtain a result.

The virtual video customer service robot synthesis system based on the generation of the countermeasure network comprises the following devices: the cloud server, the memory, the processor and the computer program stored on the memory and capable of running on the processor, the processor implementing the integration scheme and the method steps when executing the program.

Compared with the prior art, the invention provides a virtual video customer service robot synthesis method and system based on a generated countermeasure network, which have the following beneficial effects:

1. the invention provides a method and a system for synthesizing a virtual video customer service robot based on a generated confrontation network, and provides two schemes for synthesizing the virtual video customer service robot, namely, the scheme is to record a section of video for reading corresponding characters in advance and transfer the facial features to an image picture of virtual customer service; inputting characters and an image video of virtual customer service, and directly synthesizing the virtual customer service video with emotion and real lip movement;

2. according to the virtual video customer service robot synthesis method and system based on the generation countermeasure network, the synthesis scheme provided by the method can enable a user to realize synthesis of various languages, optional selection of customer service images and application of various scenes, and integrates emotion of a speaker into the video synthesis process, so that the method has better authenticity and good expansibility;

3. the virtual video customer service robot synthesis method and system based on the generation countermeasure network integrate a set of system based on a Web end, support a user to directly log in a website, upload audio and video materials, synthesize the audio and video materials on line and produce the video customer service robot in batches quickly.

Drawings

FIG. 1 is a flow chart of a synthetic method of a virtual video customer service robot based on a generation countermeasure network according to the present invention;

fig. 2 is a schematic diagram of the overall network structure of Wav2Lip according to the second embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-2, a virtual video service robot synthesis system based on a generation countermeasure network includes a lip shape generator module, an expression generator module, a text emotion analysis module, and a text voice synthesis module.

Example 1:

the embodiment of the invention provides a virtual video customer service robot synthesis method based on a generated countermeasure network, which comprises the following steps:

101: the you-get tool was used to collect 1000 central station news simulcast videos of different people as the corresponding chinese corpus-video dataset, and arranged in the format of the LRS2 dataset.

Further, the mpeg tool is used for extracting audio from the video, the audio file is converted into Mel blocks through the python library librosa for network reading, the video is cut into MP4 format files with the resolution of 256 × 256 and the time length of 15 seconds, and the data set is preprocessed.

102: the Wav2Lip network model was trained on the collected chinese dataset. The model can extract the mapping relation between the sound and the lip shape through a face decoder and an audio decoder to generate a synthetic lip shape, and continuously corrects the synthetic effect through a pre-trained lip shape synthetic discriminator and a visual effect discriminator jointly trained with the generator to be used as the lip shape generator of the scheme in the step two.

During specific implementation, training is carried out on a pre-training model of the original network, so that the network can give consideration to the characteristics of Chinese pronunciation on the basis of keeping the original performance, and the lip synthesis effect is improved.

103: and training three models, namely First Order Motion, ExprGAN and Bi-LSTM, to be respectively used as the lip synthesizer of the scheme 1, the expression generator of the scheme 2 and the text emotion analysis module.

104: calling a Baidu speech synthesis API interface and integrating the models of the two trained schemes. And (3) building a front end by using an vue framework, packaging and packaging a model interface by using a flash and django package of python, building a rear end, and building a website based on a Web end.

105: the user can select the scheme in the first step and the scheme in the second step according to the requirement of the user. If the scheme in the step one is selected, a video and an image picture of virtual customer service which are read in advance need to be prepared; if the scheme in the step two is selected, the image video (or image) of the virtual customer service and the characters needing to be read by the robot need to be prepared.

106: and (4) logging in a website by a user, and submitting the materials to obtain a synthetic result.

In summary, the invention provides a synthetic method of a virtual video service robot based on a generation countermeasure network, and the innovation lies in that two schemes for synthesizing the virtual video service robot are provided, so that a user can select the virtual video service robot independently according to the requirements; the synthesis scheme can enable a user to realize synthesis of different languages, arbitrary selection of customer service images and application of various scenes, and integrates the emotion of a speaker into the video synthesis process, so that the voice synthesis method has good reality; a set of system based on a Web end is integrated, and a user is supported to directly log in a website, upload audio and video materials and quickly synthesize the audio and video materials in batches.

Example 2:

the scheme in example 1 is further described below with reference to specific examples and calculation formulas, which are described in detail below:

first, data preparation

The invention uses the you-get tool to collect 1000 segments of news simulcast videos of central television stations of different characters as corresponding Chinese corpus-video data sets, and arranges the videos according to the format of the LRS2 data set. Further, the dataset was preprocessed with ffmpeg, librosa tools.

The data set consists of corresponding audio and video. The video part comprises broadcast contents of 5 different male broadcasters and 5 different female broadcasters, the frame rate is 25 fps, the resolution is cut into 256 × 256, the duration is 25 seconds, and the format is MP 4; the audio part is a mel block extracted from the video for the network to directly obtain the sound information.

Second, training of model

The invention comprises four modules which are respectively: the system comprises a lip shape generation module, an expression generation module, a text emotion analysis module and a voice synthesis module, which are specifically as follows.

(1) A lip shape generating module:

the lip generating module in the scheme of the First step adopts a First Order Motion model, and does not need to use any label or prior information to carry out animation processing on the picture. The model is trained on a set of videos depicting facial features, after which the model can be used for lip migration. The specific implementation is to separate the appearance information from the motion information by using a method for generating a confrontation network. In order to support the robustness of the model to complex motion, the model extracts face key points and local affine transformation in a source video, and a generator network models the motion of a target object, namely extracts static appearance information from a source image and combines the static appearance information with motion information obtained from a driving video to obtain a synthetic video.

And step two, the Lip generating module in the scheme adopts a Wav2Lip model, and the model consists of a generator and two discriminators. The generator may be divided into a face information encoder, a voice information encoder, and a face information decoder. The face information encoder consists of a series of jump-connected residual convolution blocks. It masks lip information of a set of random video frames R as prior pose P, and concatenates with R by channel number as encoder input. The encoder extracts lip information in the input as a facial feature map for decoding and reconstruction of a subsequent network; the voice coder is coded by a series of two-dimensional convolution blocks, extracts voice information input into a Mel block S, and then cascades the voice information with a facial feature map; the facial information decoder decodes the features encoded by the two encoders, and reconstructs lip video matched with audio through a series of up-sampling and deconvolution operations, wherein the specific lip reconstruction L1 loss is as follows:

wherein Lg is a lip shape reconstructed by the generator, Lg is a real image, and N is the number of input images

The lip sync discriminator is used to penalize lip generation that is not synchronized with the audio. When the generated video frames are input into a pre-trained lip-sync discriminator in a cascade mode according to the time dimension, the discriminator identifies the lower half part of the generated face, and the synchronization loss is minimized, and the method specifically comprises the following steps:

the weight of the lip-shaped synchronous generator is kept unchanged in the training process of the GAN network, the lip-shaped synchronous generator has 91% accuracy rate for judging whether lip-shaped audio is synchronous or not, and the training of the generator can be well restrained.

A visual effects discriminator is trained in conjunction with a generator network, the discriminator being used to constrain the generation of distorted faces. Discriminator D consists of a series of convolution blocks. Each block contains one convolutional layer and one leak ReLU active layer. During the discriminant training process, the network minimizes the loss ldsc, as follows:

eventually, the total loss of the network is

Wherein sw and sg are preset parameters, lreon is the reconstruction loss, Esync is the synchronization loss, and Lgen is the generator loss.

(2) The expression generation module:

the expression generation module is composed of an ExprGAN model. ExprGAN is an expression editing algorithm capable of controlling expression intensity, a facial image can be changed into a target expression with various styles, and the expression intensity can be continuously controlled. The generator of ExprGAN consists of an encoder, whose input is the face image, and a decoder, whose output is the reconstructed image; the discriminator of ExprGAN is used to constrain the intensity and authenticity of the expression. The whole network can be divided into three phases: a controller learning stage, an image reconstruction stage and an image refinement stage. And generating the facial video with the specified expression through three stages.

(3) The text emotion analysis module:

the text sentiment analysis is used for analyzing the sentiment tendency of the sentence, and the two-way LSTM model is used for analyzing the text sentiment, so that words can be better processed, and two-way semantic dependence can be captured. The bidirectional LSTM model is synthesized by a forward LSTM model and a backward LSTM model. The LSTM model consists of an input word at the time t, a cell state, a temporary cell state, a hidden layer state, a forgetting gate, a memory gate and an output gate. The calculation process can be summarized as that the gating cell state is used for forgetting and memorizing new information, so that information useful for calculation at the subsequent moment is transmitted, and useless information is discarded; the hidden layer state and new input of the previous step participate in the operation of each step, and the forgetting and memorizing content of each step is determined. The required emotional tendency judgment can be obtained by synthesizing the forward LSTM and the backward LSTM, namely splicing the obtained output results of the hidden layer states of the two LSTMs. The emotion is divided into six emotions, namely neutral, happy, angry, hurry, surprise and fear.

(4) A speech synthesis module:

the invention calls a TTS interface of hundred degrees at the module. The technology can well complete Chinese voice synthesis, rhythm processing can naturally process problems of text such as punctuation, polyphones and the like, the effect is vivid, and the whole system can be well served.

Integration of three, model

If the user selects the scheme in the step one, directly generating a video with accurate lip movement and natural facial expression through a trained First Order Motion model; if the user selects the scheme in the step two, inputting the text into an emotion analysis module, and analyzing corresponding emotion; generating audio of corresponding voice intonation by calling TTS; inputting the video into a lip generator, and synthesizing a lip movement video synchronous with the sound and the picture together with the audio generated by the TTS; and inputting the video into an expression generator, and adjusting facial expressions according to the analyzed emotion to obtain a result.

The model embodiment has the following three key creation points:

firstly, two schemes of synthesizing a virtual video customer service robot are provided;

the technical effects are as follows: the scheme of the first step is migration synthesis, the synthetic effect is vivid, and the method is suitable for scenes with high requirements on the reality of the video; and step two, the scheme is text synthesis, so that real lip shapes, expressions and sounds can be quickly synthesized in a one-stop mode according to characters, and the synthesized video has good time sequence stability and is more suitable for large-scale commercial application scenes.

Secondly, a method for enabling the synthesized Chinese video to have accurate lip movement and natural expression is provided;

the technical effects are as follows: the trained model has excellent performance, the lip synchronization error LSE-D is reduced to 6.39 from the original 10.33, the lip synchronization confidence LSE-C is increased to 7.789 from 3.199, and the visual quality is improved to 4.12 from 3.91. Meanwhile, the model is formed by combining pure synthetic lip shapes into the synthesis of expressions.

And thirdly, integrating a system for synthesizing the virtual video customer service robot.

The technical effects are as follows: the four modules are integrated into a set of system, and a website is built, so that one-stop synthesis of two schemes can be well realized.

In conclusion, the method realizes the synthesis of the virtual video customer service robot through four modules and two schemes, can accurately drive the lip shape and naturally synthesize expressions and voices, and has a good visual effect. Meanwhile, the integrated system can enable users to rapidly produce the virtual video customer service robot in batches.

Example 3:

the embodiment of the invention can be used in the generation of the virtual video customer service and can also be used in the following application scenes.

If a historical figure and a static picture are used for completing specific actions such as singing, festival blessing and the like, if a problem corpus is imported in advance, the system of the virtual video customer service robot can be applied to a campus greeting robot, a psychological consultation robot and the like, students and the robots can realize real face-to-face communication, and better human-computer interaction is realized.

Example 4:

a virtual video customer service robot composition system based on a generation countermeasure network, the system comprising: a website domain name, a cloud server, a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor executing the program to perform the method steps in embodiments 1 and 2.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A synthetic method and system of a virtual video customer service robot based on a generation countermeasure network are characterized in that: the virtual video customer service robot synthesis system based on the generation countermeasure network comprises a lip shape generator module, an expression generator module, a text emotion analysis module and a text voice synthesis module.

2. The method and system for synthesizing the virtual video service robot based on the generation countermeasure network of claim 1 are characterized in that: the virtual video customer service robot synthesis method based on the generation of the countermeasure network comprises the following steps:

the method comprises the following steps: collecting 1000 segments of news simulcast videos of a central television station with the duration of 15 seconds as a corresponding Chinese corpus-video data set, training a Wav2Lip and First Order Motion Model on the data set to enable the data set to better accord with the characteristics of Chinese pronunciation, and using the Model as a Lip generator;

step two: training an ExprGAN model on an Oulu-CASIA NIR & VIS facial expression data set to serve as an expression generator, training a bidirectional LSTM model to serve as a text emotion analysis module, and calling a Baidu TTS interface to synthesize speech with emotion;

step three: integrating the four modules, building a front end by using a VUE framework based on Web end development, building a rear end by using flash and django package packaging interfaces of Python, and performing reverse proxy by using nginx to integrate a virtual video customer service robot synthetic network station and a virtual video customer service robot synthetic network platform with two schemes;

step four: a user selects two corresponding synthesis schemes according to the self requirement;

3. The method and system for synthesizing the virtual video service robot based on the generation countermeasure network according to the first step of claim 2 are characterized in that: the scheme of the first step is migration synthesis, is more suitable for scenes with high requirements on lip alignment, and can clearly and truly display human face videos.

4. The method and the system for synthesizing the virtual video customer service robot based on the generation countermeasure network according to the step two of claim 2, characterized in that: and the scheme of the second step is text synthesis, so that the method is more suitable for large-scale commercial application scenes, can directly synthesize real lip shapes, expressions and sounds according to characters, and the synthesized video has good time sequence stability, rapid synthesis and vivid effect.

5. The method and system for synthesizing the virtual video service robot based on the generation countermeasure network according to the first step of claim 2 are characterized in that: if the user selects the scheme in the step one, a source video and a video customer service image which are read with corresponding characters in advance need to be provided for the platform server.

6. The method and system for synthesizing the virtual video service robot based on the generation countermeasure network according to the step two of claim 2, wherein: and if the user selects the scheme in the step two, providing any video representing the virtual customer service image and the characters to be read by the customer service to the platform server.

7. The method and system for synthesizing the virtual video service robot based on the generation countermeasure network according to the first step of claim 2 are characterized in that: the Wav2Lip Model specifically extracts features of videos and audios of continuous frames, introduces synthesis loss, and generates Lip Motion videos with good smoothness for resisting network synthesis, and the First Order Motion Model specifically performs animation processing on pictures without using any labels or prior information, namely training on a group of videos depicting facial features, so that the Model can be used for Lip migration.

8. The method and system for synthesizing the virtual video service robot based on the generation countermeasure network according to the step two of claim 2, wherein: the ExprGAN model is specifically an expression editing algorithm with controllable expression intensity, facial images can be changed into target expressions with various styles, the expression intensity can also be continuously controlled, and the bidirectional LSTM model is specifically used for analyzing text emotion by using a bidirectional LSTM model and is used for better processing degree words and capturing bidirectional semantic dependence.

9. The method and system for synthesizing the virtual video service robot based on the generation countermeasure network according to the first step of claim 2 are characterized in that: and if the user selects the scheme in the First step, directly generating a video with accurate lip movement and natural facial expression through the trained First Order Motion model.

10. The method and system for synthesizing the virtual video service robot based on the generation countermeasure network according to the step two of claim 2, wherein: if the user selects the scheme in the step two, inputting the text into an emotion analysis module, analyzing corresponding emotion, generating audio of corresponding voice tone by calling TTS, inputting the video into a lip generator, synthesizing the video with lip movement together with the audio generated by the TTS, inputting the video into an expression generator, and adjusting facial expression according to the analyzed emotion to obtain a result.

11. The method and the system for synthesizing the virtual video customer service robot based on the generation countermeasure network according to claim 2 are characterized in that: the virtual video customer service robot synthesis system based on the generation of the countermeasure network comprises the following devices: a cloud server, a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the integration scheme and the method steps when executing the program.