CN112820265B

CN112820265B - Speech synthesis model training method and related device

Info

Publication number: CN112820265B
Application number: CN202010960441.1A
Authority: CN
Inventors: 廖锡光
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-14
Filing date: 2020-09-14
Publication date: 2023-12-08
Anticipated expiration: 2040-09-14
Also published as: CN112820265A

Abstract

The embodiment of the application discloses a voice synthesis model training method and a related device. And identifying the audio corresponding to the first response role and the text corresponding to the audio according to the audio and video fragment, and further training according to the text corresponding to the audio and the audio to obtain a voice synthesis model corresponding to the first response role. The voice interaction with the user can be carried out through the voice of the first response role, and the interestingness of the voice interaction is improved. Because the voice synthesis model corresponding to each first response role is obtained by training by taking the audio and video works as audio sources, no dubbing personnel or stars are required to be invited to record audio in advance, the generation cost of the voice synthesis model used in voice interaction is reduced, and the generation efficiency of the model is improved.

Description

Speech synthesis model training method and related device

Technical Field

The application relates to the field of artificial intelligence, in particular to a voice synthesis model training method and a related device.

Background

With the development of artificial intelligence technology, intelligent voice devices, such as smart phones, smart speakers, chat robots, etc., are increasingly being used by a wide variety of users. The user can interact with the intelligent voice equipment through voice, so that the intelligent voice equipment can respond according to the voice sent by the user.

In order to enrich the voice interaction, the voice interaction is more vivid and interesting, the user can customize the response role, the intelligent voice equipment can interact with the user through the voice of the customized response role, and the user feels that the user is talking with the response role. At present, the voice of intelligent voice equipment is input through the audio of standard text by dubbing personnel or stars in advance, model training is carried out, and a voice synthesis model of the person (namely a response role) is obtained, so that voice is synthesized by utilizing the voice synthesis model of the person customized by a user, and a conversation is carried out with the user.

However, this approach requires inviting dubbing personnel or stars to record audio in advance in order to train the speech synthesis model, resulting in cost-effective model generation.

Disclosure of Invention

In order to solve the technical problems, the application provides a voice synthesis model training method and a related device, which reduce the generation cost of a voice synthesis model used in voice interaction and improve the generation efficiency of the model.

The embodiment of the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application provides a method for training a speech synthesis model, where the method includes:

collecting audio and video works;

extracting an audio and video clip corresponding to the first response role from the audio and video work;

identifying the audio corresponding to the first response role and the text corresponding to the audio according to the audio-video fragment;

and training according to the audio and the text corresponding to the audio to obtain a voice synthesis model corresponding to the first response role.

In a second aspect, an embodiment of the present application provides a speech synthesis model training apparatus, where the apparatus includes a collecting unit, an extracting unit, a recognizing unit, and a training unit:

the collecting unit is used for collecting the audio and video works;

the extraction unit is used for extracting an audio and video fragment corresponding to the first response role from the audio and video work;

the identification unit is used for identifying the audio corresponding to the first response role and the text corresponding to the audio according to the audio-video fragment;

and the training unit is used for training and obtaining a voice synthesis model corresponding to the first response role according to the audio and the text corresponding to the audio.

In a third aspect, an embodiment of the present application provides an apparatus for training a speech synthesis model, the apparatus comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to perform the method of the first aspect according to instructions in the program code.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium storing program code for performing the method of the first aspect.

According to the technical scheme, when the voice synthesis model is trained, the existing audio and video works are used as training samples, namely the audio and video works are collected, and the audio and video segments corresponding to the first response role are extracted from the audio and video works. And identifying the audio corresponding to the first response role and the text corresponding to the audio according to the audio and video fragment, and further training according to the text corresponding to the audio and the audio to obtain a voice synthesis model corresponding to the first response role. The voice interaction with the user can be carried out through the voice of the first response role, and the interestingness of the voice interaction is improved. Because the voice synthesis model corresponding to each first response role is obtained by training by taking the audio and video works as audio sources, no dubbing personnel or stars are required to be invited to record audio in advance, the generation cost of the voice synthesis model used in voice interaction is reduced, and the generation efficiency of the model is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a schematic diagram of a system architecture of a training method for a speech synthesis model according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for training a speech synthesis model according to an embodiment of the present application;

fig. 3 is a schematic diagram of a module structure of voice interaction according to an embodiment of the present application;

FIG. 4 is a flowchart of a voice interaction method according to an embodiment of the present application;

FIG. 5 is a flowchart of a voice interaction method according to an embodiment of the present application;

FIG. 6 is a block diagram of a speech synthesis model training device according to an embodiment of the present application;

fig. 7 is a block diagram of a terminal device according to an embodiment of the present application;

fig. 8 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

At present, in order to enrich voice interaction, the voice interaction is more vivid and interesting, the voice of the intelligent voice equipment is input through the audio of a standard text by dubbing staff or stars in advance, model training is carried out, and a voice synthesis model of the person (namely a response role) is obtained, so that voice is synthesized by utilizing the user-defined voice synthesis model of the person, a conversation is carried out with the user, and the user has the feeling of talking with the stars.

However, different users have different favorites, some users like star a, some users like star B, and some users may like the role of star C in a movie, so in order to meet the requirements of different users as much as possible, numerous stars need to be invited to enter audio in advance, so as to obtain a large amount of audio data to train a speech synthesis model to obtain a speech synthesis model corresponding to different characters.

However, this approach requires inviting dubbing personnel or stars to record in advance, resulting in too much cost and low efficiency of generating the speech synthesis model used in the speech interaction.

In order to solve the technical problems, the embodiment of the application provides a voice interaction method, wherein a voice synthesis model used in the voice interaction method is obtained by training with an audio-video work as an audio source, so that dubbing personnel or stars are not required to be invited to record audio in advance, the generation cost of the voice synthesis model used in voice interaction is reduced, and the generation efficiency of the model is improved.

In addition, all audio and video works can be used as audio sources for training the speech synthesis model, the audio sources are rich, and the training of the rich speech synthesis model is facilitated.

Embodiments of the present application may relate to the field of artificial intelligence (Artificial Intelligence, AI), which is a theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, obtains knowledge, and uses the knowledge to obtain optimal results.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In embodiments of the present application, artificial intelligence techniques that may be involved include directions of computer vision (image), speech techniques, natural language processing, machine learning, and the like. Computer Vision (CV) is a science of how to "look" at a machine, and more specifically, to replace a camera and a Computer to perform machine Vision such as identifying and measuring a target by human eyes, and further perform graphic processing, so that the Computer is processed into an image more suitable for human eyes to observe or transmit to an instrument to detect.

For example, the embodiment of the application can extract the audio and video clips corresponding to the response role from the audio and video works through the image recognition (Image recognition, IR) technology in the computer vision technology, and further extract the audio of the response role and the text corresponding to the audio.

Key technologies of the Speech technology (Speech Technology) are a Speech recognition technology and a Speech synthesis technology (TTS). The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future. For example, it is possible to recognize which user the voice information is input by using a voiceprint recognition technique, recognize the content of the voice information by using a voice recognition technique, and generate a response voice corresponding to the voice information according to a voice synthesis technique.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge-graph, etc., such as responding to a user's voice information by robotic questions and answers techniques to generate a response voice.

Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine Learning typically includes Deep Learning (Deep Learning) techniques, including artificial neural networks (artificial neural network), such as convolutional neural networks (Convolutional Neural Network, CNN), recurrent neural networks (Recurrent Neural Network, RNN), deep neural networks (Deep neural network, DNN), and the like. In this embodiment, the speech synthesis model may be obtained through training by means of machine learning, so that in the speech interaction process, the response speech of the speech information is generated by using the speech synthesis model.

Referring to fig. 1, fig. 1 is a schematic diagram of a system architecture of a speech synthesis model training method according to an embodiment of the present application. The system architecture comprises a terminal device 101 and a server 102, wherein the terminal device 101 can interact with a user in voice, and when the user inputs a certain voice, the terminal device 101 can respond to the voice, so that the user can talk with the terminal device 101.

The terminal device 101 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a smart watch, a smart speaker, a smart television, or the like having a smart voice assistant.

The server 102 may be used to store a large number of audiovisual works as well as to store speech synthesis models corresponding to different responsive characters. Of course, the server 102 may obtain a large number of audiovisual works from other servers. The server 102 may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, where the terminal device 101 and the server 102 may be directly or indirectly connected through a wired or wireless communication manner, and the present application is not limited herein. The server 102 may collect audiovisual work. The audio-video works can be film-video works, animation works, musical works and the like, and at least one role is included in the audio-video works and can serve as a first response role.

The server 102 extracts an audio-video clip corresponding to the first response character from the audio-video work, for example _××× The audio-video clip of character a in the "is provided. The server 102 identifies the audio corresponding to the first answering character and the text corresponding to the audio according to the audio-video clip, and further trains according to the text corresponding to the audio and the audio to obtain the voice synthesis model corresponding to the first answering character.

The trained speech synthesis model may be stored in the server 102, and when the user performs speech interaction with the terminal device 101, the server 102 may invoke the speech synthesis model according to the speech information sent by the terminal device 101, so as to simulate the sound of the first response character to generate the response speech.

In the system architecture shown in fig. 1, the server 102 is used to execute the above-mentioned speech synthesis model training method, and the response speech generated by calling the speech synthesis model training is sent to the terminal device 101, so that the terminal device 101 sends out the response speech. Of course, the server 102 may transmit the speech synthesis model to the terminal device 101, and the terminal device 101 may generate the response speech using the speech synthesis model, so that the terminal device 101 transmits the response speech. In some cases, the terminal device 101 may also implement a voice synthesis model training method, store some corresponding relations between the response roles and the voice synthesis model, thereby generating response voices by the terminal device 101 itself, and issue the response voices. The embodiment of the present application does not limit the execution subject of the speech synthesis model training method, and the system architecture shown in fig. 1 is only an example and does not limit the present application.

Next, a method for training a speech synthesis model provided by the embodiment of the present application will be described in detail with reference to the accompanying drawings, using a server as an execution body.

Referring to fig. 2, fig. 2 is a flowchart of a method for training a speech synthesis model according to an embodiment of the present application, where the method includes:

s201, collecting audio and video works.

There are a large number of audiovisual works on the network, such as film and television works, animation works, etc. The audio and video works comprise a large number of characters, and with the playing of the audio and video works, the user likes many characters in the audio and video works, so that the user hopes to interact with the characters in the audio and video works.

In order to enable a user to generate the feeling of interacting with a role in an audio-video work, the embodiment of the application can take the role in the audio-video work as a first response role to interact with the user. This requires training a speech synthesis model from the audio-visual work corresponding to the first response character.

Therefore, the server can collect a large number of audio and video works, and train the voice synthesis model of the first response role by taking the audio and video works as training samples.

S202, extracting an audio and video clip corresponding to the first response role from the audio and video work.

Because each first response role does not appear in the audio-video work at a time, some segments do not include the first response role, and therefore, the server can extract the audio-video segment corresponding to the first response role from the audio-video work.

For example, a character a in a television play "×" may be used as a first answering character, and when a speech synthesis model corresponding to the character a needs to be trained, a video clip of the character a may be extracted.

The method for extracting the audio and video clips can be that the audio and video clips of the first answering character are manually intercepted, and the audio and video clips of the first answering character which are speaking can be identified through an image identification method.

In some cases, because each character in the audio-video work may be customized by the user as a response character, the speech synthesis model may be trained for each character in the audio-video work, i.e., the first response character is all characters in the audio-video work.

In some cases, the speaking time periods of different roles are different, some roles are taken as main roles, the speaking time period is long, and some roles can have very short speaking time period. Therefore, in order to ensure the accuracy of model training, the audio and video fragment extraction can be performed for the role with the speaking duration reaching the preset threshold value, so that the corresponding speech synthesis model is trained.

Because a large number of characters are included in the audio and video works, the characters with speaking time reaching the preset threshold value can be quite many, some characters are favorite and focused characters of the deep audience, the characters are highly likely to be customized as response characters by users, and some characters can be hardly customized as response characters. As users can discuss roles in the audio-video works on the network, namely comment information. Therefore, comment information aiming at the audio and video works can be obtained, the comment information can reflect the heat of each role, and the user favorites the roles. The more easily the character with high popularity (good evaluation) is customized as a response character by a user, so that the character with the evaluation higher than the preset threshold value is determined to be a first response character from the audio-video works according to the comment information.

S203, identifying the audio corresponding to the first response role and the text corresponding to the audio according to the audio-video fragment.

In this embodiment, the text corresponding to the audio may be identified by directly identifying the subtitle text in the audio-video clip through an image identification manner, or may be identified by identifying the text corresponding to the audio through a speech recognition model.

For example, if the character a is a first answering character, identifying text corresponding to audio and audio of the character a in the audio-video clip corresponding to the character a, and training the obtained text corresponding to audio and audio to obtain a speech synthesis model of the character a.

S204, training according to the audio and the text corresponding to the audio to obtain a voice synthesis model corresponding to the first response role.

And recognizing the corresponding audio and audio texts of the response character according to the audio and video clip, thereby taking the corresponding audio and audio texts as training data, and training the speech synthesis model according to the corresponding audio and audio texts to obtain a speech synthesis model of the response character, wherein the process of S202-S204 can be shown by a model training module in figure 3.

The speech synthesis model may be various neural network models, such as a CNN model, a DNN model, and the like, among others.

Based on the speech synthesis model obtained by the foregoing training, a method for performing speech interaction using the speech synthesis model will be described next. Referring to fig. 4, fig. 4 is a flowchart of a voice interaction method according to an embodiment of the present application, where the method includes:

s401, voice information input by a user is acquired.

In order to enrich the voice interaction, the voice interaction is more vivid and interesting, and a user can customize the response role on the terminal equipment, so that the terminal equipment can interact with the user through the voice of the customized response role, and the user feels that the user is talking with the response role.

For example, if the user likes a character a in a movie work, the user may set the character a as a response character on the terminal device, so that the terminal device may respond to the voice information input by the user with the sound of the character a.

When the user wants to customize the response role for the terminal device, namely the intelligent voice device, the intelligent voice device can receive the user-defined request of the user and send the user-defined request to the server, the user-defined request comprises the role identification, and the server sets the response role of the intelligent voice device according to the role identification.

In the embodiment of the present application, the manner of triggering the custom request may include multiple manners, and one manner may be that the terminal device provides a response role candidate list, and the user selects a favorite response role from the response role candidate list, so as to trigger the custom request. Alternatively, the user directly inputs a favorite response role on the terminal device, thereby triggering the custom request.

It should be noted that, when the response character is customized, the character attribute of the response character may be customized, as shown in fig. 3, fig. 3 shows a schematic diagram of a module structure of voice interaction, for example, the character attribute module in fig. 3, and the character attribute may include, for example, a nickname of the character, an age of the character, a hobbies of the character, and the like. Therefore, in one possible implementation manner, the custom request may further include attribute information of the response role, so that the manner of setting the response role of the intelligent voice device according to the role identifier may be to set the response role and the corresponding role attribute of the intelligent voice device according to the role identifier and the attribute information.

Of course, in some cases, the role attribute may be set after the response role is set, and at this time, after the response role of the intelligent voice device is set according to the role identifier, attribute information of the response role may also be obtained, so as to set the role attribute corresponding to the response role.

In this embodiment, the user may increase the interest by customizing the character attribute of the response character.

It should be noted that in some scenarios, one terminal device, i.e. the intelligent voice device, may be shared by multiple users, and because the favorites of different users are different, the response roles customized by different users may be different, thereby meeting the personalized requirements. For example, in a home scenario, wife selects character a in a television show "xxx" as the answering character, and wife selects character B in a television show "xxx" as the answering character.

In the case that a plurality of response roles are set on one terminal device, in order to accurately respond to different users by using their customized response roles, an association relationship between the users and the response roles may be established, as shown in fig. 3, for example, a response role association module shown in fig. 3. When the user A inputs voice information, the user A needs to respond by using the voice of the character A; and when the user B inputs voice information, the user B needs to respond by using the voice of the character B.

The association relationship between the user and the response role can be established in different modes. In some cases, because the voiceprint information of different users is different, the association relationship between the user and the response role can be represented by the association relationship between the voiceprint information and the response role, that is, the user can register own voiceprint information on the terminal device when customizing the response role, the voiceprint information can represent the identity of the user, and therefore, after customizing the response role, the server can acquire the voiceprint information, so that the association relationship between the voiceprint information and the response role can be established according to the role identifier and the voiceprint information.

In other cases, the character attributes set by different users for the responsive character may be different, and thus different users may be distinguished by character attributes, particularly by character nicknames in the character attributes. Therefore, the association relationship between the user and the response character can be represented by the association relationship between the character nickname and the response character, that is, the user can define the character nickname of the response character, for example, the character nickname of the character A is called as "boss" when the user self-defines the response character. Thus, after the response role is customized, the association relationship between the role nickname and the response role can be established according to the role identification and the role nickname.

In other cases, since different users can define the same role nickname for different response roles, for example, the response role defined by the user a is the role a, the role nickname set for the user a is the "boss", the response role defined by the user B is the role B, and the role nickname set for the user B is the "boss", so that in order to accurately determine the response role of the voice information in subsequent use, the association relationship between the voice print information and the role nickname with the response role can be established, and thus, in the case that the role nicknames are the same, the response role can be determined according to the voice print information.

Of course, in some cases, the setting of the response role, the setting of the role attribute, the establishment of the association relationship, and the notification of the server may also be completed by the intelligent voice device itself, which is not limited in this embodiment.

After the user-defined setting is completed on the terminal equipment, if the user wants to control the terminal equipment through voice, namely, voice interaction is realized, the user can input voice information through a microphone of the terminal equipment, and after the terminal equipment acquires the voice information, the voice information can be sent to the server so that the server can generate response information of a user-defined response role according to the voice information.

S402, determining a target response role matched with the voice information according to the voice information.

After the server acquires the voice information input by the user, the target response role matched with the voice information can be determined, so that the voice information can be responded by the voice of the target response role.

In S401, various association relations are provided, and according to the difference of the association relations, the manner of determining the target response role matched with the voice information is different according to the voice information. If the association relationship is that of the voiceprint information and the response role, the method for determining the target response role matched with the voice information can be to perform voiceprint recognition according to the voice information to obtain a voiceprint recognition result, wherein the voiceprint recognition result can embody voiceprint information in the voice information, namely, which user is the user inputting the voice information, so that the target response role can be determined according to the voiceprint recognition result and the association relationship.

For example, the response role selected by the user a is the role a, and the association relationship between the voiceprint information of the user a and the role a is stored in the server. When the user controls the intelligent television through the voice to hope to play a certain television play, the voice information input by the user is 'I want to watch x', the server can conduct voiceprint recognition according to the voice information to obtain a voiceprint recognition result, the voiceprint recognition result reflects that voiceprint information of other users A is consistent, and then the target response role can be determined to be the role A.

If the association relationship is that of a role nickname and a response role, the method for determining the target response role matched with the voice information may be to identify the role nickname included in the voice information, and determine the target response role according to the association relationship and the role nickname included in the voice information.

For example, the response character selected by the user a is character a, the character nickname of the set character a is "boss", and the association relationship between the character nickname "boss" and the character a is stored in the server. When the user controls the intelligent television through voice to hope to play a certain television play, voice information input by the user is 'boss, i want to watch x' and the server can identify a character nickname 'boss' contained in the voice information, and then the target response character can be determined to be the character A.

Of course, if the association relationship is voiceprint information, a role nickname and a response role, the method of determining the target response role matched with the voice information may be to identify the role nickname included in the voice information, determine the response role corresponding to the role nickname, and if the response role corresponding to the role nickname includes a plurality of response roles, or the user inputting the voice information is not a user of customizing the role nickname, determine the target response role further according to the voiceprint information and the association relationship.

S403, determining a voice synthesis model corresponding to the target response role according to the corresponding relation between the first response role and the voice synthesis model.

The embodiment corresponding to fig. 2 can train to obtain the speech synthesis model of the first response character, and store the corresponding relationship between the speech synthesis model and the first response character. And the server searches a voice synthesis model corresponding to the target response role according to the target response role and the corresponding relation.

In some cases, since the speech synthesis models of all the characters may not be trained, there may be cases where the speech synthesis model corresponding to the target response character is not included in the trained speech synthesis models. In these cases, the target response role may be used as the first response role, and the step of extracting the audio/video clip corresponding to the first response role from the audio/video work in the embodiment corresponding to fig. 2 may be re-performed, so as to update the speech synthesis model obtained by training.

Through the steps, more target response roles possibly customized by the user can be obtained, and further the voice synthesis model is supplemented according to the feedback result, so that the interaction requirement of the user is met as much as possible.

S404, calling a voice synthesis model corresponding to the target response role to generate response voice corresponding to the voice information.

After determining the target response role, the server may invoke a speech synthesis model corresponding to the target response role, generate response speech corresponding to the speech information, and send the response speech to the terminal device, so that the terminal device sends the response speech by using the sound of the target response role.

If the voice information is used for controlling the terminal equipment to execute the action, the terminal equipment can execute the corresponding action after sending out the response voice.

For example, if the target response character is character a and the voice information input by the user is "old man, i want to see×", the generated response voice may be "good" (sound of character a), and after receiving the response voice, the terminal device sends out the response voice "good" by sound of character a and plays the television show "×".

In some embodiments, the user has customized a role attribute for the answering role, and the terminal device can return the role attribute to the user in response to the user's question-answering voice when the role is asked.

For example, the target response character is character a, the character attribute is character nickname "×", for example, if the user inputs "who you are", the generated response voice may be "i am" ("voice of character a") and after receiving the response voice, the terminal device sends out the response voice "i am" ("i am") by using the voice of character a.

According to the technical scheme, when the voice information input by the user is acquired so as to realize voice interaction, the target response role matched with the voice information can be determined according to the voice information, and then the voice synthesis model corresponding to the target response role is determined according to the corresponding relation between the first response role and the voice synthesis model. And then, calling a voice synthesis model corresponding to the target response role to generate response voice corresponding to the voice information, so that voice interaction with the user through the voice of the target response role is realized, and the interestingness of the voice interaction is improved.

Next, the voice interaction method provided by the embodiment of the application will be described in connection with an actual application scenario. For example, the smart television can be controlled through voice, and the smart television can interact with a user through voice as a smart voice device. Referring to fig. 5, the method includes:

s501, a user opens the intelligent television to enter a response role setting interface.

S502, the user selects a favorite character (for example, a character A) as a target response character.

S503, when the user performs voice control on the intelligent television, voice information of 'old man, I want to watch x'.

S504, the server acquires voice information of 'old man, I want to see x'.

S505, the server determines the target response role as the role A according to the voice information.

S506, the server determines a voice synthesis model corresponding to the target response role according to the corresponding relation between the first response role and the voice synthesis model.

S507, the server calls a voice synthesis model corresponding to the target response role, and generates response voice 'good' (the voice of the role A) corresponding to the voice information.

S508, the intelligent television receives the response voice "good" and responds to the "good" with the sound of the role A.

S509, playing a television play by the intelligent television.

Based on the speech synthesis model training method provided in the corresponding embodiment of fig. 2, the embodiment of the present application further provides a speech synthesis model training device 600, referring to fig. 6, the device 600 includes a collecting unit 601, an extracting unit 602, a recognizing unit 603, and a training unit 604:

the collecting unit 601 is configured to collect an audio and video work;

the extracting unit 602 is configured to extract an audio/video clip corresponding to the first response role from the audio/video work;

the identifying unit 603 is configured to identify, according to the audio-video clip, audio corresponding to the first response role and text corresponding to the audio;

The training unit 604 is configured to train to obtain a speech synthesis model corresponding to the first response role according to the audio and the text corresponding to the audio.

In a possible implementation manner, the apparatus further includes an acquisition unit and a determination unit:

the acquisition unit is used for acquiring comment information aiming at the audio and video works;

and the determining unit is used for determining the role with the good evaluation degree larger than the preset threshold value as the first response role from the audio-video works according to the evaluation information.

In a possible implementation manner, the apparatus further includes a generating unit:

the acquisition unit is also used for acquiring voice information input by a user;

the determining unit is further used for determining a target response role matched with the voice information according to the voice information;

determining a voice synthesis model corresponding to the target response role according to the corresponding relation between the first response role and the voice synthesis model;

the generating unit is used for calling the voice synthesis model corresponding to the target response role and generating response voice corresponding to the voice information.

In one possible implementation manner, if the speech synthesis model corresponding to the target response role is not included in the speech synthesis model obtained by training, the determining unit is further configured to:

And taking the target response role as the first response role, triggering the extraction unit 602 to re-execute the step of extracting the audio and video fragments corresponding to the first response role from the audio and video works so as to update the trained voice synthesis model.

In one possible implementation manner, before the voice information input by the user is obtained, the apparatus further includes a receiving unit and a setting unit:

the receiving unit is used for receiving a user-defined request of the user, wherein the user-defined request comprises a role identifier;

the setting unit is used for setting the response role of the intelligent voice equipment according to the role identification.

In one possible implementation manner, if the response role of the intelligent voice device includes a plurality of response roles, the user-defined request includes voiceprint information of the user, and the apparatus further includes an establishing unit:

the establishing unit is used for establishing the association relation between the voiceprint information and the response role according to the role identifier and the voiceprint information;

the determining unit is further configured to:

performing voiceprint recognition according to the voice information to obtain a voiceprint recognition result;

and determining the target response role according to the voiceprint recognition result and the association relation.

In a possible implementation manner, the custom request further includes attribute information of the response role, and the setting unit is configured to:

and setting a response role and corresponding role attributes of the intelligent voice equipment according to the role identification and the attribute information.

In one possible implementation manner, if the response role of the intelligent voice device includes a plurality of response roles, the attribute information includes a role nickname, and the establishing unit is further configured to:

establishing an association relationship between the role nickname and the response role according to the role identifier and the role nickname;

the determining unit is used for:

identifying a character nickname included in the voice information;

and determining the target response role according to the association relationship and the role nickname included in the voice information.

The embodiment of the application also provides equipment for training the speech synthesis model, and the equipment is described below with reference to the accompanying drawings. Referring to fig. 7, an embodiment of the present application provides a device for training a speech synthesis model, where the device may be a terminal device, and the terminal device is exemplified by a smart phone:

fig. 7 is a block diagram showing a part of a structure of a smart phone related to a terminal device provided by an embodiment of the present application. Referring to fig. 7, the smart phone includes: radio Frequency (RF) circuit 710, memory 720, input unit 730, display unit 740, sensor 750, audio circuit 760, wireless fidelity (wireless fidelity, wiFi) module 770, processor 780, and power supply 790. Those skilled in the art will appreciate that the smartphone structure shown in fig. 7 is not limiting of the smartphone and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The memory 720 may be used to store software programs and modules, and the processor 780 performs various functional applications and data processing of the smartphone by running the software programs and modules stored in the memory 720. The memory 720 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebooks, etc.) created according to the use of the smart phone, etc. In addition, memory 720 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The processor 780 is a control center of the smart phone, connects various parts of the entire smart phone using various interfaces and lines, and performs various functions of the smart phone and processes data by running or executing software programs and/or modules stored in the memory 720, and calling data stored in the memory 720. Optionally, the processor 780 may include one or more processing units; preferably, the processor 780 may integrate an application processor that primarily processes operating systems, user interfaces, applications, etc., with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 780.

In this embodiment, the processor 780 in the terminal device 700 may perform the following steps;

collecting audio and video works;

The device for training a speech synthesis model may further include a server, and referring to fig. 8, fig. 8 is a schematic diagram of a server 800 according to an embodiment of the present application, where the server 800 may have a relatively large difference due to configuration or performance, and may include one or more central processing units (Central Processing Units, abbreviated as CPU) 822 (e.g., one or more processors) and a memory 832, and one or more storage media 830 (e.g., one or more mass storage devices) storing application programs 842 or data 844. Wherein the memory 832 and the storage medium 830 may be transitory or persistent. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 822 may be configured to communicate with the storage medium 830 to execute a series of instruction operations in the storage medium 830 on the server 800.

The server 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input/output interfaces 858, and/or one or more operating systems 841, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, and the like.

In this embodiment, the cpu 822 in the server 800 may perform the following steps;

collecting audio and video works;

Embodiments of the present application also provide a computer readable storage medium for storing a program code for executing the voice interaction method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the speech synthesis model training method described in the previous embodiments.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented, for example, in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of training a speech synthesis model, the method comprising:

collecting audio and video works;

determining the role of which the speaking time reaches a preset threshold value from all roles in the audio and video works;

comment information aiming at the audio and video works is obtained;

according to the comment information, determining a role with a good evaluation degree larger than a preset threshold value from the roles with the speaking time reaching the preset threshold value as a first response role;

extracting an audio-video clip in which the first answering character is speaking from the audio-video work in an image recognition mode;

Training according to the audio and the text corresponding to the audio to obtain a voice synthesis model corresponding to the first response role;

acquiring voice information input by a user;

identifying a role nickname included in the voice information, determining a response role corresponding to the role nickname, if the response role corresponding to the role nickname includes a plurality of roles, performing voiceprint identification according to the voice information to obtain a voiceprint identification result, and determining a target response role matched with the voice information according to the voiceprint identification result and the associated information; the association information is the association relationship among the character identification of the response character, the character nickname and the voiceprint information of the user;

and if the speech synthesis model corresponding to the target response role is not included in the speech synthesis model obtained through training, taking the target response role as the first response role, and re-executing the step of extracting the audio and video fragment which is speaking by the first response role from the audio and video work so as to update the speech synthesis model obtained through training.

2. The method according to claim 1, wherein the method further comprises:

and calling a voice synthesis model corresponding to the target response role to generate response voice corresponding to the voice information.

3. The method of claim 2, wherein prior to the obtaining the voice information input by the user, the method further comprises:

receiving a user-defined request of the user, wherein the user-defined request comprises a role identifier;

and setting a response role of the intelligent voice equipment according to the role identification.

4. The method of claim 3, wherein if the responsive role of the intelligent voice device includes a plurality of responsive roles, the user-defined request includes voiceprint information of the user, the method further comprising:

and establishing the association relation between the voiceprint information and the response role according to the role identification and the voiceprint information.

5. The method of claim 3, wherein the custom request further includes attribute information of the response role, and the setting the response role of the intelligent voice device according to the role identifier includes:

6. The method of claim 5, wherein if the responsive personas of the intelligent voice device include a plurality of, the attribute information includes a persona nickname, the method further comprising:

and establishing an association relationship between the role nickname and the response role according to the role identifier and the role nickname.

7. The device is characterized by comprising a collecting unit, an extracting unit, a recognizing unit, a training unit, an obtaining unit and a determining unit;

the collecting unit is used for collecting the audio and video works;

the device is also for: determining the role of which the speaking time reaches a preset threshold value from all roles in the audio and video works; comment information aiming at the audio and video works is obtained; according to the comment information, determining a role with a good evaluation degree larger than a preset threshold value from the roles with the speaking time reaching the preset threshold value as a first response role;

the extracting unit is used for extracting the audio-video clip of the first answering character speaking from the audio-video work in an image recognition mode;

The training unit is used for training according to the audio and the text corresponding to the audio to obtain a voice synthesis model corresponding to the first response role;

the acquisition unit is used for acquiring voice information input by a user;

the determining unit is used for identifying a role nickname included in the voice information, determining a response role corresponding to the role nickname, if the response role corresponding to the role nickname comprises a plurality of roles, performing voiceprint identification according to the voice information to obtain a voiceprint identification result, and determining a target response role matched with the voice information according to the voiceprint identification result and the associated information; the association information is the association relationship among the character identification of the response character, the character nickname and the voiceprint information of the user; determining a voice synthesis model corresponding to the target response role according to the corresponding relation between the first response role and the voice synthesis model;

and the determining unit is further configured to, if the speech synthesis model obtained by training does not include the speech synthesis model corresponding to the target response role, re-execute the step of extracting the audio/video segment of the speech of the first response role from the audio/video work by taking the target response role as the first response role, so as to update the speech synthesis model obtained by training.

8. The apparatus of claim 7, wherein the apparatus further comprises: a generating unit;

9. The apparatus of claim 8, wherein prior to the obtaining the voice information input by the user, the apparatus further comprises: a receiving unit and a setting unit;

10. The apparatus of claim 9, wherein if the responsive role of the intelligent voice device includes a plurality of responsive roles, the customized request includes voiceprint information of the user, the apparatus further comprising: a building unit;

the establishing unit is used for establishing the association relation between the voiceprint information and the response role according to the role identifier and the voiceprint information.

11. The apparatus of claim 10, wherein the custom request further includes attribute information of the response role, and the setting unit is configured to:

12. The apparatus of claim 11, wherein if the responsive character of the intelligent voice device comprises a plurality of character nicknames, the attribute information comprises character nicknames, the establishing unit is further configured to:

13. An apparatus for speech synthesis model training, the apparatus comprising a processor and a memory:

the processor is configured to perform the method of any of claims 1-6 according to instructions in the program code.

14. A computer readable storage medium, characterized in that the computer readable storage medium is for storing a program code for causing a computer device to perform the method of any one of claims 1-6.