CN117219043A

CN117219043A - Model training method, model application method and related device

Info

Publication number: CN117219043A
Application number: CN202310015162.1A
Authority: CN
Inventors: 李广之; 段志毅; 杨颖�; 翁超; 戴北; 甄帅; 卞衍尧; 陆远
Original assignee: Shenzhen Tencent Information Technology Co Ltd
Current assignee: Shenzhen Tencent Information Technology Co Ltd
Priority date: 2023-01-04
Filing date: 2023-01-04
Publication date: 2023-12-12

Abstract

The embodiment of the application discloses a model training method, a model application method and a related device, wherein when model training is carried out, the accuracy of an initial voice synthesis model in the process of directly synthesizing voice information based on text information and adjusting parameters can be reflected through the difference between voice information to be determined and target sample voice information, and further, the voice synthesis model obtained by carrying out parameter adjustment on the initial voice synthesis model based on the difference can realize the process of directly synthesizing the voice information corresponding to the text information to be synthesized and the adjusting parameters based on the text information to be synthesized, so that the voice information meets the requirement of adjusting parameters on the voice mode adjustment and the integral voice pronunciation characteristic of the text information to be synthesized, and the authenticity of the voice information after adjustment is improved on the premise of ensuring the accurate adjustment of the voice information, and further, the voice synthesis effect is improved.

Description

Model training method, model application method and related device

Technical Field

The application relates to the field of data processing, in particular to a model training method and a related device.

Background

The speech synthesis technology is one of the popular data processing technologies, and has the function of simulating a real speech sound mode and generating corresponding speech information based on input text information. When using the speech synthesis technology, in order to make the obtained speech information more real and fit with the actual requirement, the speech synthesis party generally adjusts the relevant parameters of the speech information according to the own speech synthesis requirement.

In the related art, the voice synthesis technology does not support parameter adjustment in the process of generating voice information, and a voice synthesizer can only adjust parameters such as intonation, duration and the like of the voice information after synthesizing the voice information.

Because the parameter adjustment mode can only be performed after the audio information is synthesized and is adjusted based on the voice information, the context information contained in the text information to be synthesized is difficult to refer to, so that the distortion problem of the adjusted voice information is easy to occur, and the voice synthesis effect is poor.

Disclosure of Invention

In order to solve the technical problems, the application provides a model training method, which enables a model obtained through training to directly synthesize voice information based on adjusting parameters and texts to be synthesized, so that the voice information can meet adjusting requirements, can be attached to the text characteristics of the texts to be synthesized, and improves the voice synthesis effect.

The embodiment of the application discloses the following technical scheme:

in a first aspect, an embodiment of the present application discloses a model training method, the method including:

acquiring a sample text information set, wherein the sample text information set comprises a plurality of sample text information, the sample text information is provided with corresponding sample voice information and sample adjusting parameters, and the sample voice information is generated based on the sample adjusting parameters;

respectively taking the plurality of sample text information as target sample text information, and generating voice characteristic information corresponding to the target sample text information according to the target sample text information through an initial voice synthesis model, wherein the voice characteristic information is used for identifying the pronunciation mode of the target sample text information in the voice information;

adjusting the voice characteristic information according to target sample adjusting parameters corresponding to the target sample text information through the initial voice synthesis model to obtain adjusted voice characteristic information;

generating undetermined voice information corresponding to the target sample text information according to the adjusted voice characteristic information through the initial voice synthesis model;

According to the difference between the undetermined voice information and the target sample voice information corresponding to the target sample text information, adjusting the model parameters corresponding to the initial voice synthesis model to obtain a voice synthesis model, wherein the voice synthesis model is used for synthesizing the voice information according to the text information to be synthesized and the adjusting parameters corresponding to the text information to be synthesized.

In a second aspect, an embodiment of the present application discloses a model application method, where the method includes:

acquiring text information to be synthesized and an adjusting parameter corresponding to the text information to be synthesized, which are input by a voice synthesis object, wherein the adjusting parameter is used for adjusting the pronunciation mode of the text information to be synthesized in the voice information;

inputting the text information to be synthesized and the adjusting parameters corresponding to the text information to be synthesized into a voice synthesis model, and generating target voice information corresponding to the text information to be synthesized through the voice synthesis model;

and sending the target voice information to the voice synthesis object.

In a third aspect, an embodiment of the present application discloses a model training apparatus, where the apparatus includes an obtaining unit, a first generating unit, a first adjusting unit, a second generating unit, and a second adjusting unit:

The acquisition unit is used for acquiring a sample text information set, wherein the sample text information set comprises a plurality of sample text information, the sample text information is provided with corresponding sample voice information and sample adjusting parameters, and the sample voice information is generated based on the sample adjusting parameters;

the first generation unit is used for respectively taking the plurality of sample text information as target sample text information, generating voice characteristic information corresponding to the target sample text information according to the target sample text information through an initial voice synthesis model, and the voice characteristic information is used for identifying the pronunciation mode of the target sample text information in the voice information;

the first adjusting unit is used for adjusting the voice characteristic information according to the target sample adjusting parameters corresponding to the target sample text information through the initial voice synthesis model to obtain adjusted voice characteristic information;

the second generating unit is used for generating undetermined voice information corresponding to the target sample text information according to the adjusted voice characteristic information through the initial voice synthesis model;

the second adjusting unit is configured to adjust model parameters corresponding to the initial speech synthesis model according to a difference between the to-be-determined speech information and the target sample speech information corresponding to the target sample text information, so as to obtain a speech synthesis model, where the speech synthesis model is configured to synthesize the speech information according to the to-be-synthesized text information and the adjustment parameters corresponding to the to-be-synthesized text information.

In one possible implementation manner, the first generating unit is specifically configured to:

determining phoneme characteristic information, semantic characteristic information and prosody characteristic information corresponding to the target sample text information, wherein the phoneme characteristic information is used for identifying a phoneme composition corresponding to the target sample text information, the semantic characteristic information is used for identifying semantics corresponding to the target sample text information, and the prosody characteristic information is used for identifying pronunciation prosody corresponding to the target sample text information;

and generating voice characteristic information corresponding to the target sample text information according to the phoneme characteristic information, the semantic characteristic information and the prosody characteristic information.

In one possible implementation manner, the sample text information has corresponding sample emotion tags, the initial speech synthesis model includes initial emotion feature information corresponding to a plurality of emotion tags, and the first generation unit is specifically configured to:

determining target initial emotion feature information corresponding to a target sample emotion label corresponding to the target sample text information;

generating voice characteristic information corresponding to the target sample text information according to the target initial emotion characteristic information and the target sample text information;

The second adjusting unit is specifically configured to:

according to the difference between the undetermined voice information and the target sample voice information corresponding to the target sample text information, adjusting model parameters corresponding to the initial voice synthesis model and the target initial emotion feature information to obtain a voice synthesis model, wherein the voice synthesis model comprises emotion feature information respectively corresponding to the plurality of emotion tags, the emotion feature information is obtained by adjusting the initial emotion feature information corresponding to the emotion tags, and the voice synthesis model is used for synthesizing the voice information according to the text information to be synthesized, the adjusting parameters corresponding to the text information to be synthesized and the emotion tags corresponding to the text information to be synthesized.

In a possible implementation manner, the target sample adjustment parameter includes a first adjustment parameter, where the first adjustment parameter is used to adjust a first feature parameter included in the voice feature information, and the first adjustment unit is specifically configured to:

and adjusting the first characteristic parameters included in the voice characteristic information according to the first adjusting parameters through the initial voice synthesis model to obtain adjusted voice characteristic information.

In a possible implementation manner, the initial speech synthesis model includes a parameter prediction part, the target sample adjustment parameter includes a second adjustment parameter, the second adjustment parameter is used for adjusting a second characteristic parameter determined according to the speech characteristic information, the speech characteristic information does not include the second characteristic parameter, and the target sample text information has a corresponding sample second characteristic parameter;

the apparatus further comprises a determination unit:

the determining unit is used for determining undetermined second characteristic parameters corresponding to the voice characteristic information through the parameter predicting part;

the first adjusting unit is specifically configured to:

determining a second characteristic parameter to be adjusted according to the second characteristic parameter of the sample and the second adjustment parameter;

adjusting the voice characteristic information according to the second characteristic parameters to be adjusted through the initial voice synthesis model to obtain adjusted voice characteristic information;

the second adjusting unit is specifically configured to:

and adjusting model parameters corresponding to the parameter prediction part according to the difference between the second characteristic parameter to be determined and the second characteristic parameter of the sample, and adjusting model parameters except for the parameter prediction part in the initial speech synthesis model according to the difference between the speech information to be determined and the target sample speech information corresponding to the target sample text information to obtain the speech synthesis model.

In a possible implementation manner, the target sample adjustment parameter has a corresponding emotion tag, and the second adjustment unit is specifically configured to:

determining a first emotion characteristic parameter corresponding to the emotion label;

normalizing the second characteristic parameters of the sample according to the first target characteristic parameters;

and according to the difference between the undetermined second characteristic parameters and the normalized sample second characteristic parameters, adjusting model parameters corresponding to the parameter prediction part, wherein the parameter prediction part in the voice synthesis model is used for determining normalized second characteristic parameters corresponding to voice characteristic information, and determining second characteristic parameters corresponding to the voice characteristic information according to the second emotion characteristic parameters corresponding to emotion labels corresponding to the text information to be synthesized and the normalized second characteristic parameters.

In one possible implementation, the first adjustment parameter includes any one or a combination of a drag control parameter for adjusting the drag parameter in the first characteristic parameter, an accent control parameter for adjusting the accent parameter in the first characteristic parameter, and a break control parameter for adjusting the break parameter in the first characteristic parameter.

In one possible implementation, the second adjustment parameter includes any one or a combination of a duration control parameter, a intonation control parameter, and a heave control parameter, where the duration control parameter is used to adjust the duration parameter in the second feature parameter, the intonation control parameter is used to adjust the intonation parameter in the second feature parameter, and the heave control parameter is used to adjust the heave parameter in the second feature parameter.

In one possible implementation, the second adjusting unit is specifically configured to:

generating a first spectrogram according to the undetermined voice information, and generating a second spectrogram according to the target sample voice information;

determining, by generating an antagonism network discriminator, a similarity parameter between the first and second spectrograms, the similarity parameter being used to identify a difference between the first and second spectrograms;

and adjusting model parameters corresponding to the initial voice synthesis model according to the similar parameters to obtain a voice synthesis model, wherein the similar parameters determined according to the voice synthesis model are larger than a preset threshold.

In a fourth aspect, an embodiment of the present application discloses a model application apparatus, where the apparatus includes an obtaining unit, a generating unit, and a sending unit:

The acquisition unit is used for acquiring text information to be synthesized and input by a voice synthesis object and adjusting parameters corresponding to the text information to be synthesized, wherein the adjusting parameters are used for adjusting the pronunciation mode of the text information to be synthesized in the voice information;

the generating unit is used for inputting the text information to be synthesized and the adjusting parameters corresponding to the text information to be synthesized into a voice synthesis model, and generating target voice information corresponding to the text information to be synthesized through the voice synthesis model;

the sending unit is used for sending the target voice information to the voice synthesis object.

In a possible implementation manner, the generating unit is specifically configured to:

determining phoneme characteristic information, semantic characteristic information and prosody characteristic information corresponding to the text information to be synthesized, wherein the phoneme characteristic information is used for identifying a phoneme composition corresponding to the text information to be synthesized, the semantic characteristic information is used for identifying semantics corresponding to the text information to be synthesized, and the prosody characteristic information is used for identifying pronunciation prosody corresponding to the text information to be synthesized;

generating voice characteristic information corresponding to the text information to be synthesized according to the phoneme characteristic information, the semantic characteristic information and the prosody characteristic information, wherein the voice characteristic information is used for identifying the pronunciation mode of the text information to be synthesized in the voice information;

Adjusting the voice characteristic information according to the adjusting parameters corresponding to the text information to be synthesized;

and generating target voice information corresponding to the text information to be synthesized according to the adjusted voice characteristic information.

In a possible implementation manner, the speech synthesis model includes a parameter adjustment portion and a parameter prediction portion, where the adjustment parameters include a first adjustment parameter and a second adjustment parameter, the first adjustment parameter is used to adjust a first feature parameter included in the speech feature information, and the second adjustment parameter is used to adjust a second feature parameter determined according to the speech feature information, and the speech feature information does not include the second feature parameter;

the generating unit is specifically configured to:

determining, by the parameter prediction portion, a second feature parameter corresponding to the voice feature information according to the voice feature information;

determining a second characteristic parameter to be adjusted according to the second adjustment parameter and the second characteristic parameter;

and adjusting, by the parameter adjusting section, a first characteristic parameter included in the voice characteristic information according to the first adjustment parameter, and adjusting the voice characteristic information according to the second characteristic parameter to be adjusted.

In a possible implementation manner, the adjustment parameter includes an emotion tag, and the generating unit is specifically configured to:

determining a second characteristic parameter after normalization processing corresponding to the voice characteristic information according to the voice characteristic information;

and determining a second characteristic parameter corresponding to the voice characteristic information according to the emotion characteristic parameter corresponding to the emotion label and the normalized second characteristic parameter.

In a possible implementation manner, the adjustment parameters include an emotion tag and an emotion degree parameter, the emotion degree parameter is used for identifying a degree of adjusting a pronunciation mode of the text information to be synthesized in voice information to an emotion identified by the emotion tag, and the generating unit is specifically configured to:

determining emotion characteristic information corresponding to the emotion tag;

and generating voice characteristic information corresponding to the text information to be synthesized according to the emotion characteristic information, the emotion characteristic parameter, the phoneme characteristic information, the semantic characteristic information and the prosody characteristic information.

In one possible implementation, the apparatus further includes a display unit:

the display unit is used for displaying an information input interface to the voice synthesis object, wherein the information input interface is used for inputting text information to be synthesized and adjustment parameters;

The acquisition unit is specifically configured to:

and acquiring text information to be synthesized and input by a voice synthesis object and adjusting parameters corresponding to the text information to be synthesized through the information input interface.

In a fifth aspect, embodiments of the present application disclose a computer device comprising a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is configured to execute the model training method according to any one of the first aspect or the model application method according to any one of the second aspect according to instructions in the program code.

In a sixth aspect, an embodiment of the present application discloses a computer readable storage medium, where the computer readable storage medium is configured to store a computer program, where the computer program is configured to execute the model training method according to any one of the first aspects, or the model application method according to any one of the second aspects.

In a seventh aspect, embodiments of the present application disclose a computer program product comprising instructions which, when run on a computer, cause the computer to perform the model training method according to any of the first aspects, or the model application method according to any of the second aspects.

According to the technical scheme, when model training is carried out, a sample text information set for carrying out model training is firstly obtained, the sample text information set comprises a plurality of sample text information, the sample text information is provided with corresponding sample voice information and sample adjusting parameters, wherein the sample voice information is generated based on the sample text information and the sample adjusting parameters, namely, the sample voice information is matched with the adjusting parameters to adjust the pronunciation mode of the sample text information in the voice information. In the voice information synthesis process, a plurality of sample text information are respectively used as target sample text information, voice characteristic information corresponding to the target sample text information is generated according to the target sample text information through an initial voice synthesis model, and the voice characteristic information is used for identifying the pronunciation mode of the target sample text information in the voice information. Therefore, the voice characteristic information is adjusted through the target sample adjusting parameters corresponding to the target sample text information, and the pronunciation mode of the target sample text information can be adjusted in the same adjusting mode as the target sample voice information corresponding to the target sample text information. According to the method, the initial speech synthesis model is used for generating undetermined speech information corresponding to the text information of the target sample according to the adjusted speech characteristic information, so that the accuracy of the initial speech synthesis model in the process of directly synthesizing the speech information based on the text information and the adjustment parameters can be shown through the difference between the undetermined speech information and the speech information of the target sample, and further, the speech synthesis model obtained by carrying out parameter adjustment on the initial speech synthesis model based on the difference can be used for realizing the process of directly synthesizing the speech information corresponding to the text information to be synthesized and the adjustment parameters based on the text information to be synthesized, so that the speech information not only meets the requirement of the adjustment parameters on the pronunciation mode, but also meets the integral speech pronunciation characteristics of the text information to be synthesized, and the authenticity of the adjusted speech information is improved on the premise of guaranteeing the accurate adjustment of the speech information, and further, the speech synthesis effect is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of a model training method in a practical application scenario provided in an embodiment of the present application;

FIG. 2 is a flowchart of a model training method according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a model training method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a model training method according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a model training method according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a model training method according to an embodiment of the present application;

FIG. 7 is a flowchart of a method for model application according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a method for applying a model according to an embodiment of the present application;

FIG. 9 is a schematic diagram of an information input interface according to an embodiment of the present application;

Fig. 10 is a schematic diagram of a model training method in a practical application scenario according to an embodiment of the present application;

FIG. 11 is a schematic diagram of a model training method in a practical application scenario according to an embodiment of the present application;

FIG. 12 is a block diagram of a model training apparatus according to an embodiment of the present application;

FIG. 13 is a block diagram of a model application device according to an embodiment of the present application;

fig. 14 is a block diagram of a terminal according to an embodiment of the present application;

fig. 15 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

When obtaining the voice information corresponding to the text information to be synthesized through voice synthesis, in order to improve the effect of the voice information or meet various voice synthesis requirements of the voice synthesis object, the voice synthesis object generally performs various adjustments on the voice information, such as adjusting intonation, pause and the like of pronunciation in the voice information.

In the related art, since a speech synthesis model for generating speech information does not have a capability of adjusting speech information, a speech synthesis object needs to adjust the speech information after the speech synthesis model outputs the speech information. Because the voice information in the related art is regulated after the voice synthesis is finished, the voice information can be regulated only according to the pronunciation mode of specific characters, and the characteristics of the text information of the whole text information to be synthesized are not referred, the voice information is difficult to regulate by combining with the context information in the text information, the matching degree of the regulated voice information and the text characteristics of the text information to be synthesized is low, the sense of reality is lacked, and the voice synthesis effect is poor.

In order to solve the technical problems, the embodiment of the application provides a model training method, in the model training process, a model is directly combined with text information and adjusting parameters to generate undetermined voice information, and then model parameters are adjusted based on the difference between the undetermined voice information and sample voice information, so that the model can learn how to synthesize the voice information by combining the adjusting parameters, and meanwhile, the voice information can be attached to accurate sample voice information corresponding to the text information, so that the trained model can accurately synthesize the voice information by combining the adjusting parameters and the text information, synchronous implementation of voice synthesis and parameter adjustment is realized, and parameter adjustment can be combined with text characteristics of the text information, thereby improving the authenticity of the voice information and improving the voice synthesis effect.

It will be appreciated that the method may be applied to a processing device that is capable of model training, for example, a terminal device or a server having model training functionality. The method can be independently executed by the terminal equipment or the server, can also be applied to a network scene of communication between the terminal equipment and the server, and is executed by the cooperation of the terminal equipment and the server. The terminal equipment can be a computer, a mobile phone and other equipment. The server can be understood as an application server or a Web server, and can be an independent server or a cluster server in actual deployment.

The present application also relates to artificial intelligence (Artificial Intelligence, AI) technology, which is a theory, method, technique and application system that simulates, extends and expands human intelligence, senses environment, acquires knowledge and uses knowledge to obtain optimal results using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. The present application relates generally to speech processing techniques, natural speech processing techniques, and machine learning techniques among others.

Key technologies to the speech technology (Speech Technology) are automatic speech recognition technology (ASR) and speech synthesis technology (TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

The application can be applied to natural language processing technology for extracting text information in a plurality of text feature dimensions, and can be used for machine learning technology in model training and parameter adjusting parts, and can realize integral speech synthesis by speech technology.

In order to facilitate understanding of the technical solution provided by the embodiments of the present application, a model training method provided by the embodiments of the present application will be described below with reference to an actual application scenario.

Referring to fig. 1, fig. 1 is a schematic diagram of a model training method in an actual application scenario, where a processing device is a model training server 101 with a model training function, provided by an embodiment of the present application.

The sample text information set acquired by the model training server 101 includes N sample text information, i.e., sample text information 1 and sample text information 2 and …, and each sample text information has corresponding sample voice information and sample adjustment parameters. The N sample text information is used as target sample text information, the model training server 101 inputs the target sample text information and target sample adjustment parameters corresponding to the target sample text information into an initial speech synthesis model, and speech feature information corresponding to the target sample text information can be determined through the initial speech synthesis model, where the speech feature information is used to identify the pronunciation mode of the target sample text information in the speech information, that is, the speech information is generated according to the pronunciation mode identified in the speech feature information.

In order to enable the initial speech synthesis model to learn how to generate final speech information directly based on the adjustment parameters, in the model training process, the speech feature information can be adjusted according to the adjustment parameters of the target sample through the initial speech synthesis model to obtain adjusted speech feature information, and then undetermined speech information is generated according to the adjusted speech feature information through the initial speech synthesis model.

Because the target sample voice information is accurate and real voice information corresponding to the target sample text information after being adjusted based on the target sample adjusting parameters, the target sample voice information is matched with the text expression of the target sample text information, and therefore, the model training server 101 can adjust model parameters corresponding to the initial voice synthesis model according to voice information differences between the to-be-determined voice information and the target sample voice information, so that the to-be-determined voice information output by the initial voice synthesis model is close to the target sample voice information, the initial voice synthesis model can learn how to combine the text information and the adjusting parameters, the text expression matched with the text information is generated, and the voice information adjusted by attaching the pronunciation mode marked by the adjusting parameters is attached, and the sense of reality of the generated voice information can be improved, and the voice synthesis effect is improved. After the speech synthesis model is obtained through training, as shown in fig. 1, relatively real and accurate speech information corresponding to the text information to be synthesized can be directly generated through the speech synthesis model according to the text information to be synthesized and the adjustment parameters, and subsequent adjustment of the speech information is not needed.

Next, a model training method provided by an embodiment of the present application will be described with reference to the accompanying drawings.

Referring to fig. 2, fig. 2 is a flowchart of a model training method according to an embodiment of the present application, where the method includes:

s201: a sample text information set is obtained.

First, a processing device obtains a set of sample text information for model training, the set of sample text information including a plurality of sample text information having corresponding sample speech information and sample adjustment parameters, the sample speech information being generated based on the sample adjustment parameters. That is, the sample adjusting parameter is used to identify the adjusting mode, such as adjusting direction, adjusting force, etc., of the pronunciation mode of the sample text information in the voice information, where the sample voice information is real and accurate voice information corresponding to the sample text information and meets the text expression of the sample text information.

S202: and respectively taking the plurality of sample text information as target sample text information, and generating voice characteristic information corresponding to the target sample text information according to the target sample text information through an initial voice synthesis model.

In the related art, a speech synthesis model generates speech information based only on text information to be synthesized, and adjustment of the speech information can be performed only on the basis of existing speech information. However, the text features and text expressions of the text information to be synthesized are only referred to in the generation process of the voice information, so that the adjustment mode in the related technology is difficult to attach to the text expressions of the text information to be synthesized, and the adjusted voice information has a larger difference from the text expressions of the text information to be synthesized, and lacks realism.

In order to solve the above technical problems, the initial speech synthesis model in the present application does not generate speech information only according to sample text information, but generates speech information based on sample text information and corresponding sample adjustment parameters, so that the adjustment of the sample adjustment parameters is embodied in the generation process of speech information, rather than parameter adjustment after speech information generation. Because the generation of the voice information is generated based on the text characteristics and the text expressions which are represented by the sample text information, the voice information adjusting mode can be attached to the text expressions of the sample text information during parameter adjustment, and the sense of reality of the voice information is enhanced.

The processing device sequentially extracts each sample text information in the sample text information set as target sample text information, and inputs the target sample text information and target sample adjusting parameters corresponding to the target sample text information into the initial speech synthesis model, namely the target sample text information can be any one of a plurality of sample text information. Firstly, through the initial speech synthesis model, speech feature information corresponding to the target sample text information can be generated according to the target sample text information, the speech feature information is used for identifying the pronunciation mode of the target sample text information in the speech information, for example, the pronunciation, intonation, pause, fluctuation and the like of characters in the target sample text information in the speech information can be determined based on the speech feature information, so that the speech information corresponding to the target sample text information can be generated based on the speech feature information.

It should be emphasized that the speech feature information is not the speech information finally output by the model, the speech information is generated based on the pronunciation mode identified in the speech feature information, the speech synthesis model in the application is based on model parameters, and the speech information is generated according to the speech feature information, and the process combines the text feature and the text expression corresponding to the text information.

S203: and adjusting the voice characteristic information according to the target sample adjusting parameters corresponding to the target sample text information through the initial voice synthesis model to obtain the adjusted voice characteristic information.

The above-mentioned method can identify the pronunciation mode of the text information in the voice information through the voice feature information, so that in order to make the model learn how to realize the adjustment of the pronunciation mode through the adjustment parameters in the process of synthesizing the voice information, the processing device can adjust the voice feature information through the initial voice synthesis model according to the target sample adjustment parameters corresponding to the target sample text information before generating the voice information, thereby adjusting the pronunciation mode of the final target sample text information in the voice information and obtaining the adjusted voice feature information.

S204: and generating undetermined voice information corresponding to the text information of the target sample according to the adjusted voice characteristic information through the initial voice synthesis model.

Through the initial speech synthesis model, the processing device can analyze the pronunciation mode of the target sample text information in the speech information based on the adjusted speech feature information, so that undetermined speech information corresponding to the target sample text information can be generated according to the pronunciation mode, and the undetermined speech information is speech information generated by the initial speech synthesis model which is not trained and is based on the target sample text information and the target sample adjustment parameters.

It can be understood that, because the initial speech synthesis model is generated based on the adjusted speech feature information, the adjusted speech feature information is obtained by adjusting the speech feature information based on the target sample adjustment parameter, the parameter adjustment process in the speech synthesis model of the present application is performed during the speech information generation process, and the model may not perform parameter adjustment after the speech information generation.

S205: and according to the difference between the to-be-determined voice information and the target sample voice information corresponding to the target sample text information, adjusting model parameters corresponding to the initial voice synthesis model to obtain a voice synthesis model.

Because the target sample voice information is real and accurate voice information corresponding to the target sample text information in a regulating mode meeting the target sample regulating parameter identification, the accuracy and the authenticity of the target sample text information on the text expression of the target sample text information can be represented by the difference between the target sample voice information and the target sample voice information, and the regulating accuracy of the initial voice synthesis model in parameter regulation based on the target sample regulating parameter can be represented. The processing equipment can adjust model parameters corresponding to the initial voice synthesis model according to the difference between the voice information, so that the undetermined voice information output by the adjusted initial voice synthesis model gradually approaches the target sample voice information, thereby learning how to accurately adjust the pronunciation mode of the target sample text information in the voice information based on the target sample adjustment parameters, learning how to combine parameter adjustment, generating the voice information which is matched with the text characteristics and text expression of the target sample text information, further enabling the voice synthesis model obtained through training to have the capability of directly combining the text information and the adjustment parameters to generate the voice information, avoiding adjustment after the voice information is generated, enabling the voice information generated by the voice synthesis model obtained through training to be combined with the text expression meaning of the text information, meeting the voice adjustment requirement of the adjustment parameters, and improving the authenticity and accuracy of the generated voice information.

Therefore, the voice synthesis model can be used for synthesizing voice information according to the text information to be synthesized and the adjusting parameters corresponding to the text information to be synthesized, wherein the text information to be synthesized is any text information needing to be synthesized by voice, the adjusting parameters are used for reflecting the requirement of voice synthesis objects on adjusting the pronunciation mode of the text information to be synthesized in the voice information when the voice synthesis objects are synthesized by the voice information, and the voice synthesis objects are objects initiating voice synthesis.

It will be appreciated that the characteristic information that can be analyzed by the text information may include a variety of types. For example, when the text information is Chinese text information, the phonemic feature corresponding to the text information can be analyzed through the pinyin information of the characters, and the pause feature, the interval feature and the like of the text information when the text information pronounces in the voice information can be analyzed through the word combination of the characters, wherein the feature information of the dimensions can influence the pronouncing mode of the text information in the voice information to a certain extent.

Based on this, in one possible implementation, to improve accuracy of the voice feature information, the processing device may determine the voice feature information corresponding to the text information in combination with the feature information of multiple dimensions. When generating the voice feature information corresponding to the target sample text information according to the target sample text information, the processing device may determine the phoneme feature information, the semantic feature information and the prosodic feature information corresponding to the target sample text information, where the phoneme feature information is used to identify a phoneme composition corresponding to the target sample text information, the phonemes are minimum speech units divided according to natural attributes of the speech, and the phonemes are analyzed according to pronunciation actions in syllables, and one action forms one phoneme. For example, the Chinese syllable o (ā) has only one phoneme, the love (a i) has two phonemes, the generation (d a i) has three phonemes and the like.

The semantic feature information is used for identifying the semantics corresponding to the text information of the target sample, and it can be understood that under different semantics, the same text information may have different pronunciation modes in the voice information. For example, when the text information of "I go" expresses the action of "I go to xx place", in the popular pronunciation mode, the word of "I" is three, and the text information of "I go" expresses the following words! He too much-! When the character is exclamation, the character is four in popular pronunciation mode, and the pronunciation speed of the text information is different in two different expression modes. Thus, in order to improve the accuracy of the pronunciation manner identified by the speech feature information, the processing device may determine the speech feature information corresponding to the target sample text information in combination with the semantics of the target sample text information, where the semantics feature information may be determined, for example, by a transformer (Bidirectional Encoder Representation from Transformers, BERT) model.

The prosody characteristic information is used for identifying the pronunciation prosody corresponding to the target sample text information, the prosody refers to the pronunciation prosody corresponding to the target sample text information, such as word segmentation, prosody phrase, prosody word and the like, and the finally generated voice information can be more attached to the real prosody when the target sample text information is expressed by voice through the prosody information

In summary, the processing device may generate, according to the phoneme feature information, the semantic feature information, and the prosodic feature information, speech feature information corresponding to the target sample text information, so that a pronunciation manner identified by the speech feature information may be attached to a phoneme structure, text semantics, and prosodic features of the target sample text information at the same time, and further, speech information generated based on the speech feature information may have pronunciation accuracy due to attaching the phoneme structure, and pronunciation realism due to attaching the text semantics and prosodic features.

As shown in fig. 3, fig. 3 is a schematic diagram of a model training method provided in an embodiment of the present application, in which a speech feature information determining part in an initial speech synthesis model includes a deep learning model encoder (Transformer encoder), a transformer encoder (BERT encoder) and an embedding layer (Embedded), after target sample text information and target sample adjustment parameters are input into the initial speech synthesis model, the initial speech synthesis model may first determine a phoneme level feature (Phoneme level features), a character level feature (Character level features) and a Word and phrase level feature (Word & phrase level features) corresponding to the target sample text information, the phoneme level feature may be a phoneme composition of the target sample text information, the character level feature may be a character included in the target sample text information, the Word and phrase level feature may be a Word and phrase included in the target sample text information, and a prosody when the target sample text information is expressed by speech may be analyzed through the Word and phrase.

Phoneme feature information corresponding to the phoneme level features may be generated by a deep learning encoder, semantic feature information corresponding to the character set features may be generated by a transformer encoder, and prosodic feature information corresponding to the word and phrase level features may be generated by an embedding layer. And fusing the three parts of characteristic information to obtain voice characteristic information, then adjusting parameters and voice characteristic information based on the target sample to obtain adjusted voice characteristic information, and generating undetermined voice information based on the adjusted voice characteristic information.

It will be appreciated that when communicating by voice, if the emotion of the communication partner is different, the pronunciation mode will also be different when the information is expressed by voice. For example, a voice message uttered by a communication person in a "lively" emotion is typically faster and higher in intonation than a voice message uttered in a "flat" emotion. Based on this, in order to further improve the authenticity and diversity of the synthesized speech information, and meet the diversified speech synthesis requirements, in one possible implementation, the processing device may also combine the emotion information to model-train the initial speech synthesis model.

Firstly, a corresponding sample emotion tag can be added to the sample text information, the sample text information in the sample text information set is provided with the corresponding sample emotion tag, and the sample emotion tag is used for identifying emotion bias corresponding to the sample text information, namely emotion which can be represented when the sample text information is expressed through sample voice information. The sample emotion labels can be obtained through manual analysis or can be obtained through various automatic text emotion analysis modes, for example, the sample text information is identified through an emotion identification model. In addition, when the sample voice information corresponding to the sample text information is recorded manually, the recorder of the sample voice information can record the voice information based on a specific emotion, and determine a sample emotion label of the sample text information based on the emotion.

The initial speech synthesis model can comprise initial emotion feature information corresponding to a plurality of emotion labels respectively, the initial emotion feature information can be changed into emotion feature information corresponding to the plurality of emotion labels respectively after being adjusted by parameters in a model training process, and the emotion feature information is used for adjusting the pronunciation mode identified by the speech feature information to the corresponding emotion, so that the emotion can be represented on the basis of the adjusted speech feature information in the pronunciation mode. For example, after the voice characteristic information is adjusted by the emotion characteristic information corresponding to the emotion label of "vital energy", the finally generated voice information can be subjected to higher intonation and higher language speed, so that the emotion of "vital energy" and the like can be reflected.

When generating the voice feature information corresponding to the target sample text information according to the target sample text information, the processing device may determine the target initial emotion feature information corresponding to the target sample emotion tag corresponding to the target sample text information, and then generate the voice feature information corresponding to the target sample text information according to the target initial emotion feature information and the target sample text information, so that the pronunciation mode identified by the voice feature information can represent the emotion corresponding to the target sample emotion tag, as shown in fig. 4.

In order to enable emotion characteristic information to accurately reflect the characteristics of the corresponding emotion in a pronunciation mode, when a speech synthesis model is obtained by adjusting model parameters corresponding to an initial speech synthesis model according to the difference between to-be-determined speech information and target sample speech information corresponding to target sample text information, processing equipment can adjust the model parameters corresponding to the initial speech synthesis model and target initial emotion characteristic information according to the difference between the to-be-determined speech information and target sample speech information corresponding to the target sample text information, and a speech synthesis model is obtained. Because the emotion identified by the target sample emotion tag is the accurate emotion expressed by the target sample voice information, the accuracy of adjusting the pronunciation mode identified by the voice characteristic information to the emotion direction corresponding to the target sample emotion tag when the initial voice synthesis model generates the voice characteristic information according to the target initial emotion characteristic information can be embodied through the difference between the target sample voice information and the undetermined voice information, so that the accurate emotion characteristic information corresponding to each emotion can be provided in the voice synthesis model obtained through training in the training mode, the emotion characteristic information can be used for generating the voice characteristic information capable of truly embodying the pronunciation modes under various emotions, and the voice synthesis model can learn how to accurately generate the voice characteristic information based on the emotion characteristic information.

Therefore, the voice synthesis model can comprise initial emotion feature information respectively corresponding to a plurality of emotion tags through adjustment, the obtained emotion feature information respectively corresponding to the emotion tags can be used for synthesizing voice information according to the text information to be synthesized, adjustment parameters corresponding to the text information to be synthesized and the emotion tags corresponding to the text information to be synthesized, so that the voice information can meet the adjustment of the adjustment parameters on the pronunciation mode on the premise of attaching the text expression of the text information to be synthesized, and meanwhile, the voice information can reflect the emotion corresponding to the emotion tags and enrich the sense of reality of the voice information.

It can be understood that when the parameter adjustment is performed on the voice feature information, a part of feature parameters corresponding to the voice feature information directly exist in the voice feature information, and the model can directly adjust the part of feature parameters through the adjustment parameters. For example, the interrupt parameter is used to control pauses in speech information, and such feature parameter can be directly implemented by adding information intervals in speech feature information; the accent parameter is used for controlling the light accent in the voice information, the characteristic parameter can be directly realized by adjusting the volume corresponding to the specific text in the voice characteristic information, the trailing tone parameter is used for controlling whether trailing tone (namely trailing tone drag length corresponding to certain text information) appears in the voice information, and the characteristic parameter can be directly realized by adjusting the information quantity corresponding to the trailing tone of the specific text in the voice characteristic information.

The partial characteristic parameters are not directly existing in the voice characteristic information and are needed to be realized through overall analysis of the voice characteristic information, such as intonation, duration, fluctuation characteristics and the like of the voice information, so that the model is needed to have the capability of accurately analyzing the characteristic parameters of the voice information before the adjustment for the adjustment of the partial characteristic parameters. Next, it will be described how the ability of model parameter tuning can be trained for two different classes of feature parameters.

In a possible implementation manner, the target sample adjustment parameter includes a first adjustment parameter, where the first adjustment parameter is used to adjust a first feature parameter included in the voice feature information, that is, the first feature parameter may be directly obtained from the voice feature information. When voice characteristic information is regulated according to target sample regulation parameters corresponding to target sample text information through an initial voice synthesis model to obtain regulated voice characteristic information, processing equipment can regulate first characteristic parameters included in the voice characteristic information according to first regulation parameters through the initial voice synthesis model to obtain regulated voice characteristic information.

Wherein the first adjustment parameter may comprise any one or a combination of a drag control parameter, an accent control parameter, and a break control parameter. The drag sound control parameters are used for adjusting drag sound parameters in the first characteristic parameters, the stress sound control parameters are used for adjusting stress sound parameters in the first characteristic parameters, the breaking control parameters are used for adjusting breaking parameters in the first characteristic parameters, and the effects of the drag sound parameters, the stress sound parameters and the breaking parameters are described in the above. As shown in fig. 5, after generating the voice feature information by combining the feature information with multiple dimensions, the initial voice synthesis model may obtain the adjusted voice feature information according to a first adjustment parameter (including an accent control parameter, a drag control parameter, and a break control parameter) in the target sample adjustment parameters.

In one possible implementation manner, the target sample adjustment parameter includes a second adjustment parameter, where the second adjustment parameter is used to adjust a second feature parameter determined according to the voice feature information, and the voice feature information does not include the second feature parameter, that is, the model cannot directly adjust the second feature parameter in the voice feature information based on the second adjustment parameter. Therefore, the processing device needs to train the model first to have the capability of accurately analyzing the second characteristic parameters corresponding to the voice characteristic information.

In an embodiment of the present application, the initial speech synthesis model includes a parameter prediction portion, where the parameter prediction portion is configured to determine a second feature parameter corresponding to the speech feature information. The target sample text information has a corresponding sample second feature parameter, and the sample second feature parameter is an accurate second feature parameter corresponding to the voice feature information generated based on the target sample text information.

The processing device can determine the second undetermined characteristic parameters corresponding to the voice characteristic information through the parameter prediction part, wherein the second undetermined characteristic parameters are the second characteristic parameters corresponding to the voice characteristic information analyzed by the model. When the voice characteristic information is regulated according to the target sample regulation parameters corresponding to the target sample text information through the initial voice synthesis model to obtain the regulated voice characteristic information, the processing equipment can separately train parameter prediction and parameter regulation when the accuracy of model parameter regulation is trained, so that the problem that the model training efficiency is low because the predicted second characteristic parameter is inaccurate is solved, for example, the difference between the generated voice information and the sample voice information is larger although the parameter regulation capability is accurate is solved.

The processing device may determine the second feature parameter to be adjusted according to the second feature parameter of the sample and the second adjustment parameter, and because the second feature parameter of the sample and the second adjustment parameter are both accurate parameters corresponding to the text information of the target sample, the second feature parameter to be adjusted is an accurate feature parameter corresponding to the voice information of the target sample. The processing equipment can adjust the voice characteristic information according to the second characteristic parameter to be adjusted through the initial voice synthesis model to obtain the adjusted voice characteristic information, so that the adjusted voice characteristic information meets the second characteristic parameter to be adjusted.

When the model parameters corresponding to the initial speech synthesis model are adjusted according to the difference between the to-be-determined speech information and the target sample speech information corresponding to the target sample text information to obtain the speech synthesis model, the to-be-determined second characteristic parameters are the second characteristic parameters determined by the model, and the sample second characteristic parameters are accurate second characteristic parameters corresponding to the target sample text information, so that the accuracy of the parameter prediction model on the second characteristic parameters can be reflected through the difference between the to-be-determined second characteristic parameters and the sample second characteristic parameters, and the processing equipment can adjust the model parameters corresponding to the parameter prediction part according to the difference between the to-be-determined second characteristic parameters and the sample second characteristic parameters.

And the processing equipment can adjust model parameters except the parameter prediction part in the initial speech synthesis model according to the difference between the to-be-determined speech information and the target sample speech information corresponding to the target sample text information to obtain the speech synthesis model. Because the undetermined voice information is generated based on the accurate second characteristic parameters to be adjusted, the difference can embody the accuracy of the model in voice characteristic information generation, parameter adjustment and voice information generation based on the voice characteristic information in a targeted manner, and therefore the initial voice synthesis model can be trained in a targeted manner.

The second adjustment parameter may include any one or a combination of a duration control parameter, a intonation control parameter, and a heave control parameter, where the duration control parameter is used to adjust the duration parameter in the second feature parameter, the intonation control parameter is used to adjust the intonation parameter in the second feature parameter, and the heave control parameter is used to adjust the heave parameter in the second feature parameter. The duration parameter is used for controlling the pronunciation duration of the text information in the voice information, the intonation parameter is used for controlling the pronunciation intonation of the text information in the voice information, and the fluctuation parameter is used for controlling pronunciation fluctuation of the text information in the voice information. As shown in fig. 6, the second tuning parameters include a intonation control parameter, a Pitch control parameter and a Duration control parameter, the sample second feature parameters include a sample intonation parameter, a sample Pitch parameter and a sample Duration parameter, and the parameter prediction part in the initial speech synthesis model includes a Duration adapter (Duration adapter), a Pitch adapter (Pitch adapter) and a Pitch adapter (Range adapter) respectively configured to determine a Duration parameter to be determined, a Pitch parameter to be determined and a Pitch parameter to be determined corresponding to the speech feature information. And determining the intonation parameter to be regulated, the fluctuation parameter to be regulated and the duration parameter to be regulated based on the second characteristic parameter and the second regulation parameter of the sample, and generating the regulated voice characteristic information together with the first regulation parameter. The second adjusting parameter can be not adjusted, the training of the parameter predicting part can be realized, and the training of the model on the parameter adjusting capability can be realized by adjusting the voice characteristic information through the second characteristic parameter of the sample.

In the parameter prediction process, the more complex the value of the parameter and the larger the variation amplitude, the higher the prediction difficulty is generally, based on which, in one possible implementation manner, in order to reduce the training difficulty of the parameter prediction part in the model, the processing device may perform normalization processing on the second characteristic parameter to reduce the variation amplitude of the parameter, thereby reducing the prediction difficulty of the parameter.

As described above, the pronunciation modes of the voice information are different in different emotions, and similarly, the voice information corresponding to the same emotion is generally close in pronunciation modes, so when the second characteristic parameters are normalized, the processing device can analyze the commonality of the second characteristic parameters of the voice characteristic information of the same emotion, thereby obtaining the standard of normalization processing.

The processing device may determine the emotion tag corresponding to each sample text information according to the above multiple emotion tag determining manners, and at the same time, the processing device may obtain multiple text information corresponding to the same emotion tag, and determine, by analyzing accurate voice feature information corresponding to the text information, an emotion feature parameter corresponding to the emotion tag, where the emotion feature parameter is used as a reference for normalizing a second feature parameter of the voice feature information corresponding to the text information corresponding to the emotion tag. For example, the emotion feature parameter may be a mean and a variance of a second feature parameter corresponding to the voice feature information under the emotion label.

In this implementation manner, the target sample adjustment parameter has a corresponding emotion tag, and when the model parameter corresponding to the parameter prediction portion is adjusted according to the difference between the second feature parameter to be determined and the second feature parameter of the sample, the processing device may determine the first emotion feature parameter corresponding to the emotion tag first, and then perform normalization processing on the second feature parameter of the sample according to the first target feature parameter.

The processing device may adjust the model parameter corresponding to the parameter prediction portion according to the difference between the second feature parameter to be determined and the second feature parameter of the normalized sample, so that the parameter prediction portion may determine the second feature parameter corresponding to the voice feature information after normalization. It can be understood that, since the normalized second characteristic parameter is not the second characteristic parameter that needs to be adjusted finally, in order to perform accurate parameter adjustment on the voice characteristic information, the normalized second characteristic parameter needs to be subjected to inverse normalization before performing parameter adjustment, so as to obtain the accurate second characteristic parameter.

That is, the parameter prediction part in the speech synthesis model may be configured to determine a normalized second feature parameter corresponding to the speech feature information, and determine the second feature parameter corresponding to the speech feature information according to the second emotion feature parameter corresponding to the emotion label corresponding to the text information to be synthesized and the normalized second feature parameter.

For example, because the Pitch parameters for different emotions are often widely different, the processing device may predetermine the mean and variance, e.g., μ, of the Pitch parameters for each emotion _emo Mean value, sigma, of intonation parameters corresponding to sad (emo) emotion _emo Representing the variance of the intonation parameters corresponding to the sad emotion, the intonation parameters pitch of the sad emotion _ori Normalization is carried out to obtain normalized intonation parameter picth _norm The manner of (2) can be expressed as follows:

in one possible implementation, to further improve the accuracy of the speech synthesis, the processing device may analyze in combination with generating a countermeasure network (Generative Adversarial Nets, GAN) model with a higher resolution in information analysis when analyzing the differences between the pending speech information and the target sample speech information.

When the model parameters corresponding to the initial speech synthesis model are adjusted according to the difference between the to-be-determined speech information and the target sample speech information corresponding to the target sample text information to obtain a speech synthesis model, the processing equipment can generate a first spectrogram according to the to-be-determined speech information and generate a second spectrogram according to the target sample speech information, and the first spectrogram is a spectrogram which is developed based on time sequence of the to-be-determined speech information, so that the information characteristics of the to-be-determined speech information in a time domain can be accurately and finely represented; the second spectrogram is a spectrogram for expanding the target sample voice information based on time sequence, and can accurately and finely show the information characteristics of the target sample voice information in the time domain.

The processing device may determine a similarity parameter between the first and second spectrograms by generating an antagonism network discriminator (GAN discriminator), the similarity parameter being used to identify a difference between the first and second spectrograms, the larger the similarity parameter being indicative of a more similar between the first and second spectrograms, i.e. a smaller difference between the pending speech information and the target sample speech information. The processing device can adjust model parameters corresponding to the initial voice synthesis model according to the similar parameters to obtain a voice synthesis model, and the similar parameters determined according to the voice synthesis model are larger than a preset threshold, namely the voice information determined according to the voice synthesis model is close to the sample voice information.

The generating countermeasure network discriminant can be obtained by training the generating countermeasure network model, and in the training process, the processing equipment can add noise interference into the voice information spectrogram through a generating countermeasure network generator (GAN generator), so that the generating countermeasure network discriminant analyzes the difference between the disturbed spectrogram and the pre-disturbance spectrogram, and the trained generating countermeasure network discriminant has the capability of accurately identifying the difference between the spectrograms.

Based on the speech synthesis model trained by the above model training method, next, the application process of the model will be described in detail.

First, an embodiment of the present application provides a method for applying a model, referring to fig. 7, and fig. 7 is a flowchart of the method for applying a model provided in the embodiment of the present application, where the method includes:

s701: and acquiring the text information to be synthesized and the adjusting parameters corresponding to the text information to be synthesized, which are input by the voice synthesis object.

Where the speech synthesis object is the object that needs to be speech synthesized, typically the adjusting parameters and the provider of the text information to be synthesized. The adjusting parameter is used for adjusting the pronunciation mode of the text information to be synthesized in the voice information. The adjustment parameters may include the first adjustment parameter and the second adjustment parameter, where, since the emotion tag may be used to adjust the pronunciation manner in the voice information to the corresponding emotion, the emotion tag may also be regarded as an adjustment parameter.

S702: inputting the text information to be synthesized and the adjusting parameters corresponding to the text information to be synthesized into a voice synthesis model, and generating target voice information corresponding to the text information to be synthesized through the voice synthesis model.

According to the voice synthesis model obtained by the model training method, the processing equipment can directly generate corresponding target voice information based on the text information to be synthesized and the adjusting parameters, wherein the target voice information is the text expression attached to the text information to be synthesized and accords with the voice information of adjusting the adjusting parameters in a pronunciation mode, so that the target voice information has higher authenticity and accuracy. The voice synthesis model can integrate the adjustment of the adjustment parameters to the pronunciation mode when the target voice information is generated, so that subsequent adjustment work for the target voice information is not needed, and the adjustment efficiency of the voice information is improved.

S703: and sending the target voice information to the voice synthesis object.

After the target voice information is generated, the processing device may send the target voice information to the voice synthesis object so that the voice synthesis object can apply the target voice information.

In one possible implementation manner, when generating the target voice information corresponding to the text information to be synthesized, the processing device may determine, through the voice synthesis model, phoneme characteristic information, semantic characteristic information and prosody characteristic information corresponding to the text information to be synthesized, where the phoneme characteristic information is used to identify a phoneme composition corresponding to the text information to be synthesized, the semantic characteristic information is used to identify a semantic meaning corresponding to the text information to be synthesized, and the prosody characteristic information is used to identify a pronunciation prosody corresponding to the text information to be synthesized.

The processing device may generate, according to the phoneme feature information, semantic feature information, and prosodic feature information, speech feature information corresponding to the text information to be synthesized in combination with the multidimensional text feature, where the speech feature information is used to identify a pronunciation manner of the text information to be synthesized in the speech information. Then, through the voice synthesis model, the processing device can adjust voice characteristic information according to the adjusting parameters corresponding to the text information to be synthesized, and finally generates target voice information corresponding to the text information to be synthesized according to the adjusted voice characteristic information. It follows that the model application mode emphasizes again that the speech information in the present application is generated after the adjustment parameters play a role in the pronunciation mode adjustment.

In a possible implementation manner, the speech synthesis model includes a parameter adjustment portion and a parameter prediction portion, where the adjustment parameters include a first adjustment parameter for adjusting a first feature parameter included in the speech feature information and a second adjustment parameter for adjusting a second feature parameter determined according to the speech feature information, and the speech feature information does not include the second feature parameter. That is, the speech synthesis model of the present application may adjust the feature parameters directly included in the speech feature information, or may adjust the feature parameters that need to be determined by analyzing the speech feature information.

When the voice characteristic information is adjusted according to the adjusting parameters corresponding to the text information to be synthesized, the processing equipment can determine second characteristic parameters corresponding to the voice characteristic information according to the voice characteristic information through the parameter predicting part, and then determine the second characteristic parameters to be adjusted according to the second adjusting parameters and the second characteristic parameters. The processing device may adjust the first feature parameter included in the voice feature information according to the first adjustment parameter and adjust the voice feature information according to the second feature parameter to be adjusted through the parameter adjusting portion, so as to obtain adjusted voice feature information, where the adjusted voice feature information simultaneously satisfies the adjustment of the first adjustment parameter in the first feature parameter dimension and the adjustment of the second adjustment parameter in the second feature parameter dimension.

In one possible implementation, to enrich the way and dimension of the speech information adjustment, the processing device may also provide parameter adjustment of the emotion dimension to the speech synthesis object through the speech synthesis model. The processing device may provide an emotion tag input function to the speech synthesis object so that the speech synthesis object inputs an emotion tag for identifying an emotion desired by the speech synthesis object to be expressed by the synthesized speech information as one of the adjustment parameters.

The adjusting parameter may include an emotion tag, when determining a second feature parameter corresponding to the voice feature information according to the voice feature information, the processing device may determine a normalized second feature parameter corresponding to the voice feature information according to the voice feature information, where the normalized second feature parameter is a feature parameter corresponding to an emotion identified by the emotion tag, and the processing device may determine the second feature parameter corresponding to the voice feature information according to the emotion feature parameter corresponding to the emotion tag and the normalized second feature parameter, where the emotion feature parameter is used to perform inverse normalization processing on the normalized second feature parameter corresponding to the emotion tag, where the normalized second feature parameter is obtained based on a commonality analysis on the second feature parameters of the plurality of voice feature information corresponding to the emotion tag, and may be, for example, a mean value and a variance of the second feature parameter of the voice feature information of the emotion tag.

The processing device may provide not only a directional selection of emotion regulations, but also a degree selection of emotion regulations to the speech synthesis object when parameter regulations are performed in combination with the regulation parameters of the emotion dimension. The adjustment parameters may include an emotion tag and an emotion level parameter for identifying a level of adjustment of a manner of pronunciation of the text information to be synthesized in the speech information to an emotion identified by the emotion tag. For example, when the emotion label is "anger", the finally generated voice information can be enabled to show different degrees of anger emotion such as "slightly anger", "general anger", "quite anger" and the like through different emotion degree parameters, so that the degree of freedom of adjusting emotion dimension is improved, and the voice adjusting effect is enriched.

When generating the voice feature information corresponding to the text information to be synthesized according to the phoneme feature information, the semantic feature information and the prosodic feature information, the emotion feature information corresponding to the emotion tag is used for adjusting the pronunciation mode identified by the voice feature information to the corresponding emotion, so that the degree of adjusting the pronunciation mode to the corresponding emotion can be realized by adjusting the emotion feature information.

The processing device can determine the emotion feature information corresponding to the emotion tag through the voice synthesis model, and then generate the voice feature information corresponding to the text information to be synthesized together according to the emotion feature information, the emotion feature parameter, the phoneme feature information, the semantic feature information and the prosody feature information, so that the voice feature information meets the requirements of a voice synthesis object on the emotion adjustment direction and degree on the basis of fully matching the text expression of the text information to be synthesized, and provides the voice information enabling the voice synthesis object to be more satisfied. For example, taking "this is an example" as the text information to be synthesized, the adjustment parameter includes designating the emotion label as "anger" and the emotion degree parameter as 0.5, and the intonation corresponding to the "example" word in the text information to be synthesized is increased by 50%. The voice synthesis model analyzes the text information to be synthesized to obtain a pinyin sequence as a phoneme level feature; the corresponding Chinese characters are used as character level characteristics; the prosodic result of the text to be synthesized is the word and phrase level feature, and the phoneme feature information, the semantic feature information and the prosodic feature information corresponding to the text to be synthesized are obtained. And obtaining corresponding emotion feature information according to the emotion label 'anger', wherein the emotion feature information can be, for example, vector information, and multiplying the vector information by emotion degree parameter 0.5 to be used as emotion feature for generating voice feature information; the speech characteristic information is adjusted by multiplying the intonation parameter of the "example" word predicted by the intonation adapter by 150% as the intonation parameter to be adjusted.

Referring to fig. 8, fig. 8 is a schematic diagram of a model application method provided in an embodiment of the present application, after text information to be synthesized and adjustment parameters are input into a speech synthesis model, the phoneme characteristic information, semantic characteristic information and prosodic characteristic information can be obtained through the phoneme level characteristic, character level characteristic, word and phrase level characteristic corresponding to the text information to be synthesized respectively, the speech characteristic information corresponding to the text information to be synthesized and emotion characteristic information and emotion degree parameter corresponding to the emotion tag can be determined by combining the text characteristic information in three dimensions, the parameter prediction part of the speech synthesis model includes a time length adapter, a intonation adapter and a intonation adapter, which are respectively used for determining the time length parameter, the intonation parameter and the fluctuation parameter corresponding to the speech characteristic information, and the second adjustment parameter of the adjustment parameters includes a time length control parameter, a intonation control parameter and a fluctuation control parameter, the time length parameter to be adjusted can be determined according to the time length control parameter and the adjustment parameter, the intonation parameter to be determined according to the intonation control parameter and the fluctuation parameter to be determined. The voice characteristic information can be adjusted by combining the three second characteristic parameters to be adjusted, and the accent control parameter, the dragging control parameter and the breaking control parameter included in the first adjustment parameter, so as to obtain the adjusted voice characteristic information, and finally, voice information corresponding to the text information to be synthesized is generated based on the adjusted voice characteristic information.

In order to facilitate the input of information for speech synthesis by a speech synthesis object, in one possible implementation, the processing device may present the speech synthesis object with an information input interface for inputting text information to be synthesized and adjustment parameters. The processing equipment can acquire the text information to be synthesized and the adjusting parameters corresponding to the text information to be synthesized, which are input by the voice synthesis object, through the information input interface.

As shown in fig. 9, fig. 9 is a schematic diagram of an information input interface provided in an embodiment of the present application, in the information input interface, a speech synthesis object may select a tone of synthesized speech information (including multiple tone selections of tone a), select an emotion tag, input an emotion degree parameter, a duration control parameter (using a speech speed as a duration control parameter, and implementing different speech information durations by adjusting a speech speed), a intonation control parameter (including intonation lifting and intonation range), a break control parameter (using pauses of speech information as a practical effect), and a heave control parameter (using intonation intensity as a practical effect). After the speech information is synthesized, the speech synthesis object can listen to the speech information through the interface and see the speech information display, and in the display, the intonation parameters, the duration parameters and the rhythm conditions corresponding to the characters of the text information to be synthesized, namely 'a test', can be seen.

In order to facilitate understanding of the technical solution provided by the embodiment of the present application, the model training method provided by the embodiment of the present application will be described below with reference to an actual application scenario.

Referring to fig. 10, fig. 10 is a schematic diagram of a model training method in a practical application scenario according to an embodiment of the present application.

The practical application scene can relate to a plurality of fields, for example, in the game field, the speech synthesis model trained by the method can generate the speech information of the non-player character (NPC) in the game, so that the speech information provided by the non-player character in the game to the player is more natural and real, and the game experience is brought to the player with better quality. Meanwhile, because the model in the application supports the generation of voice information by combining diversified parameter adjustment and text characteristics, a game developer can efficiently generate diversified high-quality voice information, enrich game content and improve game development efficiency.

Or, in the field of man-machine interaction, various devices with voice interaction functions, which are used by an object in daily life, can generate voice information in real time through the voice synthesis model for voice interaction with the object. Based on the voice interaction requirements of different objects, the voice interaction device can set different tone, intonation, speech speed, emotion and other adjustment parameters in the voice synthesis model so as to meet the diversified voice interaction requirements of the objects as much as possible.

Besides the above-mentioned fields, the speech synthesis model trained by the model training method can be applied to various fields requiring real, natural and diversified speech information, such as the vehicle-mounted technical field, the intelligent education field, and the like, and is not limited herein.

In order to improve the accuracy of sample information, before sample text information and sample voice information are generated, human settings can be defined, wherein the human settings refer to person settings, emotion classification is carried out according to the human settings, the same emotion is graded in degree, and emotion expressions under different scenes are matched; then, according to the setting of the person and emotion, designing text information, and aiming at the setting of different emotion states of the person, designing corresponding text information; in the recording process, a recorded scene, a character state and the like are defined in advance, a recording object which accords with human setting is selected, the recording object enters a role, and recording is started again, so that sample text information which accords with emotion and real and accurate sample voice information corresponding to the sample text information are obtained as sample information.

In the data preprocessing, the processing device may add a corresponding sample emotion tag and a tuning parameter to the sample Text information, and perform model training based on these information to produce a speech synthesis (TTS) model. Through the speech synthesis model and the speech synthesis object, corresponding speech information can be synthesized based on the text information to be synthesized and the adjustment parameters input by the information input interface.

Model training process as shown in fig. 11, phoneme-level features, character-level features, words and phrases-level features in a sample text information sample are input to a deep learning model encoder, a transformer encoder, and an embedding layer, respectively, where the phoneme-level features are input to the deep learning model encoder, the character-level features are input to the transformer encoder, and the words and phrases-level features are, for example: the information of the word segmentation, prosodic phrase and prosodic word is used as the input of the embedded layer. The model selects a corresponding emotion feature vector (style enabling) according to the sample emotion label corresponding to the sample text information, wherein the emotion feature vector is the emotion feature information, and the emotion feature vector is multiplied by a corresponding emotion degree parameter, such as: the emotion degree parameter is 1.5, that is, the emotion feature vector is 1.5 as the feature of emotion dimension, so that the synthesis of the corresponding emotion is achieved, the action is realized by a hidden layer for emotion dimension adjustment in the initial speech synthesis model, and the hidden layer can be used for generating speech feature information based on the emotion feature information and the emotion degree parameter in the model application process. The second characteristic parameter is regulated in a similar method, the predicted value is regulated in the application process, and the second characteristic parameter of the sample is multiplied by the corresponding second regulating parameter in the training process, so that the control of the speech synthesis effect is achieved.

The initial speech synthesis model may generate a corresponding parameter vector based on the first tuning parameters, which parameter vector is used to tune the first feature parameters in the speech feature information, e.g., accent parameter vectors may be generated based on accent control parameters (Emphasis embedding), trailing tone parameter vectors may be generated based on trailing tone control parameters (Stretch embedding), and break parameter vectors may be generated based on break control parameters (Interrupt embedding). The voice characteristic information can be adjusted through the parameter vectors and the tone parameter to be adjusted, the fluctuation parameter to be adjusted and the duration parameter to be adjusted in the second characteristic parameters to be adjusted, so that the adjusted voice characteristic information is obtained.

The adjusted voice characteristic information can be up-sampled through Self-care up-sampling (Self-Attention upsampling) to obtain undetermined voice information corresponding to the sample text information, wherein the Self-care up-sampling can make the connection between the voice information corresponding to each character in the voice information more natural. A spectrogram (melspctrogram) corresponding to the undetermined voice information can be obtained through an autoregressive decoder (Autoregressive decoder), an initial voice synthesis model can be used for determining whether two spectrograms are matched by generating similar parameters between the spectrogram and a spectrogram (group-truth Melspectrogram) corresponding to the sample voice information through an antagonism network discriminant, so as to analyze the difference between the undetermined voice information and the sample voice information, and the difference can be used for adjusting the autoregressive decoder, self-attention up-sampling, adjusting voice characteristic information and generating model parameters related to the voice characteristic information and can be used for adjusting emotion characteristic vectors; the difference between the pending second characteristic parameter and the sample second characteristic parameter may adjust various adapters for determining the second characteristic parameter.

Finally, the speech synthesis model obtained through training can output a spectrogram corresponding to the speech information, can also directly output the speech information, and provides a diversified result output mode for selecting a speech synthesis object.

Based on the model training method provided by the foregoing embodiment, the embodiment of the present application further provides a model training device, referring to fig. 12, fig. 12 is a block diagram of a model training device 1200 provided by the embodiment of the present application, where the device 1200 includes an obtaining unit 1201, a first generating unit 1202, a first adjusting unit 1203, a second generating unit 1204, and a second adjusting unit 1205:

the obtaining unit 1201 is configured to obtain a sample text information set, where the sample text information set includes a plurality of sample text information, and the sample text information has corresponding sample voice information and a sample adjustment parameter, and the sample voice information is generated based on the sample adjustment parameter;

the first generating unit 1202 is configured to generate, according to an initial speech synthesis model, speech feature information corresponding to the target sample text information by using the plurality of sample text information as target sample text information, where the speech feature information is used to identify a pronunciation mode of the target sample text information in the speech information;

The first adjusting unit 1203 is configured to adjust, according to the initial speech synthesis model and the target sample adjustment parameter corresponding to the target sample text information, the speech feature information to obtain adjusted speech feature information;

the second generating unit 1204 is configured to generate, according to the adjusted speech feature information, to-be-determined speech information corresponding to the target sample text information by using the initial speech synthesis model;

the second adjusting unit 1205 is configured to adjust model parameters corresponding to the initial speech synthesis model according to a difference between the to-be-determined speech information and the target sample speech information corresponding to the target sample text information, so as to obtain a speech synthesis model, where the speech synthesis model is configured to synthesize the speech information according to the to-be-synthesized text information and the adjustment parameters corresponding to the to-be-synthesized text information.

In one possible implementation manner, the first generating unit 1202 is specifically configured to:

In one possible implementation manner, the sample text information has a corresponding sample emotion tag, the initial speech synthesis model includes initial emotion feature information corresponding to each of a plurality of emotion tags, and the first generation unit 1202 is specifically configured to:

the second adjusting unit 1205 is specifically configured to:

In a possible implementation manner, the target sample adjustment parameter includes a first adjustment parameter, where the first adjustment parameter is used to adjust a first feature parameter included in the voice feature information, and the first adjustment unit 1203 is specifically configured to:

the apparatus further comprises a determination unit:

the first adjusting unit 1203 is specifically configured to:

the second adjusting unit 1205 is specifically configured to:

In one possible implementation manner, the target sample adjustment parameter has a corresponding emotion tag, and the second adjustment unit 1205 is specifically configured to:

In one possible implementation, the second adjusting unit 1205 is specifically configured to:

Based on the model application method provided by the foregoing embodiment, the embodiment of the present application further provides a model application device, referring to fig. 13, fig. 13 is a block diagram of a model application device 1300 provided by the embodiment of the present application, where the device includes an obtaining unit 1301, a generating unit 1302, and a sending unit 1303:

the obtaining unit 1301 is configured to obtain text information to be synthesized and an adjustment parameter corresponding to the text information to be synthesized, where the adjustment parameter is used to adjust a pronunciation mode of the text information to be synthesized in the speech information;

the generating unit 1302 is configured to input the text information to be synthesized and the adjustment parameters corresponding to the text information to be synthesized into a speech synthesis model, and generate, by using the speech synthesis model, target speech information corresponding to the text information to be synthesized;

the sending unit 1303 is configured to send the target voice information to the voice synthesis object.

In one possible implementation manner, the generating unit 1302 is specifically configured to:

The generating unit 1302 is specifically configured to:

In a possible implementation manner, the adjustment parameter includes an emotion tag, and the generating unit 1302 is specifically configured to:

In one possible implementation, the apparatus further includes a display unit:

the acquiring unit 1301 is specifically configured to:

The embodiment of the application also provides computer equipment, and the equipment is described below with reference to the accompanying drawings. Referring to fig. 14, an embodiment of the present application provides a device, which may also be a terminal device, where the terminal device may be any intelligent terminal including a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA for short), a Point of Sales (POS for short), a vehicle-mounted computer, and the like, and the terminal device is taken as an example of the mobile phone:

Fig. 14 is a block diagram showing a part of the structure of a mobile phone related to a terminal device provided by an embodiment of the present application. Referring to fig. 14, the mobile phone includes: radio Frequency (RF) circuitry 710, memory 720, input unit 730, display unit 740, sensor 750, audio circuitry 760, wireless fidelity (Wireless Fidelity, wiFi) module 770, processor 780, and power supply 790. It will be appreciated by those skilled in the art that the handset construction shown in fig. 14 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 14:

the RF circuit 710 may be configured to receive and transmit signals during a message or a call, and specifically, receive downlink information of a base station and process the downlink information with the processor 780; in addition, the data of the design uplink is sent to the base station. Generally, RF circuitry 710 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (Low Noise Amplifier, LNA for short), a duplexer, and the like. In addition, the RF circuitry 710 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (Global System of Mobile communication, GSM for short), general packet radio service (General Packet Radio Service, GPRS for short), code division multiple access (Code Division Multiple Access, CDMA for short), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA for short), long term evolution (Long Term Evolution, LTE for short), email, short message service (Short Messaging Service, SMS for short), and the like.

The memory 720 may be used to store software programs and modules, and the processor 780 performs various functional applications and data processing of the handset by running the software programs and modules stored in the memory 720. The memory 720 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 720 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 730 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 730 may include a touch panel 731 and other input devices 732. The touch panel 731, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on or thereabout the touch panel 731 using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 731 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 780, and can receive commands from the processor 780 and execute them. In addition, the touch panel 731 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 730 may include other input devices 732 in addition to the touch panel 731. In particular, the other input devices 732 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 740 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The display unit 740 may include a display panel 741, and optionally, the display panel 741 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD) or an Organic Light-Emitting Diode (OLED) or the like. Further, the touch panel 731 may cover the display panel 741, and when the touch panel 731 detects a touch operation thereon or thereabout, the touch operation is transferred to the processor 780 to determine the type of touch event, and then the processor 780 provides a corresponding visual output on the display panel 741 according to the type of touch event. Although in fig. 14, the touch panel 731 and the display panel 741 are two separate components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 731 and the display panel 741 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 750, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 741 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 741 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the handset are not described in detail herein.

Audio circuitry 760, speaker 761, and microphone 762 may provide an audio interface between a user and a cell phone. The audio circuit 760 may transmit the received electrical signal converted from audio data to the speaker 761, and the electrical signal is converted into a sound signal by the speaker 761 to be output; on the other hand, microphone 762 converts the collected sound signals into electrical signals, which are received by audio circuit 760 and converted into audio data, which are processed by audio data output processor 780 for transmission to, for example, another cell phone via RF circuit 710 or for output to memory 720 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive emails, browse webpages, access streaming media and the like through a WiFi module 770, so that wireless broadband Internet access is provided for the user. Although fig. 14 shows the WiFi module 770, it is understood that it does not belong to the essential constitution of the mobile phone, and can be omitted entirely as required within the scope of not changing the essence of the invention.

The processor 780 is a control center of the mobile phone, connects various parts of the entire mobile phone using various interfaces and lines, and performs various functions of the mobile phone and processes data by running or executing software programs and/or modules stored in the memory 720 and calling data stored in the memory 720, thereby performing overall detection of the mobile phone. Optionally, the processor 780 may include one or more processing units; preferably, the processor 780 may integrate an application processor that primarily processes operating systems, user interfaces, applications, etc., with a modem processor that primarily processes wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 780.

The handset further includes a power supply 790 (e.g., a battery) for powering the various components, which may preferably be logically connected to the processor 780 through a power management system, such as to provide for managing charging, discharging, and power consumption by the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In this embodiment, the processor 780 included in the terminal device further has the following functions:

Or has the following functions:

and sending the target voice information to the voice synthesis object.

Referring to fig. 15, fig. 15 is a schematic diagram of a server 800 according to an embodiment of the present application, where the server 800 may have a relatively large difference due to different configurations or performances, and may include one or more central processing units (Central Processing Units, abbreviated as CPUs) 822 (e.g., one or more processors) and a memory 832, and one or more storage media 830 (e.g., one or more mass storage devices) storing application programs 842 or data 844. Wherein the memory 832 and the storage medium 830 may be transitory or persistent. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 822 may be configured to communicate with the storage medium 830 to execute a series of instruction operations in the storage medium 830 on the server 800.

The Server 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input/output interfaces 858, and/or one or more operating systems 841, such as Windows Server ^TM ，Mac OS X ^TM ，Unix ^TM ,Linux ^TM ，FreeBSD ^TM Etc.

The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 15.

The embodiments of the present application also provide a computer readable storage medium storing a computer program for executing any one of the model training method or the model application method described in the foregoing embodiments.

Embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform the model training method or the model application method of any of the above embodiments.

Those of ordinary skill in the art will appreciate that: all or part of the steps for implementing the above method embodiments may be implemented by hardware related to program instructions, where the above program may be stored in a computer readable storage medium, and when the program is executed, the program performs steps including the above method embodiments; and the aforementioned storage medium may be at least one of the following media: read-only memory (ROM), RAM, magnetic disk or optical disk, etc., which can store program codes.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment is mainly described in a different point from other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, with reference to the description of the method embodiments in part. The apparatus and system embodiments described above are merely illustrative, in which elements illustrated as separate elements may or may not be physically separate, and elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.

The foregoing is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present application should be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims

1. A method of model training, the method comprising:

2. The method according to claim 1, wherein the generating the voice feature information corresponding to the target sample text information according to the target sample text information includes:

3. The method according to claim 1, wherein the sample text information has a corresponding sample emotion tag, the initial speech synthesis model includes initial emotion feature information corresponding to each of a plurality of emotion tags, and the generating speech feature information corresponding to the target sample text information according to the target sample text information includes:

the step of adjusting model parameters corresponding to the initial speech synthesis model according to the difference between the undetermined speech information and the target sample speech information corresponding to the target sample text information to obtain a speech synthesis model, comprising the following steps:

4. The method according to claim 1, wherein the target sample adjustment parameters include first adjustment parameters, the first adjustment parameters are used for adjusting first feature parameters included in the voice feature information, the adjusting, by the initial voice synthesis model, the voice feature information according to the target sample adjustment parameters corresponding to the target sample text information, to obtain adjusted voice feature information includes:

5. The method according to claim 1, wherein the initial speech synthesis model includes a parameter prediction section, the target sample adjustment parameter includes a second adjustment parameter for adjusting a second feature parameter determined according to the speech feature information, the second feature parameter is not included in the speech feature information, and the target sample text information has a corresponding sample second feature parameter;

the method further comprises the steps of:

determining undetermined second characteristic parameters corresponding to the voice characteristic information through the parameter prediction part;

The step of adjusting the voice characteristic information according to the target sample adjusting parameters corresponding to the target sample text information through the initial voice synthesis model to obtain adjusted voice characteristic information comprises the following steps:

6. The method of claim 5, wherein the target sample adjustment parameters have corresponding emotion tags, and wherein adjusting the model parameters corresponding to the parameter prediction portion based on the difference between the pending second feature parameters and the sample second feature parameters comprises:

7. The method of claim 4, wherein the first adjustment parameters include any one or more of a combination of a drag control parameter for adjusting the drag parameter in the first characteristic parameter, an accent control parameter for adjusting the accent parameter in the first characteristic parameter, and a break control parameter for adjusting the break parameter in the first characteristic parameter.

8. The method of claim 5, wherein the second tuning parameters comprise any one or more of a combination of a duration control parameter for tuning a duration parameter in the second feature parameter, a intonation control parameter for tuning a intonation parameter in the second feature parameter, and a relief control parameter for tuning a relief parameter in the second feature parameter.

9. The method according to claim 1, wherein the adjusting model parameters corresponding to the initial speech synthesis model according to the difference between the pending speech information and the target sample speech information corresponding to the target sample text information to obtain the speech synthesis model includes:

10. A method of model application, the method comprising:

and sending the target voice information to the voice synthesis object.

11. The method according to claim 10, wherein the generating the target voice information corresponding to the text information to be synthesized includes:

12. The method according to claim 11, wherein the speech synthesis model includes a parameter adjustment section and a parameter prediction section, the adjustment parameters including a first adjustment parameter for adjusting a first feature parameter included in the speech feature information and a second adjustment parameter for adjusting a second feature parameter determined from the speech feature information, the second feature parameter not being included in the speech feature information;

the adjusting the voice characteristic information according to the adjusting parameters corresponding to the text information to be synthesized comprises the following steps:

13. The method according to claim 12, wherein the adjustment parameter includes an emotion tag, and the determining a second feature parameter corresponding to the voice feature information according to the voice feature information includes:

14. The method according to claim 11, wherein the adjustment parameters include an emotion tag and an emotion degree parameter, the emotion degree parameter being used for identifying a degree of adjusting a pronunciation manner of the text information to be synthesized in speech information to an emotion identified by the emotion tag, the generating speech feature information corresponding to the text information to be synthesized according to the phoneme feature information, the semantic feature information and the prosodic feature information includes:

15. The method according to claim 10, wherein the method further comprises:

displaying an information input interface to the voice synthesis object, wherein the information input interface is used for inputting text information to be synthesized and adjusting parameters;

the obtaining the text information to be synthesized and the adjusting parameters corresponding to the text information to be synthesized, which are input by the speech synthesis object, includes:

16. A model training device, characterized in that the device comprises an acquisition unit, a first generation unit, a first adjustment unit, a second generation unit and a second adjustment unit:

17. A model application apparatus, characterized in that the apparatus comprises an acquisition unit, a generation unit and a transmission unit:

18. A computer device, the computer device comprising a processor and a memory:

the processor is configured to perform the model training method of any of claims 1-9, or the model application method of any of claims 10-15, according to instructions in the program code.

19. A computer readable storage medium for storing a computer program for executing the model training method of any one of claims 1-9 or the model application method of any one of claims 10-15.

20. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the model training method of any of claims 1-9 or the model application method of any of claims 10-15.