CN115376486A

CN115376486A - Speech synthesis method, device, computer equipment and storage medium

Info

Publication number: CN115376486A
Application number: CN202211016347.6A
Authority: CN
Inventors: 陈学源; 吴志勇; 徐东; 赵伟峰
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd; Shenzhen International Graduate School of Tsinghua University
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd; Shenzhen International Graduate School of Tsinghua University
Priority date: 2022-08-23
Filing date: 2022-08-23
Publication date: 2022-11-22

Abstract

The embodiment of the application discloses a voice synthesis method, a voice synthesis device, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a target emotion category and a target emotion intensity corresponding to a text to be synthesized; determining the emotional characteristics of the text to be synthesized according to the emotional token, the target emotional category and the target emotional intensity which correspond to the preset emotional categories respectively; any emotion token is used for representing the characteristics corresponding to the preset emotion types; determining the emotional text characteristics of the text to be synthesized according to the text characteristics and the emotional characteristics of the text to be synthesized; and synthesizing the emotional voice of which the text to be synthesized accords with the target emotion category and the target emotion intensity according to the emotional text characteristics. The synthesis of the speech with rich emotion can be realized to improve the auditory effect of the user.

Description

Speech synthesis method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of speech technology, and in particular, to a speech synthesis method, apparatus, computer device, and storage medium.

Background

Current speech synthesis techniques typically focus on the synchronization of only text content with speech content, which may result in the synthesized audio being acoustically stiff and acoustically ineffective. Since emotion is a state (such as internal feeling, intention, etc.) of human voice other than text, it is a hot research problem of current voice synthesis technology to synthesize an emotional voice to improve auditory effect.

Disclosure of Invention

The embodiment of the application provides a voice synthesis method, a voice synthesis device, computer equipment and a storage medium, which can realize synthesis of voices rich in emotion so as to improve the auditory effect of a user.

A first aspect of an embodiment of the present application discloses a speech synthesis method, including:

acquiring a target emotion category and a target emotion intensity corresponding to a text to be synthesized;

determining the emotional characteristics of the text to be synthesized according to the emotion tokens, the target emotion types and the target emotion intensity which respectively correspond to the preset emotion types; any emotion token is used for representing the characteristics corresponding to the preset emotion types;

determining the emotional text characteristics of the text to be synthesized according to the text characteristics and the emotional characteristics of the text to be synthesized;

and synthesizing the emotional voice of which the text to be synthesized accords with the target emotion category and the target emotion intensity according to the emotional text characteristics.

A second aspect of the embodiments of the present application discloses a speech synthesis apparatus, including:

the acquiring unit is used for acquiring a target emotion category and a target emotion intensity corresponding to the text to be synthesized;

the first determining unit is used for determining the emotional characteristics of the text to be synthesized according to the emotion tokens, the target emotion types and the target emotion intensity which respectively correspond to a plurality of preset emotion types; any emotion token is used for representing the characteristics corresponding to the preset emotion types;

the second determining unit is used for determining the emotional text feature of the text to be synthesized according to the text feature of the text to be synthesized and the emotional feature;

and the synthesis unit is used for synthesizing the emotional voice of which the text to be synthesized meets the target emotion category and the target emotion intensity according to the emotional text characteristics.

In a third aspect of embodiments of the present application, a computer device is disclosed, which includes a processor, a memory, and a network interface, where the processor, the memory, and the network interface are connected to each other, where the memory is used to store a computer program, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method of the first aspect.

A fourth aspect of embodiments of the present application discloses a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions, which, when executed by a processor, cause the processor to perform the method of the first aspect.

A fifth aspect of embodiments of the present application discloses a computer program product or a computer program comprising program instructions stored in a computer-readable storage medium. The processor of the computer device reads the program instructions from the computer-readable storage medium, and the processor executes the program instructions to cause the computer device to perform the method of the first aspect described above.

In the embodiment of the application, the computer equipment can acquire the target emotion category and the target emotion intensity corresponding to the text to be synthesized; determining the emotional characteristics of the text to be synthesized according to the emotion tokens, the target emotion categories and the target emotion intensity which are respectively corresponding to the plurality of preset emotion categories; any emotion token is used for representing the characteristics corresponding to the preset emotion categories, further, the emotion text characteristics of the text to be synthesized can be determined according to the text characteristics and the emotion characteristics of the text to be synthesized, and therefore the emotion voice of the text to be synthesized, which conforms to the target emotion categories and the target emotion intensity, can be synthesized according to the emotion text characteristics. By implementing the method, the corresponding emotion characteristics aiming at the text to be synthesized can be obtained through the specified emotion types and the specified emotion intensities, the emotion text characteristics rich in text information and emotion information are further obtained by combining the text characteristics of the text to be synthesized, and the synthesized voice with the specified emotion types and emotion intensities is synthesized through the emotion text characteristics, so that the synthesis of the voice rich in emotion can be realized, the auditory effect of a user is improved, and the flexibility and the adjustability of the synthesized voice can be improved. In addition, when determining the emotional characteristics, the emotion can be characterized by utilizing the emotion tokens corresponding to various emotion categories so as to enhance the controllability and interpretability of the emotion characterization.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1a is a schematic structural diagram of a speech synthesis scene according to an embodiment of the present application;

FIG. 1b is a block diagram of a speech synthesis system according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a speech synthesis method provided in an embodiment of the present application;

FIG. 3a is a schematic structural diagram of obtaining a target emotion category and a target emotion intensity according to an embodiment of the present application;

FIG. 3b is a schematic structural diagram of an emotion speech synthesis model provided in an embodiment of the present application;

FIG. 3c is a schematic structural diagram of an emotion control network provided in an embodiment of the present application;

FIG. 3d is a schematic structural diagram of another emotion speech synthesis model provided in the embodiment of the present application;

FIG. 4 is a flow chart of another speech synthesis method provided by the embodiments of the present application;

FIG. 5a is a schematic structural diagram of a reference model provided in an embodiment of the present application;

FIG. 5b is a schematic structural diagram of an emotion extraction network provided in an embodiment of the present application;

FIG. 5c is a schematic structural diagram of an emotion characterization network provided in an embodiment of the present application;

FIG. 5d is a schematic structural diagram of another reference model provided in the embodiments of the present application;

fig. 6 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best viewed human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like.

Based on the voice technology and the machine learning technology mentioned in the artificial intelligence technology, the application provides an emotion voice synthesis model (i.e. a model for performing emotion control on a text and obtaining corresponding emotion voice). The emotion voice synthesis model can carry out emotion control on the text to be synthesized through emotion types and emotion intensities to obtain corresponding emotion voice; furthermore, the embodiment of the application also provides a speech synthesis scheme based on the emotion speech synthesis model; specifically, the principle of the scheme is as follows: aiming at a text to be synthesized, which needs to be subjected to voice synthesis, a target emotion type and a target emotion intensity corresponding to the text to be synthesized can be obtained firstly; and then performing emotion control on the text to be synthesized based on the target emotion category and the target emotion intensity to obtain corresponding emotion voice. For example, the emotional features of the text to be synthesized may be determined based on the emotion tokens, the target emotion categories and the target emotion intensities respectively corresponding to the plurality of preset emotion categories. The emotion token is used for representing characteristics of the corresponding preset emotion category. And further acquiring text characteristics of the text to be synthesized so as to obtain the corresponding emotional voice of the text to be synthesized under the target emotion category and the target emotion intensity according to the emotion characteristics and the text characteristics. For example, the emotion text feature of the text to be synthesized can be determined, so as to determine the corresponding emotion voice according to the emotion text feature.

In one implementation, the speech synthesis scheme may be implemented by invoking an end-to-end emotion speech synthesis model, for example, the text to be synthesized, the target emotion category and the target emotion intensity may be directly accepted as input, so as to obtain corresponding emotion speech through the emotion speech synthesis model.

In summary, the speech synthesis scheme proposed by the embodiment of the present application may have the following beneficial effects: the emotion characteristics of the text to be synthesized can be obtained through the appointed emotion types and the emotion intensities, the emotion text characteristics rich in text information and emotion information are further obtained by combining the text characteristics of the text to be synthesized, and the synthesized voice of the appointed emotion types and the emotion intensities is synthesized through the emotion text characteristics, so that the synthesis of the voice rich in emotion can be realized, the auditory effect of a user is improved, and the flexibility and the adjustability of the synthesized voice can be improved. In addition, when determining the emotional characteristics, the emotion can be characterized by utilizing the emotion tokens corresponding to various emotion categories so as to enhance the controllability and interpretability of the emotion characterization. In addition, the emotion voice synthesis is realized by calling the emotion voice synthesis model, and the automation and the intellectualization of the voice synthesis can be improved.

The embodiment of the application can be applied to various application scenes such as man-machine interaction, voice synthesis, voice interaction, voice assistants and vocal novel generation, so that the control on the emotion types and the emotion intensity can be realized by utilizing the voice synthesis scheme, voices with rich emotions can be obtained, and the user experience can be further improved. For example, in a voiced novel generation scenario, a user may input a desired emotion category and emotion intensity in a voiced novel application, and after the voiced novel application receives the user-input emotion category and emotion intensity, a novel may be read aloud with a particular emotion category and emotion intensity based on the novel text and the input emotion category and emotion intensity. For another example, in application scenarios such as human-computer interaction, voice interaction, and voice assistant, the user may input desired emotion types and emotion intensities on the relevant interface in advance, and then in subsequent interaction, the user may hear interested voices with emotions, thereby increasing the user hearing experience.

In the application scenario mentioned above, the user only needs to provide the text to be synthesized, the emotion category and the emotion intensity as input, and the end-to-end emotion-controllable speech synthesis model can directly accept the text to be synthesized, the emotion category and the emotion intensity as input and output the synthesized speech with the specified emotion category and intensity. For example, as shown in FIG. 1a, if the user wants the synthesized text to be "New year good! The emotion type is "happy" and the emotion intensity is "0.9", so that the user can directly input the three contents in the relevant interface of the application program corresponding to the application scene, and when the application program receives the three contents, the application program can automatically return the corresponding emotion voice, and specifically, the application program can obtain the corresponding emotion voice according to the voice synthesis scheme provided in the application and broadcast the emotion voice.

In a specific implementation, the execution subject of the above-mentioned speech synthesis scheme may be a computer device including, but not limited to, a terminal or a server. In other words, the computer device may be a server or a terminal, or a system of a server and a terminal. The above-mentioned terminal may be an electronic Device, including but not limited to a Mobile phone, a tablet computer, a desktop computer, a notebook computer, a palm computer, a vehicle-mounted Device, an intelligent voice interaction Device, an Augmented Reality/Virtual Reality (AR/VR) Device, a helmet mounted display, a wearable Device, an intelligent speaker, an intelligent household appliance, an aircraft, a digital camera, a camera, and other Mobile Internet Devices (MID) with network access capability. The above-mentioned server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, vehicle-road cooperation, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

In one implementation, when the computer device is a server, the embodiment of the present application provides a speech synthesis system, which may include at least one terminal and at least one server, as shown in fig. 1 b; taking the terminal as an example, the user may input the target emotion category and the target emotion intensity corresponding to the text to be synthesized on the terminal interface, and the terminal may acquire the target emotion category and the target emotion intensity. After the terminal acquires the target emotion category and the target emotion intensity, the target emotion category and the target emotion intensity can be uploaded to the server, so that the server synthesizes emotion voice for a text to be synthesized according to the acquired target emotion category and the target emotion intensity, and corresponding emotion voice is obtained. After the emotion voice is synthesized, the server can also return the emotion voice to the terminal, so that the emotion voice can be broadcasted on the terminal interface.

Based on the above-mentioned speech synthesis scheme, embodiments of the present application provide a speech synthesis method that can be executed by the above-mentioned computer device. Referring to fig. 2, the speech synthesis method includes, but is not limited to, the following steps:

s201: and acquiring a target emotion category and a target emotion intensity corresponding to the text to be synthesized.

The text to be synthesized may be any text, for example, the text to be synthesized may be a novel text, a news text, a user dialog text, or the like; the target emotion category may be one of one or more preset emotion categories, for example, the preset emotion categories may include happy, angry, sad, surprised, afraid, aversion, and the like, which are not illustrated herein, and the target emotion category may be any one of the preset emotion categories, such as the target emotion category may be happy or afraid, and the like. The target emotion intensity can be an emotion intensity representation of an emotion corresponding to the target emotion category, and the emotion intensity can be represented by a numerical value, for example, the range of the emotion intensity can be 0-1, wherein the larger the numerical value is, the stronger the corresponding emotion intensity is, and the smaller the numerical value is, the weaker the corresponding emotion intensity is; that is, the target emotion intensity may be any one value from 0 to 1, for example, the target emotion intensity may be 0.4 or 0.9. For example, assuming that the target emotion category is happy, if the target emotion intensity is 0.9, the strength of happy feeling may be represented to be large, and if the target emotion intensity is 0.1, the strength of happy feeling may be represented to be small.

In one implementation, when a speech synthesis requirement exists, a target emotion category and a target emotion intensity corresponding to a text to be synthesized can be acquired.

Alternatively, the determination that a speech synthesis requirement exists may be made when a speech synthesis request is received by the computer device. For example, a user may send a speech synthesis request for speech to be synthesized to a computer device to cause the computer device to receive the speech synthesis request, and after the computer device receives the speech synthesis request, it is determined that there is a speech synthesis requirement for the speech to be synthesized. In a possible implementation manner, when a user needs to perform speech synthesis on a certain text, the user can perform relevant operations through a user operation interface output by the terminal so as to send a speech synthesis request for the text to be synthesized to the computer device.

See, for example, FIG. 3 a: the terminal used by the user can display a user operation interface in the terminal screen, and the user operation interface can at least comprise a data setting area marked by 301 and a confirmation control marked by 302, wherein the data setting area can comprise a text setting area, an emotion category setting area and an emotion intensity setting area. The text setting area may be used to input text information, for example, a text may be directly input in the text setting area, or a storage area address where the text is located may also be input, and the storage area address is used to link to the corresponding text; the emotion category setting area can be used for inputting specific emotion categories; the emotion intensity setting area can be used to input a specific emotion intensity.

If a user wants to perform speech synthesis on a certain text, the user can input related information in the data setting area 301, such as "New year's good", emotion category "happy", and emotion intensity "0.9" as the text shown in fig. 3 a; then, the user may perform a trigger operation (e.g., a click operation, a press operation, etc.) on the confirmation control 302, so as to trigger the terminal used by the user to acquire input data for voice synthesis from the data setting area 301, where the input data may include a text to be synthesized, a target emotion category, and a target emotion intensity, and the data is the data set in the data setting area 301, for example, the text to be synthesized is "new year good", the target emotion category is "happy", and the target emotion intensity is "0.9". After the terminal acquires the input data, a speech synthesis request for the input data may be sent to the computer device, where the speech synthesis request may include the input data.

Alternatively, the speech synthesis requirement may also be generated by triggering a speech synthesis timing task. For example, a speech synthesis timing task may be set, in which a trigger condition for speech synthesis of the text to be synthesized is indicated. The text to be synthesized can be pre-stored in a certain storage area, when the voice synthesis timing task is triggered, the text to be synthesized can be directly obtained from the storage area, voice synthesis operation is performed, the target emotion type and the target emotion intensity of the text to be synthesized can be stored in association with the corresponding text to be synthesized, and the corresponding target emotion type and the target emotion intensity can also be obtained when the text to be synthesized is obtained. The trigger condition may be that the current time reaches a preset speech synthesis time; or the residual storage space of the storage area reaches the preset residual storage space; or, a new text to be synthesized is added to the storage area, and the like.

S202: and determining the emotional characteristics of the text to be synthesized according to the emotion tokens, the target emotion categories and the target emotion intensities which respectively correspond to the plurality of preset emotion categories.

One preset emotion category corresponds to one emotion token, and any emotion token can be used for representing characteristics corresponding to the preset emotion category. The emotion tokens respectively corresponding to the multiple preset emotion categories are obtained in the process of training the emotion speech synthesis model, wherein the obtaining manner of the emotion tokens may refer to the relevant description in step S402.

The number of the target emotion categories can be one or more, correspondingly, the target emotion intensity can also be one or more, and one target emotion category can correspond to one target emotion intensity. In the embodiment of the present application, a target emotion category and a target emotion intensity are taken as an example for related description.

In a broad sense, when the emotion characteristics are characterized in the whole emotion space, weighted summation can be performed on the emotion tokens corresponding to each preset emotion type in the whole emotion space, and the result of the weighted summation is the generalized emotion characteristics, as shown in the following formula (1).

Wherein E represents the emotional characteristics, N represents the total number of preset emotion categories (or the total number of emotion tokens), and T _i Represents the emotion token corresponding to the ith preset emotion category (or directly understood as the ith emotion token), omega _i And representing the token weight of the ith emotion token, wherein the token weight of the ith emotion token is the emotion intensity corresponding to the ith preset emotion category.

Based on this, it can be seen that the above representation of emotional characteristics includes both generalized emotional categories and generalized emotional intensities. In the embodiment of the present application, the purpose is to specify the text to be synthesized to the control of the emotion intensity of a specific emotion category, in this case, a reference emotion category may be set first, for example, the reference emotion category may be a neutral emotion category or other emotion categories, and considering that the neutral emotion category is a category that does not contain any emotion relative to other emotion categories, the embodiment of the present application takes the neutral emotion category as an example of the reference emotion category for relevant explanation. The neutral emotion category is assumed to be the emotion characterization with the weakest emotion intensity corresponding to all emotion categories, namely the emotion intensity of 0, and the emotion categories except the neutral emotion category are assumed to be the emotion characterization with the strongest emotion intensity of the emotion category, namely the emotion characterization with the emotion intensity of 1. In a high-dimensional emotion space, the closer the emotion token to a neutral emotion category is, the smaller the representative emotion intensity is; conversely, the closer to a particular emotion token, the stronger the emotion intensity of the class. Here, the specific emotion token is the emotion token corresponding to the target emotion category, that is, the emotion intensity of the emotion token corresponding to the neutral emotion category is 0, and the emotion intensity of the emotion token corresponding to the target emotion category is 1, but in the embodiment of the present application, it is required to obtain the emotion representation of the target emotion category at the specific emotion intensity (that is, the target emotion intensity), an operation of interpolating between the emotion intensity of a specific emotion token (that is, the emotion token corresponding to the target emotion category) and the emotion intensity of the neutral emotion token (that is, the emotion token corresponding to the neutral emotion category), that is, an operation of interpolating between 0 and 1 may be considered to determine the emotion feature of the text to be synthesized. If the acquired target emotion intensity is used as the token weight of the specific emotion token and the difference value between 1 and the target emotion intensity is used as the token weight of the neutral emotion token, the token weights of the emotion tokens corresponding to other preset emotion categories are defaulted to 0. For example, such an implementation of determining emotional characteristics may be represented by the following equation (2).

E＝αT _hap +(1-α)T _neu (2)

Wherein, T _hap Emotion token representing correspondence of target emotion category, T _neu Represents the emotion token corresponding to the neutral emotion category, and alpha represents the token weight of the emotion token corresponding to the target emotion category, namely the target emotionTarget emotion intensity corresponding to the emotion category, (1- α) represents the token weight of the emotion token corresponding to the neutral emotion category. For example, taking the target emotion category as happy, T in the formula (2) _hap It is understood that the emotion token corresponds to the happy emotion category, and α represents the token weight corresponding to the happy emotion category, i.e. the target emotion intensity.

As can be seen from the above, the emotion tokens of a plurality of preset emotion categories can be weighted and combined to enhance the controllability and interpretability of the emotion representations. In addition, the emotion representation space may be enlarged by using a neutral emotion as a representation when the emotion intensity is 0 and using an extreme specific emotion as a representation when the emotion intensity is 1.

In one implementation, the specific implementation of step S202 can be performed by an emotion control network, which can be a network in the emotion speech synthesis model, as shown in fig. 3 b. In a specific implementation, the emotion feature can be obtained by calling an emotion voice synthesis model, and specifically can be obtained by performing emotion control on an emotion control network obtained by the emotion voice synthesis model. The emotion control network may have a weighting logic (the weighting logic is a specific implementation logic for determining the emotion characteristics of the text to be synthesized, as shown in the above formula (2)), which is required in the embodiment of the present application, and a plurality of emotion tokens corresponding to predetermined emotion categories are pre-configured in the emotion control network. FIG. 3c is a schematic diagram of the network structure of the emotion control network, and as can be seen from FIG. 3c, the main idea of the emotion control network is to perform weighted combination on a plurality of different emotion tokens.

The specific implementation of obtaining the emotion characteristics through the emotion control network may be: and inputting the target emotion category and the target emotion intensity into an emotion control network in the emotion voice synthesis model, and weighting emotion tokens respectively corresponding to a plurality of preset emotion categories in the emotion control network based on the target emotion category and the target emotion intensity so as to obtain the emotion characteristics of the text to be synthesized. For example, token weights of emotion tokens respectively corresponding to a plurality of preset emotion categories may be determined, weighting processing may be performed on the token weights and corresponding emotion tokens to obtain a plurality of weighting processing results, and the plurality of weighting processing results may be subjected to summation operation, where the summation operation result is an emotion feature. The method for determining the token weights of the emotion tokens respectively corresponding to the preset emotion categories comprises the following steps: matching the target emotion category with a plurality of preset emotion categories, and setting the token weight of the emotion token corresponding to the matched preset emotion categories as follows: target emotion intensity; setting the token weight of the emotion token corresponding to the neutral emotion category in the preset emotion categories as follows: 1 and the target emotion intensity; setting the token weights of the emotion tokens corresponding to other preset emotion categories as follows: 0.

therefore, the emotion control network can artificially adjust the emotion types and the emotion intensities of the voices to be synthesized, and further can synthesize the voices with any different emotion types and emotion intensities so as to improve the adjustability and the flexibility of voice synthesis.

In the embodiment of the application, the emotion types and the emotion intensities at sentence levels can be controlled, and the emotion types and the emotion intensities at phoneme levels can be controlled. For sentence level control, the emotion of the sentence level is only required to be regulated and controlled, and then the emotion is expanded to be as long as the phoneme sequence; for the control at the phoneme level, it is necessary to separately control the emotion corresponding to each phoneme, for example, an emotion strength may be added to each phoneme in the phoneme sequence, and the phonemes are weighted by using the emotion strengths to implement the control of the phonemes, where the emotion strength corresponding to each phoneme may be set to gradually become stronger or weaker (e.g., the emotion strength corresponding to each phoneme is 0.5, 0.3, 0.1, 8230;), or become stronger (e.g., the emotion strength corresponding to each phoneme is 0.1, 0.3, 0.5, 8230;), or smoothly fade between different emotions (i.e., using different emotion categories as reference emotions, the neutral emotion token in the above formula (2) may be a sad emotion token or a fear emotion token, etc.).

It should be noted that, for the emotion control network, the key point is to use multiple emotion tokens to characterize emotion categories and emotion intensities, and the specific characterization manner may be linear (such as the weighting process mentioned above) or non-linear, such as taking a logarithm or mapping to other spaces, and the characterization manner is not limited specifically.

S203: acquiring text characteristics of the text to be synthesized, and determining the emotional text characteristics of the text to be synthesized according to the emotional characteristics and the text characteristics.

In one implementation, the text feature is obtained by encoding a phoneme sequence corresponding to the text to be synthesized. Optionally, the text feature may be obtained by calling an acoustic network in the emotion speech synthesis model, and the acoustic network may include an encoding network and a decoding network; firstly, a text to be synthesized may be converted into a phoneme sequence, and after the phoneme sequence is obtained, the phoneme sequence is input into an encoding network for encoding processing to obtain a text feature of the text to be synthesized. The conversion of the text to be synthesized into the phoneme sequence may also be implemented by a network, for example, the network may be a phoneme conversion network, that is, the acoustic network may further include a phoneme conversion network; in a specific implementation, the text to be synthesized may be input into the phoneme conversion network for text-to-phoneme conversion, so as to obtain a phoneme sequence, as shown in fig. 3 d.

After the text features are obtained, the emotion text features of the text to be synthesized can be determined based on the text features and the emotion features. The manner of determining the emotional text feature of the text to be synthesized may include the following 3 cases.

Case (1): the text features and the emotion features can be directly subjected to fusion processing to obtain the emotion text features. The fusion processing may be addition processing, and it is understandable that the text feature may be composed of one or more text feature values, and the emotion feature may also be composed of one or more emotion feature values, where the feature length of the text feature is the same as the feature length of the emotion feature, and then the text feature value in the text feature and the emotion feature value in the emotion feature may be added correspondingly, and the addition result may be used as the emotion text feature. For example, assuming that text features can be characterized as (A1, A2, A3, A4) and emotion features can be characterized as (B1, B2, B3, B4), emotion text features can be characterized as (A1 + B1, A2+ B2, A3+ B3, A4+ B4).

Case (2): the method comprises the steps of firstly obtaining a reference weight sequence corresponding to text features of a text to be synthesized, and carrying out vector embedding processing on the reference weight sequence to obtain a reference weight embedding vector aiming at the reference weight sequence; then, according to the emotion feature, the text feature and the reference weight embedded vector, determining the emotion text feature of the text to be synthesized, and if the emotion feature, the text feature and the reference weight embedded vector can be subjected to fusion processing, so as to obtain a corresponding emotion text feature, as shown in fig. 3 d. The fusion process can be understood by referring to the above description, and is not described herein again. The reference weight sequence can comprise one or more reference weights, the sequence length of the reference weight sequence is the same as that of the phoneme sequence, and the reference weight in the reference weight sequence can represent the contribution degree of each phoneme in the phoneme sequence to the current emotion representation or the contribution degree of each phoneme to the emotion representation under the target emotion category; and the value corresponding to the reference weight is in positive correlation with the contribution degree, that is, the larger the value corresponding to the reference weight is, the larger the corresponding contribution degree is, and the smaller the value corresponding to the reference weight is, the smaller the corresponding contribution degree is. In one possible implementation, all of the reference weights in the reference weight sequence may be set to the highest weight value, for example, if any of the reference weights ranges from 0 to 1, the highest weight value is 1, which indicates that all of the reference weights in the reference weight sequence are set to 1. Then, the reference weight embedding vector corresponding to the reference weight sequence is added to the output (text feature) of the encoding network, that is, the reference weight embedding vector and the text feature are fused as described above.

It is known that, the larger the numerical value corresponding to the reference weight is, the larger the corresponding contribution degree is, all the reference weights in the reference weight sequence are set to 1, so that the contribution degree of each phoneme to emotion representation under a target emotion category is the maximum, when the reference weight embedding vector corresponding to the reference weight sequence is subsequently fused to obtain emotion text features, emotion representation for the target emotion category can be made to be stronger and more extreme, and the aforementioned reference is determined by using the target emotion category and the neutral emotion category when determining emotion features (emotion text features obtained by fusing emotion features with text features and reference weight embedding vectors), so that, by setting 1, the emotion representation of the target emotion category is made to be stronger and more extreme, the difference between the target emotion category and the neutral emotion category can be enlarged, that is, the space of the emotion representation can be enlarged and the average influence of the emotion representation can be reduced, thereby improving the speech synthesis effect.

Case (3): after obtaining the emotion text feature through the condition (1) or the condition (2), the variable information adaptation processing may be performed on the emotion text feature, where the variable information adaptation processing may be understood as fusing the feature of the variable information into the emotion text feature to obtain the finally required emotion text feature. For example, the variable information may be duration, pitch, energy, and the like, that is, the features of these dimensions may also be incorporated into the emotion text feature to enhance the emotion representation of the emotion text feature, thereby improving the emotion representation effect. The variable information adaptation process may be implemented by a variable information adaptation network, for example, the variable information adaptation network may be a network in the acoustic network, that is, the acoustic network may further include a variable information adaptation network; in a specific implementation, the emotion text feature may be input into the variable information adaptation network to perform variable information adaptation processing, so as to obtain a final required emotion text feature, as shown in fig. 3 d.

It should be noted that, for the end-to-end acoustic network in the embodiment of the present application, the specific implementation may be various network structures as described above (as shown in fig. 3d, the acoustic network may be fastspech 2), or may be based on other acoustic networks, such as Tacotron 2; the acoustic network is not particularly limited in this application.

S204: and synthesizing the corresponding emotional voice of the text to be synthesized under the target emotion category and the target emotion intensity according to the emotional text characteristics.

In one implementation, the emotion text feature may be decoded to obtain a mel-frequency spectrum diagram of the text to be synthesized, where the mel-frequency spectrum diagram is a spectrum diagram obtained by converting the frequency into mel (mel) scale; the decoding process may be implemented by a decoding network in the acoustic network, that is, the emotion text feature may be input into the decoding network for decoding process, so as to obtain a mel frequency spectrum diagram, as shown in fig. 3 d. And then, carrying out sound code conversion on the Mel frequency spectrogram, thereby synthesizing (obtaining) the corresponding emotional voice of the text to be synthesized under the target emotion category and the target emotion intensity. The neural vocoder (for example, waveRNN) may be used to perform vocoding on the mel frequency spectrum so as to generate a waveform, so as to obtain an emotion voice corresponding to the text to be synthesized under the target emotion category and the target emotion intensity. In a specific application scenario (such as voice interaction, voice assistant, etc.), after obtaining the emotion voice, the emotion voice can be broadcasted.

As can be seen from the above description, the synthesis of emotion speech can be realized by calling an emotion speech synthesis model in the embodiment of the present application; the steps S201 to S204 can be understood as a processing procedure (or inference procedure) of the emotion voice synthesis model in the actual application scenario. Briefly, the embodiment of the application provides an end-to-end emotion type and emotion degree controllable emotion voice synthesis model, and the emotion voice synthesis model can directly accept a text to be synthesized, emotion type and emotion intensity as input to obtain corresponding emotion voice.

In the embodiment of the application, the emotion characteristics of the text to be synthesized can be obtained through the specified emotion types and the specified emotion intensities, the emotion text characteristics rich in text information and emotion information can be further obtained by combining the text characteristics of the text to be synthesized, and the synthesized voice with the specified emotion types and emotion intensities can be synthesized through the emotion text characteristics, so that the synthesis of the voice rich in emotion can be realized, the auditory effect of a user can be improved, and the flexibility and the adjustability of the synthesized voice can be improved. In addition, when determining the emotional characteristics, the emotion can be characterized by utilizing the emotion tokens corresponding to various emotion categories so as to enhance the controllability and interpretability of the emotion characterization. In addition, an end-to-end emotion type and emotion degree controllable emotion voice synthesis model can be established, and emotion voice synthesis can be realized by directly utilizing the emotion voice synthesis model so as to improve automation and intellectualization of voice synthesis.

Referring to fig. 4, fig. 4 is a schematic flowchart of another speech synthesis method according to an embodiment of the present application. The method is applied to computer equipment and can be executed by the computer equipment; the embodiment of the application mainly explains a training process of obtaining an emotion voice synthesis model by training a reference model, wherein the reference model comprises an acoustic network and an emotion network. As shown in fig. 4, the speech synthesis method may include:

s401: training sample pairs are obtained.

The training sample pair may include sample data and tag data corresponding to the sample data, where the sample data may include a sample text and a sample audio, and the tag data may include an emotion category tag corresponding to the sample data, and the emotion category tag may be used to indicate an emotion category corresponding to the sample data, for example, an emotion type tag corresponding to a certain sample data is used to indicate a happy emotion category. It should be understood that the number of training sample pairs may include one or more, and the embodiment of the present application mainly uses one training sample pair as an example for describing model training.

S402: and acquiring a sample Mel frequency spectrum diagram of the sample audio, and inputting the sample Mel frequency spectrum diagram into the emotion network in the reference model to obtain the sample emotion characteristics and the weights of the emotion tokens respectively corresponding to the plurality of preset emotion categories.

In one implementation, a mel-frequency spectrogram corresponding to a sample audio frequency may be obtained first, and for example, the mel-frequency spectrogram may be referred to as a sample mel-frequency spectrogram; the sample Mel frequency spectrogram can be used as input of an emotion network in a reference model, so that the sample Mel frequency spectrogram is processed by the emotion network to obtain sample emotion characteristics and weights of emotion tokens respectively corresponding to a plurality of preset emotion types. Wherein, the weight of the emotion token corresponding to each of the plurality of preset emotion categories can be used for calculating the loss value of the subsequent model; the sample Mel frequency spectrogram can be used for subsequent model loss value calculation, as described in step S404.

The model structure diagram of the reference model may be as shown in fig. 5a, and as shown in fig. 5a, the reference model may include an acoustic network and an emotion network, the acoustic network may be an end-to-end network, and the emotion network may include an emotion extraction network and an emotion characterization network. Optionally, the specific implementation manner of implementing step S402 by using the emotion network may be: the method comprises the steps of inputting a sample Mel frequency spectrogram into an emotion extraction network for feature extraction to obtain global emotion features of sample audio, wherein the global emotion features can be directly output by the emotion extraction network; further, the global emotion characteristics can be input into the emotion characterization network for feature extraction again, so that sample emotion characteristics of the sample audio under a plurality of preset emotion categories and weights of emotion tokens respectively corresponding to the plurality of preset emotion categories are obtained. Wherein, the sample emotional characteristics can be directly the output of the emotion characterization network; the weights of the plurality of emotion tokens can be obtained in the processing process of the emotion characterization network, and the specific obtaining process can be referred to the description in the emotion characterization network. The emotion extraction network and emotion characterization network are described in detail below.

(1) Emotion extraction network

The network structure of the emotion extraction network, which may also be referred to as a sample audio coding network, may be as shown in FIG. 5 b. As shown in fig. 5b, the input to the emotion extraction network may be a sample mel frequency spectrum, which may be the corresponding mel frequency spectrum of a sample audio, as described above; the output of the emotion extraction network may include a global emotion feature and may further include a sample weight sequence, and the sequence length of the sample weight sequence may be the same as the sequence length of the sample phoneme sequence corresponding to the sample mel frequency spectrum. The network structure of the emotion extraction network (sample audio coding network) mainly comprises L convolutional networks, a gate control loop network, a length adjusting network and an attention mechanism network. The L convolution networks can be used for extracting emotional features in the sample Mel frequency spectrograms, and each convolution network can comprise a one-dimensional convolution, batch normalization and an activation function. The gated cyclic network may further extract emotional features in the sample mel-frequency spectrogram. The length adjustment network may perform a sequence length conversion, for example, in this application, a sequence with the same length as the frame sequence may be converted into a sequence with the same length as the sample phoneme sequence, and this time, the gate control loop network extracts that the sequence length of the emotion feature in the sample mel frequency spectrogram is the same as the sequence length of the frame sequence, where the frame sequence may refer to an original sequence corresponding to the sample mel frequency spectrogram, and usually, the original sequence corresponding to the sample mel frequency spectrogram is longer than the sample phoneme sequence corresponding to the sample text (the understanding of the sample phoneme sequence may be described in reference to step S403 below), and in order to ensure that the sequences are equal in length, the length adjustment of the sequence may be performed by the length adjuster. Specifically, the length adjustment network may adopt a frame merging operation, and perform an average or maximum merging operation on all adjacent frames belonging to the same phoneme, so as to convert a sequence with the same length as the frame sequence into a sequence with the same length as the sample phoneme sequence.

An attention mechanism network can be used for obtaining the global emotional characteristics and the sample weight sequence, and the attention mechanism network can be a single-head attention mechanism network; for the attention mechanism network, the output of the length adjustment network can be used as a key and a value of the attention mechanism network, and a learnable vector is randomly initialized as a query, and the global emotional features and the sample weight sequence can be obtained through the attention mechanism network. The global emotion feature can represent global emotion information of the sample audio, and each sample weight in the sample weight sequence can represent the contribution degree of each phoneme in the sample audio to the current emotion feature; in general, the phonemes comprised by the sample audio are identical to the phonemes comprised by the sample text, and each sample weight in the sample weight sequence may also be understood as being used for characterizing: and the contribution degree of the corresponding phoneme in the sample phoneme sequence corresponding to the sample text to the emotion characterization, or the contribution degree of the corresponding phoneme to the emotion characterization under the emotion category corresponding to the emotion category label.

It should be noted that, for the emotion extraction network, the key point of the emotion extraction network is to use an attention mechanism to extract global emotion features and a sample weight sequence. Besides the attention mechanism network structure shown in fig. 5b, in other implementations, the emotion extraction network may also be a network structure based on a convolutional neural network, or LSTM, or fully connected, etc., and the present application is not limited in particular.

(2) Emotion characterization network

The network structure diagram of the emotion characterization network can be as shown in fig. 5c, the key point of the emotion characterization network is to limit different emotion categories on different emotion tokens respectively, and the emotion characterization network can also be referred to as an emotion token layer. As shown in FIG. 5c, the input of the emotion characterization network may be global emotion features from the output of the sample audio encoder (emotion extraction network), the output may include an emotion vector at sentence level, the emotion vector may be the above mentioned sample emotion features, and the output may further include emotion tokens corresponding to a plurality of preset emotion categories. For example, if the number of the plurality of preset emotion categories is N, N emotion tokens are output, that is, the emotion tokens respectively corresponding to the N preset emotion categories, for example, the plurality of preset emotion categories may include neutral, happy, angry, sad, surprised, afraid, and hate. The network structure of the emotion characterization network (emotion token layer) mainly comprises an attention mechanism network, wherein the attention mechanism network can be used for further extracting emotional features by utilizing global emotional features to obtain sample emotional features, the attention mechanism in the attention mechanism network can be a single-head attention mechanism, and in short, the attention mechanism network can be understood as a single-head attention mechanism network specially aiming at emotion characterization.

The attention mechanism network can use N (N is the total number corresponding to a plurality of preset emotion categories) randomly initialized emotion tokens as keys and values, and use the input (namely, global emotion feature) of the emotion characterization network as query (query), so as to implement the processing mechanism of the attention mechanism network. It can be understood that the core of the attention mechanism is to let the network pay attention to where it needs more attention, and it is generally embodied in the form of attention weights, and generally, the nature of the attention mechanism can be understood as weighted summation. Correspondingly, in the embodiment of the present application, the attention mechanism can be understood as a weighted summation of the N emotion tokens. The sample emotional feature is the result of weighted summation of the N emotional tokens, and therefore compared with the global emotional feature obtained by the emotion extraction network, the sample emotional feature is more interpretable. It is also known that, in the weighted sum processing, a weight (which may also be referred to as an attention weight) corresponding to each of the N emotion tokens is involved.

The weight of the emotion token can be obtained by calculating the similarity of the global emotion characteristics and the emotion token; for example, the weight of each emotion token may be determined in the following manner: performing similarity calculation on the global emotional features and each emotional token to obtain the similarity between the global emotional features and each emotional token; then, each similarity is normalized (such as softmax), and each normalization processing result is the weight of the corresponding emotion token. By the method, the weight corresponding to each emotion token can be obtained, and then the weighting of the N emotion tokens can be carried out on the basis of the weights to obtain the sample emotion characteristics.

When the reference model is trained, each emotion token may be initialized randomly, and then through continuous iterative training, the weights corresponding to the N emotion tokens and the N emotion tokens respectively are also updated continuously, and after the reference model training is completed, the obtained weights corresponding to the N emotion tokens and the N emotion tokens respectively (i.e., the weights corresponding to the N emotion tokens and the N emotion tokens in the emotion characterization network in the last iterative training) are data required by the embodiment of the present application. Wherein, the N emotion tokens are the emotion tokens involved in the above step S202; the weights corresponding to the N emotion tokens are required for calculating the model loss value in step S404. One emotion token corresponds to one preset emotion category, the weights corresponding to the N emotion tokens respectively are the weights corresponding to the preset emotion categories respectively, and N is the total number of the preset emotion categories.

In the embodiment of the application, for the attention mechanism network of the emotion characterization network, a cross entropy loss may be added to the weight in the attention mechanism network (i.e. the weight of each emotion token), where the cross entropy loss is a multi-classification loss, and then a cross entropy loss is added to the weights of a plurality of emotion tokens, i.e. each emotion token is classified (i.e. each emotion token corresponds to an emotion category), and in the training of the reference model, each sample data corresponds to an emotion category label for indicating the emotion category. In the training of the reference model, when cross entropy loss is calculated on the emotion class labels corresponding to the sample data based on the weights of a plurality of emotion tokens obtained by the sample data, if the classification is wrong (namely, predicted emotion classes and emotion class labels are different), a large loss value is generated in the calculation based on the cross entropy loss, when the training is performed through the loss value, the loss value is gradually reduced, when the loss value is updated to be small, the classification can be indicated to be correct, that is, the emotion class label corresponding to the sample data can be limited on the corresponding emotion token, for example, if the emotion class label corresponding to the sample data is happy, in the calculation of the cross entropy loss, the emotion representation of the happy emotion class can be limited on the happy emotion token, and through the training of a large number of sample data, the emotion representation corresponding to each emotion token can be respectively limited on a designated corresponding emotion token, so that one emotion representation can correspond to the emotion representation of one emotion class. Thus, the entire emotion space can be well characterized by a weighted combination of multiple emotion tokens.

It should be noted that, for the emotion characterization network, the key point is to limit the emotion characterizations of different emotion categories on different emotion tokens respectively; the specific number and type of emotion tokens depends on the data used, e.g. in addition to the above mentioned emotion classes, other classes of emotions are also possible, such as euphoria, worship, etc., and the application is not limited to the preset emotion classes.

It is to be understood that the sample audio is usually an audio signal in the time domain, and in the signal processing, the sample audio in the time domain can be converted into the sample audio in the frequency domain by processing usually using data in the frequency domain. Based on this, the way of obtaining the sample mel frequency spectrum of the sample audio may be: firstly, converting sample audio from a time domain to a frequency domain by utilizing fast Fourier transform to obtain a corresponding spectrogram; the spectrogram can represent the distribution of the sample audio on different frequencies, the spectrogram is represented by taking the frequency as a logarithmic scale, compared with the sample audio on a time domain, the frequency domain representation of the sample audio can highlight the characteristics of the sample audio more easily, and the data volume can be reduced, so that the data processing speed is improved. After obtaining the spectrogram corresponding to the sample audio, the frequency in the spectrogram may be further transformed to obtain a mel-frequency spectrogram, for example, a logarithmic scale of the frequency may be transformed to a mel scale to form the mel-frequency spectrogram.

S403: and inputting the sample text and the sample emotional characteristics into an acoustic network in a reference model to obtain a predicted Mel frequency spectrogram of the sample text.

In one implementation, the acoustic network may include an encoding network and a decoding network; the coding network can be used for coding a sample phoneme sequence corresponding to a sample text to obtain a sample text characteristic of the sample text; based on this, it can be known that, if the coding network processes phonemes instead of directly processing the text, the text-to-phoneme conversion processing may be performed on the sample text first to obtain a sample phoneme sequence corresponding to the sample text. The text-to-phoneme process may be implemented by a phoneme conversion network, which may be included in the acoustic network, as shown in fig. 5 d. Then, after obtaining the sample phoneme sequence, the sample phoneme sequence may be input to a coding network in the acoustic network for coding, so as to obtain a sample text feature of the sample text.

The sample emotion characteristics can also be input into an acoustic network in a reference model, so that sample emotion text characteristics are obtained in the acoustic network according to the sample emotion characteristics and the sample text characteristics, and the sample emotion text characteristics can contain characteristics of multiple dimensions (namely text and emotion); furthermore, the sample emotion text characteristics can be input into a decoding network for decoding processing, and a predicted Mel frequency spectrogram of the sample text is obtained.

Optionally, the specific implementation manner of obtaining the sample emotion text feature according to the sample emotion feature and the sample text feature may include the following 3 cases:

case (1): and directly carrying out fusion processing on the sample emotion characteristics and the sample text characteristics to obtain corresponding sample emotion text characteristics. The fusion process may refer to the description in step S203, and is not described herein again.

Case (2): obtaining a sample weight sequence from the output of the emotion extraction network; the sequence length of the sample weight sequence is the same as the sequence length of the sample phoneme sequence, and each sample weight in the sample weight sequence can be used for characterizing: the degree of contribution of the corresponding phoneme in the sample phoneme sequence to the emotion characterization, wherein the related understanding of the sample weight sequence can be referred to the description in step S402. After the sample weight sequence is obtained, vector embedding can be carried out on the sample weight sequence to obtain a sample weight embedded vector for the sample weight sequence; through the vector embedding process, the dimension of the sample weight sequence can be converted to be consistent with the dimension of the sample emotion feature or the sample text feature, so that the fusion process can be carried out. Further, the sample emotion text features can be obtained according to the sample emotion features, the sample text features and the sample weight embedding vectors, and for example, the sample emotion features, the sample text features and the sample weight embedding vectors can be subjected to fusion processing to obtain the sample emotion text features. The fusion process can be referred to the description in step S203, and is not described herein again.

Optionally, each sample weight in the sample weight sequence may be represented by a numerical value, and considering that the sample weight sequence is composed of a large number of discrete numerical values, and the numerical values corresponding to each sample weight may be different, then, subsequently, when the sample weight sequence is subjected to vector embedding, vector embedding needs to be performed on a large number of different numerical values, the calculation amount is large, and the model training speed is affected, then, box separation operation may be performed on each sample weight in the sample weight sequence, wherein the box separation operation may be understood as quantization, that is, numerical values in a certain range may all be classified into one numerical value, so as to reduce the calculation amount of subsequent vector embedding, and further improve the model training speed.

The specific implementation of the binning operation may be: h classification categories corresponding to the sample weight sequence and classification values corresponding to each classification category are determined; wherein each bin category represents a numerical range, such as 0-0.1 for one of the H bin categories, 0.1-0.2 for the other bin category, and so on; the classification value corresponding to the classification category is that all the numerical values in the classification category are set to be a numerical value, for example, all the numerical values in the classification category of 0 to 0.1 can be set to be 0.05. When determining the bin classes, H bin classes may be divided based on the maximum value and the minimum value in the sample weight sequence, for example, a difference between the maximum value and the minimum value may be calculated, a ratio between the difference and a reference value may be calculated, a result of rounding down the ratio or a result of rounding up the ratio is used as the number of the bin classes (i.e., H), and a range of values from the maximum value to the minimum value is divided into H number value ranges (the range of each value range may be the same or different), so that the H number value ranges are also H bin classes. For example, H is 4, and the maximum and minimum values are 0.1-0.9, respectively, then the 4 binning classes may be 0.1-0.3, 0.3-0.5, 0.5-0.7, 0.7-0.9. H may be preset, and may be a value of 10, 8, or the like. After H binning classes are determined, the binning class corresponding to each sample weight in the sample weight sequence is determined according to the H binning classes, and then the numerical value of each sample weight subjected to binning operation is determined based on the binning value corresponding to the binning class. For example, for a sample weight, the sample weight may be set to a binning value corresponding to the binning class in which the sample weight is located. For example, assuming that a sample weight is 0.01, the corresponding binning class is 0-0.1, and the binning value of the binning class is 0.05, the sample weight may be updated to 0.05 after the binning operation.

Case (3): after the sample emotion text features are obtained through the condition (1) or the condition (2), performing variable information adaptation processing on the sample emotion text features, wherein the variable information adaptation processing can be understood to fuse one or more features of variable information (such as duration, pitch, energy and the like) into the sample emotion text features to obtain finally required sample emotion text features; the variable information adaptation process may be implemented by a variable information adaptation network, for example, the variable information adaptation network may be a network in the acoustic network, that is, the acoustic network may further include a variable information adaptation network; in a specific implementation, the sample emotion text feature may be input into the variable information adaptation network to perform variable information adaptation processing, so as to obtain a finally required sample emotion text feature, as shown in fig. 5 d.

S404: and training the reference model according to the weights of the emotion tokens respectively corresponding to the prediction Mel frequency spectrogram, the sample Mel frequency spectrogram, the emotion category labels and the plurality of preset emotion categories to obtain a trained reference model, and determining an emotion voice synthesis model according to the trained reference model.

In one implementation, a model loss value of a reference model may be determined based on a predicted mel-frequency spectrum, a sample mel-frequency spectrum, a predicted emotion class label, and an emotion class label, and the reference model may be trained by using the model loss value to obtain a trained reference model. For example, the model parameters in the reference model can be updated according to the direction of the decrease of the model loss value, so as to achieve the effect of model training.

Optionally, the specific implementation of determining the model loss value of the reference model may be: determining a first loss value of the reference model by using the prediction Mel frequency spectrogram and the sample Mel frequency spectrogram, and determining a second loss value of the reference model by using weights of the emotion tokens respectively corresponding to the plurality of preset emotion types and the emotion type labels; the loss function used in determining the first loss value may be a mean square error loss function, and the loss function used in determining the second loss value may be a cross entropy loss function. After the first loss value and the second loss value are obtained, a model loss value may be determined based on the two loss values. For example, the sum between the first loss value and the second loss value may be taken as a model loss value; for another example, after the first loss value and the second loss value are weighted respectively, a sum of the two weighting results is used as a model loss value, a weighting value corresponding to the first loss value and a weighting value corresponding to the second loss value may be preset, and a numerical value thereof is not specifically limited in this application, for example, the weighting value corresponding to the first loss value may be 0.3, and the weighting value corresponding to the second loss value may be 0.7. Based on this, it can be seen that, by using the above training mode, the reference model can be trained by combining the loss of multiple dimensions, so as to improve the robustness of the model, and make the speech synthesis effect of the reference model better.

In one implementation, after the trained reference model is obtained, the emotion speech synthesis model may be determined according to the trained reference model. Optionally, emotion tokens corresponding to a plurality of preset emotion categories respectively can be obtained from the output of the emotion characterization network, and an emotion control network is constructed by the emotion tokens corresponding to the plurality of preset emotion categories respectively; the output here may refer to the output of the emotion characterization network during the last training in the iterative training process of the reference model. And then extracting the acoustic network in the trained reference model, so as to determine the emotion voice synthesis model based on the acoustic network and the emotion control network in the trained reference model, that is, the emotion voice synthesis model can be formed by the acoustic network and the emotion control network in the trained reference model, for example, as shown in fig. 3b or fig. 3 d.

The above method embodiments are all illustrations of the method of the present application, and the description of each embodiment has a respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments. For example, after the emotion speech synthesis model is obtained through training, the target emotion category and the target emotion intensity corresponding to the text to be synthesized can be obtained, so that the control of the emotion category and the emotion intensity of the text to be synthesized is realized based on the emotion speech synthesis model, and the emotion speech of the text to be synthesized under the target emotion category and the target emotion intensity is obtained, which is not described herein again.

In the embodiment of the application, the data without the emotion intensity labels can be directly used for training, and additional pretreatment operations such as emotion intensity sequencing or extraction and the like do not need to be carried out on the emotion data set, so that the training flow of the reference model is simplified; and by utilizing the trained reference model, an end-to-end emotion voice synthesis model with controllable emotion types and emotion degrees can be established so as to directly accept the text to be synthesized, the emotion types and the emotion intensities as input to obtain corresponding emotion voice and improve the automation and the intellectualization of voice synthesis.

To understand the speech synthesis method of the present application, the following description is further described with reference to the model structure diagrams shown in fig. 3d and fig. 5 d. In order to realize speech synthesis under specific emotion types and emotion intensities, the embodiment of the application can relate to an acoustic network, for example, fastSpeech 2, and on the basis, an emotion extraction network (also referred to as a sample audio encoder), an emotion characterization network (also referred to as an emotion token layer) and an emotion control network are further added, and can be used for extracting, characterizing and controlling emotion respectively. The embodiment of the application relates to a training phase (as in steps S401-S404) and an inference phase (as in steps S201-S204). The model frames utilized in the training phase and the inference phase are different, and the model utilized in the training phase may be the above-mentioned reference model (the reference model includes an acoustic network, an emotion extraction network, and an emotion extraction network), as shown in fig. 5 d. The models utilized in the inference phase may be the emotion speech synthesis models mentioned above (emotion speech synthesis models include acoustic networks, emotion control networks), as shown in fig. 3 d. These two phases are explained in relation to the following.

In the training phase, the reference model may receive a < sample text, sample audio > pair with an emotion category label as input, and the emotion extraction network may extract global emotion features from a sample mel-frequency spectrogram corresponding to the sample audio, and then input the global emotion features into the emotion characterization network to obtain sentence-level emotion features of a specific emotion category and expand the sentence-level emotion features into sample emotion features as long as a sample phoneme sequence corresponding to the sample audio. Further, the sample emotion characteristics and the output of the coding network in the acoustic network (i.e., the sample text characteristics) may be added. In addition, the sample weight embedding vector corresponding to the sample weight sequence in the emotion extraction network output can also be added to the output of the coding network to obtain the finally required sample emotion text characteristics, and optionally, the sample weight sequence can be subjected to binning operation first, and then each type of sample weight sequence is subjected to vector embedding independently to obtain the sample weight embedding vector. After the sample emotion text feature is obtained, a predicted Mel frequency spectrum diagram is further obtained according to the sample emotion text feature, so that training of a reference model is achieved, and the reference model is shown in fig. 5 d.

In the inference stage, voices of specific emotion types and emotion intensities can be synthesized through artificially given texts to be synthesized, target emotion types and target emotion intensities. The emotion control network receives a target emotion category and target emotion intensity as input, performs specific weighted representation on a plurality of emotion tokens obtained in a training stage, outputs emotion characteristics of a specified emotion category and emotion intensity, and then adds the emotion characteristics with the output (namely text characteristics) of a coding network in the acoustic network. In addition, reference weight embedded vectors corresponding to the reference weight sequences can also be added to the output of the coding network, so that the emotional text features are obtained; further, a mel frequency spectrum diagram can be obtained according to the emotion text feature, and emotion voice can be obtained according to the mel frequency spectrum diagram, as shown in fig. 3 d.

It can be seen that the embodiment of the application provides a speech synthesis method capable of controlling emotion types and intensities. In the training stage, the method can obtain global emotion characteristics at sentence level and a sample weight sequence with the same length as a sample phoneme sequence through an emotion extraction network; secondly, different types of emotions are respectively limited on different emotion tokens through a specific emotion representation network; therefore, an emotion characterization vector (namely, sample emotion characteristics) at a sentence level and a plurality of emotion tokens representing preset emotion categories are obtained, and the plurality of emotion tokens can well characterize the whole emotion space. In the inference stage, different weighted combinations of emotion tokens can be artificially designated through an emotion control network, so that emotion representations of different emotion categories and emotion intensities are obtained, and corresponding audios are further synthesized through an end-to-end acoustic network. The emotion characterization control can be in a sentence level or a phoneme level with finer granularity. In addition, weight embedding (such as a sample weight embedding vector in a training stage and a reference weight embedding vector in an inference stage) is further introduced in the embodiment of the application, so that the space of emotion characterization is further expanded, the influence of emotion characterization averaging is reduced, and the speech synthesis quality is improved.

Fig. 6 is a schematic structural diagram of a speech synthesis apparatus according to an embodiment of the present application. The speech synthesis apparatus described in this embodiment includes:

an obtaining unit 601, configured to obtain a target emotion category and a target emotion intensity corresponding to a text to be synthesized;

a first determining unit 602, configured to determine an emotion feature of the text to be synthesized according to an emotion token, the target emotion category, and the target emotion intensity that respectively correspond to multiple preset emotion categories; any emotion token is used for representing the characteristics corresponding to the preset emotion types;

a second determining unit 603, configured to determine, according to the text feature of the text to be synthesized and the emotion feature, an emotion text feature of the text to be synthesized;

and a synthesizing unit 604, configured to synthesize, according to the emotion text features, emotion voice of which the text to be synthesized meets the target emotion category and the target emotion intensity.

In an implementation manner, the first determining unit 602 is specifically configured to:

inputting the target emotion category and the target emotion intensity into an emotion control network in an emotion voice synthesis model, and weighting emotion tokens respectively corresponding to the preset emotion categories by the emotion control network based on the target emotion category and the target emotion intensity to obtain emotion characteristics of the text to be synthesized; and the emotion token corresponding to the preset emotion category is obtained in the training process of the emotion voice synthesis model.

In one implementation, the emotion speech synthesis model is obtained by training a reference model, and the reference model includes an acoustic network and an emotion network; the apparatus further comprises a training unit 605, specifically configured to:

obtaining a training sample pair, wherein the training sample pair comprises sample data and label data corresponding to the sample data, the sample data comprises a sample text and a sample audio, and the label data comprises an emotion category label corresponding to the sample data;

obtaining a sample Mel frequency spectrogram of the sample audio, inputting the sample Mel frequency spectrogram into the emotion network, and obtaining sample emotion characteristics and weights of emotion tokens respectively corresponding to the preset emotion categories;

inputting the sample text and the sample emotional characteristics into the acoustic network to obtain a predicted Mel frequency spectrogram of the sample text;

and training the reference model according to the weights of the emotion tokens respectively corresponding to the prediction Mel frequency spectrogram, the sample Mel frequency spectrogram, the emotion class labels and the preset emotion classes to obtain a trained reference model, and determining an emotion voice synthesis model according to the trained reference model.

In one implementation, the emotion network comprises an emotion extraction network and an emotion characterization network; the training unit 605 is specifically configured to:

inputting the sample Mel frequency spectrogram into the emotion extraction network for feature extraction to obtain global emotion features of the sample audio;

inputting the global emotion characteristics into the emotion characterization network for characteristic extraction, and obtaining sample emotion characteristics of the sample audio under the plurality of preset emotion categories and weights of emotion tokens respectively corresponding to the plurality of preset emotion categories.

In one implementation, the acoustic network includes an encoding network and a decoding network; the training unit 605 is specifically configured to:

performing text-to-phoneme conversion on the sample text to obtain a sample phoneme sequence corresponding to the sample text;

inputting the sample phoneme sequence into the coding network for coding to obtain sample text characteristics of the sample text;

obtaining sample emotion text characteristics according to the sample emotion characteristics and the sample text characteristics;

and inputting the sample emotion text characteristics into the decoding network for decoding processing to obtain a predicted Mel frequency spectrogram of the sample text.

In an implementation manner, the training unit 605 is specifically configured to:

obtaining a sample weight sequence from the output of the emotion extraction network; the sequence length of the sample weight sequence is the same as that of the sample phoneme sequence, and each sample weight in the sample weight sequence is used for characterizing the contribution degree of a corresponding phoneme in the sample phoneme sequence to emotion characterization;

performing vector embedding on the sample weight sequence to obtain a sample weight embedded vector aiming at the sample weight sequence;

and embedding the vector according to the sample emotion characteristics, the sample text characteristics and the sample weight to obtain sample emotion text characteristics.

obtaining emotion tokens respectively corresponding to a plurality of preset emotion categories from the output of the emotion representation network;

constructing an emotion control network by the emotion tokens respectively corresponding to the plurality of preset emotion categories;

and determining the emotion voice synthesis model based on the acoustic network in the trained reference model and the emotion control network.

In an implementation manner, the second determining unit 603 is specifically configured to:

acquiring a reference weight sequence corresponding to the text feature of the text to be synthesized, wherein the sequence length of the reference weight sequence is the same as the sequence length of a phoneme sequence of the text to be synthesized;

performing vector embedding processing on the reference weight sequence to obtain a reference weight embedded vector aiming at the reference weight sequence;

and determining the emotional text features of the text to be synthesized according to the emotional features, the text features and the reference weight embedded vector.

In an implementation manner, the synthesis unit 604 is specifically configured to:

decoding the emotional text features to obtain a Mel frequency spectrum diagram of the text to be synthesized;

and performing sound code conversion on the Mel frequency spectrogram to obtain the emotional voice of which the text to be synthesized conforms to the target emotion category and the target emotion intensity.

It is understood that the division of the units in the embodiments of the present application is illustrative, and is only one logical function division, and there may be another division manner in actual implementation. Each functional unit in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

Fig. 7 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. The computer device described in this embodiment includes: a processor 701, a memory 702, and a network interface 703. Data may be exchanged between the processor 701, the memory 702, and the network interface 703.

The Processor 701 may be a Central Processing Unit (CPU), and may also be other general purpose processors, digital Signal Processors (DSP), application Specific Integrated Circuits (ASIC), field-Programmable Gate arrays (FPGA) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 702, which may include both read-only memory and random-access memory, provides program instructions and data to the processor 701. A portion of the memory 702 may also include non-volatile random access memory. When the processor 701 calls the program instruction, it is configured to:

determining the emotional characteristics of the text to be synthesized according to the emotion tokens, the target emotion types and the target emotion intensity which respectively correspond to the preset emotion types; any one emotion token is used for representing the characteristics of the corresponding preset emotion category;

In one implementation, the processor 701 is specifically configured to:

inputting the target emotion type and the target emotion intensity into an emotion control network in an emotion voice synthesis model, and weighting emotion tokens respectively corresponding to the preset emotion types by the emotion control network based on the target emotion type and the target emotion intensity to obtain the emotion characteristics of the text to be synthesized; and obtaining the emotion token corresponding to the preset emotion category in the training process of the emotion voice synthesis model.

In one implementation, the emotion speech synthesis model is obtained by training a reference model, and the reference model includes an acoustic network and an emotion network; the processor 701 is further configured to:

In one implementation, the emotion network comprises an emotion extraction network and an emotion characterization network; the processor 701 is specifically configured to:

In one implementation, the acoustic network includes an encoding network and a decoding network; the processor 701 is specifically configured to:

In one implementation, the processor 701 is specifically configured to:

In an implementation manner, the processor 701 is specifically configured to:

In one implementation, the processor 701 is specifically configured to:

and determining the emotional text feature of the text to be synthesized according to the emotional feature, the text feature and the reference weight embedded vector.

In one implementation, the processor 701 is specifically configured to:

The embodiment of the present application further provides a computer storage medium, in which program instructions are stored, and when the program is executed, part or all of the steps of the speech synthesis method in the embodiment corresponding to fig. 2 or fig. 4 may be included.

It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the order of acts described, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art will recognize that the embodiments described in this specification are preferred embodiments and that acts or modules referred to are not necessarily required for this application.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a program, and the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.

Embodiments of the present application also provide a computer program product or computer program comprising program instructions stored in a computer readable storage medium. The processor of the computer device reads the program instructions from the computer-readable storage medium, and the processor executes the program instructions to cause the computer device to perform the steps performed in the embodiments of the methods described above.

The foregoing describes a speech synthesis method, apparatus, computer device, and storage medium provided in the embodiments of the present application in detail, and specific examples are applied herein to explain the principles and implementations of the present application, and the descriptions of the foregoing embodiments are only used to help understand the method and core ideas of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of speech synthesis, the method comprising:

2. The method of claim 1, wherein the determining the emotional characteristics of the text to be synthesized according to the emotion tokens, the target emotion categories and the target emotion intensities respectively corresponding to a plurality of preset emotion categories comprises:

inputting the target emotion type and the target emotion intensity into an emotion control network in an emotion voice synthesis model, and weighting emotion tokens respectively corresponding to the preset emotion types by the emotion control network based on the target emotion type and the target emotion intensity to obtain the emotion characteristics of the text to be synthesized; and the emotion token corresponding to the preset emotion category is obtained in the training process of the emotion voice synthesis model.

3. The method of claim 2, wherein the emotion speech synthesis model is obtained by training a reference model, and the reference model comprises an acoustic network and an emotion network; the method further comprises the following steps:

obtaining a sample Mel frequency spectrogram of the sample audio, inputting the sample Mel frequency spectrogram into the emotion network, and obtaining sample emotion characteristics and weights of emotion tokens respectively corresponding to the plurality of preset emotion categories;

4. The method of claim 3, wherein the emotion networks comprise an emotion extraction network and an emotion characterization network; the inputting the sample Mel frequency spectrogram into the emotion network to obtain sample emotion characteristics and weights of emotion tokens respectively corresponding to the preset emotion categories, comprising:

5. The method of claim 4, wherein the acoustic network comprises an encoding network and a decoding network; the step of inputting the sample text and the sample emotional characteristics into the acoustic network to obtain a predicted Mel frequency spectrogram of the sample text comprises:

6. The method of claim 5, wherein obtaining sample emotion text features from the sample emotion features and the sample text features comprises:

7. The method of claim 4, wherein determining the emotion speech synthesis model from the trained reference model comprises:

obtaining emotion tokens respectively corresponding to a plurality of preset emotion types from the output of the emotion characterization network;

8. The method according to claim 1 or 2, wherein the determining the emotional text feature of the text to be synthesized according to the text feature of the text to be synthesized and the emotional feature comprises:

acquiring a reference weight sequence corresponding to the text features of the text to be synthesized, wherein the sequence length of the reference weight sequence is the same as the sequence length of a phoneme sequence of the text to be synthesized;

carrying out vector embedding processing on the reference weight sequence to obtain a reference weight embedding vector aiming at the reference weight sequence;

9. The method according to claim 1 or 2, wherein the synthesizing of the emotion voice of the text to be synthesized according to the emotion text features comprises:

10. A computer device comprising a processor, a memory and a network interface, the processor, the memory and the network interface being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-9.

11. A computer storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause a computer device having the processor to perform the method of any one of claims 1-9.