CN112364653A

CN112364653A - Text analysis method, apparatus, server and medium for speech synthesis

Info

Publication number: CN112364653A
Application number: CN202011240517.XA
Authority: CN
Inventors: 刘世超
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2020-11-09
Filing date: 2020-11-09
Publication date: 2021-02-12

Abstract

The embodiment of the application discloses a text analysis method, a text analysis device, a text analysis server and a text analysis medium for speech synthesis. One embodiment of the method comprises: acquiring a text of a voice to be synthesized; and inputting the text of the speech to be synthesized into a pre-trained text analysis model to obtain a text analysis result, wherein the text analysis model comprises a feature sharing layer and at least two subtask models, and the text analysis result comprises labels which are output in the text of the speech to be synthesized and correspond to the at least two subtask models. The embodiment reduces the flow of speech synthesis front-end processing, and can reduce the over-fitting risk of the subtasks, thereby improving the performance of the text analysis model.

Description

Text analysis method, apparatus, server and medium for speech synthesis

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to a text analysis method, a text analysis device, a text analysis server and a text analysis medium for speech synthesis.

Background

With the development of artificial intelligence technology, the Speech synthesis (Text To Speech, TTS) technology has also gained more and more applications. Currently, the speech synthesis technology is mainly divided into two parts, front end and back end. The front-end technology is mainly used for performing linguistic analysis on texts. It may include, but is not limited to, at least one of the following: language judgment, text normalization (e.g., determining various symbol, numeric reading), linguistic feature extraction (e.g., word segmentation, polyphones), prosodic analysis and prediction, and the like.

A pipeline (pipeline) structure is usually adopted to sequentially process various subtasks of the speech synthesis front end, so as to complete text analysis.

Disclosure of Invention

The embodiment of the application provides a text analysis method, a text analysis device, a text analysis server and a text analysis medium for speech synthesis.

In a first aspect, an embodiment of the present application provides a text analysis method for speech synthesis, where the method includes: acquiring a text of a voice to be synthesized; inputting a text of the voice to be synthesized into a pre-trained text analysis model to obtain a text analysis result, wherein the text analysis model comprises a feature sharing layer and at least two subtask models, and the text analysis result comprises labels which are output in the text of the voice to be synthesized and correspond to the at least two subtask models.

In some embodiments, the text analysis model is trained by: acquiring a training sample set, wherein training samples in the training sample set comprise sample input and sample labeling information for training at least two subtask models; and taking the sample input of the training sample in the training sample set as input, taking the sample marking information corresponding to the input sample input as expected output, and training to obtain the text analysis model.

In some embodiments, the at least two subtask models include a prosody prediction model; the prosody prediction model comprises a character vector representation network and a sequence labeling network.

In some embodiments, the sample input and sample label information used to train the prosody prediction model includes sample text and prosody pause labels corresponding to the sample text.

In some embodiments, the character vector Representation network comprises a BERT (bidirectional Encoder replication from transformations) model, and the sequence labeling network comprises a long-short term memory network and a conditional random field layer.

In some embodiments, the feature sharing layer includes at least one of the BERT model, the long-short term memory network, and the conditional random field layer.

In some embodiments, the at least two subtask models further include at least two of: the system comprises a word segmentation model, a part-of-speech tagging model, a named entity recognition model and a polyphone pronunciation prediction model.

In a second aspect, an embodiment of the present application provides a text analysis apparatus for speech synthesis, including: an acquisition unit configured to acquire a text of a speech to be synthesized; the analysis unit is configured to input a text of the speech to be synthesized into a pre-trained text analysis model to obtain a text analysis result, wherein the text analysis model comprises a feature sharing layer and at least two subtask models, and the text analysis result comprises labels which are output in the text of the speech to be synthesized and correspond to the at least two subtask models.

In some embodiments, the character vector representation network comprises a BERT model, and the sequence labeling network comprises a long-short term memory network and a conditional random field layer.

In a third aspect, an embodiment of the present application provides a server, where the server includes: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.

In a fourth aspect, the present application provides a computer-readable medium, on which a computer program is stored, which when executed by a processor implements the method described in any implementation manner of the first aspect.

According to the text analysis method, the text analysis device, the text analysis server and the text analysis medium for speech synthesis, parallel processing of multiple subtasks is achieved through the text analysis model comprising the feature sharing layer and at least two subtask models, and the flow of speech synthesis front-end processing is reduced. Moreover, the subtasks can learn the characteristics of other subtasks through the characteristic sharing layer, the over-fitting risk of the subtasks can be reduced, and therefore the performance of the text analysis model is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a text analysis method for speech synthesis according to the present application;

FIG. 3 is a schematic diagram of an application scenario of a text analysis method for speech synthesis according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an embodiment of a text analysis model for a text analysis method for speech synthesis according to the present application;

FIG. 5 is a schematic block diagram of one embodiment of a text analysis apparatus for speech synthesis according to the present application;

FIG. 6 is a schematic block diagram of an electronic device suitable for use in implementing embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the disclosure and are not limiting of the disclosure. It should be further noted that, for the convenience of description, only the portions relevant to the present disclosure are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 shows an exemplary architecture 100 to which the text analysis method for speech synthesis or the text analysis apparatus for speech synthesis of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The

terminal devices

101, 102, 103 interact with a server 105 via a network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, a text editing application, a reading application, and the like.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices having a display screen and supporting voice playing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for the newscast-like application on the

terminal devices

101, 102, 103. The background server may process a text of the speech to be synthesized of the terminal device (e.g., a newsread in a newsread application — performing analysis, etc., and generate a processing result (e.g., a text analysis result of the newsread)).

Note that, the text of the speech to be synthesized may also be directly stored locally in the server 105, and the server 105 may directly extract and process the locally stored text of the speech to be synthesized, in which case, the

terminal apparatuses

101, 102, and 103 and the network 104 may not be present.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the text analysis method for speech synthesis provided in the embodiment of the present application is generally executed by the server 105, and accordingly, the text analysis device for speech synthesis is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a text analysis method for speech synthesis according to the present application is shown. The text analysis method for speech synthesis comprises the following steps:

step 201, obtaining a text of a speech to be synthesized.

In the present embodiment, the execution subject of the text analysis method for speech synthesis (such as the server 105 shown in fig. 1) may acquire the text of the speech to be synthesized by a wired connection manner or a wireless connection manner. As an example, the execution subject may obtain a text of a speech to be synthesized, which is stored locally in advance, or may obtain a text of a speech to be synthesized, which is sent by an electronic device (for example, a terminal device shown in fig. 1) connected to the execution subject in communication. The text of the speech to be synthesized may be determined according to an actual application scenario. As an example, the text of the speech to be synthesized may include a newsfeed in a news client. As yet another example, the text of the speech to be synthesized may include e-book text in a reading-like application.

Step 202, inputting the text of the speech to be synthesized into a pre-trained text analysis model to obtain a text analysis result.

In this embodiment, the executing entity may input the text of the speech to be synthesized, which is obtained in step 201, to a pre-trained text analysis model to obtain a text analysis result. The text analysis model may include a feature sharing layer and at least two subtask models. The text analysis result may include a label output corresponding to the at least two subtask models in the text of the speech to be synthesized.

In this embodiment, the feature sharing layer may be configured to implement sharing of hidden layer parameters in the at least two subtask models. The feature sharing layer may include various feed Forward Neural Networks (FNNs). Alternatively, the feature sharing layer may be constructed based on a transform encoder. The subtask models described above may include various models for performing speech synthesis front-end tasks. It may include, but is not limited to, at least one of the following: language judgment model, text standardization model, word segmentation model, part of speech tagging model, polyphone pronunciation prediction model and the like.

In the present embodiment, the subtask model may include a language judgment model and a text normalization model, as an example. For the text "5000-year brilliant civilization" of the speech to be synthesized, the text analysis result may include a language label "chinese" to which the text of the speech to be synthesized output by the language judgment model belongs and a reading label "five thousand" corresponding to "5000" output by the text normalization model. As yet another example, the subtask models described above may include a word segmentation model and a polyphonic pronunciation prediction model. For the text "Nanjing city Changjiang river bridge" of the speech to be synthesized, the text analysis result may include the word segmentation result "Nanjing city/Changjiang river bridge" of the text of the speech to be synthesized output by the word segmentation model and the reading label "ch & ng" corresponding to the "length" output by the polyphonic pronunciation prediction model.

In some optional implementations of this embodiment, the text analysis model may be obtained by training through the following steps:

in a first step, a set of training samples is obtained.

In these implementations, the executive for training the text analysis model described above may obtain the set of training samples in various ways. The training samples in the training sample set may include sample input and sample labeling information for training the at least two subtask models. As an example, the sample input and sample labeling information used to train the language judgment model may include a sample text and labeling information used to characterize the language to which the sample text belongs. As yet another example, the sample input and sample annotation information used to train the segmentation model may include sample text and annotation information used to characterize the results of the segmentation of the sample text. As yet another example, the sample input and sample annotation information used to train the polyphonic pronunciation prediction model may include sample text containing the polyphonic characters and annotation information used to characterize the reading of the polyphonic characters in the sample text in the current context.

And secondly, taking the sample input of the training samples in the training sample set as input, taking the sample marking information corresponding to the input sample input as expected output, and training to obtain a text analysis model.

In these implementations, the execution agent takes as input a sample input of a training sample in a training sample set, takes as an expected output sample label information corresponding to the input sample input, and trains by a machine learning method to obtain the text analysis model.

Based on the optional implementation manner, the execution subject may train the text analysis model by using training samples corresponding to respective subtasks. Thus, for subtasks with insufficient sample size, the associated shared features can also be learned through training samples of other subtasks.

In some optional implementations of this embodiment, the at least two subtask models further include at least two of the following: the system comprises a word segmentation model, a part-of-speech tagging model, a named entity recognition model and a polyphone pronunciation prediction model. The subtask model may be implemented by using various deep neural network models, which are not described herein again.

Based on the optional implementation manner, the execution main body can integrate a plurality of voice synthesis front-end tasks through the text analysis model, so that the processing flow is reduced, and the text analysis efficiency is improved.

With continued reference to fig. 3, fig. 3 is a schematic diagram of an application scenario of a text analysis method for speech synthesis according to an embodiment of the present application. In the application scenario of fig. 3, a user 301 uses a terminal device 302 to send a text "today weather really good" 303 of speech to be synthesized to a background server 304. Then, the background server 304 inputs the text 303 of the speech to be synthesized into a pre-trained text analysis model 305, and obtains a text analysis result "today/weather/true (h { hacek over (a) } o) }" 306 through the feature sharing layer 3051, the word segmentation model 3052 and the polyphonic pronunciation prediction model 3053. Optionally, the background server 304 may also perform a back-end processing of speech synthesis on the text analysis result, so as to generate a synthesized speech 307 corresponding to the text 303 of the speech to be synthesized. The backend server 304 may also send the generated synthesized speech 307 to the terminal device 302.

At present, a pipeline structure is usually adopted to sequentially process each subtask of a speech synthesis front end, so that the processing flow of the speech synthesis front end is excessively long, and the accuracy is not high due to the accumulation of errors of each link. The method provided by the above embodiment of the present application realizes parallel processing of multiple subtasks through the text analysis model including the feature sharing layer and at least two subtask models, and reduces the flow of speech synthesis front-end processing. Moreover, the subtasks can learn the characteristics of other subtasks through the characteristic sharing layer, the over-fitting risk of the subtasks can be reduced, and therefore the performance of the text analysis model is improved.

With further reference to FIG. 4, a schematic diagram 400 of an embodiment of a text analysis model for a text analysis method for speech synthesis is shown.

The structure 400 of an embodiment of the text analysis model of the text analysis method for speech synthesis provided by the embodiment may include a feature sharing layer 401, a prosody prediction model 402, and a subtask model 403. The prosody prediction model 402 may include a character vector representation network 4021 and a sequence tagging network 4022. The above-described character vector representation network 4021 may include various neural network models for word vector generation, such as a word2vec model. The sequence labeling network 4022 may include various models for sequence labeling, such as Hidden Markov Models (HMMs).

It should be noted that the above feature sharing layer 401 may be the same as that described in step 202 in the foregoing embodiment, and is not described herein again.

In some optional implementations of the present embodiment, the sample input and the sample annotation information for training the prosody prediction model may include sample text and prosody pause annotations corresponding to the sample text. The prosody pause labeling system may be, for example, the 5-level (0 to 4) prosody labeling system of tobi (tones and Break industries). Alternatively, the prosody pause label can be a 4-level prosody structure (#1, #2, #3, # 4). Where "# 1" may be used to represent the smallest unit of a prosodic word without pauses. "# 2" may be used to indicate rereading with a short pause. "# 3" may be used to indicate a prosodic phrase and there is a pause. "# 4" may be used to indicate the end of a complete sentence. Thus, the above sample text may be, for example, "i remember so far a memorable scene of watching the flag-up for the first time". The above-described prosody pause label may be "i remember for the first time #1 to #1 that #1 watched #1 the memorable #2 scene #4 of the #1 flag-up # 1. "

Based on the optional implementation manner, the execution main body may perform end-to-end prosody prediction by using the prosody prediction model. Compared with a pipeline mode, the prosody prediction subtask does not need to be based on word segmentation and part-of-speech tagging results, so that the tagging of intermediate links can be reduced, and the error accumulation of the preposed process is reduced.

In some optional implementations of this embodiment, the character vector representation network may include a BERT model. The sequence labeling network can comprise a long-short term memory network and a conditional random field layer.

Based on the optional implementation mode, the text prosody prediction can be realized by using a BERT model for pre-training and using a structure of connecting a long-short term memory network and a conditional random field layer, so that the accuracy of the prosody prediction is improved.

In some optional implementations of this embodiment, based on the optional implementations, the feature sharing layer may include at least one of the BERT model, the long-short term memory network, and the conditional random field layer.

Based on the optional implementation manner, the features generated by at least one of the BERT model, the long-short term memory network and the conditional random field layer can be expressed as the shared sub-tasks, so that the effect of the whole text analysis model is improved.

In some optional implementations of this embodiment, the at least two subtasks further include at least two of the following: the system comprises a word segmentation model, a part-of-speech tagging model, a named entity recognition model and a polyphone pronunciation prediction model.

As can be seen from fig. 4, the structure 400 of one embodiment of the text analysis model of the text analysis method for speech synthesis in the present embodiment embodies the structure of the prosody prediction model included in the text analysis model, and the prosody prediction model specifically includes the structures of the character vector representation network and the sequence labeling network. Therefore, the scheme described in the embodiment can integrate prosody prediction into a text analysis model for speech synthesis front-end processing, thereby further enriching the implementation functions of the text analysis model.

With further reference to fig. 5, as an implementation of the methods shown in the above-mentioned figures, the present application provides an embodiment of a text analysis apparatus for speech synthesis, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable in various electronic devices.

As shown in fig. 5, the text analysis apparatus 500 for speech synthesis provided by the present embodiment includes an acquisition unit 501 and an analysis unit 502. The acquiring unit 501 is configured to acquire a text of a speech to be synthesized; the analysis unit 502 is configured to input a text of the speech to be synthesized into a pre-trained text analysis model, and obtain a text analysis result, where the text analysis model includes a feature sharing layer and at least two subtask models, and the text analysis result includes a label output in the text of the speech to be synthesized corresponding to the at least two subtask models.

In the present embodiment, in the text analysis device 500 for speech synthesis: the specific processing of the obtaining unit 501 and the analyzing unit 502 and the technical effects thereof can refer to the related descriptions of step 201 and step 202 in the corresponding embodiment of fig. 2, which are not repeated herein.

In some optional implementations of this embodiment, the text analysis model may be obtained by training through the following steps: acquiring a training sample set, wherein training samples in the training sample set comprise sample input and sample labeling information for training at least two subtask models; and taking the sample input of the training sample in the training sample set as input, taking the sample marking information corresponding to the input sample input as expected output, and training to obtain the text analysis model.

In some optional implementations of the embodiment, the at least two subtask models may include a prosody prediction model. The prosody prediction model may include a character vector representation network and a sequence labeling network.

In some optional implementations of the present embodiment, the sample input and the sample annotation information for training the prosody prediction model may include sample text and prosody pause annotations corresponding to the sample text.

In some optional implementations of this embodiment, the feature sharing layer may include at least one of the BERT model, the long-short term memory network, and the conditional random field layer.

In some optional implementations of this embodiment, the at least two subtask models may further include at least two of the following items: the system comprises a word segmentation model, a part-of-speech tagging model, a named entity recognition model and a polyphone pronunciation prediction model.

In the apparatus provided by the above embodiment of the present application, the analysis unit 502 inputs the text of the speech to be synthesized, which is acquired by the acquisition unit 501, to the text analysis model including the feature sharing layer and the at least two subtask models, so that parallel processing of multiple subtasks is realized, and a flow of front-end processing of speech synthesis is reduced. Moreover, the subtasks can learn the characteristics of other subtasks through the characteristic sharing layer, the over-fitting risk of the subtasks can be reduced, and therefore the using effect of the text analysis model is improved.

Referring now to FIG. 6, a block diagram of an electronic device (e.g., the server of FIG. 1) 600 suitable for implementing embodiments of the present application is shown. The terminal device in the embodiments of the present application may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a PDA (personal digital assistant), a PAD (tablet), a PMP (portable multimedia player), a vehicle-mounted terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The server shown in fig. 6 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: input devices 606 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, etc.; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, or the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of the embodiments of the present application.

It should be noted that the computer readable medium described in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (Radio Frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the server; or may exist separately and not be assembled into the server. The computer readable medium carries one or more programs which, when executed by the server, cause the server to: acquiring a text of a voice to be synthesized; inputting a text of the voice to be synthesized into a pre-trained text analysis model to obtain a text analysis result, wherein the text analysis model comprises a feature sharing layer and at least two subtask models, and the text analysis result comprises labels which are output in the text of the voice to be synthesized and correspond to the at least two subtask models.

Computer program code for carrying out operations for embodiments of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language, Python, or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes an acquisition unit and an analysis unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, the acquisition unit may also be described as a "unit that acquires text of speech to be synthesized".

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure in the embodiments of the present application is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is possible without departing from the scope of the present disclosure. For example, the above features and (but not limited to) the features with similar functions disclosed in the embodiments of the present application are mutually replaced to form the technical solution.

Claims

1. A method of text analysis for speech synthesis, comprising:

acquiring a text of a voice to be synthesized;

inputting the text of the speech to be synthesized into a pre-trained text analysis model to obtain a text analysis result, wherein the text analysis model comprises a feature sharing layer and at least two subtask models, and the text analysis result comprises labels which are output in the text of the speech to be synthesized and correspond to the at least two subtask models.

2. The method of claim 1, wherein the text analysis model is trained by:

acquiring a training sample set, wherein training samples in the training sample set comprise sample input and sample labeling information for training the at least two subtask models;

and taking the sample input of the training sample in the training sample set as input, taking the sample marking information corresponding to the input sample input as expected output, and training to obtain the text analysis model.

3. The method of claim 1, wherein the at least two subtask models include a prosodic prediction model; the prosody prediction model comprises a character vector representation network and a sequence labeling network.

4. The method of claim 3, wherein the sample input and sample annotation information used to train the prosodic prediction model comprises sample text and prosodic pause annotations corresponding to the sample text.

5. The method of claim 4, wherein the character vector representation network comprises a BERT model and the sequence annotation network comprises a long-short term memory network and a conditional random field layer.

6. The method of claim 5, wherein the feature sharing layer comprises at least one of the BERT model, a long-short term memory network, and a conditional random field layer.

7. The method according to one of claims 1 to 6, wherein at least two of the at least two subtask models further comprise at least two of: the system comprises a word segmentation model, a part-of-speech tagging model, a named entity recognition model and a polyphone pronunciation prediction model.

8. A text analysis apparatus for speech synthesis, comprising:

an acquisition unit configured to acquire a text of a speech to be synthesized;

the analysis unit is configured to input the text of the speech to be synthesized into a pre-trained text analysis model to obtain a text analysis result, wherein the text analysis model comprises a feature sharing layer and at least two subtask models, and the text analysis result comprises labels, which are output in the text of the speech to be synthesized and correspond to the at least two subtask models.

9. A server, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.

10. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-7.