CN115620700A

CN115620700A - Speech synthesis method and system based on long sentence mark point pretreatment

Info

Publication number: CN115620700A
Application number: CN202210977599.9A
Authority: CN
Inventors: 杨静波; 汤跃忠; 陈龙; 陈云坤; 田野; 刘丹
Original assignee: Third Research Institute Of China Electronics Technology Group Corp; Beijing Zhongdian Huisheng Technology Co ltd
Current assignee: Third Research Institute Of China Electronics Technology Group Corp; Beijing Zhongdian Huisheng Technology Co ltd
Priority date: 2022-08-15
Filing date: 2022-08-15
Publication date: 2023-01-17

Abstract

The invention discloses a voice synthesis method and system based on long sentence landmark pretreatment. The voice synthesis method based on long sentence landmark pretreatment comprises the following steps: obtaining a long text without punctuation; carrying out punctuation processing on the punctuation-free long text; and performing voice synthesis on the punctuated long text without punctuation after punctuation processing. By adopting the invention, punctuation is added to the user text by adding a text preprocessing process between text input and speech synthesis. And the whole process is finished at the background of the system, so that the user can not sense the change in the whole process, and the non-sensing operation is realized.

Description

Speech synthesis method and system based on long sentence mark point pretreatment

Technical Field

The invention relates to the technical field of computers, in particular to a voice synthesis method and system based on long sentence landmark preprocessing.

Background

With the development of artificial intelligence technology, the speech synthesis technology is more and more emphasized by people, can be applied to the fields of man-machine interaction or text conversion into natural language output and the like, and has been widely applied to scenes such as intelligent question answering, speech broadcasting, audio book, virtual anchor and the like.

Speech synthesis, which refers to a technology for converting arbitrarily inputted text into corresponding speech, is an important research branch in the field of natural speech processing.

Disclosure of Invention

The present invention has been made in view of the findings and recognition by the applicant of the following facts and problems: the speech synthesis software supports the conversion of a text into an audio file, but when a user edits the text, long or even ultra-long sentences without any punctuations sometimes occur, so that the problems of poor perception such as word swallowing, tired hearing and the like occur in the synthesis process.

In order to solve the technical problem, embodiments of the present invention provide a speech synthesis method and system based on long sentence landmark preprocessing, so as to improve the experience of a user.

The voice synthesis method based on long sentence mark point pretreatment comprises the following steps:

obtaining a long text without punctuation;

carrying out punctuation processing on the punctuation-free long text;

and carrying out voice synthesis on the punctuated long text without punctuation after punctuation processing.

According to some embodiments of the invention, the obtaining the punctual-free long text comprises:

acquiring an input text from a user input interface of a speech synthesis system;

judging whether the input text is a punctuation-free long text or not, and if so, acquiring the punctuation-free long text;

the punctuation processing is carried out on the punctuation-free long text, and the punctuation-free long text comprises the following steps:

and carrying out punctuation processing on the punctuation-free long text in a background in a user imperceptible mode.

According to some embodiments of the invention, the punctuating the punctuate-free long text comprises:

and carrying out punctuation processing on the punctuation-free long text based on a deep learning model.

According to some embodiments of the invention, the punctuation processing on the punctuation-free long text based on the deep learning model comprises:

extracting a word vector of the non-punctuation long text;

obtaining the probability of each word vector corresponding to various punctuation marks;

and determining the preset punctuation marks corresponding to each word vector according to the probability of each punctuation mark corresponding to each word vector based on the global optimal principle.

According to some embodiments of the invention, the extracting the word vector of the punctualess long text comprises:

extracting the word vector of the punctuate-free long text by using an ALBERT model;

the obtaining of the probability of each word vector corresponding to various punctuation marks comprises:

obtaining the probability of each word vector corresponding to various punctuation marks by using a BilSTM model;

the determining the preset punctuation marks corresponding to each word vector according to the probability of each punctuation mark corresponding to each word vector based on the global optimal principle comprises the following steps:

and determining the preset punctuation marks corresponding to each word vector by using a CRF model according to the probability of each punctuation mark corresponding to each word vector.

A speech synthesis system according to an embodiment of the present invention includes:

the acquisition unit is used for acquiring the punctuation-free long text;

the processing unit is used for carrying out punctuation processing on the punctuation-free long text;

and the speech synthesis engine is used for performing speech synthesis on the punctuated-point-free long text after punctuation processing.

According to some embodiments of the invention, the acquisition unit is configured to:

the processing unit is configured to:

and performing punctuation processing on the punctuation-free long text in a background in a user perception-free mode.

According to some embodiments of the invention, the processing unit is to:

extracting a word vector of the non-punctuation long text;

According to some embodiments of the invention, the processing unit is to:

By adopting the embodiment of the invention, punctuation is added to the user text and is not displayed on a software page by adding a text preprocessing process between text input and speech synthesis, thereby solving the problem of poor perception in the synthesis process.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. In the drawings:

FIG. 1 is a flow chart of a speech synthesis method based on long-term landmark preprocessing according to an embodiment of the present invention;

FIG. 2 is a flow chart of a speech synthesis method based on long-term landmark preprocessing according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a user of the speech synthesis system entering text in a software operation interface according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a speech synthesis system performing punctuation processing on a punctuation-free long text in the background according to an embodiment of the present invention;

fig. 5 is a schematic diagram of the working principle of the deep learning model in the embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Additionally, in some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

As shown in fig. 1, the speech synthesis method based on long-term landmark preprocessing according to the embodiment of the present invention includes:

s1, obtaining a punctuate-free long text; the term "punctualess long text" is understood to mean a sentence that should have punctuation but not punctuation according to the language specification. "non-punctuation long text" meets two criteria: firstly, the whole sentence has no punctuation mark; secondly, according to the requirement of language specification, the sentences should have pause. For example: the ' don't care ' sentence has no punctuation in the whole sentence, and there should be a pause between ' don ' and ' don't care ', so the ' don't care ' sentence is a long text without punctuation. Although the sentence which can be played by me has no punctuation, the words in the sentence can be completely expressed without pause, so that the sentence does not belong to the category of punctuation-free long texts.

S2, punctuation processing is carried out on the punctuation-free long text;

and S3, carrying out voice synthesis on the punctuated long text without punctuation after punctuation processing.

The existing speech synthesis technology is to directly perform speech synthesis on a text input by a user, if the text input by the user is a non-punctuation long text, the played text has the effect that the text is directly read according to characters at one stroke, no pause exists, the listening feeling is tired, the user still needs to think how to pause in the listening process, the understanding of the user on the content is influenced, the user may need to listen to the text for several times continuously to really understand the meaning of the content, and the user perception is very poor.

By adopting the embodiment of the invention, punctuation is added to the text of the user by adding a text preprocessing process between the text input and the speech synthesis, the punctuation-free long text after the punctuation processing is subjected to the speech synthesis and then played, the text is paused during the pause, the user can directly know the content in the listening process, and the experience performance is excellent. And the whole process is finished at the background of the system, so that the user can not sense the change in the whole process, and the non-sensing operation is realized.

On the basis of the above-described embodiment, modified embodiments are further proposed, and it is to be noted here that, in order to make the description brief, only the differences from the above-described embodiment are described in each modified embodiment.

and judging whether the input text is a punctuate-free long text, and if so, acquiring the punctuate-free long text.

It can be understood that the input text is directly extracted from the user input interface, and then whether the text is a long text without punctuation is judged, if yes, a step is added before the speech synthesis of the original speech synthesis method flow, punctuation is added, and then the text with the punctuation added is played.

The adding process is completed by a background system. From the perspective of a user, the whole method process has no influence on the user, and the user only needs to input texts on an input interface just like operating the original voice synthesis system. And the punctuation processing is also directed at the obtained long text without punctuation, and does not influence the input text displayed by the user input interface.

Therefore, the intelligence of the voice synthesis method can be improved, the experience of the user is improved, and the use habit of the user does not need to be changed.

The text of the whole mark point preprocessing process runs in a background in a user-imperceptible mode, no presentation is performed on a software interface visible to a user, the original text input by the user is reserved, no modification is performed, and the user experience is good.

The punctuation processing of punctuation-free long texts is automatically completed through the deep learning model, so that the processing efficiency can be ensured, and the processing precision can also be ensured.

extracting a word vector of the punctuate-free long text; it can be understood that each word in the punctuate-free long text is subjected to vectorization processing, and a word vector corresponding to each word is obtained.

If the probability of each word vector is directly and independently classified, and the punctuation mark with the highest probability of each word vector is selected as an output result, information between adjacent word vectors cannot be considered, global optimum cannot be obtained, and the classification result is not ideal. The invention considers global optimum to determine the preset punctuation mark corresponding to each word vector, thereby improving the overall performance.

and extracting the word vectors of the punctuate-free long text by using an ALBERT model.

The ALBERT is a natural language processing pre-training language characterization model, can calculate the mutual relation between words, utilizes the calculated relation to adjust the weight to extract important features in a text, utilizes a self-attention mechanism structure to pre-train, pre-trains a deep bidirectional characterization based on the fusion of left and right contexts of all layers, can capture context information in a real sense, and can learn the relation between continuous text segments.

The invention uses the ALBERT model to represent the word vector, can effectively represent the ambiguity of the word, the ALBERT model effectively reduces a large number of parameters through the factorization of the word embedded vector and the cross-layer parameter sharing method, the parameters are only 1.8M, and the parameter number of the BERT model is 64 times of that of the ALBERT model, so the ALBERT model has less memory cost during training and is convenient for deployment.

According to some embodiments of the present invention, the obtaining the probability of each word vector corresponding to various punctuation marks comprises:

and obtaining the probability of each word vector corresponding to various punctuation marks by using a BilSTM model.

The BilSTM (bidirectional long-short time memory neural network) is formed by combining a forward LSTM and a backward LSTM and is used for acquiring semantic information of a Chinese sequence by a word vector through a bidirectional cyclic neural network. An LSTM (long-short-term memory neural network) is a popular recurrent neural network, which is sensitive to short-term input and can store long-term state.

The processing process of the BilSTM model comprises the following steps:

and taking the word vector as the input of each time step of the bidirectional long-time and short-time memory neural network to obtain a forward and backward hidden state sequence of the bidirectional LSTM layer. And after the forward and backward processes are completed, splicing all the hidden state sequences according to positions to obtain a complete hidden state sequence. The linear output layer then maps the complete hidden state sequence to

Wei (A)

The number of punctuation symbol types that can be selected) to obtain a probability value for each word vector corresponding to each punctuation symbol type.

According to some embodiments of the present invention, the determining, based on the global optimal principle and according to the probability that each word vector corresponds to various punctuation marks, a preset punctuation mark corresponding to each word vector includes:

If the score values of all positions are directly and independently classified at the moment, the output result is directly obtained by selecting the highest score value, information between adjacent sentences cannot be considered, the global optimum cannot be obtained, and the classification result is not ideal. The third layer of the model, the CRF layer, is introduced. The score value of each position is not classified independently, but information between adjacent sentences is considered, so that globally optimal classification is obtained, and the overall performance of the model can be improved by the design.

A speech synthesis method based on long sentence landmark preprocessing according to an embodiment of the present invention is described in detail below in a specific embodiment with reference to fig. 2 to 5. It is to be understood that the following description is illustrative only and is not intended to be in any way limiting. All similar structures and similar variations thereof adopted by the invention are intended to fall within the scope of the invention.

When a plurality of synthesis software realizes the synthesis of long texts without punctuations, the adopted mechanism is that professional sound recorders obtain audio data of relevant training by reading a large amount of long texts to form a sound library supporting the synthesis of the long texts; or for similar problem feedback the user needs to manually punctuate the solution. User perception is poor, and time economy is costly, such as:

the method comprises the steps of preparing a large amount of long text data, searching for proper recording personnel to record, acquiring recorded audio file data to train, forming a sound bank with long text audio, and loading the sound bank to a voice synthesis engine, so that the time and economic cost are high, the text and audio data need to be prepared, and the time cost is high;

appropriate recording personnel are needed for recording, related recording cost is paid, and the economic cost is high;

the simple and rough feedback of the user requires manual punctuation correction, and extra labor cost is added for the user;

the problem is not solved, so that the user can feel tired when the audio is synthesized in the using process, and the user experience is poor;

therefore, the invention needs to solve the problem of natural and smooth synthesis of long text speech without punctuation points by more friendly operation.

The speech synthesis method of the embodiment of the invention adopts a deep learning method for long text content without punctuation in speech synthesis software, utilizes natural language understanding technology to preprocess, adopts a background correction in a user non-perception presentation mode, and has a flow chart as shown in fig. 2.

The punctuation processing function of the embodiment of the invention is automatically completed by a deep learning model of ALBERT + BilSTM + CRF.

The ALBERT + BilSTM + CRF model consists of 3 models, namely an ALBERT (lightweight transform bidirectional encoder representation) model, a BilSTM (bidirectional long-and-short time memory neural network) model and a CRF (conditional random field) model.

Firstly, a word vector of a long text without punctuation is obtained by utilizing a pre-trained ALBERT model and is recorded as a sequence

The obtained word vector can effectively extract the characteristics in the text by utilizing the interrelation between the words (also called as words, which are the same meaning) and the characters.

Then, the word vector is used as the input of each time step of the bidirectional long-time and short-time memory neural network to obtain forward and backward hidden state sequences of the bidirectional LSTM layer, after the forward and backward are all processed, all the hidden state sequences are spliced according to positions to obtain a complete hidden state sequence, and then the complete hidden state sequence is mapped to the linear output layer

Wei (A)

Number of selectable punctuation types), noting that the extracted sentence features are all sequences after mapping as a matrix

，

Each dimension of

Representing word vectors

Probability values corresponding to the jth class punctuation marks.

The model structure diagram is shown in fig. 5, in which the text as an example is "do not me", and the result obtained after the punctuation preprocessing is "do not me".

After the punctuation preprocessing, the user can realize the imperceptible processing in the software interface of the voice synthesis system, i.e. the punctuation is not presented, so the user still sees the initial punctuation-free long text in the software interface of the voice synthesis system.

If the punctuation preprocessing is not performed, the user inputs a long text without punctuation in the speech synthesis system and directly synthesizes the long text, the user may swallow the word and synthesize and play all the texts at one stroke, the hearing sense is tired, the user perception is poor, and fig. 3 is a schematic diagram of a user input interface.

The user inputs a long text without punctuation in a voice synthesis system, the long text is synthesized in an engine after punctuation preprocessing, and after text punctuation is intelligently added by adopting a deep learning algorithm, a background data screenshot is shown in fig. 4.

And adding a punctuation preprocessing process before synthesizing the punctuation-free long text, so that the calculation of an engine part can be reduced, synthesizing the normal text directly by the engine, and realizing punctuation preprocessing by a deep learning algorithm if the punctuation-free long text is synthesized.

The long text is preprocessed by the punctuation through a deep learning algorithm, corresponding punctuations are reasonably added and then input to an engine for synthesis, the effect is far better than that of unprocessed audio, and the user has good hearing.

It should be noted that the above-mentioned embodiments are only preferred embodiments of the present invention, and are not intended to limit the present invention, and those skilled in the art can make various modifications and changes. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

the acquisition unit is used for acquiring the punctuation-free long text;

On the basis of the above-described embodiment, various modified embodiments are further proposed, and it is to be noted herein that, in order to make the description brief, only the differences from the above-described embodiment are described in the various modified embodiments.

and judging whether the input text is a long text without punctuation, and if so, acquiring the long text without punctuation.

According to some embodiments of the invention, the processing unit is to:

The text of the whole mark point preprocessing process runs in a background in a user-unaware form, is not presented in a software interface visible to a user, the original text input by the user is reserved, no modification is made, and the user experience is good.

According to some embodiments of the invention, the processing unit is to:

extracting a word vector of the punctuate-free long text;

According to some embodiments of the invention, the processing unit is to:

It should be noted that suffixes such as "module", "component", or "unit" used to indicate elements are only used to facilitate the description of the present invention, and have no specific meaning in itself. Thus, "module", "component" or "unit" may be used mixedly.

Although some embodiments described herein include some features included in other embodiments instead of others, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. The particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. For example, in the claims, any of the claimed embodiments may be used in any combination.

The terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

Any reference signs placed between parentheses shall not be construed as limiting the claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The use of the words first, second, third and the like are used for distinguishing between similar objects and not necessarily for describing any order. These words may be interpreted as names.

"and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Claims

1. A speech synthesis method based on long sentence landmark preprocessing is characterized by comprising the following steps:

obtaining a long text without punctuation;

carrying out punctuation processing on the punctuation-free long text;

and performing voice synthesis on the punctuated long text without punctuation after punctuation processing.

2. The method of claim 1, wherein the obtaining the punctual-free long text comprises:

the punctuation processing is performed on the punctuation-free long text, and the punctuation-free long text comprises the following steps:

3. The method of claim 1, wherein said punctuating said punctuate-free long text comprises:

4. The method of claim 3, wherein the punctuation-free long text based on the deep learning model is punctuated, comprising:

extracting a word vector of the punctuate-free long text;

5. The method of claim 4, wherein said extracting a word vector for said punctuate-free long text comprises:

6. A speech synthesis system based on long sentence landmark preprocessing, comprising:

the acquisition unit is used for acquiring the punctuation-free long text;

7. The system of claim 6, wherein the acquisition unit is to:

the processing unit is configured to:

8. The system of claim 6, wherein the processing unit is to:

9. The system of claim 8, wherein the processing unit is to:

extracting a word vector of the punctuate-free long text;

10. The system of claim 9, wherein the processing unit is to: