CN117854474A - Speech data set synthesis method and system with expressive force and electronic equipment - Google Patents

Speech data set synthesis method and system with expressive force and electronic equipment Download PDF

Info

Publication number
CN117854474A
CN117854474A CN202410185825.9A CN202410185825A CN117854474A CN 117854474 A CN117854474 A CN 117854474A CN 202410185825 A CN202410185825 A CN 202410185825A CN 117854474 A CN117854474 A CN 117854474A
Authority
CN
China
Prior art keywords
expressive force
data set
voice data
text
expressive
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410185825.9A
Other languages
Chinese (zh)
Inventor
俞凯
刘森
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sipic Technology Co Ltd
Original Assignee
Sipic Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sipic Technology Co Ltd filed Critical Sipic Technology Co Ltd
Priority to CN202410185825.9A priority Critical patent/CN117854474A/en
Publication of CN117854474A publication Critical patent/CN117854474A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/027Concept to speech synthesisers; Generation of natural phrases from machine-based concepts
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L2013/083Special characters, e.g. punctuation marks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a synthesis method, a synthesis system and electronic equipment of a voice data set with expressive force. The method comprises the following steps: the method comprises the steps of obtaining original voice data with expressive force, dividing to obtain expressive force voice segments, and performing voice recognition on the expressive force voice segments to obtain a voice data set with expressive force voice segments and recognition texts; adjusting the personification words and punctuation marks of the recognized text in the voice data set to obtain a corrected voice data set; performing text expressive force analysis on the corrected voice data set, and determining each voice segment with expressive force in the corrected voice data set and the text expressive force category corresponding to the recognition text; and marking the corrected voice data set with the text expressive force category in batches by utilizing the large model according to the rule corresponding to the text expressive force category, so as to obtain the voice data set with expressive force with batch marking. The embodiment of the invention constructs the high expressive voice synthesis data set, and has higher expressive force on TTS effect.

Description

Speech data set synthesis method and system with expressive force and electronic equipment
Technical Field
The invention relates to the field of intelligent voice, in particular to a method and a system for synthesizing a voice data set with expressive force and electronic equipment.
Background
With the development of deep learning, the quality of TTS (text-to-speech) is improved, so that the TTS model can generate speech very similar to human speech. However, these models tend to be adept at synthesizing speech with relatively simple emotional characteristics. When expressions like novels, poems, talk shows, etc. are involved, these are often rich in expressive text. This text form affects the rhythm of the speaker's speech, making it difficult for the TTS model to reach the desired level of performance.
Prior art in order to solve the above problems, it is possible to use:
1. the pre-trained language model is applied to expressive speech synthesis, and adopts a pre-trained BERT model to extract semantic features of the text, wherein the BERT model is obtained by training on massive texts. The extracted semantic features are used as auxiliary information to be added into a speech synthesis model taking Tacotron2 as a backbone for training, and experimental results show that the expressive force of the synthesized speech is obviously improved under the addition of the semantic features.
2. Syntactic features are modeled using a graph neural network. Since the syntactic dependency tree itself is a tree structure, the graph data structure can be well characterized. In addition, syntaSpeech proposes a graph encoder that can help models learn syntactic features and then be used to assist in the prediction of acoustic features, prosody and phoneme duration. For part-of-speech tagging, researchers typically use a learnable embedding table to fuse it into the speech synthesis model.
In the process of implementing the present invention, the inventor finds that at least the following problems exist in the related art:
the prior art relies on a pre-trained language model or coarse granularity semantic representation of a basic syntax structure, has no expressive force on a text, represents linguistic features only by extracting part-of-speech labels or dependency syntaxes of the text, and does not deeply and thoroughly mine the linguistic features of the text, so that TTS has poor effect when generating high-expressive voice.
Disclosure of Invention
In order to at least solve the problems that in the prior art, screen adaptation is performed after a program is run, the smoothness of the program is slowed down, a plurality of sets of layout files may not be designed to enumerate all the sizes, the adaptation cannot be well achieved, and the apk size is increased.
In a first aspect, an embodiment of the present invention provides a method for synthesizing a speech data set with expressive force, including:
the method comprises the steps of obtaining original voice data with expressive force, dividing the original voice data to obtain expressive force voice segments, and carrying out voice recognition on the expressive force voice segments to obtain a voice data set with expressive force voice segments and recognition texts, wherein the original voice data comprises the following steps: a comment;
adjusting the personification words and punctuation marks of the recognized text in the voice data set to correct the erroneous recognition of the voice segment with expressive force due to the pitch and the speech speed, so as to obtain a corrected voice data set;
performing text expressive force analysis on the corrected voice data set, and determining each text expressive force category corresponding to each expressive force voice segment and the identification text in the corrected voice data set, wherein the text expressive force category comprises: sentence pattern, scene, technique of congratulation, imitation of character and emotion color;
and marking the corrected voice data set with the text expressive force category in batches by utilizing the large model according to the rule corresponding to the text expressive force category, and obtaining the voice data set with expressive force with batch marking.
In a second aspect, an embodiment of the present invention provides a synthesis system for a speech data set having expressive force, including:
the voice data set determining module is used for obtaining original voice data with expressive force, dividing the original voice data to obtain expressive force voice segments, and carrying out voice recognition on the expressive force voice segments to obtain a voice data set with expressive force voice segments and recognition texts, wherein the original voice data comprises: a comment;
the correction module is used for adjusting the personification words and punctuation marks of the recognized text in the voice data set so as to correct the erroneous recognition of the voice segment with expressive force due to the pitch and the speech speed, and a corrected voice data set is obtained;
the expressive force analysis module is used for carrying out text expressive force analysis on the corrected voice data set, and determining each text expressive force category corresponding to each expressive force voice segment and the identification text in the corrected voice data set, wherein the text expressive force category comprises: sentence pattern, scene, technique of congratulation, imitation of character and emotion color;
and the data set synthesis module is used for marking the corrected voice data sets with the text expressive force categories in batches by utilizing the large model according to the rules corresponding to the text expressive force categories, so as to obtain the voice data sets with expressive force with the batch marking.
In a third aspect, there is provided an electronic device, comprising: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for synthesizing a expressive speech data set according to any one of the embodiments of the invention.
In a fourth aspect, embodiments of the present invention provide a storage medium having stored thereon a computer program, characterized in that the program when executed by a processor implements the steps of the method for synthesizing a expressive speech data set of any of the embodiments of the present invention.
In a fifth aspect, embodiments of the present invention provide a computer program product comprising computer programs/instructions, characterized in that the computer programs/instructions, when executed by a processor, implement the steps of the method for synthesizing a speech dataset with expressive power of any of the embodiments of the present invention.
The embodiment of the invention has the beneficial effects that: a high-expressive voice synthesis data set is constructed, the data set is from audio with expressive force, the audio is processed in a targeted mode, text expressive force features are classified through the angles of linguistics, literature and the like, and a large language model is designed to assist in labeling, so that expressive force types in the data set are labeled more accurately and efficiently. In TTS effect, it has higher expressive power than the current open-source dataset.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for synthesizing a speech dataset with expressive power according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a method for synthesizing a speech data set with expressive force according to an embodiment of the present invention;
FIG. 3 is a statistical schematic diagram of Story TTS of a method for synthesizing a speech dataset with expressive force according to an embodiment of the present invention;
FIG. 4 is a schematic diagram showing a method for synthesizing a speech dataset with expressive power according to an embodiment of the present invention;
FIG. 5 is a schematic diagram showing a specific classification of sentence patterns, scenes and means for inpainting in Story TTS of a method for synthesizing a speech dataset with expressive force according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of character types in Story TTS of a method for synthesizing a expressive speech dataset according to an embodiment of the present invention;
FIG. 7 is a schematic diagram of LLMs expressive force annotation of a method for synthesizing a speech dataset with expressive force according to an embodiment of the present invention;
FIG. 8 is a schematic diagram of a expressive force encoder model of a method for synthesizing a speech dataset with expressive force according to an embodiment of the present invention;
FIG. 9 is a schematic diagram showing evaluation results of a method for synthesizing a speech dataset with expressive force according to an embodiment of the present invention;
FIG. 10 is a schematic diagram of a synthesizing system for a speech dataset with expressive power according to an embodiment of the present invention;
fig. 11 is a schematic structural diagram of an embodiment of an electronic device for synthesizing a voice data set with expressive force according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Fig. 1 is a flowchart of a method for synthesizing a speech data set with expressive force according to an embodiment of the present invention, which includes the following steps:
s11: the method comprises the steps of obtaining original voice data with expressive force, dividing the original voice data to obtain expressive force voice segments, and carrying out voice recognition on the expressive force voice segments to obtain a voice data set with expressive force voice segments and recognition texts, wherein the original voice data comprises the following steps: a comment;
s12: adjusting the personification words and punctuation marks of the recognized text in the voice data set to correct the erroneous recognition of the voice segment with expressive force due to the pitch and the speech speed, so as to obtain a corrected voice data set;
s13: performing text expressive force analysis on the corrected voice data set, and determining each text expressive force category corresponding to each expressive force voice segment and the identification text in the corrected voice data set, wherein the text expressive force category comprises: sentence pattern, scene, technique of congratulation, imitation of character and emotion color;
s14: and marking the corrected voice data set with the text expressive force category in batches by utilizing the large model according to the rule corresponding to the text expressive force category, and obtaining the voice data set with expressive force with batch marking.
In this embodiment, the method constructs a text-to-speech (TTS) dataset, which is the first TTS dataset to include rich expressive power in both speech and text, and is also equipped with a comprehensive annotation of the text expressive power in relation to speech. The data set has high sound quality, organized, chapter-coherent characteristics, and has a sufficient amount of data. Meanwhile, the method also establishes a framework supported by LLM (large language models, large language model) to annotate text expression in five different dimensions (sentence pattern, scene, pedigree technique, imitation of character and emotion colors). Further, experiments were conducted to verify that the TTS model can produce speech with enhanced expressive power when integrating annotated text expressive power tags, and the overall flow of the method is shown in fig. 2.
For step S11, the method obtains the original voice data with expressive force from the internet by means of searching, crawling and the like. Specifically, regarding the selection and retrieval of raw speech data, the method specifically selects a comment in the form of Chinese traditional verbal art, the performer tells a story, simulates various sounds, and characterizes a character to attract the viewer. This form of spoken art is typically based on historical novels, which allows the spoken story to not only have rich phonetic prosody, but also be diverse in terms of text expressive force, such as language structure, paraphrasing, role playing, etc. Therefore, the story telling program meets the high-expressive voice target with rich expressive force labels required by the method to a great extent. By way of example, the method selects a comment program named "intelligent eastern hasse", which teaches a legend story of the key character eastern hasse developed by ancient han dynasty. To construct the dataset, the method retrieves the recorded voice data from a common web site, which is organized into 160 consecutive chapters. The duration of each chapter of these chapters is about 24 minutes, thus totaling 64 hours, including the intermittent rest time.
After obtaining the raw speech data for the review, the method estimates the SNR (Signal to Interference plus Noise Ratio, signal to noise ratio) of the speech data for the speech cut, wherein the noise power is calculated using silence segments predicted by the VAD (Voice activity detection, speech activity detection) tool. As shown in fig. 3, the SNR is estimated to be 32dB, representing the high audio quality of the waveform. Subsequently, statistical analysis was performed on a plurality of common Mandarin (ZH) and English (EN) data sets, as shown in FIG. 4, respectively comparing the prior art data sets to include: LJSpeech, blizzard-2013, hiFi-TTS, libriTTS, aishell3, biaobie and StoryTTS of the present method. The results show that Story TTS exhibits significantly higher pitch standard deviations than other data sets, and that only the Story TTS data set of the present method is achievable for annotation results, providing convincing evidence for its significant acoustic expressiveness.
Specifically, to process the initially coarsely segmented speech data, the method first segments the chapter-level speech into utterances according to the duration of the silence period using a VAD tool. In this step, long silence was also eliminated, obtaining a speech segment of a utterance for 60.9 hours. Subsequently, without a matching text transcript, the speech segment is recognized using the speech recognition model Whisper, resulting in a text transcript. It can be seen that too long speech segments still exist after the VAD process. To solve this problem, the method determines the speech segments and their corresponding text. The prolonged text is manually divided into smaller sentences and then the text fragments are synchronized with the speech using the Aeneas2 tool. This alignment can accurately clip speech, generate 33108 speech data sets for speech and text with expressive speech segments and recognition text.
For step S12, the pitch and the speech speed in the storytelling performance are extremely variable in consideration of the comment type voice, and the voice recognition result shows a higher error rate than the standard voice. To cope with this problem, the present method examines each speech segment and corrects the recognition error. In addition, the pseudo-acoustic components in the speech are replaced with appropriate words in the corresponding text. To achieve the highest accuracy, a manual inspection may be used. The pre-trained model based on preset rules may also be used to correct the personification and punctuation marks. Since punctuation and personification play a vital role in text expression, emotions such as surprise or shock can be expressed through exclamation marks, and human conversations or mind ideas can be expressed through double apostrophes. While Whisper can identify some punctuation marks, it is still far from expected. Thus, punctuation needs to be further corrected and added during text review to ensure that punctuation is used as accurately as possible. This concern for punctuation accuracy also greatly facilitates subsequent text emotion analysis efforts. The text of the modified speech dataset shows a high degree of spoken language and is rich in role-playing, psychological, action and environmental descriptions.
For step S13, in the study of speech-related text expressivity, the method in the art is divided into five dimensions from the literature study, linguistics and pedigrees. These dimensions include sentence patterns, scenes, techniques of paraphrasing, mimicking characters, and emotion colors. For the techniques of word processing, such as exaggeration and sentence patterns, such as statement sentences, are commonly used means of text expression. For example, the use of exclamation sentences or the addition of exaggeration techniques may evoke an excited or surprised emotion. In view of the characteristics of the Story TTS data set of the method, scenes such as role playing (namely, imitation characters) are adopted. For example, role playing scenes typically have strong emotional content, while paralogues typically lack emotional elements. The specific classification of sentence patterns, scenes, and the method of the repair is shown in fig. 5. Emotional color tends to directly affect the expression of the performer. The present method does not divide them into different polarities or predefined categories, but selects a more accurate method: the emotional color of sentences is summarized with several words. This approach may provide a more accurate description of text emotion than traditional classification approaches.
For TTS use where the method modifies the speech dataset Story TTS, it is possible to use the Story TTS to simulate a character (i.e., to play a character role in a comment story), where the actor often simulates the character's speech pattern when delivering speech. For example, she deliberately decreases the tone, slows down, and increases the tone, speeds up, when playing an elderly person, when mimicking the counterpoint. The characters can be divided into 19 character types according to the characteristics of age, sex, status and the like, which is also the basis of imitation of performers. The six most common character types are shown in fig. 6.
For step S14, the method uses a large model to label the corrected voice data set with the text expressive force category in batches according to the rule corresponding to the text expressive force category, and further, the large model includes:
the large model receives a continuously input corrected voice data set with text expressive force category and carries out context marking on the continuously input corrected voice data set with text expressive force category;
the large model receives source data of the corrected voice data set with text expressive force category, and strengthens the personification, the mind monologue and the character imitation of the corrected voice data set based on the source data;
and marking the words corresponding to the text expressive force categories in the reinforced corrected voice data set in batches by using the large model to obtain the voice data set with expressive force and with word marks.
And marking the reinforced corrected voice data concentrated sentence pattern, scene, congratulation technique and words corresponding to the simulated characters in batches by using the Claude2 large model.
And marking the words corresponding to the emotion colors in the reinforced corrected voice data set in batches by using the GPT4 large model.
In this embodiment, LLM has been used in extensive research. To speed up the labeling process and reduce costs, the present method uses GPT4 and Claude2 for bulk annotation, both of which are more powerful than GPT 3. In the annotation process, claude2 is used to annotate sentences, paraphrasing techniques, scenes, and simulated characters. However, when referring to the emotional color in the summary text, the present method finds that Claude2 performs poorly. Turning therefore to GPT4, it has proven to be more skilled in this regard. Under the prompt of the method, the image of the linguist is established for the LLM. The tutorial model is then continued informing that its input text is continuous, requiring marking with context information. For example, two consecutive sentences may belong to the same role playing scene. Subsequently, the sources and features of the text are described in detail, highlighting their richness in terms of the elements personification, inline monologue, role playing, etc. Finally, the indication model annotates each sentence in a prescribed format according to the summarized cues and requirements. Wherein sentence patterns, scenes, congratulation techniques, and imitated characters must be assigned to specific categories, and each emotion color should be summarized as several words.
Initially, the method tried to mark in the zero shot setting, but the results showed less accuracy. Thus, the method provides the model with markup text and explains the underlying principles behind each markup decision. In this case, the model shows improved accuracy and meets the method tag requirements. Fig. 7 shows an example of annotation using LLM.
As an embodiment, after the obtaining the speech data set with batch labeling and expressive force, the method further comprises:
constructing an expressive force encoder;
and testing the voice data set with the expressive force with the batch annotation by using an expressive force encoder to obtain the voice data set with the expressive force with the batch annotation effectively.
The structure of the expressive force encoder includes: the method comprises the steps of BERT, a multi-head attention layer, an up-sampling layer, a linear layer and four independent and learnable embedding layers, wherein the BERT is used for receiving an input voice data set, outputting word-level embedding, the BERT of the sentence vector receives emotion colors of a large model mark, outputting emotion color embedding, determining phonemes of emotion distribution from the word-level embedding and the emotion color embedding by utilizing the multi-head attention layer, the up-sampling layer and the linear layer, and the four independent and learnable embedding layers respectively and independently receive scenes, sentence patterns, a congratulation method and an imitation character of the large model mark, outputting corresponding phonemes, and obtaining a test result through the phonemes of emotion distribution and the corresponding phonemes.
To fully exploit the expressive force annotation of the method, the method exploits an expressive force encoder. Four independent, learnable embedded tables are used to provide information for models of four tags: sentence patterns, scenes, methods of paraphrasing, and simulated characters. For each sentence, four class numbers are assigned according to the four expression tags. These numbers are then entered into the corresponding embedded tables with vector dimensions of 32, 64 and 256, respectively, and the structure of the expressive encoder is shown in fig. 8.
With respect to modeling of emotion colors, the method adopts different model structures. Considering that emotional descriptions are typically condensed into words, representing the overall emotion of a sentence, the emotion may change in a sentence. For example, in exclamation sentences, emotion tends to be exacerbated near the tail. The method initially uses a pre-trained BERT to extract word-level embeddings for the entire sentence. Then, the sentence BERT is used to extract the embedding of emotion colors. Through the cross attention among the embedments, the emotion distribution of different positions in the text is captured, and the expression accuracy is improved. Next, the results are up-sampled to the phoneme level based on word-to-phoneme correspondence and added to the encoder output along with the previous four embeddings.
In addition, the method also builds a TTS model to analyze the influence of the text expression label to be annotated, which is created by the method, on the synthesized voice. The baseline model constructed by the method is realized based on the existing VQTTS system, and utilizes the acoustic characteristics of self-supervision vector quantization instead of the traditional mel spectrogram. Specifically, it consists of an acoustic model t2v and a vocoder v2 w. t2v receives the sequence and then outputs VQ acoustic features and auxiliary features consisting of pitch, energy and speech probability, and v2w receives them, thereby synthesizing the waveform.
According to the method, a high-expressive voice synthesis data set is constructed, the data set is from audio with expressive force, the audio is processed in a targeted mode, text expressive force features are classified through linguistic, literature and other angles, a large language model is designed to assist in labeling, and expressive force types in the data set are labeled more accurately and efficiently. In TTS effect, it has higher expressive power than the current open-source dataset.
The method performs experiments to evaluate the impact of each of the five text expressive force tags on the expressive force of the synthesized speech. In addition, the cumulative effect of using all of these tags together was also evaluated. For these experiments, 300 time periods of acoustic models were trained using batch sizes equal to 8, respectively. The vocoders are shared and are trained on StoryTTS for 100 periods using a batch of 8. The remaining model configurations are consistent with parameters. Each experiment was performed on a single 2080Ti GPU. For preprocessing the text data, a Grapheme-To-Phoneme (G2P) tool is used for text-To-Phoneme conversion. 5% of the text is also set aside for the test and validation set, where the test set consists of 3 consecutive chapters. To obtain the ground truth phoneme duration, a Montreal forced aligner is used that uses Kaldi to force the alignment.
The synthesized text-to-speech was evaluated and the method performed a MOS (mean opinion score ) hearing test involving 20 listeners who were asked to score each sample. The MOS rating is based on a 1-5 level, with an increment of 0.5 points, with a confidence interval of 95%. During the test, the listener is instructed to specifically evaluate the expression level of the synthesized speech while simultaneously evaluating the speech quality. For objective evaluation, MCD (Mel-cepstral distortion ) was calculated using DTW (dynamic time warping, dynamic time warping). In addition, log F0 RMSE (root mean square error ) also calculated using DTW was also analyzed. MCD measures general speech quality, while log-F0 RMSE evaluates performance based on prosody of speech. The lower the values of these two indicators, the better the sound quality and cadence in speech performance.
The evaluation results are given in fig. 9. It can be seen that both subjective and objective scores are better than the baseline model when the expressive label is incorporated into the model. In particular, sentence and scene lifting is relatively small. This may be due to the ubiquitous nature of statements in the dataset, resulting in limited information available to the model. Furthermore, while scene type distributions are fairly uniform, their diversity is insufficient to provide adequate information for the model due to the different role-playing and role-mimicking in an in-mind monolithic scene. The method of the pedigree manipulation and emotion color bring more obvious enhancement. Among the expressive labels, the simulated characters are the most effective because they directly provide information about the current simulated character, enabling the model to effectively learn how the simulated character speaks, thereby synthesizing speech close to the original data. Finally, fusion of all expression tags provides the most significant enhancement. It is significantly better than other settings in both objective and subjective terms, providing more and more accurate information for models about simulated characters and scenes. This fusion also benefits from the complementation of sentence patterns, paraphrasing techniques, and emotion colors.
In general, the method for the first time involves a very expressive TTS dataset from both an acoustic and text perspective. The data set is derived from high-quality recordings of mandarin storytelling programs, and on the one hand, provides valuable resources for researchers studying acoustic expressive power. In addition, the method also carries out comprehensive analysis on the representation text and divides the text expression related to the voice into five different dimensions. LLM is then used and provided with some examples of manual annotation for batch annotation. The efficient marking of LLMs also provides insight into similar data marking operations. Thus, the dataset is provided with rich text expressive annotations. Experimental results show that the TTS model can be combined with the annotated text expressive force label to generate the voice with the remarkably improved expressive force.
Fig. 10 is a schematic structural diagram of a system for synthesizing a speech data set with expressive force according to an embodiment of the present invention, where the system may perform the method for synthesizing a speech data set with expressive force according to any of the above embodiments and be configured in a terminal.
The synthesizing system 10 of the present embodiment for a speech data set with expressive force includes: the speech data set determination module 11, the correction module 12, the expressive force analysis module 13 and the data set synthesis module 14.
The voice data set determining module 11 is configured to obtain original voice data with expressive force, segment the original voice data to obtain expressive force voice segments, and perform voice recognition on the expressive force voice segments to obtain a voice data set with expressive force voice segments and recognition text, where the original voice data includes: a comment; the correction module 12 is configured to adjust the phonetic words and punctuation marks of the recognized text in the speech data set, so as to correct erroneous recognition of the voice segment with expressive force due to pitch and speech speed, thereby obtaining a corrected speech data set; the expressive force analysis module 13 is configured to perform text expressive force analysis on the modified speech data set, and determine text expressive force categories corresponding to each of the speech segments with expressive force and the recognition text in the modified speech data set, where the text expressive force categories include: sentence pattern, scene, technique of congratulation, imitation of character and emotion color; the data set synthesizing module 14 is configured to perform batch labeling on the corrected voice data set with the text expressive force category according to the rule corresponding to the text expressive force category by using the large model, so as to obtain a voice data set with expressive force with batch labeling.
The embodiment of the invention also provides a nonvolatile computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions can execute the synthesis method of the voice data set with expressive force in any of the method embodiments;
as one embodiment, the non-volatile computer storage medium of the present invention stores computer-executable instructions configured to:
the method comprises the steps of obtaining original voice data with expressive force, dividing the original voice data to obtain expressive force voice segments, and carrying out voice recognition on the expressive force voice segments to obtain a voice data set with expressive force voice segments and recognition texts, wherein the original voice data comprises the following steps: a comment;
adjusting the personification words and punctuation marks of the recognized text in the voice data set to correct the erroneous recognition of the voice segment with expressive force due to the pitch and the speech speed, so as to obtain a corrected voice data set;
performing text expressive force analysis on the corrected voice data set, and determining each text expressive force category corresponding to each expressive force voice segment and the identification text in the corrected voice data set, wherein the text expressive force category comprises: sentence pattern, scene, technique of congratulation, imitation of character and emotion color;
and marking the corrected voice data set with the text expressive force category in batches by utilizing the large model according to the rule corresponding to the text expressive force category, and obtaining the voice data set with expressive force with batch marking.
As a non-volatile computer readable storage medium, it may be used to store a non-volatile software program, a non-volatile computer executable program, and modules, such as program instructions/modules corresponding to the methods in the embodiments of the present invention. One or more program instructions are stored in a non-transitory computer readable storage medium that, when executed by a processor, perform the method of synthesizing a expressive speech data set in any of the method embodiments described above.
Fig. 11 is a schematic hardware structure of an electronic device with a method for synthesizing a speech data set with expressive force according to another embodiment of the present application, as shown in fig. 11, where the device includes:
one or more processors 1110, and a memory 1120, one processor 1110 being illustrated in fig. 11. The apparatus of the method of synthesizing a expressive speech data set may further include: an input device 1130 and an output device 1140.
The processor 1110, memory 1120, input devices 1130, and output devices 1140 may be connected by a bus or other means, for example in fig. 11.
The memory 1120 is used as a non-volatile computer readable storage medium, and may be used to store a non-volatile software program, a non-volatile computer executable program, and a module, such as program instructions/modules corresponding to the method for synthesizing a speech data set with expressive force in the embodiments of the present application. The processor 1110 executes various functional applications of the server and data processing, that is, implements the synthesizing method of the voice data set having expressive power of the above-described method embodiment, by running the nonvolatile software programs, instructions, and modules stored in the memory 1120.
Memory 1120 may include a storage program area that may store an operating system, at least one application program required for functionality, and a storage data area; the storage data area may store data, etc. In addition, memory 1120 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some embodiments, memory 1120 optionally includes memory remotely located relative to processor 1110, which may be connected to the mobile device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 1130 may receive input numerical or character information. The output 1140 may comprise a display device such as a display screen.
The one or more modules are stored in the memory 1120 that, when executed by the one or more processors 1110, perform the method of synthesizing a expressive speech data set in any of the method embodiments described above.
The product can execute the method provided by the embodiment of the application, and has the corresponding functional modules and beneficial effects of the execution method. Technical details not described in detail in this embodiment may be found in the methods provided in the embodiments of the present application.
The non-transitory computer readable storage medium may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the device, etc. Further, the non-volatile computer-readable storage medium may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the non-transitory computer readable storage medium may optionally include memory remotely located relative to the processor, which may be connected to the apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiment of the invention also provides electronic equipment, which comprises: the system comprises at least one processor and a memory communicatively connected with the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method for synthesizing a expressive speech data set according to any one of the embodiments of the invention.
The electronic device of the embodiments of the present application exist in a variety of forms including, but not limited to:
(1) Mobile communication devices, which are characterized by mobile communication functionality and are aimed at providing voice, data communication. Such terminals include smart phones, multimedia phones, functional phones, low-end phones, and the like.
(2) Ultra mobile personal computer equipment, which belongs to the category of personal computers, has the functions of calculation and processing and generally has the characteristic of mobile internet surfing. Such terminals include PDA, MID, and UMPC devices, etc., such as tablet computers.
(3) Portable entertainment devices such devices can display and play multimedia content. The device comprises an audio player, a video player, a palm game machine, an electronic book, an intelligent toy and a portable vehicle navigation device.
(4) Other electronic devices with data processing functions.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," comprising, "or" includes not only those elements but also other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of synthesizing a expressive speech data set, comprising:
the method comprises the steps of obtaining original voice data with expressive force, dividing the original voice data to obtain expressive force voice segments, and carrying out voice recognition on the expressive force voice segments to obtain a voice data set with expressive force voice segments and recognition texts, wherein the original voice data comprises the following steps: a comment;
adjusting the personification words and punctuation marks of the recognized text in the voice data set to correct the erroneous recognition of the voice segment with expressive force due to the pitch and the speech speed, so as to obtain a corrected voice data set;
performing text expressive force analysis on the corrected voice data set, and determining each text expressive force category corresponding to each expressive force voice segment and the identification text in the corrected voice data set, wherein the text expressive force category comprises: sentence pattern, scene, technique of congratulation, imitation of character and emotion color;
and marking the corrected voice data set with the text expressive force category in batches by utilizing the large model according to the rule corresponding to the text expressive force category, and obtaining the voice data set with expressive force with batch marking.
2. The method of claim 1, wherein the large model includes, in accordance with rules corresponding to the text expressive force category:
the large model receives a continuously input corrected voice data set with text expressive force category and carries out context marking on the continuously input corrected voice data set with text expressive force category;
the large model receives source data of the corrected voice data set with text expressive force category, and strengthens the personification, the mind monologue and the character imitation of the corrected voice data set based on the source data;
and marking the words corresponding to the text expressive force categories in the reinforced corrected voice data set in batches by using the large model to obtain the voice data set with expressive force and with word marks.
3. The method of claim 2, wherein the batch labeling of the words corresponding to the text expressive force category in the enhanced modified speech dataset with the large model comprises:
and marking the reinforced corrected voice data concentrated sentence pattern, scene, congratulation technique and words corresponding to the simulated characters in batches by using the Claude2 large model.
4. The method of claim 2, wherein the batch labeling of the words corresponding to the text expressive force category in the enhanced modified speech dataset with the large model comprises:
and marking the words corresponding to the emotion colors in the reinforced corrected voice data set in batches by using the GPT4 large model.
5. The method of claim 1, wherein after the deriving the expressive speech data set with batch labeling, the method further comprises:
constructing an expressive force encoder;
and testing the voice data set with the expressive force with the batch annotation by using an expressive force encoder to obtain the voice data set with the expressive force with the batch annotation effectively.
6. The method of claim 5, wherein the structure of the expressive force encoder comprises: the method comprises the steps of BERT, a multi-head attention layer, an up-sampling layer, a linear layer and four independent and learnable embedding layers, wherein the BERT is used for receiving an input voice data set, outputting word-level embedding, the BERT of the sentence vector receives emotion colors of a large model mark, outputting emotion color embedding, determining phonemes of emotion distribution from the word-level embedding and the emotion color embedding by utilizing the multi-head attention layer, the up-sampling layer and the linear layer, and the four independent and learnable embedding layers respectively and independently receive scenes, sentence patterns, a congratulation method and an imitation character of the large model mark, outputting corresponding phonemes, and obtaining a test result through the phonemes of emotion distribution and the corresponding phonemes.
7. A system for synthesizing a expressive speech data set, comprising:
the voice data set determining module is used for obtaining original voice data with expressive force, dividing the original voice data to obtain expressive force voice segments, and carrying out voice recognition on the expressive force voice segments to obtain a voice data set with expressive force voice segments and recognition texts, wherein the original voice data comprises: a comment;
the correction module is used for adjusting the personification words and punctuation marks of the recognized text in the voice data set so as to correct the erroneous recognition of the voice segment with expressive force due to the pitch and the speech speed, and a corrected voice data set is obtained;
the expressive force analysis module is used for carrying out text expressive force analysis on the corrected voice data set, and determining each text expressive force category corresponding to each expressive force voice segment and the identification text in the corrected voice data set, wherein the text expressive force category comprises: sentence pattern, scene, technique of congratulation, imitation of character and emotion color;
and the data set synthesis module is used for marking the corrected voice data sets with the text expressive force categories in batches by utilizing the large model according to the rules corresponding to the text expressive force categories, so as to obtain the voice data sets with expressive force with the batch marking.
8. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the method of any of claims 1-6.
9. An electronic device, comprising: at least one processor, and a memory communicatively coupled to the at least one processor, wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps of the method of any one of claims 1-6.
10. A storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method according to any of claims 1-6.
CN202410185825.9A 2024-02-19 2024-02-19 Speech data set synthesis method and system with expressive force and electronic equipment Pending CN117854474A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410185825.9A CN117854474A (en) 2024-02-19 2024-02-19 Speech data set synthesis method and system with expressive force and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410185825.9A CN117854474A (en) 2024-02-19 2024-02-19 Speech data set synthesis method and system with expressive force and electronic equipment

Publications (1)

Publication Number Publication Date
CN117854474A true CN117854474A (en) 2024-04-09

Family

ID=90544270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410185825.9A Pending CN117854474A (en) 2024-02-19 2024-02-19 Speech data set synthesis method and system with expressive force and electronic equipment

Country Status (1)

Country Link
CN (1) CN117854474A (en)

Similar Documents

Publication Publication Date Title
CN102360543B (en) HMM-based bilingual (mandarin-english) TTS techniques
US7010489B1 (en) Method for guiding text-to-speech output timing using speech recognition markers
CN101551947A (en) Computer system for assisting spoken language learning
Stan et al. TUNDRA: a multilingual corpus of found data for TTS research created with light supervision
El Amrani et al. Building CMU Sphinx language model for the Holy Quran using simplified Arabic phonemes
CN111369974B (en) Dialect pronunciation marking method, language identification method and related device
WO2021074721A2 (en) System for automatic assessment of fluency in spoken language and a method thereof
CN109448704A (en) Construction method, device, server and the storage medium of tone decoding figure
Nagano et al. Data augmentation based on vowel stretch for improving children's speech recognition
CN109697975B (en) Voice evaluation method and device
CN111370001A (en) Pronunciation correction method, intelligent terminal and storage medium
CN116453502A (en) Cross-language speech synthesis method and system based on double-speaker embedding
Al-Bakeri et al. ASR for Tajweed rules: integrated with self-learning environments
Dai [Retracted] An Automatic Pronunciation Error Detection and Correction Mechanism in English Teaching Based on an Improved Random Forest Model
CN111508522A (en) Statement analysis processing method and system
CN112634861B (en) Data processing method, device, electronic equipment and readable storage medium
Yu et al. Overview of SHRC-Ginkgo speech synthesis system for Blizzard Challenge 2013
CN110992986B (en) Word syllable stress reading error detection method, device, electronic equipment and storage medium
CN115050351A (en) Method and device for generating timestamp and computer equipment
CN117854474A (en) Speech data set synthesis method and system with expressive force and electronic equipment
CN113990286A (en) Speech synthesis method, apparatus, device and storage medium
Louw et al. The Speect text-to-speech entry for the Blizzard Challenge 2016
Bang et al. An automatic feedback system for English speaking integrating pronunciation and prosody assessments
Rallabandi et al. Submission from CMU for blizzard challenge 2019
Elfahal Automatic recognition and identification for mixed sudanese arabic–english languages speech

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination