CN113823259B - Method and device for converting text data into phoneme sequence - Google Patents

Method and device for converting text data into phoneme sequence Download PDF

Info

Publication number
CN113823259B
CN113823259B CN202110832833.4A CN202110832833A CN113823259B CN 113823259 B CN113823259 B CN 113823259B CN 202110832833 A CN202110832833 A CN 202110832833A CN 113823259 B CN113823259 B CN 113823259B
Authority
CN
China
Prior art keywords
polyphone
sentence
features
grammar
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110832833.4A
Other languages
Chinese (zh)
Other versions
CN113823259A (en
Inventor
吴志勇
宋长河
周逸轩
卞衍尧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Shenzhen International Graduate School of Tsinghua University
Original Assignee
Tencent Technology Shenzhen Co Ltd
Shenzhen International Graduate School of Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd, Shenzhen International Graduate School of Tsinghua University filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110832833.4A priority Critical patent/CN113823259B/en
Publication of CN113823259A publication Critical patent/CN113823259A/en
Application granted granted Critical
Publication of CN113823259B publication Critical patent/CN113823259B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/06Elementary speech units used in speech synthesisers; Concatenation rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Machine Translation (AREA)

Abstract

A method, apparatus, device, and computer-readable storage medium for converting text data into a phoneme sequence are disclosed. The method for converting text data into a phoneme sequence comprises the following steps: extracting semantic features corresponding to the sentence and character semantic features corresponding to one or more continuous characters in the sentence based on the sentence in the text data, determining grammar features corresponding to the sentence based on the sentence semantic features corresponding to the sentence, determining polyphone features based on the character semantic features and the grammar features corresponding to the sentence, the polyphone features indicating polyphone pronunciation information of the characters, and determining phoneme sequences corresponding to the sentence based on the grammar features and the polyphone features. The method and the device extract grammar features and polyphone features in the text data by utilizing the neural network, fuse the features in a cascading mode, and optionally introduce tone variation information in the text data, so that the synthesized voice is more natural.

Description

Method and device for converting text data into phoneme sequence
Technical Field
The present disclosure relates to the field of artificial intelligence services, and more particularly, to a method, apparatus, device, and computer-readable storage medium for converting text data into a phoneme sequence.
Background
Text-To-Speech (TTS) technology has been proposed To convert Text data into Speech. TTS technology has been widely used in products such as voice assistants, intelligent navigation, electronic books, etc. TTS technology uses both linguistics and psychology to intelligently transform words into a natural speech stream through the design of neural networks. However, current TTS technology is not sufficiently friendly for ideographic-based languages, such as chinese.
Currently, prior to generating speech, it is necessary to convert an input character sequence into a corresponding pronunciation phoneme sequence. This conversion process is also referred to as front-end processing in TTS technology. While ideographic languages typically have a transposition, e.g., two-tone, three-tone, and light-tone transposition for chinese. These variations result in inaccuracy in the converted pronunciation phoneme sequence. At present, the conversion of ideographic languages into phonetic sequences is almost always based on conversion rules preset by linguists. For example, for Chinese, a linguist typically summarizes a series of rules for converting Chinese characters into phonetic labels, and then writes the rules into a form that the computer can understand. However, the effort to establish the preset conversion rules is enormous and it is difficult to cover all cases. In addition, as such rules become more complex, the conversion of the same chinese character may be matched by multiple rules, creating rule conflicts. With the increase in data, more and more researchers have attempted to use statistical-based methods for front-end processing. But the above methods are highly dependent on the experience of the feature engineering and modelers.
There are also researchers currently considering the use of neural networks to solve the above problems. However, the existing neural network scheme still has the problems of difficult voice labeling, inaccurate prediction and the like, and the problems cause low voice synthesis quality. Accordingly, there is a need for further improvements to the front-end processing schemes in existing TTS technology to synthesize more ideographic-friendly speech.
Disclosure of Invention
Embodiments of the present disclosure provide a method and apparatus for converting text data into a phoneme sequence, a method and apparatus for simplifying a complex text processing model into a lightweight text processing model, and a computer readable storage medium.
Embodiments of the present disclosure provide a method of converting text data into a phoneme sequence, comprising: extracting semantic features corresponding to the sentence and character semantic features corresponding to one or more continuous characters in the sentence based on the sentence in the text data, determining grammar features corresponding to the sentence based on the sentence semantic features corresponding to the sentence, determining polyphone features based on the character semantic features and the grammar features corresponding to the sentence, the polyphone features indicating polyphone pronunciation information of the characters, and determining phoneme sequences corresponding to the sentence based on the grammar features and the polyphone features.
The embodiment of the disclosure provides an apparatus for converting text data into a phoneme sequence, comprising: an extraction unit configured to extract, based on a sentence in the text data, semantic features of the sentence and character semantic features corresponding to one or more consecutive characters in the sentence, a first determination unit configured to determine grammatical features corresponding to the sentence based on the semantic features corresponding to the sentence, a second determination unit configured to determine polyphone features indicating polyphone pronunciation information of characters based on the semantic features of the character and the grammatical features corresponding to the sentence, and a third determination unit configured to determine a phoneme sequence corresponding to the sentence based on the grammatical features and the polyphone features.
Embodiments of the present disclosure provide an apparatus for converting text data into a phoneme sequence, comprising: a processor; and a memory, wherein the memory stores a computer executable program that, when executed by the processor, performs the method described above.
Embodiments of the present disclosure provide an apparatus for simplifying a complex text processing model into a lightweight text processing model, comprising: a processor; and a memory storing computer instructions which, when executed by the processor, implement the above-described method.
Embodiments of the present disclosure provide a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described method.
According to another aspect of the present disclosure, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable medium and executes the computer instructions to cause the computer device to perform the aspects described above or methods provided in various alternative implementations of the aspects described above.
Embodiments of the present disclosure provide a proposed method of converting text data into a phoneme sequence, which extracts grammatical features and polyphonic features in the text data using a neural network and fuses the above features in a cascade manner, and optionally introduces tonal information in the text data, and have the following four advantages compared to previous methods.
① In the embodiment of the disclosure, a plurality of features in text data are fused in a cascading mode, so that features fused with interaction information among the plurality of features are obtained.
② Embodiments of the present disclosure also introduce polyphonic character features in the front-end processing to eliminate divergence of the speech synthesis process, thereby providing more correct dictionary pronunciation for the character sequence to be synthesized.
③ The embodiment of the disclosure introduces grammar characteristics in the front-end processing process to assist prosody control, so that the prosody of the synthesized voice is more accurate.
④ The embodiment of the disclosure also introduces tone-shifting information in the front-end processing process, thereby making the synthesized voice more natural.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required to be used in the description of the embodiments will be briefly described below. The drawings in the following description are only exemplary embodiments of the present disclosure.
Fig. 1A is an example schematic diagram illustrating an application scenario according to an embodiment of the present disclosure.
Fig. 1B is an example schematic diagram showing a model for converting text data into a phoneme sequence.
Fig. 2A is a flowchart illustrating a method of converting text data into a phoneme sequence in accordance with an embodiment of the present disclosure.
Fig. 2B is a schematic diagram illustrating a method 200 of converting text data into a phoneme sequence in accordance with an embodiment of the present disclosure.
Fig. 3 is yet another schematic diagram illustrating a method 200 of converting text data into a phoneme sequence in accordance with an embodiment of the present disclosure.
Fig. 4A is an example schematic diagram of a component span score set according to an embodiment of the present disclosure.
Fig. 4B is a schematic diagram of a syntax tree according to an embodiment of the present disclosure.
Fig. 5 is yet another schematic diagram of a multi-tone word analysis module according to an embodiment of the present disclosure.
Fig. 6 is a schematic diagram of an apparatus for converting text data into a phoneme sequence according to an embodiment of the disclosure.
Fig. 7 is a block diagram illustrating an apparatus for converting text data into a phoneme sequence according to an embodiment of the present disclosure.
Detailed Description
In order to make the objects, technical solutions and advantages of the present disclosure more apparent, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present disclosure and not all of the embodiments of the present disclosure, and that the present disclosure is not limited by the example embodiments described herein.
In the present specification and drawings, steps and elements having substantially the same or similar are denoted by the same or similar reference numerals, and repeated descriptions of the steps and elements will be omitted. Meanwhile, in the description of the present disclosure, the terms "first," "second," and the like are used merely to distinguish the descriptions, and are not to be construed as indicating or implying relative importance or order.
For purposes of describing the present disclosure, the following presents concepts related to the present disclosure.
The present disclosure may utilize acoustic models to implement the method of converting text data into a sequence of phonemes. The first encoder, the second encoder, the component analysis module, the pinyin prediction layer, the pitch analysis module, the decoder, the speech generation module, and the like mentioned below are constituent modules of the acoustic model.
The acoustic model of the present disclosure may be artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) based. Artificial intelligence is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. For example, for the acoustic model of the present disclosure, it is capable of translating languages of a plurality of different languages in a manner similar to a human reading and understanding the languages. Artificial intelligence is a function of understanding and translating a plurality of languages of different languages into a language of another language by researching design principles and implementation methods of various intelligent machines.
Artificial intelligence technology relates to a wide range of technology, both hardware-level and software-level. The artificial intelligence software technology mainly comprises the directions of computer vision technology, natural language processing, machine learning/deep learning and the like.
Optionally, the acoustic model in the present disclosure employs natural language processing (Nature Language processing, NLP) techniques. Natural language processing technology is an important direction in the fields of computer science and artificial intelligence, and can implement various theories and methods for effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, based on natural language processing techniques, the acoustic model of the present disclosure may analyze the input text data and extract features in the text data, and then generate audio data in a manner that enables human-spoken text.
Optionally, the natural language processing techniques employed by embodiments of the present disclosure may also be based on machine learning (MACHINE LEARNING, ML) and deep learning. Machine learning is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. Natural language processing techniques utilize machine learning to study how a computer simulates or implements the behavior of a human learning language by analyzing existing, categorized text data to obtain new knowledge or skills, reorganizing existing knowledge structures to continuously improve their performance. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like.
Alternatively, the acoustic models that may be used in embodiments of the present disclosure below may all be artificial intelligence models, particularly neural network models based on artificial intelligence. Typically, artificial intelligence based neural network models are implemented as loop-free graphs, in which neurons are arranged in different layers. Typically, the neural network model includes an input layer and an output layer, which are separated by at least one hidden layer. The hidden layer transforms the input received by the input layer into a representation useful for generating an output in the output layer. The network nodes are fully connected to nodes in adjacent layers via edges, and there are no edges between nodes within each layer. Data received at a node of an input layer of the neural network is propagated to a node of an output layer via any one of a hidden layer, an active layer, a pooling layer, a convolutional layer, and the like. The input and output of the neural network model may take various forms, which is not limited by the present disclosure.
Embodiments of the present disclosure provide solutions related to techniques such as artificial intelligence, natural language processing, and machine learning, and are specifically described by the following embodiments.
The acoustic model of the embodiments of the present disclosure may be integrated in an electronic device, which may be a terminal or a server or the like. For example, the acoustic model may be integrated in the terminal. The terminal may be, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal computer (PC, personal Computer), a smart box, a smart watch, or the like. For another example, the acoustic model may be integrated at the server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs, content Delivery Network), basic cloud computing services such as big data and artificial intelligent platforms, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the disclosure is not limited herein.
It will be appreciated that the means for applying the acoustic model of the embodiments of the present disclosure to infer may be either a terminal, a server, or a system of terminals and servers.
It will be appreciated that the method of converting text data into acoustic features of the acoustic model of the embodiments of the present disclosure may be performed on a terminal, may be performed on a server, or may be performed jointly by a terminal and a server.
The acoustic model provided by the embodiment of the disclosure can also relate to artificial intelligence cloud services in the field of cloud technology. Cloud technology (Cloud technology) refers to a hosting technology that unifies serial resources such as hardware, software, networks and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is a generic term of network technology, information technology, integration technology, management platform technology, application technology and the like based on cloud computing business model application, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.
Among them, the artificial intelligence cloud service is also generally called AIaaS (AI AS A SERVICE, chinese is "AI as service"). The service mode of the artificial intelligent platform is the mainstream at present, and particularly, the AIaaS platform can split several common AI services and provide independent or packaged services at the cloud. This service mode is similar to an AI theme mall: all developers can access one or more artificial intelligence services provided by the use platform through application program interfaces (APIs, application Programming Interface), and part of the deep developers can also use the AI framework and AI infrastructure provided by the platform to deploy and operate and maintain self-proprietary cloud artificial intelligence services.
Fig. 1A is an example schematic diagram illustrating a scenario 100 of acoustic model reasoning in accordance with an embodiment of the present disclosure. Fig. 1B is an example schematic diagram showing a model for converting text data into a phoneme sequence.
Currently, there are a variety of speakable applications. The user may install a speakable application on his user terminal and indicate to the speakable application text data that needs to be converted into audio data. The user terminal may then transmit a text data conversion request to the server of the application via the network, then receive converted audio data corresponding to the text data, and then play the audio data.
After receiving the text data to be converted, the server converts the text data using an acoustic model to obtain audio data, and then feeds back the audio data (e.g., audio data corresponding to the text data in fig. 1A) to the user.
The user may score the audio data. For example, if the user believes that the audio data corresponds well to text data, yet pronounces accurately with polyphones, and is close to the effect of a human reading, the user may give a higher score to the audio data, and the server may treat the text data-audio data pair as a positive sample for training the acoustic model in real time. If the user gives a lower score to the audio data, the server may take the text data-audio data pair as a negative sample for training the acoustic model in real-time. A collection of a plurality of such text data-audio data pairs is also referred to as a text-to-speech data set.
Of course, the server may also use other ways to obtain samples for training the acoustic model. For example, the server may capture audio of the real-person speakable text and corresponding text that already exists in the current internet environment, and then train the acoustic model using such real-person speakable text. For example, referring to FIG. 1A, a server may obtain text from a database and then use it for training of acoustic models.
Current acoustic models used to convert text data into phoneme sequences can be complex or difficult to annotate with speech or result in inaccurate predictions. Several available acoustic models are briefly described below with reference to FIG. 1B.
Shan, dai, zhang, zou et al propose a character-to-phoneme conversion (G2P) network or an auxiliary G2P network as shown in fig. 1B to alleviate the problem of inaccuracy of synthesized speech due to complex pronunciation labeling rules. In the G2P network, a polyphone disambiguation model is added to make the conversion between polyphones and phonemes more accurate. Specifically, the polyphone disambiguation model takes contextual information as input and is independently trained by separate data sets in anticipation of identifying ambiguous information in the polyphones. Although the multi-word disambiguation model is used in the method to provide the conversion between multi-words and phonemes, the scheme requires extracting the context information through an independent module, and is not only cumbersome in process but also poor in disambiguation effect.
Recently, dai et al have also attempted to assist in identifying ambiguous information in polyphones by extracting embedded information from a pre-trained BERT model. Pan et al have attempted to correct pronunciation errors using the system change law. However, in these methods, each sub-module is independently trained using heterogeneous data sets, resulting in knowledge learned by each sub-module being isolated from each other, and thus poor performance and robustness.
The present disclosure provides, based on this, a method of converting text data into a phoneme sequence, comprising: extracting semantic features corresponding to the sentence and character semantic features corresponding to one or more continuous characters in the sentence based on the sentence in the text data, determining grammar features corresponding to the sentence based on the sentence semantic features corresponding to the sentence, determining polyphone features based on the character semantic features and the grammar features corresponding to the sentence, the polyphone features indicating polyphone pronunciation information of the characters, and determining phoneme sequences corresponding to the sentence based on the grammar features and the polyphone features.
Thus, the acoustic model of the present disclosure is made up of a plurality of cascaded sub-modules. For example, these sub-modules may relate to selection analysis of word segmentation and part-of-speech tagging, grammar tree-based language feature learning, attention-based polyphone disambiguation learning, speech variation prediction, speech generation, and so forth, respectively. By cascading the sub-modules, information learned by each sub-module is fused and shared with each other, so that the performance and the robustness are improved. In addition, the acoustic model disclosed by the invention can be further based on component analysis, so that the naturalness of the synthesized voice is improved. For example, the present disclosure improves performance and robustness by extracting a highly prosodic grammar feature from the component analysis tree and taking the grammar feature as an additional input to the TTS, thereby avoiding training a separate prosodic structure prediction module.
Embodiments in accordance with the present disclosure are described in more detail below in conjunction with fig. 2A-5.
Fig. 2A is a flowchart illustrating a method 200 of converting text data into a phoneme sequence in accordance with an embodiment of the present disclosure. Fig. 2B is a schematic diagram illustrating a method 200 of converting text data into a phoneme sequence in accordance with an embodiment of the present disclosure.
The method 200 of converting text data into a phoneme sequence according to an embodiment of the present disclosure may be applied to any electronic device. It is understood that the electronic device may be a different kind of hardware device, such as a Personal Digital Assistant (PDA), an audio/video device, a mobile phone, an MP3 player, a personal computer, a laptop computer, a server, etc. For example, the electronic device may be the server and user terminal of fig. 1A, etc. Hereinafter, the present disclosure is described by taking a server as an example, and those skilled in the art should understand that the present disclosure is not limited thereto.
For example, the method 200 according to an embodiment of the present disclosure includes the following steps S201 to S204. First, in step S201, semantic features corresponding to a sentence and character semantic features corresponding to one or more continuous characters in the sentence are extracted based on the sentence in the text data. Next, in step S202, grammar characteristics corresponding to the sentence are determined based on the sentence meaning characteristics corresponding to the sentence. Next, in step S203, a polyphone feature is determined based on the character semantic feature and the grammar feature corresponding to the sentence, the polyphone feature indicating polyphone pronunciation information of the character. In step S204, a phoneme sequence corresponding to the sentence is determined based on the grammar feature and the polyphone feature.
For example, the text data described in this disclosure may be any element that constitutes text to be read in fig. 1A, e.g., a word, sentence, phrase, paragraph, chapter, etc. The present disclosure does not impose any limitation on the length and language type of the text data, for example, text information in english, chinese, russian, japanese, korean, etc., such as "child lovely o" in chinese, etc., may be included in the text data. The following is described by way of example in Chinese, and those skilled in the art will appreciate that the present disclosure is not so limited.
For example, referring to fig. 2B, step S201 may be performed using a sentence meaning analysis module. Alternatively, the sentence analysis module may be a neural network model. The neural network model is implemented, for example, as a loop-free graph, in which neurons are arranged in different layers. The neural network model includes an input layer and an output layer, the input layer and the output layer being separated by at least one hidden layer. The hidden layer transforms the input received by the input layer into a representation useful for generating an output in the output layer. The network nodes are fully connected to nodes in adjacent layers via edges, and there are no edges between nodes within each layer. Data received at a node of an input layer of the neural network is propagated to a node of an output layer via any of a plurality of hidden layers, a plurality of active layers, a plurality of pooled layers, a plurality of convolutional layers, and the like. The input and output of the neural network model may take various forms, which is not limited by the present disclosure. Alternatively, the semantic analysis module may also be a BERT (Bidirectional Encoder Representations from Transformer) network designed for chinese characters. The BERT network can train by using large-scale unlabeled corpus to acquire semantic information in text data so as to facilitate subsequent processing.
For example, the semantic features extracted by the sentence semantic analysis module may correspond to the semantics of the entire sentence, and the character semantic features may correspond to the semantics of the combination of all/part of the consecutive characters in the sentence. For example, if a sentence in the text data is "child lovely," the sentence meaning feature corresponding to the sentence corresponds to the meaning of the sentence, and the character semantic feature may correspond to the meaning of the character or the continuous character corresponding to "child," "lovely," "o," or the like, respectively. In embodiments of the present disclosure, sentence semantic features and character semantic features may also be identical features to facilitate subsequent operations. Character semantic features may also be composed of a portion of the elements in sentence semantic features, which is not limiting in this disclosure.
With continued reference to fig. 2B, step S202 may be performed using a grammar analysis module and a grammar feature learning module. Optionally, the grammar analysis module and the grammar feature learning module may include one or more neural network models. Optionally, the neural network model in the parse module is trained with the syntax tree dataset as a sample. The structure and training manner of the neural network model in the grammar analysis module and the grammar feature learning module will be further described with reference to fig. 3 to 5, and will not be described herein.
For example, the parsing module may determine the grammatical feature corresponding to the sentence based on the semantic feature of the sentence. Optionally, the above grammar features are integrated with the syntax structure information of the sentence, the part-of-speech information corresponding to each word in the sentence, the word segmentation boundary information, the word segmentation position information, and the like. For example, neurons in a neural network model in a parse module, after training, may predict part-of-speech information, word segmentation boundary information, and word segmentation location information corresponding to a sentence based on the semantic features of the sentence. In addition, neurons in the neural network model in the parsing module may predict phrase-level information (e.g., phrase combination manner information, phrase composition information, phrase boundary information, phrase position information, etc.) and sentence-level information (e.g., sentence syntax structure information, sentence boundary information, sentence attention information, etc.) based on sentence-meaning characteristics of a certain sentence after training, and the present disclosure is not limited thereto.
The neural network may then use this information to determine the corresponding grammar encoding features for the sentence. Then, the grammar analysis module re-utilizes the grammar coding features corresponding to the sentences to determine the grammar features corresponding to the sentences.
For example, the neural network of the grammar feature learning module may then further determine grammar features for polyphonic pronunciation and grammar features for prosody based on the grammar features determined by the grammar analysis module. Alternatively, since it is often determined that polyphone information of a certain character only needs information of several characters before and after the character, not information of the entire sentence, the length (span) of the grammar feature for polyphones may be smaller than that for prosody.
Next, step S203 may be performed using the polyphone analysis module. Optionally, the neural network in the polyphone analysis module may fuse the grammatical features and the semantic features of the characters, thereby extracting polyphone pronunciation information of the characters. For example, the polyphone analysis module may splice the grammatical feature and the character semantic feature into an initial polyphone feature, and then determine the polyphone feature based on the initial polyphone feature and dictionary pronunciation information corresponding to each character in the character combination. Optionally, the neural network model in the polyphone analysis module may incorporate dictionary pronunciation information into its neuron parameters after being trained. The dictionary pronunciation information relates to polyphone pronunciation information of a certain character in the dictionary. For example, a neural network model in the polyphone analysis module may predict the manner in which a character is pronounced in a dictionary based on the initial polyphone characteristics (e.g., resulting in pinyin information for a polyphone). The structure and training manner of the neural network model in the multi-word analysis module and how to extract dictionary pronunciation information through training will be further described with reference to fig. 3 to 5, and will not be described here again.
For example, as described above, the neural network of the grammar feature learning module may determine grammar features for polyphonic pronunciation based on grammar features corresponding to the sentences. Thus, the polyphone analysis module may also determine polyphone features based on the grammatical features for the polyphone pronunciation and the character semantic features. Alternatively, the grammar characteristics for the polyphone pronunciation may be determined by a grammar characteristics learning module. That is, the polyphone analysis module may cascade with a grammar feature learning module to directly obtain grammar features for polyphone pronunciation. Then, the polyphone analysis module may splice the grammatical feature for polyphone pronunciation and the character semantic feature into an initial polyphone feature, and then determine the polyphone feature based on the initial polyphone feature and dictionary pronunciation information corresponding to each character in the character combination. Although a polyphone parsing module is shown in fig. 2B in cascade with a grammar feature learning module to obtain grammar information for polyphones, those skilled in the art will appreciate that a polyphone parsing module may also be in cascade with a grammar parsing module to directly obtain grammar features determined by the grammar parsing module.
In addition, the polyphone analysis module can be cascaded with the tone variation analysis module to input polyphone features to the tone variation analysis module. Optionally, the tone-changing information may be fused in each neuron in the neural network in the tone-changing analysis module, so the tone-changing analysis module may utilize the neural network built-in to fuse the polyphone feature and the tone-changing information, so as to determine the polyphone feature fused with the tone-changing information.
As an example, the multi-tone character feature and the multi-tone character feature fused with the tone variation information may be a pinyin sequence for generating a voice, or a multi-dimensional numerical vector fused with pinyin information of each character in the sentence, which is not limited to this disclosure.
Next, step S204 may be performed using a voice generation module. For example, as described above, the neural network of the grammar feature learning module may determine grammar features for prosody based on the grammar features corresponding to the sentences. The speech generation module may then determine a sequence of phonemes corresponding to the sentence based on the prosodic specific grammatical feature and the polyphonic feature. That is, the speech generation module may cascade with the grammar feature learning module so as to directly acquire the grammar features for prosody. Although the speech generation module is shown in fig. 2B in cascade with the grammar feature learning module to obtain grammar information for prosody, those skilled in the art will appreciate that the speech generation module may also be in cascade with the grammar analysis module to directly obtain the grammar features determined by the grammar analysis module. The syntax structure information of a sentence is generally highly similar to the prosody information of the sentence, so that the prosody control of the speech can be assisted by using the grammatical features aiming at the prosody, and a matching rule is not required to be designed for the prosody control alone or a prosody structure prediction module is not required to be trained alone, so that the difficulty of prosody control is reduced, and the synthesized speech is more natural.
As an example, the speech generation module may also be cascaded directly with the pitch analysis module to directly obtain polyphonic features fused with pitch information. And further, the voice generating module may further determine a phoneme sequence corresponding to the sentence based on the polyphone feature fused with the inflexion information and the grammar feature for prosody. Although a cascade of speech generation modules and pitch analysis modules is shown in fig. 2B to obtain polyphonic features fused with pitch information, it will be appreciated by those skilled in the art that the speech generation modules may also be cascaded with the polyphonic analysis modules to directly obtain polyphonic features determined by the polyphonic analysis modules.
For example, in the case where the multi-tone character and the multi-tone character fused with the tone variation information described above are pinyin sequences for generating a voice (or multi-dimensional numerical vectors fused with pinyin information for each character in the sentence), the voice generating module may further include a pinyin-to-phoneme conversion module to convert the multi-tone character (or the multi-tone character fused with the tone variation information) into an initial phoneme sequence. Next, other models (which may be linear models or neural network models) in the speech generation module, in combination with the initial phoneme sequence and the grammatical features (or grammatical features for prosody), determine the phoneme sequence corresponding to the sentence.
In addition, the speech generation module may also include a neural network and a vocoder that converts the phoneme sequence into audio data. For example, the neural network that converts the phoneme sequence into audio data may be an autoregressive neural network model based on an attention mechanism (e.g., tacotron) or a non-autoregressive feedforward neural network model based on a duration predictor (e.g., fast-speed), etc., and the disclosure is not limited thereto. The audio data in the present disclosure may be melpri audio feature data, but the audio data may also be audio data that can be decoded by a vocoder in any format.
Thus, the embodiments of the present disclosure provide a proposed method of converting text data into a phoneme sequence, which extracts grammatical features and polyphonic features in the text data using a neural network and fuses the above features in a cascade manner, and optionally introduces tonal information in the text data, and have the following four advantages compared to previous methods.
① In the embodiment of the disclosure, a plurality of features in text data are fused in a cascading mode, so that features fused with interaction information among the plurality of features are obtained.
② Embodiments of the present disclosure also introduce polyphonic character features in the front-end processing to eliminate divergence of the speech synthesis process, thereby providing more correct dictionary pronunciation for the character sequence to be synthesized.
③ The embodiment of the disclosure introduces grammar characteristics in the front-end processing process to assist prosody control, so that the prosody of the synthesized voice is more accurate.
④ The embodiment of the disclosure also introduces tone-shifting information in the front-end processing process, thereby making the synthesized voice more natural.
The above-described individual modules are described in more detail below in connection with fig. 3 to 5.
Fig. 3 is yet another schematic diagram illustrating a method 200 of converting text data into a phoneme sequence in accordance with an embodiment of the present disclosure. Fig. 4A is an example schematic diagram of a component span score set according to an embodiment of the present disclosure. Fig. 4B is a schematic diagram of a syntax tree according to an embodiment of the present disclosure. Fig. 5 is yet another schematic diagram of a multi-tone word analysis module according to an embodiment of the present disclosure.
Referring to fig. 3, the sentence meaning analysis module extracts sentence meaning features corresponding to sentences and character meaning features corresponding to one or more continuous characters in the sentences based on sentences in the text data. The sentence meaning analysis module inputs sentence meaning characteristics and character meaning characteristics into the grammar analysis module and the polyphone analysis module respectively. As one example, the sentence meaning analysis module is a BERT model for chinese characters, whose output is a chinese character BERT embedded sequence. For example, the chinese character BERT embedding sequence may be [ C 0,C1,C2,...,ck,...,cL,cL+1 ] as input, where L is the length (number of characters) of the input sentence, and C 0、cL+1 is a special start and end marker for assisting subsequent component analysis modules and dynamic programming decoder calculations.
The parsing module may include a first encoder, a component analysis module, and a dynamic programming decoder. Wherein the dynamic programming decoder is used only for the training process, and the first encoder and the component analysis module are both used for the training process and the reasoning process. As one example, syntax coding features corresponding to the sentence are determined by a first encoder based on the syntax features, syntax features corresponding to the sentence are determined by a composition analysis module based on the syntax coding features, and both the first encoder and the composition analysis module are trained from a syntax tree dataset.
For example, the training of the first encoder may comprise the following steps. First, based on a grammar sample sentence in the grammar tree dataset, determining a grammar coding feature corresponding to the sample sentence by using the first encoder. And then, determining the grammar characteristics corresponding to the grammar sample sentences and extracting the component span scores in the grammar coding characteristics corresponding to the grammar sample sentences by using the component analysis module. And then, based on the component span scores, determining the predicted part-of-speech labels, the predicted word boundary labels and the predicted word position labels corresponding to the words in the grammar sample sentences. And then calculating a value corresponding to the first loss function based on the predicted part-of-speech label, the predicted word boundary label and the predicted word position label corresponding to each word in the grammar sample sentence and the actual part-of-speech label, the actual word boundary label and the actual word position label corresponding to each word in the grammar sample sentence. Then, based on the value corresponding to the first loss function, parameters of neurons in the first encoder and the component analysis module are adjusted so that the first loss function converges.
Optionally, the first encoder is a neural network model stacked from 8 identical transducers (transformers). Wherein each converter includes a cascade of a multi-headed attention layer, a first regularization layer, a feed-forward layer, and a second regularization layer, and the output of the first regularization layer is also input to the second regularization layer. The semantic features may be input to not only the multi-headed attention layer of the first converter but also the first regularization layer, which is not limited by this disclosure.
As shown in fig. 3, the first encoder may take the sentence meaning feature as an input, the syntax coding feature as an output, and input the syntax coding feature to the component analysis module. I.e. the first encoder is arranged to implement determining the corresponding syntax coding feature of the sentence based on the syntax feature. Continuing with the above example, assume that the first encoder takes the above Chinese character BERT embedded sequence [ c 0,c1,c2,...,ck,...,cL,cL+1 ] as input and takes as output a syntax coding feature [ y 0,y1,y2,...,yk,...,yL,yL+1 ] that has the same length as the above Chinese character BERT embedded sequence.
Alternatively, the component analysis module may take the syntax coding feature as input and the syntax feature as output. For example, the component analysis module may determine a set of component span scores for the syntactic coding features. As one example, in training the first encoder using the syntax tree dataset, a component analysis module may be used to determine syntax features corresponding to the syntax sample sentences and extract component span scores in the syntax coding features corresponding to the syntax sample sentences. Continuing with the example above, the constituent analysis module combines the syntax coding features [ y 0,y1,y2,...,yk,...,yL,yL+1 ] and predicts a constituent span score set s (i, j,) for it. For example, the component span score set s (i, j,) may be calculated in equations (1) and (2).
s(i,j,·)=W2RCLU(LayerNorm(W1v+b1))+b2, (1)
Where 0.ltoreq.i.ltoreq.j.ltoreq.L, and v combines elements in the syntax-encoded features at the locations indicated by i and j.AndBi-directional context information of the character at position k is captured, which is derived from element y k in the syntax-encoded feature. For example, the number of the cells to be processed,Representing the context information for y k from the element at the even position,Representing the contextual information for y k from the element at the odd position. W 2、W1、b1、b2 is the neuron parameters to be trained in the component analysis module.
For example, the composition analysis module may further construct a composition analysis tree T of fig. 3 consisting of a gray upper triangular matrix based on the composition span score set to characterize the scores of the composition over the respective spans. For example, the composition analysis tree T may be expressed by formula (3).
T:={(it,jt;lt):f=l,...,|T|}. (3)
Thus, the optimal component analysis tree T best can be expressed as formula (4).
That is, the optimal label l for span (i, j) can be found by solving the above equation (4) and splitting span (i, i) into two sub-spans (i, m) and (m, j). Sub-spans (i, m) and (m, j) correspond to two sub-components of the component analysis tree T under span (i, j). For example, the two sub-components may be further represented by equation (5). Labels herein may be part of speech labels, word segmentation boundary labels, word segmentation location labels, etc., and the disclosure is not limited thereto.
For the ith character in the sentence, it is not necessary to further subdivide it over span (i-1, j), but the optimal label for it is determined directly with equation (6).
Since the component analysis tree T consisting of the upper triangular matrix of gray in FIG. 3 is not valid for all spans (i, j), an auxiliary empty marker is introduced thereinThe span score used to characterize this is 0. That is, a component with a span fraction of 0 in the component span fraction set can be expressed by the following formula (7).
Fig. 4A shows, as an example, a span score representation for the sentence "child lovely" which may be a further example of the upper triangular matrix in fig. 3, which is gray. Further, the span scores corresponding to some of the components in the component span score set s (i, j,) may be further replicated to the next layer for subsequent dynamic programming decoder resolution. The span scores may correspond to part-of-speech tags and word segmentation boundary tags, respectively.
Thus, the extended syntax tree shown in FIG. 4B may be further derived using a dynamic programming decoder according to the example in FIG. 4A. In the extended syntax tree, the IP identifies the syntax features of the entire sentence. NP identifies noun-related grammatical features, VP identifies verb-related grammatical features, ADVP identifies adverb-related grammatical features, and SP identifies end-of-sentence small word-related grammatical features. The expanded grammar tree is an example of determining a predicted part-of-speech label, a predicted word boundary label and a predicted word position label corresponding to each word in the grammar sample sentence based on the component span score.
In the extended syntax tree shown in fig. 4B, NN is the part of speech predicted by the component analysis module for "child", i.e., predicting the word as a noun. AD is the part of speech that the component analysis module predicts for "good", i.e. predicts that the word is an adverb. VA is the part of speech that the component analysis module predicts for "lovely", i.e., predicts that the word is a tabular adjective. IJ is the part of speech predicted by the component analysis module for "o", i.e., predicting that the word is an exclamation word. B is the word segmentation boundary and word segmentation position of the component analysis module which are predicted for small and predictable, namely the start of a word is predicted for small and predictable. M is the word segmentation boundary and word segmentation position predicted by the component analysis module for 'child', namely, the intermediate character of a word predicted by 'child'. E is the word segmentation boundary and word segmentation position predicted by the component analysis module for 'children' and 'love', namely, the end of a word is predicted for 'children' and 'love'. E is the word segmentation boundary and word segmentation position of the prediction of the o by the component analysis module, namely the prediction of the o is a single word and/or is positioned at the end of a sentence. The dynamic programming decoder is used for determining the predicted part-of-speech labels, the predicted word segmentation boundary labels and the predicted word segmentation position labels corresponding to the words in the grammar sample sentences based on the component span scores.
As described above, in training the first encoder in the parse module, the dynamic programming decoder may output the predicted word-segmentation tags and the predicted part-of-speech tags corresponding to a certain sample sentence in the syntax tree dataset. Then, calculating a value corresponding to the first loss function by calculating a predicted part-of-speech label, a predicted word segmentation boundary label and a predicted word segmentation position label corresponding to each word segment in the sample sentence and an actual part-of-speech label, an actual word segmentation boundary label and an actual word segmentation position label corresponding to each word segment in the grammar sample sentence. Thus, parameters of neurons in the first encoder and the component analysis module may be adjusted based on the value corresponding to the first loss function, such that the first loss function converges.
I.e. when the first loss function converges, the first encoder training is completed. Through the training process, the first encoder learns part-of-speech information, word segmentation boundary information and word segmentation position information corresponding to each word segmentation in the sample sentences in the grammar tree data set. When facing the scene shown in fig. 1A, the first encoder may predict the part-of-speech information, the word segmentation boundary information, and the word segmentation position information of the sentence to be spoken based on the part-of-speech information, the word segmentation boundary information, the word segmentation position information, and the like learned in the sample sentence. The first encoder may also predict phrase-level information and sentence-level information of the sentence to be read aloud based on phrase-level information (e.g., combination manner information of phrases, component information of phrases, boundary information of phrases, position information of phrases, etc.) and sentence-level information (e.g., syntax structure information of sentences, boundary information of sentences, attention information of sentences, etc.) of the sentence which it learns in the sample sentence. The constituent analysis module will then be utilized to further determine the corresponding grammar encoding features for the sentence. The constituent analysis module then further combines the syntax coding features to output syntax features.
With continued reference to fig. 3, the grammatical features output by the component analysis module are input to a grammatical feature learning module. The grammar feature learning module includes a shared hidden layer, a first convolutional neural network layer (shown as a first CNN), and a second convolutional neural network (shown as a second CNN). The shared hidden layer receives grammar features, further fuses the grammar features, and then inputs the fused grammar features into the first convolution neural network layer and the second convolution neural network respectively. Alternatively, the first convolutional neural network layer and the second convolutional neural network layer are independent 1-dimensional structures that respectively output the same feature representation as the sentence character length.
As shown in fig. 3, the first CNN inputs the grammatical features for prosody into the speech generation module, and the second CNN inputs the grammatical features for polyphones into the polyphone analysis module. As described above, since it is often determined that polyphone information of a certain character requires only information of several characters before and after the character, and not information of the entire sentence, the length (span) of the grammatical feature for polyphones may be smaller than that for prosody. Thus, as one example, a first CNN that generates grammatical features for prosody may use a convolution kernel with span [3,5,7] and a second CNN that generates grammatical features for polyphones may use a convolution kernel with span [1,3,5 ]. Those skilled in the art will appreciate that the present disclosure is not so limited.
The multi-tone word analysis module is further described with continued reference to fig. 3 and 5. The polyphone analysis module comprises a splicer, a second encoder and a pinyin prediction layer. The splicer splices the character semantic features and the grammar features aiming at the polyphones into initial polyphone features. The second encoder determines the polyphone feature based on the initial polyphone feature and dictionary pronunciation information corresponding to each character in the character combination.
Optionally, the second encoder is a neural network model stacked of 2 identical transducers (transformers). Wherein each converter includes a cascade of a multi-headed attention layer, a first regularization layer, a feed-forward layer, and a second regularization layer, and the output of the first regularization layer is also input to the second regularization layer. The semantic features may be input to not only the multi-headed attention layer of the first converter but also the first regularization layer, which is not limited by this disclosure.
Optionally, the second encoder is trained by a multi-tone word data set, the training of the second encoder comprising the following steps. First, based on a polyphone sample sentence in the polyphone data set, an initial polyphone feature corresponding to the polyphone sample sentence is determined. And then, based on the initial polyphone characteristics corresponding to the polyphone sample sentence, determining the polyphone characteristics corresponding to the polyphone sample sentence by using the second encoder. And then, based on the multi-sound character characteristics corresponding to the multi-sound character sample sentence, determining the predicted pinyin labels corresponding to the characters in the multi-sound character sample sentence by utilizing a pinyin prediction layer. And then, calculating the value of a second loss function based on the predicted pinyin labels corresponding to the characters in the multi-pronunciation sample sentence and the actual pinyin labels corresponding to the characters in the multi-pronunciation sample sentence. And then, based on the value corresponding to the second loss function, adjusting parameters of neurons in the second encoder and the pinyin prediction layer so as to enable the second loss function to converge.
Continuing with the above example, after the training of the parsing module is completed, the multi-pronunciation sample sentence in the multi-pronunciation data set may be converted into an initial multi-pronunciation feature using the semantic analysis module, the grammar feature learning module, and the splicer after the training is completed. Then, the second encoder predicts the multi-pronunciation character feature corresponding to the multi-pronunciation character sample sentence based on the initial multi-pronunciation character feature, and inputs the multi-pronunciation character feature to the pinyin prediction layer. Referring to fig. 5, the second encoder will sequentially output the polyphone characteristics corresponding to each character in the order of time steps. The pinyin prediction layer determines predicted pinyin labels corresponding to the characters accordingly. For example, for "small", the pinyin word prediction layer decodes the multi-tone word feature as "xiao3", i.e., "xiao" for the third sound. Similarly, the pinyin of each character in "Haieri lovely" is further decoded according to the time-step sequence.
Then, in the training process of the polyphone analysis module, the value of the second loss function may be calculated based on the predicted pinyin labels corresponding to each character in the polyphone sample sentence and the actual pinyin labels corresponding to each character in the polyphone sample sentence. As shown in FIG. 3, the actual pinyin callout may be pinyin information corresponding to each character in the dictionary. For example, if the pinyin labels for "small" in the dictionary are also "xiao" for the third sound, then the predicted outcome and the actual outcome are the same. If the value of the second loss function tends to converge for each sample in the polyphone data set, training of the polyphone analysis module is complete.
Through the training process, the second encoder learns dictionary pronunciation information corresponding to each character in the sample sentence in the multi-pronunciation data set. In the face of the scene shown in fig. 1A, the second encoder may predict dictionary pronunciation information of each character of the sentence to be read based on the dictionary pronunciation information or the like learned in the sample sentence, and further determine the polyphone feature corresponding to the sentence.
With continued reference to fig. 3, the polyphone analysis module inputs polyphone features to the pitch analysis module. The tone-shift analysis module includes a cascaded third convolutional neural network, a non-shared output layer, and a softmax layer. The softmax layer outputs polyphone features fused with tonal information.
The multi-tone character fused with tone variation information is determined by a tone variation analysis module, and the tone variation analysis module is trained by a text-to-voice data set. The training of the tonal analysis module comprises the following steps. First, based on a speech sample sentence in the text-to-speech dataset, a predicted polyphone feature corresponding to the speech sample sentence is determined using the pitch analysis module. Then, a value of a third loss function is calculated based on the predicted polyphone feature corresponding to the speech sample sentence and the actual polyphone feature corresponding to the speech sample sentence. Finally, based on the value of the third loss function, neuron parameters in the tone-change analysis module are adjusted so that the third loss function converges.
Through the training process, the tonal analysis module learns tonal information of the text to speech data set sample sentences during actual reading. When facing the scene shown in fig. 1A, the pitch-shifting analysis module can predict the pitch-shifting information of each character of the sentence to be read based on the pitch-shifting information and the like learned by the sample sentence, and further determine the corresponding polyphone characteristic of the sentence fused with the pitch-shifting information.
The tone-changing analysis module further fuses and adjusts the multi-tone character features through the third convolution neural network, the non-sharing output layer and the softmax layer to obtain multi-tone character features fused with tone-changing information, and scores the multi-tone character features fused with tone-changing information based on actual text-to-voice data, so that the obtained pinyin sequence is more accurate.
With continued reference to fig. 3, the pitch analysis module inputs the polyphone features fused with the pitch information to the speech generation module. The voice generation module comprises a pinyin to phoneme conversion module, a phoneme splicing module, an audio conversion module and a vocoder. Specifically, the pinyin to phoneme module may take the polyphone characteristics fused with the inflection information as input or directly take the polyphone characteristics determined by the polyphone analysis module as input and convert it to an initial phoneme sequence.
Then, the phoneme splicing module can splice the grammar characteristics aiming at the prosody and the initial phoneme sequences to obtain the phoneme sequences corresponding to the sentences. Since the phoneme sequence includes not only the phoneme information of the sentence but also the prosody information of the sentence, a separate audio conversion module needs to be trained to decode the phoneme sequence into audio data. Optionally, the frequency conversion module may include a third encoder, an attention layer, and a decoder.
Optionally, the speech generation module is trained from the text-to-speech data set, the training of the speech generation module comprising the following steps. Firstly, based on the voice sample sentences in the text-to-voice data set, the predicted audio corresponding to the voice sample sentences is generated by utilizing the voice generation module. Then, a value of a fourth loss function is calculated based on the predicted audio corresponding to the speech sample sentence and the actual audio corresponding to the speech sample sentence. Finally, based on the value of the fourth loss function, a neuron parameter in the speech generation module (e.g., a neuron parameter in an audio conversion module) is adjusted so that the fourth loss function converges.
Through the training process, the voice generation module learns the voice information in the text-to-voice data set sample sentences during actual reading. When facing the scene shown in fig. 1A, the speech generating module may predict the speech information of each character of the sentence to be read based on the speech information learned in the sample sentence, and further determine the speech corresponding to the sentence.
As mentioned above, during the training process described above, each module can be trained independently, whereby the gradient does not return (i.e., the gradient locks) to avoid interference between the various modules.
As shown in fig. 3-5, the acoustic model of the present disclosure is composed of a plurality of cascaded sub-modules. For example, these sub-modules may relate to selection analysis of word segmentation and part-of-speech tagging, grammar tree-based language feature learning, attention-based polyphone disambiguation learning, speech variation prediction, speech generation, and so forth, respectively. By cascading the sub-modules, information learned by each sub-module is fused and shared with each other, so that the performance and the robustness are improved. In addition, the acoustic model disclosed by the invention can be further based on component analysis, so that the naturalness of the synthesized voice is improved. For example, the present disclosure improves performance and robustness by extracting a highly prosodic grammar feature from the component analysis tree and taking the grammar feature as an additional input to the TTS, thereby avoiding training a separate prosodic structure prediction module.
In addition, the present disclosure also provides an apparatus for converting text data into a phoneme sequence. Fig. 6 is a block diagram illustrating an apparatus 600 for converting text data into a phoneme sequence according to an embodiment of the present disclosure. The apparatus 600 comprises an extraction unit, a first determination unit, a second determination unit and a third determination unit. And an extraction unit configured to extract, based on a sentence in the text data, semantic features corresponding to the sentence and character semantic features corresponding to one or more continuous characters in the sentence. And a first determining unit configured to determine a grammatical feature corresponding to the sentence based on the semantic feature corresponding to the sentence. And a second determining unit configured to determine a polyphone feature indicating polyphone pronunciation information of a character based on the character semantic feature and the grammar feature corresponding to the sentence. And a third determining unit configured to determine a phoneme sequence corresponding to the sentence based on the grammar feature and the polyphone feature.
Furthermore, the second determination module is further configured to: determining grammar characteristics for polyphone pronunciation based on grammar characteristics corresponding to the sentence, and determining polyphone characteristics based on the grammar characteristics for polyphone pronunciation and the character semantic characteristics.
The third determination module is further configured to: determining grammar characteristics aiming at rhythm based on grammar characteristics corresponding to the sentence, and determining a phoneme sequence corresponding to the sentence based on the grammar characteristics aiming at rhythm and the polyphone characteristics.
The extraction unit may be similar to the sentence meaning analysis module described above, the first determination unit may be similar to the grammar analysis unit (or a combination of the grammar analysis unit and the grammar feature learning unit) described above, the second determination unit may be similar to the polyphone analysis module (or a combination of the polyphone analysis unit and the pitch change analysis module) described above, and the third determination unit may be similar to the speech generation module described above. For brevity, this disclosure is not repeated here.
Fig. 7 is a block diagram illustrating an apparatus 700 for converting text data into a phoneme sequence according to an embodiment of the present disclosure.
Referring to fig. 7, a device 700 may include a processor 701 and a memory 702. The processor 701 and the memory 702 may be connected by a bus 703.
The processor 701 may perform various actions and processes in accordance with programs stored in the memory 702. In particular, the processor 701 may be an integrated circuit chip having signal processing capabilities. The processor may be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, and may be of the X87 architecture or ARM architecture.
The memory 702 has stored thereon computer instructions that when executed by the microprocessor implement the method 200. The memory 702 may be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory may be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (ddr SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and direct memory bus random access memory (DR RAM). It should be noted that the memory of the methods described in this disclosure is intended to comprise, without being limited to, these and any other suitable types of memory.
Embodiments of the present disclosure provide a computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the above-described method.
According to another aspect of the present disclosure, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable medium and executes the computer instructions to cause the computer device to perform the aspects described above or methods provided in various alternative implementations of the aspects described above.
It is noted that the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In general, the various example embodiments of the disclosure may be implemented in hardware or special purpose circuits, software, firmware, logic, or any combination thereof. Some aspects may be implemented in hardware, while other aspects may be implemented in firmware or software which may be executed by a controller, microprocessor or other computing device. While aspects of the embodiments of the present disclosure are illustrated or described as block diagrams, flow charts, or using some other pictorial representation, it is well understood that the blocks, apparatus, systems, techniques or methods described herein may be implemented in, as non-limiting examples, hardware, software, firmware, special purpose circuits or logic, general purpose hardware or controller or other computing devices, or some combination thereof.
The exemplary embodiments of the invention described in detail above are illustrative only and are not limiting. It will be appreciated by those skilled in the art that various modifications and combinations of the embodiments or features thereof can be made without departing from the principles and spirit of the invention, and such modifications are intended to be within the scope of the invention.

Claims (13)

1. A method of converting text data into a sequence of phonemes, comprising:
Extracting semantic features corresponding to sentences and character semantic features corresponding to one or more continuous characters in the sentences based on the sentences in the text data;
Determining grammar characteristics corresponding to the sentences based on the sentence semantic characteristics corresponding to the sentences;
determining a polyphone feature based on the character semantic feature and the grammar feature corresponding to the sentence, the polyphone feature indicating polyphone pronunciation information of a character, and
Based on the grammar features and the polyphone features, determining a phoneme sequence corresponding to the sentence,
Wherein the determining polyphone features based on the character semantic features and the grammar features corresponding to the sentences comprises:
determining grammar characteristics for polyphonic pronunciation based on grammar characteristics corresponding to the sentence, and
Determining polyphone features based on the grammatical features for polyphone pronunciation and the character semantic features;
Wherein said determining polyphone features based on said grammatical features for polyphone pronunciation and said character semantic features comprises:
splicing the grammar features for polyphone pronunciation and the character semantic features into initial polyphone features, and
And determining the polyphone characteristics based on the initial polyphone characteristics and dictionary pronunciation information corresponding to each character in the character combination.
2. The method of claim 1, wherein the determining a phoneme sequence corresponding to the sentence based on the grammatical feature and the polyphonic feature further comprises:
Determining grammar characteristics for prosody based on the grammar characteristics corresponding to the sentence,
And determining a phoneme sequence corresponding to the sentence based on the grammar characteristic aiming at rhythm and the polyphone characteristic.
3. The method of claim 2, wherein the determining a phoneme sequence corresponding to the sentence based on the grammatical feature and the polyphone feature further comprises:
based on the polyphone characteristics and the tone variation information, determining polyphone characteristics fused with the tone variation information;
and determining a phoneme sequence corresponding to the sentence based on the polyphone characteristics fused with the tone variation information and the grammar characteristics aiming at rhythm.
4. The method of claim 1, wherein the determining the grammatical feature corresponding to the sentence based on the sentence-corresponding semantic feature further comprises:
Determining grammar coding features corresponding to the sentences based on the sentence meaning features corresponding to the sentences and the part-of-speech information, word segmentation boundary information and word segmentation position information corresponding to each word segmentation in the sentences;
And determining the grammar characteristics corresponding to the sentences based on the grammar coding characteristics corresponding to the sentences.
5. The method of claim 4, wherein the syntax coding features for the sentence are determined based on the syntax features by a first encoder, the syntax features for the sentence are determined based on the syntax coding features by a composition analysis module, the first encoder and the composition analysis module are trained by a syntax tree dataset, the training of the first encoder comprising:
Determining, based on syntax sample sentences in the syntax tree dataset, syntax coding features corresponding to the syntax sample sentences with the first encoder,
Determining grammar characteristics corresponding to the grammar sample sentences and extracting component span scores in grammar coding characteristics corresponding to the grammar sample sentences by utilizing the component analysis module;
based on the component span score, determining a predicted part-of-speech label, a predicted word boundary label and a predicted word position label corresponding to each word in the grammar sample sentence;
Calculating a value corresponding to a first loss function based on a predicted part-of-speech tag, a predicted word boundary tag and a predicted word position tag corresponding to each word in the grammar sample sentence and an actual part-of-speech tag, an actual word boundary tag and an actual word position tag corresponding to each word in the grammar sample sentence; and
And adjusting parameters of neurons in the first encoder and the component analysis module based on the value corresponding to the first loss function so as to enable the first loss function to be converged.
6. The method of claim 1, wherein the polyphone features are determined by a second encoder based on the initial polyphone features, the second encoder being trained by a polyphone dataset, the training of the second encoder comprising:
determining initial polyphone characteristics corresponding to the polyphone sample sentences based on the polyphone sample sentences in the polyphone data set;
Determining, with the second encoder, polyphone features corresponding to the polyphone sample sentence based on initial polyphone features corresponding to the polyphone sample sentence;
Based on the multi-sound character characteristics corresponding to the multi-sound character sample sentence, determining a predicted pinyin mark corresponding to each character in the multi-sound character sample sentence by using a pinyin prediction layer;
Calculating a value of a second loss function based on the predicted pinyin labels corresponding to each character in the polyphone sample sentence and the actual pinyin labels corresponding to each character in the polyphone sample sentence; and
And adjusting parameters of neurons in the second encoder and the pinyin prediction layer based on the value corresponding to the second loss function so as to enable the second loss function to converge.
7. The method of claim 3, wherein the polyphonic character feature fused with the transposition information is determined by a transposition analysis module trained from a text-to-speech data set, the training of the transposition analysis module comprising:
Based on the speech sample sentences in the text-to-speech data set, determining predicted polyphone features corresponding to the speech sample sentences using the pitch analysis module,
Calculating a value of a third loss function based on the predicted polyphone feature corresponding to the speech sample sentence and the actual polyphone feature corresponding to the speech sample sentence; and
And adjusting neuron parameters in the tone-change analysis module based on the value of the third loss function so that the third loss function converges.
8. The method of claim 7, further comprising: determining, based on a phoneme sequence corresponding to the sentence, audio corresponding to the sentence using a speech generation module trained from the text-to-speech dataset, the training of the speech generation module comprising:
Based on the speech sample sentences in the text-to-speech data set, utilizing the speech generation module to predict audio corresponding to the speech sample sentences,
Calculating a value of a fourth loss function based on the predicted audio corresponding to the speech sample sentence and the actual audio corresponding to the speech sample sentence; and
Based on the value of the fourth loss function, neuron parameters in the speech generation module are adjusted so that the fourth loss function converges.
9. The method of claim 2, wherein the prosodic specific grammar characteristics are determined by a shared hidden layer and a first convolutional neural network layer, the multi-tone word specific grammar characteristics are determined by the shared hidden layer and a second convolutional neural network, and a convolution kernel of the first convolutional neural network spans more than a convolution kernel of the second convolutional neural network.
10. An apparatus for converting text data into a sequence of phonemes, comprising:
An extraction unit configured to extract, based on a sentence in the text data, semantic features corresponding to the sentence and character semantic features corresponding to one or more continuous characters in the sentence,
A first determination unit configured to determine a grammatical feature corresponding to the sentence based on the semantic feature corresponding to the sentence,
A second determination unit configured to determine a polyphone feature indicating polyphone pronunciation information of a character based on the character semantic feature and a grammar feature corresponding to the sentence, and
A third determination unit configured to determine a phoneme sequence corresponding to the sentence based on the grammatical feature and the polyphone feature,
Wherein the determining polyphone features based on the character semantic features and the grammar features corresponding to the sentences comprises:
determining grammar characteristics for polyphonic pronunciation based on grammar characteristics corresponding to the sentence, and
Determining polyphone features based on the grammatical features for polyphone pronunciation and the character semantic features;
Wherein said determining polyphone features based on said grammatical features for polyphone pronunciation and said character semantic features comprises:
splicing the grammar features for polyphone pronunciation and the character semantic features into initial polyphone features, and
And determining the polyphone characteristics based on the initial polyphone characteristics and dictionary pronunciation information corresponding to each character in the character combination.
11. The apparatus of claim 10, wherein,
The third determination unit is configured to:
Determining grammar characteristics for prosody based on the grammar characteristics corresponding to the sentence,
And determining a phoneme sequence corresponding to the sentence based on the grammar characteristic aiming at rhythm and the polyphone characteristic.
12. An apparatus for converting text data into a sequence of phonemes, comprising:
A processor; and
A memory, wherein the memory has stored therein a computer executable program which, when executed by the processor, performs the method of any of claims 1-9.
13. A computer readable storage medium having stored thereon computer instructions which, when executed by a processor, implement the method of any of claims 1-9.
CN202110832833.4A 2021-07-22 2021-07-22 Method and device for converting text data into phoneme sequence Active CN113823259B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110832833.4A CN113823259B (en) 2021-07-22 2021-07-22 Method and device for converting text data into phoneme sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110832833.4A CN113823259B (en) 2021-07-22 2021-07-22 Method and device for converting text data into phoneme sequence

Publications (2)

Publication Number Publication Date
CN113823259A CN113823259A (en) 2021-12-21
CN113823259B true CN113823259B (en) 2024-07-02

Family

ID=78912756

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110832833.4A Active CN113823259B (en) 2021-07-22 2021-07-22 Method and device for converting text data into phoneme sequence

Country Status (1)

Country Link
CN (1) CN113823259B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114999450A (en) * 2022-05-24 2022-09-02 网易有道信息技术(北京)有限公司 Homomorphic and heteromorphic word recognition method and device, electronic equipment and storage medium
CN115329785B (en) * 2022-10-15 2023-01-20 小语智能信息科技(云南)有限公司 English-Tai-old multi-language neural machine translation method and device integrated with phoneme characteristics

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10153998A (en) * 1996-09-24 1998-06-09 Nippon Telegr & Teleph Corp <Ntt> Auxiliary information utilizing type voice synthesizing method, recording medium recording procedure performing this method, and device performing this method
US8510113B1 (en) * 2006-08-31 2013-08-13 At&T Intellectual Property Ii, L.P. Method and system for enhancing a speech database
CN107402933A (en) * 2016-05-20 2017-11-28 富士通株式会社 Entity polyphone disambiguation method and entity polyphone disambiguation equipment
JP7228998B2 (en) * 2018-08-27 2023-02-27 日本放送協会 speech synthesizer and program
CN112528648A (en) * 2020-12-10 2021-03-19 平安科技(深圳)有限公司 Method, device, equipment and storage medium for predicting polyphone pronunciation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110782870A (en) * 2019-09-06 2020-02-11 腾讯科技(深圳)有限公司 Speech synthesis method, speech synthesis device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN113823259A (en) 2021-12-21

Similar Documents

Publication Publication Date Title
CN110782870B (en) Speech synthesis method, device, electronic equipment and storage medium
JP7066349B2 (en) Translation method, translation equipment and computer program
KR102677459B1 (en) Two-level speech prosody transfer
CN111312245B (en) Voice response method, device and storage medium
KR20220035180A (en) Expressive power control in E2E (End-to-end) speech synthesis system
CN111274807B (en) Text information processing method and device, computer equipment and readable storage medium
CN112352275A (en) Neural text-to-speech synthesis with multi-level textual information
CN113823259B (en) Method and device for converting text data into phoneme sequence
CN113761841B (en) Method for converting text data into acoustic features
WO2022121179A1 (en) Speech synthesis method and apparatus, device, and storage medium
CN116312463A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
WO2023045186A1 (en) Intention recognition method and apparatus, and electronic device and storage medium
Zhao et al. Tibetan Multi-Dialect Speech and Dialect Identity Recognition.
CN116665639A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
Wang et al. Joint alignment learning-attention based model for grapheme-to-phoneme conversion
Wang et al. Investigation of using continuous representation of various linguistic units in neural network based text-to-speech synthesis
CN115374784A (en) Chinese named entity recognition method based on multi-mode information selective fusion
CN113129862B (en) Voice synthesis method, system and server based on world-tacotron
CN114373445B (en) Voice generation method and device, electronic equipment and storage medium
Pandey et al. Speech Recognition of Vedic Sanskrit using Deep Learning Algorithm
CN113555006B (en) Voice information identification method and device, electronic equipment and storage medium
Gong et al. A Review of End-to-End Chinese–Mandarin Speech Synthesis Techniques
Griol et al. Big data for conversational interfaces: Current opportunities and prospects
KR102090240B1 (en) Apparatus and Method for Predicting Korean Prosodic Boundary using based on Deep Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant