CN116778907A

CN116778907A - Multi-mode-based speech synthesis method, device, equipment and storage medium

Info

Publication number: CN116778907A
Application number: CN202310688242.3A
Authority: CN
Inventors: 张旭龙; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-06-09
Filing date: 2023-06-09
Publication date: 2023-09-19

Abstract

The application relates to the technical field of artificial intelligence, and discloses a multi-mode-based speech synthesis method, device, equipment and storage medium, wherein the method comprises the steps of preprocessing a text to be synthesized to obtain character sequence information, character-level diagram sequence information and word-level diagram sequence information as input sequence information; coding the character sequence information to obtain a time domain coding vector; coding the character-level diagram sequence information and the word-level diagram sequence information to obtain a first spatial domain coding vector and a second spatial domain coding vector; performing first cross-modal attention calculation on the time domain coding vector and the first space domain coding vector to obtain a first decoding vector; performing second cross-modal attention calculation on the first decoding vector and the second spatial domain coding vector to obtain a second decoding vector; and obtaining a voice spectrogram according to the second decoding vector to generate synthesized voice. The application ensures the rhythm sense and accuracy of the synthesized voice and effectively improves the financial service level.

Description

Multi-mode-based speech synthesis method, device, equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for synthesizing speech based on multiple modes.

Background

Speech synthesis is a Text To Speech (TTS) technique that includes multiple steps of Text analysis, acoustic models, sound synthesis modules, etc. In the financial field, a financial institution widely introduces a voice synthesis technology in business scenes such as reception of greetings, business consultation, propaganda broadcasting, explanation questions and answers, and the like. In order to simplify the speech synthesis process, reduce human intervention and reduce the requirements for linguistic-related background knowledge, the end-to-end speech synthesis system implements speech synthesis by directly inputting text or phonetic characters at the input end and outputting audio waveforms at the output end. The existing end-to-end voice synthesis method ignores the importance of visual information, only utilizes the characteristics of single-mode text information, and cannot accurately and comprehensively realize voice synthesis.

Prosody represents the information of rhythm, emphasis, intonation and the like in the voice, and the prosody information determines the naturalness and fluency of the synthesized voice, so that the prosody has very important effect in voice synthesis. In the existing end-to-end voice synthesis method, a standard database with one-to-one correspondence between texts and voices is used for training a voice synthesis model, and due to the limited capacity in the standard database, the voice synthesis model cannot learn prosodic information rules of semantic connection, so that the synthesized voices lack prosodic sense, the service level of financial business is affected, for example, the customer satisfaction degree is reduced due to the fact that voice customer service is hard and lively.

Disclosure of Invention

Accordingly, in order to solve the above-mentioned problems, it is necessary to provide a method, apparatus, device and storage medium for synthesizing speech based on multiple modes, so as to solve the problems of single mode characteristics and poor prosody feeling of speech synthesis.

A multi-modal based speech synthesis method comprising:

preprocessing a text to be synthesized to obtain input sequence information; the input sequence information comprises character sequence information, character-level diagram sequence information and word-level diagram sequence information;

coding the character sequence information to obtain a time domain coding vector; coding the character-level diagram sequence information to obtain a first space domain coding vector; coding the word-level graph sequence information to obtain a second spatial domain coding vector;

performing first cross-modal attention calculation on the time domain coding vector and the first spatial domain coding vector to obtain a first decoding vector;

performing second cross-modal attention calculation on the first decoding vector and the second spatial domain coding vector to obtain a second decoding vector;

and obtaining a voice spectrogram according to the second decoding vector so as to generate the synthetic voice of the text to be synthesized.

A multi-modality based speech synthesis apparatus comprising:

the preprocessing module is used for preprocessing the text to be synthesized to obtain input sequence information; the input sequence information comprises character sequence information, character-level diagram sequence information and word-level diagram sequence information;

the coding processing module is used for carrying out coding processing on the character sequence information to obtain a time domain coding vector; coding the character-level diagram sequence information to obtain a first space domain coding vector; coding the word-level graph sequence information to obtain a second spatial domain coding vector;

the first attention calculating module is used for carrying out first cross-modal attention calculation on the time domain coding vector and the first space domain coding vector to obtain a first decoding vector;

the second attention calculating module is used for carrying out second cross-modal attention calculation on the first decoding vector and the second spatial domain coding vector to obtain a second decoding vector;

and the synthetic voice generating module is used for obtaining a voice spectrogram according to the second decoding vector so as to generate the synthetic voice of the text to be synthesized.

A computer device comprising a memory, a processor and computer readable instructions stored in the memory and executable on the processor, the processor implementing the above-described multimodal-based speech synthesis method when executing the computer readable instructions.

One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform a multimodal-based speech synthesis method as described above.

The method, the device, the equipment and the storage medium for synthesizing the voice based on the multiple modes acquire input sequence information by preprocessing the text to be synthesized; the input sequence information comprises character sequence information, character-level diagram sequence information and word-level diagram sequence information; coding the character sequence information to obtain a time domain coding vector; coding the character-level diagram sequence information to obtain a first space domain coding vector; encoding the word-level graph sequence information to obtain a second spatial domain encoding vector; performing first cross-modal attention calculation on the time domain coding vector and the first space domain coding vector to obtain a first decoding vector; performing second cross-modal attention calculation on the first decoding vector and the second spatial domain coding vector to obtain a second decoding vector; and obtaining a voice spectrogram according to the second decoding vector so as to generate the synthetic voice of the text to be synthesized. The application calculates the hidden state of the character level diagram embedding by utilizing the character level diagram embedding coding mode, calculates the hidden state of the word level diagram embedding by utilizing the word level diagram embedding coding mode, can extract semantic information of texts from multiple layers, and improves rhythm sense of synthesized voice; meanwhile, the multi-modal feature fusion is carried out by utilizing a cross-modal attention mechanism, the attention weight is calculated, so that the time domain modal coding vectors respectively receive information from two different spatial domain modal coding vectors and carry out feature selection, the rhythm sense and the accuracy of the synthesized voice are ensured, and the financial service level is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments of the present application will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a multi-modal based speech synthesis method according to an embodiment of the application;

FIG. 2 is a flow chart of a multi-modal based speech synthesis method according to an embodiment of the application;

FIG. 3 is a schematic diagram of a multi-mode-based speech synthesis apparatus according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a computer device in accordance with an embodiment of the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The embodiment of the application can acquire and process the voice data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The speech synthesis technology is also called text-to-speech technology, and is a technology for converting text information generated by a computer or input from outside into intelligible and fluent speech output. The text-to-speech system is actually an artificial intelligence system, and in order to synthesize high quality language, the semantic content of the text must be well understood in addition to relying on various rules including semantic rules, vocabulary rules, phonetic rules. The voice synthesis method based on the multiple modes is applied to a financial service business scene and improves the financial service level. Specifically, the bank is internally provided with an intelligent voice interaction robot in the self-service business system, and the high-quality synthesized voice is used for guiding clients to transact business such as card opening, account transfer, money transfer and the like; the insurance company adopts intelligent artificial customer service in the claim settlement business, and helps customers to know the claim settlement flow and solve the claim settlement problem through high-quality synthesized voice.

In one embodiment, as shown in fig. 1, a multi-mode-based speech synthesis method is provided, which includes the following steps S10-S50.

S10, preprocessing a text to be synthesized to obtain input sequence information; the input sequence information includes character sequence information, character-level diagram sequence information, and word-level diagram sequence information.

Understandably, the preprocessing of speech synthesis includes language processing, which plays an important role in the text-to-speech process. The language processing is to analyze and process the text of the text to be synthesized, and to make the computer understand the text to be synthesized completely and generate the input sequence information by simulating the understanding process of human to natural language. The preprocessing of the speech synthesis also comprises prosody processing, wherein the prosody processing plans the characteristics of voice segments for the synthesized speech, such as pitch, duration, intensity and the like, so that the synthesized speech can correctly express the meaning and sounds more natural. Graph-to-Sequence (Graph-to-Sequence) is embedded (embedded) by converting the input text Sequence into a form of a Graph structure to represent text content, grammatical relations, and semantic connections between texts, preserving prosodic information. In one embodiment, character sequence information is obtained by performing character embedding processing on an input text to be synthesized; and respectively carrying out character level diagram embedding and word level diagram embedding processing on the text to be synthesized to obtain character level diagram sequence information and word level diagram sequence information.

S20, carrying out coding processing on the character sequence information to obtain a time domain coding vector; coding the character-level diagram sequence information to obtain a first space domain coding vector; and carrying out coding processing on the word-level graph sequence information to obtain a second spatial domain coding vector.

The time domain coding vector is understandably a coding vector derived from character sequence information for representing the time step order between characters of the text to be synthesized; the spatial domain coding vector is a coding vector obtained from the graph sequence information and used for representing semantic association relations among characters of the text to be synthesized. end-To-end speech synthesis is a Sequence-To-Sequence (Seq 2 Seq) model, including an Encoder (Encoder) and a Decoder (Decoder). The encoder is a cyclic neural network for text understanding and is used for encoding input sequence information into a hidden state vector; the decoder is a cyclic neural network for text generation, and is used for decoding and translating the hidden state vector output by the encoder at each time step. The encoding vector is specifically an One-Hot encoding vector, also known as One-bit efficient encoding, that encodes N states by using N-bit state registers, each state having a separate register bit. One-Hot encoding vectors are vectors with only One element being a 1 and the remaining elements being 0. The encoder maps each character sequence data of the character sequence information to a discrete One-Hot coding vector, and codes the discrete One-Hot coding vector to a low-dimensional continuous embedded form to obtain a time domain coding vector; the encoder maps each image sequence data of the character-level image sequence information and the word-level image sequence information to a discrete One-Hot coding vector respectively, and codes the discrete One-Hot coding vector to a low-dimensional continuous embedded form representation to obtain a first spatial domain coding vector and a second spatial domain coding vector.

S30, performing first cross-modal attention calculation on the time domain coding vector and the first space domain coding vector to obtain a first decoding vector.

It will be appreciated that during end-to-end speech synthesis, the decoder decodes word by word, and if too much information is received during each decoding, internal confusion may result, with the occurrence of decoding error results. For example, the encoder encodes "weather today is good" and then passes the encoded vector to the decoder, which decodes the "weather today is good". The attention mechanism (Attention Mechanism) in neural networks is a resource allocation scheme that allocates computing resources to more important tasks while solving the information overload problem in situations where computing power is limited. By introducing an attention mechanism in the decoder, confusion of the content can be avoided, e.g. when decoding "present" the relation to "day" is larger and the relation to "gas" is not larger, by which more attention is paid to "today" without paying much attention to "weather". The feature vector represents an entity data, which may be an image, a single word or a sentence. The multi-modal feature vector comes from the multi-aspect representative information of the entity and has the characteristics of smoothness, time and space consistency, sparsity, natural clustering and the like. When the neural network is used for constructing the multi-modal feature, each modal data respectively passes through an independent neural network layer, and then maps the multiple modes to a joint space through one or more hidden layers to obtain the cross-modal joint feature. In an embodiment, the character sequence information is passed through an encoder to obtain a time domain encoded vector, the character-level graph sequence information is passed through the encoder to obtain a first spatial domain encoded vector, a first cross-modal attention calculation is performed on the time domain encoded vector and the first spatial domain encoded vector, and features of the character sequence information and features of the character-level graph sequence information are combined to obtain a first decoded vector.

S40, performing second cross-modal attention calculation on the first decoding vector and the second spatial domain coding vector to obtain a second decoding vector.

Understandably, the text to be synthesized has semantic connections between words in addition to the semantic connections between characters. The first decoding vector is a vector representation in which character sequence information and character-level diagram sequence information are feature-combined, and the second decoding vector is a vector representation in which the first decoding vector and word-level diagram sequence information are feature-combined. In an embodiment, the word-level diagram sequence information is passed through an encoder to obtain a second spatial-domain encoded vector, a second cross-modal attention calculation is performed on the first decoded vector and the second spatial-domain encoded vector, and features of the character sequence information, the character-level diagram sequence information and the word-level diagram sequence information are combined to obtain a second decoded vector.

S50, obtaining a voice spectrogram according to the second decoding vector so as to generate the synthesized voice of the text to be synthesized.

Understandably, the speech synthesis process also includes adding a post-processing network after the decoder. The decoder obtains a plurality of second decoding vectors after the attention calculation of a plurality of time steps, and optimizes the second decoding vectors through a post-processing network to obtain the voice mel spectrum characteristics. And generating a voice Mel spectrum according to the characteristics of the voice Mel spectrum, performing spectrum conversion on the voice Mel spectrum, and inversely transforming the voice Mel spectrum into a waveform sample, thereby obtaining the synthesized voice.

The embodiment obtains input sequence information by preprocessing a text to be synthesized; the input sequence information comprises character sequence information, character-level diagram sequence information and word-level diagram sequence information; coding the character sequence information to obtain a time domain coding vector; coding the character-level diagram sequence information to obtain a first space domain coding vector; encoding the word-level graph sequence information to obtain a second spatial domain encoding vector; performing first cross-modal attention calculation on the time domain coding vector and the first space domain coding vector to obtain a first decoding vector; performing second cross-modal attention calculation on the first decoding vector and the second spatial domain coding vector to obtain a second decoding vector; and obtaining a voice spectrogram according to the second decoding vector so as to generate the synthetic voice of the text to be synthesized. The application calculates the hidden state of the character level diagram embedding by utilizing the character level diagram embedding coding mode, calculates the hidden state of the word level diagram embedding by utilizing the word level diagram embedding coding mode, can extract semantic information of texts from multiple layers, and improves rhythm sense of synthesized voice; meanwhile, multi-mode feature fusion is carried out by utilizing a cross-mode attention mechanism, and the attention weight is calculated to enable the time domain mode coding vector to respectively receive information from two different spatial domain mode coding vectors and carry out feature selection, so that the accuracy of the synthesized voice is improved.

Optionally, in step S10, that is, preprocessing the text to be synthesized to obtain input sequence information, the method includes:

s101, extracting character characteristic data of the text to be synthesized;

s102, performing phoneme embedding on the character characteristic data to generate character sequence information.

It is understood that a character is a class unit or symbol in the text to be synthesized, including letters, numbers, operators, punctuation, and functional symbols. The phonemes are the minimum phonetic units divided according to the natural attributes of the speech, and are analyzed according to the pronunciation actions in syllables, and one action constitutes one phoneme. The Embedding (Embedding) mode is a data representation mode, and through a hidden layer of a newly added neural network, the association relation between data is reserved, the output dimension of the hidden layer is controlled, and the dimension reduction of high-radix classified data is realized. In this embodiment, character feature data of a text to be synthesized is extracted, and phoneme embedding is performed on the character feature data to obtain character sequence information.

The embodiment converts character characteristic data of the text to be synthesized into character sequence information so that an encoder can encode the character sequence information to obtain a hidden state vector of a time domain.

Optionally, in step S10, that is, preprocessing the text to be synthesized to obtain input sequence information, the method further includes:

s103, extracting character characteristic data of the text to be synthesized;

s104, carrying out graph embedding on the character characteristic data to obtain character level graph node information and character level graph boundary information;

s105, generating the character level diagram sequence information according to the character level diagram node information and the character level diagram boundary information.

Understandably, the present embodiment extracts character feature data of a text to be synthesized; graph embedding is carried out on character characteristic data: representing character characteristic data (each letter, symbol and the like) by nodes of the graph to obtain character level graph node information; modeling the adjacency relation between character characteristic data by using the boundary of the graph, namely connecting adjacent graph nodes by directional edges to obtain character-level graph boundary information; and splicing the character-level graph node information and the character-level graph boundary information to generate character-level graph sequence information.

According to the embodiment, character characteristic data of the text to be synthesized are converted into a graph structure form to be embedded, and the text content is represented by node embedding, so that modal characteristics are enriched; the semantic connection between the grammar relation and the text characters is represented by boundary embedding, and prosodic information is reserved; meanwhile, the encoder is convenient to encode the character-level diagram sequence information so as to obtain the character-level hidden state vector of the space domain.

s106, extracting word characteristic data of the text to be synthesized;

s107, carrying out graph embedding on the word characteristic data to obtain word level graph node information and word level graph boundary information;

s108, generating the word-level diagram sequence information according to the word-level diagram node information and the word-level diagram boundary information.

It is understood that a parse tree is a tree-like data structure that is used to determine whether the syntactic structure of a word sequence meets a given grammar. According to the embodiment, grammar processing is carried out on the text to be synthesized through a preset syntactic analysis tree, and word characteristic data of the text to be synthesized are extracted; graph embedding is carried out on word characteristic data: representing word characteristic data (each word) by using nodes of the graph to obtain word level graph node information; modeling the adjacency relation between the word characteristic data by using the boundary of the graph, namely connecting adjacent graph nodes by directional edges to obtain word-level graph boundary information; and splicing the word-level graph node information and the word-level graph boundary information to generate word-level graph sequence information.

According to the embodiment, word characteristic data of the text to be synthesized are converted into a graph structure form to be embedded, and the text content is represented by node embedding, so that modal characteristics are enriched; semantic connection between grammar relations and text words is represented by boundary embedding, and prosodic information is reserved; meanwhile, the encoder is convenient to encode the word-level graph sequence information so as to obtain the word-level hidden state vector of the space domain.

Optionally, in step S30, the performing a first cross-modal attention calculation on the time-domain encoded vector and the first spatial-domain encoded vector to obtain a first decoded vector includes:

s301, acquiring a first query space vector corresponding to the first space domain coding vector;

s302, acquiring a first key space vector and a first value space vector corresponding to the time domain coding vector;

s303, calculating a first attention weight according to the first query space vector and the first key space vector;

s304, carrying out weighted calculation on the first value space vector according to the first attention weight value to obtain the first decoding vector.

It will be appreciated that the self-attention mechanism often adopts a Query-Key-Value mode, and in the single-mode self-attention mechanism, a Query space vector (Query, Q), a Key space vector (Key, K) and a Value space vector (Value, V) are obtained by multiplying the same code vector by three corresponding trainable parameter matrices respectively. Firstly, carrying out point multiplication on Q and K to calculate similarity, carrying out normalization processing on the obtained similarity weight matrix, and then multiplying the obtained similarity weight matrix with V to calculate a weighted sum to obtain a decoding vector. The present embodiment employs a cross-modal attention mechanism, Q, K and V are not from the same code-vector, Q from the spatial domain coded vector, and K and V from the temporal coded vector, by computing the spatial domain coded vector and the temporal coded vectorSimilarity is used to combine the model features. The cross-modal attention decoder shares two different attention layers, in the first one of which a first query spatial vector Q corresponding to a first spatial domain encoded vector is acquired ₁ The method comprises the steps of carrying out a first treatment on the surface of the Acquiring a first key space vector K corresponding to a time domain coding vector ₁ And a first value space vector V ₁ The method comprises the steps of carrying out a first treatment on the surface of the According to the first query space vector Q ₁ And a first key space vector K ₁ Calculate a first attention weight a ₁ The method comprises the steps of carrying out a first treatment on the surface of the According to the first attention weight a ₁ For the first value space vector V ₁ Weighting calculation is carried out to obtain a first decoding vector context ₁ 。

In the embodiment, the cross-modal attention calculation is performed between the first spatial domain coding vector and the time domain coding vector, so that modal characteristics are enriched, and prosodic information characteristics between the character sequence and the character-level graph sequence are reserved.

Optionally, in step S40, the performing a second cross-modal attention calculation on the first decoding vector and the second spatial domain coding vector to obtain a second decoding vector includes:

s401, acquiring a second query space vector corresponding to the second space domain coding vector;

s402, acquiring a second key space vector and a second value space vector corresponding to the first decoding vector;

s403, calculating a second attention weight according to the second query space vector and the second key space vector;

s404, carrying out weighted calculation on the second value space vector according to the second attention weight value to obtain the second decoding vector.

It will be appreciated that the cross-modal attention decoder of this embodiment shares two different attention layers, in the second attention layer, a second query spatial vector Q corresponding to a second spatial domain encoded vector is obtained ₂ The method comprises the steps of carrying out a first treatment on the surface of the Acquiring a second key space vector K corresponding to the first decoding vector ₂ And a second value space vector V ₂ The method comprises the steps of carrying out a first treatment on the surface of the According to the second query space vector Q ₂ And a second key space vector K ₂ Calculating a second shotWeight a of meaning force ₂ The method comprises the steps of carrying out a first treatment on the surface of the According to the second attention weight a ₂ For the second value space vector V ₂ Weighting calculation is carried out to obtain a second decoding vector context ₂ 。

According to the embodiment, the cross-modal attention calculation is carried out between the second spatial domain coding vector and the first decoding vector, so that modal characteristics are enriched, prosodic information characteristics among the character sequence, the character-level diagram sequence and the word-level diagram sequence are reserved, and the accuracy of speech synthesis is improved.

Optionally, in step S50, the obtaining a speech spectrogram sequence according to the second decoding vector to generate the synthesized speech of the text to be synthesized includes:

s501, generating a voice Mel spectrum according to the second decoding vector;

s502, performing spectrum conversion on the voice Mel spectrum to obtain the synthesized voice of the text to be synthesized.

Understandably, the speech mel-spectrum contains time-frequency domain information, amplitude information related to the perception, and frequency domain information related to the perception. And optimizing the decoding vector according to the characteristics that the human ear hearing is sensitive to low-frequency sound and insensitive to high-frequency sound to generate a voice mel spectrum. In an embodiment, the decoder obtains a plurality of second decoding vectors after cross-modal attention calculation of a plurality of time steps, optimizes the second decoding vectors through a post-processing network to obtain voice mel spectrum features, and generates a voice mel spectrum according to the voice mel spectrum features; after a voice Mel spectrum corresponding to the text to be synthesized is generated, the WaveNet is used for carrying out spectrum conversion on the voice Mel spectrum, the characteristic expression of the voice Mel spectrum is inversely transformed into a waveform sample, and the synthesized voice corresponding to the text to be synthesized is generated.

According to the embodiment, the synthesized voice corresponding to the text to be synthesized is generated by performing spectrum conversion on the voice Mel spectrum, so that the naturalness and the accuracy of the synthesized voice are ensured.

Inputting a text to be synthesized, as shown in a flow diagram of a multi-mode-based speech synthesis method in fig. 2; performing coding processing through three independent encoder cyclic neural networks (a character embedding-character encoder, a character level diagram embedding-character level diagram encoder and a word level diagram embedding-word level diagram encoder) to obtain a character level spatial domain coding vector, a word level spatial domain coding vector and a time domain coding vector; decoding through two attention layers of a cross-modal attention decoder to obtain a mel spectrogram; to generate synthesized speech from the mel-pattern. The processing procedure of the two spatial domains is auxiliary processing, namely, the two spatial domain modes are used for assisting the attention calculation of the time domain modes in the cross-mode attention decoder, and the time domain coding vector can receive prosodic information from the two spatial domain coding vectors and carry out spatial feature combination.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.

In an embodiment, a multi-mode-based speech synthesis apparatus is provided, where the multi-mode-based speech synthesis apparatus corresponds to the multi-mode-based speech synthesis method in the above embodiment one by one. As shown in fig. 3, the multi-modal based speech synthesis apparatus includes a preprocessing module 10, an encoding processing module 20, a first attention calculation module 30, a second attention calculation module 40, and a synthesized speech generation module 50. The functional modules are described in detail as follows:

the preprocessing module 10 is used for preprocessing the text to be synthesized to obtain input sequence information; the input sequence information comprises character sequence information, character-level diagram sequence information and word-level diagram sequence information;

the encoding processing module 20 is configured to perform encoding processing on the character sequence information to obtain a time domain encoding vector; coding the character-level diagram sequence information to obtain a first space domain coding vector; coding the word-level graph sequence information to obtain a second spatial domain coding vector;

a first attention computation module 30, configured to perform a first cross-modal attention computation on the time domain encoded vector and the first spatial domain encoded vector, to obtain a first decoded vector;

a second attention computation module 40, configured to perform a second cross-modal attention computation on the first decoding vector and the second spatial domain coding vector, to obtain a second decoding vector;

a synthetic speech generating module 50, configured to obtain a speech spectrogram according to the second decoding vector, so as to generate a synthetic speech of the text to be synthesized.

Optionally, the preprocessing module 10 includes:

the first character feature data extraction unit is used for extracting character feature data of the text to be synthesized;

and the character sequence information generating unit is used for carrying out phoneme embedding on the character characteristic data to generate character sequence information.

Optionally, the preprocessing module 10 further includes:

the second character feature data extraction unit is used for extracting character feature data of the text to be synthesized;

the character characteristic data graph embedding unit is used for performing graph embedding on the character characteristic data to obtain character level graph node information and character level graph boundary information;

and the character-level diagram sequence information generating unit is used for generating the character-level diagram sequence information according to the character-level diagram node information and the character-level diagram boundary information.

Optionally, the preprocessing module 10 further includes:

the word characteristic data extraction unit is used for extracting word characteristic data of the text to be synthesized;

the word characteristic data graph embedding unit is used for performing graph embedding on the word characteristic data to obtain word level graph node information and word level graph boundary information;

and the word-level graph sequence information generating unit is used for generating the word-level graph sequence information according to the word-level graph node information and the word-level graph boundary information.

Optionally, the first attention computing module 30 includes:

a first spatial domain coding vector processing unit, configured to obtain a first query spatial vector corresponding to the first spatial domain coding vector;

a time domain coding vector processing unit, configured to obtain a first key space vector and a first value space vector corresponding to the time domain coding vector;

a first attention weight calculation unit configured to calculate a first attention weight according to the first query space vector and the first key space vector;

and the first decoding vector calculation unit is used for carrying out weighted calculation on the first value space vector according to the first attention weight value to obtain the first decoding vector.

Optionally, the second attention computing module 40 includes:

a second spatial domain coding vector processing unit, configured to obtain a second query spatial vector corresponding to the second spatial domain coding vector;

a first decoding vector processing unit, configured to obtain a second key space vector and a second value space vector corresponding to the first decoding vector;

a second attention weight calculation unit configured to calculate a second attention weight according to the second query space vector and the second key space vector;

and the second decoding vector calculation unit is used for carrying out weighted calculation on the second value space vector according to the second attention weight value to obtain the second decoding vector.

Optionally, the synthetic speech generating module 50 includes:

a voice mel-spectrum generating unit, configured to generate a voice mel-spectrum according to the second decoding vector;

and the voice synthesis unit is used for carrying out spectrum conversion on the voice Mel spectrum to obtain the synthesized voice of the text to be synthesized.

For specific limitations of the multi-modal based speech synthesis apparatus, reference may be made to the above limitations of the multi-modal based speech synthesis method, which are not repeated here. The above-described modules in the multi-modal based speech synthesis apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 4. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a readable storage medium, an internal memory. The readable storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the execution of an operating system and computer-readable instructions in a readable storage medium. The database of the computer device is used for storing data related to the multi-modal based speech synthesis method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer readable instructions when executed by a processor implement a multi-modal based speech synthesis method. The readable storage medium provided by the present embodiment includes a nonvolatile readable storage medium and a volatile readable storage medium.

In one embodiment, a computer device is provided that includes a memory, a processor, and computer readable instructions stored on the memory and executable on the processor, when executing the computer readable instructions, performing the steps of:

In one embodiment, one or more computer-readable storage media are provided having computer-readable instructions stored thereon, the readable storage media provided by the present embodiment including non-volatile readable storage media and volatile readable storage media. The readable storage medium has stored thereon computer readable instructions which when executed by one or more processors perform the steps of:

Those skilled in the art will appreciate that implementing all or part of the above described embodiment methods may be accomplished by instructing the associated hardware by computer readable instructions stored on a non-volatile readable storage medium or a volatile readable storage medium, which when executed may comprise the above described embodiment methods. Any reference to memory, storage, database, or other medium used in embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions.

The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A multi-modal based speech synthesis method, comprising:

2. The multi-modal based speech synthesis method of claim 1, wherein preprocessing the text to be synthesized to obtain input sequence information includes:

extracting character characteristic data of the text to be synthesized;

and performing phoneme embedding on the character characteristic data to generate character sequence information.

3. The multi-modal based speech synthesis method of claim 1 wherein the preprocessing of the text to be synthesized to obtain input sequence information further comprises:

extracting character characteristic data of the text to be synthesized;

performing graph embedding on the character characteristic data to obtain character level graph node information and character level graph boundary information;

and generating the character-level diagram sequence information according to the character-level diagram node information and the character-level diagram boundary information.

4. The multi-modal based speech synthesis method of claim 1 wherein the preprocessing of the text to be synthesized to obtain input sequence information further comprises:

extracting word characteristic data of the text to be synthesized;

performing graph embedding on the word characteristic data to obtain word level graph node information and word level graph boundary information;

and generating the word-level graph sequence information according to the word-level graph node information and the word-level graph boundary information.

5. The multi-modal based speech synthesis method of claim 1, wherein the performing a first cross-modal attention calculation on the time-domain encoded vector and the first spatial-domain encoded vector to obtain a first decoded vector comprises:

acquiring a first query space vector corresponding to the first space domain coding vector;

acquiring a first key space vector and a first value space vector corresponding to the time domain coding vector;

calculating a first attention weight according to the first query space vector and the first key space vector;

and carrying out weighted calculation on the first value space vector according to the first attention weight value to obtain the first decoding vector.

6. The multi-modal based speech synthesis method of claim 1, wherein the performing a second cross-modal attention computation on the first decoded vector and the second spatial domain encoded vector to obtain a second decoded vector comprises:

acquiring a second query space vector corresponding to the second space domain coding vector;

acquiring a second key space vector and a second value space vector corresponding to the first decoding vector;

calculating a second attention weight according to the second query space vector and the second key space vector;

and carrying out weighted calculation on the second value space vector according to the second attention weight value to obtain the second decoding vector.

7. The multi-modal based speech synthesis method of claim 1, wherein the obtaining a sequence of speech spectrograms from the second decoding vector to generate the synthesized speech of the text to be synthesized comprises:

generating a voice mel spectrum according to the second decoding vector;

and carrying out spectrum conversion on the voice Mel spectrum to obtain the synthesized voice of the text to be synthesized.

8. A multi-modal based speech synthesis apparatus comprising:

9. A computer device comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, wherein the processor, when executing the computer readable instructions, implements the multimodal-based speech synthesis method of any of claims 1 to 7.

10. A computer-readable storage medium storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the multi-modal based speech synthesis method of any one of claims 1-7.