CN108447486A - A kind of voice translation method and device - Google Patents

A kind of voice translation method and device Download PDF

Info

Publication number
CN108447486A
CN108447486A CN201810167142.5A CN201810167142A CN108447486A CN 108447486 A CN108447486 A CN 108447486A CN 201810167142 A CN201810167142 A CN 201810167142A CN 108447486 A CN108447486 A CN 108447486A
Authority
CN
China
Prior art keywords
unit
voice
text
sample
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810167142.5A
Other languages
Chinese (zh)
Other versions
CN108447486B (en
Inventor
王雨蒙
徐伟
江源
胡国平
胡郁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201810167142.5A priority Critical patent/CN108447486B/en
Priority to PCT/CN2018/095766 priority patent/WO2019165748A1/en
Publication of CN108447486A publication Critical patent/CN108447486A/en
Application granted granted Critical
Publication of CN108447486B publication Critical patent/CN108447486B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/033Voice editing, e.g. manipulating the voice of the synthesiser

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

This application discloses a kind of voice translation method and device, the method includes:After getting the first object voice of source speaker, by to the progress voiced translation of first object voice, generating the second target voice, wherein, the languages of second target voice are different from the languages of first object voice, and the second target voice carries the tamber characteristic of source speaker.It can be seen that, when voice carries out voiced translation before being translated to the voice of source speaker, the tamber characteristic having due to considering source speaker itself, so that voice also has the tamber characteristic of source speaker after translation, so that voice sounds the voice directly said more like source speaker after the translation.

Description

A kind of voice translation method and device
Technical field
This application involves field of computer technology more particularly to a kind of voice translation methods and device.
Background technology
Increasingly mature with artificial intelligence technology, people pursue more and more to be solved using intellectual technology Problem.For example, once people require a great deal of time to learn a new language, could with using the language as mother tongue People links up, and present, people can directly by translator, around speech recognition, intelligent translation and speech synthesis technique, To realize spoken input, machine translation and the meaning after saying translation of pronouncing.
But in current voiced translation technology, after the voice of source speaker is translated, language after obtained translation Sound is entirely the tamber characteristic of the speaker in phonetic synthesis model, in sense of hearing, is and entirely different another of source speaker The tamber characteristic of a speaker.
Invention content
The main purpose of the embodiment of the present application is to provide a kind of voice translation method and device, when to the language of source speaker When sound is translated, it can make the voice after translation that there is the tamber characteristic of source speaker.
The embodiment of the present application provides a kind of voice translation method, including:
The first object voice of acquisition source speaker;
By carrying out voiced translation to the first object voice, the second target voice is generated, wherein second target The languages of voice are different from the languages of first object voice, and second target voice carries the sound of the source speaker Color characteristic.
Optionally, it is described by the first object voice carry out voiced translation, generate the second target voice, including:
By carrying out speech recognition to the first object voice, speech recognition text is generated;
By carrying out text translation to the speech recognition text, cypher text is generated;
By carrying out phonetic synthesis to the cypher text, the second target voice is generated.
Optionally, it is described by the cypher text carry out phonetic synthesis, generate the second target voice, including:
The cypher text is subjected to cutting according to the unit-in-context of default size, obtains each target text unit;
Obtain the parameters,acoustic of each target text unit, wherein the parameters,acoustic carries the source speaker Tamber characteristic;
According to the parameters,acoustic of each target text unit, phonetic synthesis is carried out to the cypher text, generates the second mesh Poster sound.
Optionally, the method further includes:
Obtain the first sample voice of the source speaker, wherein the languages of the first sample voice and described second The languages of target voice are identical;
The identification text of the first sample voice is subjected to cutting according to the unit-in-context of the default size, is obtained each A first sample unit-in-context;
The first sound bite corresponding with the first sample unit-in-context is extracted from the first sample voice;
Parameters,acoustic is extracted from first sound bite;
Utilize each first sample unit-in-context and parameters,acoustic corresponding with the first sample unit-in-context, structure First acoustic model;
Then, the parameters,acoustic for obtaining each target text unit, including:
Using first acoustic model, the parameters,acoustic of each target text unit is obtained.
Optionally, the method further includes:
Obtain the second sample voice of the source speaker, wherein the languages of second sample voice and described second The languages of target voice are different;
The identification text of second sample voice is subjected to cutting according to the unit-in-context of the default size, is obtained each A second sample text unit;
The second sample text unit is converted, the first converting text unit is obtained, wherein first conversion Unit-in-context is unit-in-context used in the languages of second target voice;
The second sound bite corresponding with the second sample text unit is extracted from second sample voice;
Parameters,acoustic is extracted from second sound bite, obtains acoustics corresponding with the first converting text unit Parameter;
Utilize each second sample text unit, the first converting text list corresponding with the second sample text unit Position and parameters,acoustic corresponding with the first converting text unit build the second acoustic model;
Then, the parameters,acoustic for obtaining each target text unit, including:
Using second acoustic model, the parameters,acoustic of each target text unit is obtained.
Optionally, the method further includes:
Collect multiple first sample texts, wherein the languages of the first sample text and second sample voice Languages are identical;
The first sample text is subjected to cutting according to the unit-in-context of the default size, obtains each third sample Unit-in-context;
The third sample text unit is converted, the second converting text unit is obtained, wherein second conversion Unit-in-context is the unit-in-context that the third sample text unit is pronounced with the articulation type of second target voice;
Then, described to convert the second sample text unit, the first converting text unit is obtained, including:
Determine third sample text unit identical with the second sample text unit;
By the corresponding second converting text unit of identified third sample text unit, as the first converting text list Position.
Optionally, the method further includes:
Collect multiple second sample texts, wherein the languages of second sample text and second sample voice Languages are identical;
By second sample text according to the unit-in-context row cutting for presetting size described in sound, each 4th sample is obtained Unit-in-context;
The 4th sample text unit is converted, third converting text unit is obtained, wherein the third conversion Unit-in-context is the unit-in-context that the 4th sample text unit is pronounced with the articulation type of second target voice;
For the syllable in second sample text, is belonged to by study and existed with monosyllabic 4th sample text unit The combination of syntagmatic and ordinal relation, at least two continuous syllables of study in second sample text in corresponding syllable The 4th sample text unit in relationship and ordinal relation and at least two continuous syllables of study is in second sample text In syntagmatic and ordinal relation, build coding/decoding model;
Then, described to convert the second sample text unit, the first converting text unit is obtained, including:
Using the coding/decoding model, the second sample text unit is converted, obtains the first converting text list Position.
The embodiment of the present application also provides a kind of speech translation apparatus, including:
Voice acquisition unit, the first object voice for obtaining source speaker;
Speech interpreting unit is used for by first object voice progress voiced translation, generating the second target voice, Wherein, the languages of second target voice are different from the languages of first object voice, and second target voice carries The tamber characteristic of the source speaker.
The embodiment of the present application also provides a kind of speech translation apparatus, including:Processor, memory, system bus;
The processor and the memory are connected by the system bus;
The memory includes instruction, described instruction for storing one or more programs, one or more of programs The processor is set to execute method described in any one of the above embodiments when being executed by the processor.
The embodiment of the present application also provides a kind of computer readable storage mediums, including instruction, when it is transported on computers When row so that computer executes the method described in above-mentioned any one.
A kind of voice translation method and device provided by the embodiments of the present application, when the first object language for getting source speaker After sound, by carrying out voiced translation to first object voice, the second target voice is generated, wherein the languages of the second target voice Different from the languages of first object voice, the second target voice carries the tamber characteristic of source speaker.As it can be seen that pronouncing to source When voice carries out voiced translation before the voice of people is translated, the tamber characteristic that has due to considering source speaker itself so that Voice also has the tamber characteristic of source speaker after translation, so that voice sounds straight more like source speaker after the translation Connect the voice said.
Description of the drawings
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the application Some embodiments for those of ordinary skill in the art without creative efforts, can also basis These attached drawings obtain other attached drawings.
Fig. 1 is a kind of one of flow diagram of voice translation method provided by the embodiments of the present application;
Fig. 2 is the two of a kind of flow diagram of voice translation method provided by the embodiments of the present application;
Fig. 3 is phonetic synthesis model schematic provided by the embodiments of the present application;
Fig. 4 is a kind of one of flow diagram of acoustic model construction method provided by the embodiments of the present application;
Fig. 5 is the two of a kind of flow diagram of acoustic model construction method provided by the embodiments of the present application;
Fig. 6 is a kind of flow diagram of sample text unit collection method provided by the embodiments of the present application;
Relation schematic diagrams of the Fig. 7 between aligned phoneme sequence provided by the embodiments of the present application;
Fig. 8 is a kind of flow diagram of coding/decoding model construction method provided by the embodiments of the present application;
Fig. 9 is cataloged procedure schematic diagram provided by the embodiments of the present application;
Figure 10 is a kind of composition schematic diagram of speech translation apparatus provided by the embodiments of the present application;
Figure 11 is a kind of hardware architecture diagram of speech translation apparatus provided by the embodiments of the present application.
Specific implementation mode
In current voiced translation technology, after the voice of source speaker is translated, voice is complete after obtained translation It is the tamber characteristic of the speaker in synthetic model entirely, is another speaker entirely different with source speaker in sense of hearing Tamber characteristic, that is, it is that a people is speaking to sound like, the translation that another person then carries out, and is different two people's Voice effect.
For this purpose, the embodiment of the present application provides a kind of voice translation method and device, turned in the voice to source speaker When voice carries out voiced translation before translating, that is, when needing the voiced translation of source speaker into another languages, using belonging to source The phonetic synthesis model of people carries out voiced translation so that voice has the tamber characteristic of source speaker after translation, so that should Voice sounds the voice directly said more like source speaker after translation, and then the user experience is improved.
To keep the purpose, technical scheme and advantage of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In attached drawing, technical solutions in the embodiments of the present application is clearly and completely described, it is clear that described embodiment is Some embodiments of the present application, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art The every other embodiment obtained without making creative work, shall fall in the protection scope of this application.
First embodiment
It is a kind of flow diagram of voice translation method provided in this embodiment, this method includes following step referring to Fig. 1 Suddenly:
S101:The first object voice of acquisition source speaker.
For ease of distinguishing, the voice translated is translated preceding voice, is defined as first object language by the present embodiment Sound, and the speaker for saying the first object voice is defined as source speaker.
The present embodiment does not limit the source of the first object voice, for example, the first object voice can be for someone Real speech or recorded speech, can also be the spy carried out to the real speech or the recorded speech after machine processing Imitate voice.
The present embodiment does not limit the length of the first object voice yet, for example, the first object voice can be one A word, can also be in short, can also be one section of word.
S102:By carrying out voiced translation to the first object voice, the second target voice is generated, wherein described the The languages of two target voices are different from the languages of first object voice, and second target voice carries the source pronunciation The tamber characteristic of people.
For ease of distinguishing, the voice after being translated to first object voice is defined as the second target language by the present embodiment Sound.It should be noted that when first object voice is above-mentioned special efficacy voice after machine processing, it is also necessary to will further turn over The second target voice obtained after translating also carries out the special effect processing of same way.
The present embodiment does not limit the languages type of first object voice and the second target voice, if first object voice with The languages type of second target voice is different but voice is equivalent in meaning.For example, first object voice is Chinese " hello ", the Two target voices are English " hello ";Alternatively, first object voice is English " hello ", the second target voice is Chinese " you It is good ".
In practical application, user's such as source speaker can be the languages after the default translation of translator, the language for machine of serving as interpreter After sound synthetic model gets the first object voice of source speaker, voiced translation can be carried out, makes after translation Two target voices are preset translation languages.
In the present embodiment, the tamber characteristic of source speaker can be acquired in advance, for building the voice for belonging to source speaker Synthetic model is based on this, when the first object voice to source speaker carries out voiced translation, may be used and belongs to source speaker Phonetic synthesis model carry out voiced translation, so that the second target voice after translation is endowed the tamber characteristic of source speaker, this Kind tone color adaptive mode makes hearer think that the second target voice has the effect of speaking of source speaker in sense of hearing, that is, to make to turn over Voice and voice after translation are same or similar on tamber effect before translating.
To sum up, a kind of voice translation method provided in this embodiment, after getting the first object voice of source speaker, By carrying out voiced translation to first object voice, the second target voice is generated, wherein the languages of the second target voice and first The languages of target voice are different, and the second target voice carries the tamber characteristic of source speaker.As it can be seen that in the language to source speaker When voice carries out voiced translation before sound is translated, the tamber characteristic that has due to considering source speaker itself so that after translation The voice also tamber characteristic with source speaker, so that voice is sounded and directly being said more like source speaker after the translation Voice.
Second embodiment
The present embodiment will introduce the specific reality of S102 in above-mentioned first embodiment by following S202-S204 in conjunction with attached drawing Existing mode.
It is a kind of flow diagram of voice translation method provided in this embodiment, this method includes following step referring to Fig. 2 Suddenly:
S201:The first object voice of acquisition source speaker.
It should be noted that the S201 in the present embodiment is consistent with the S101 in first embodiment, related description refers to First embodiment, details are not described herein.
S202:By carrying out speech recognition to the first object voice, speech recognition text is generated.
After getting first object voice, the voice by speech recognition technology, such as based on artificial neural network is known First object voice is converted into speech recognition text by other technology.
For example, first object voice is Chinese speech " hello ", speech recognition is carried out to it, Chinese text can be obtained " hello ".
S203:By carrying out text translation to the speech recognition text, cypher text is generated.
For example, it is assumed that languages are Chinese before translation, languages are set to English after translation, then, speech recognition text is Chinese text can obtain translator of English text, such as will Chinese text by the Chinese text by the translation model of " in → English " This " hello " carries out text translation, obtains English text " hello ".
S204:By carrying out phonetic synthesis to the cypher text, the second target voice is generated, wherein second mesh The languages of poster sound are different from the languages of first object voice, and second target voice carries the source speaker Tamber characteristic.
For current voiced translation present situation, voice and discrimination of the voice in tone color before translation are very bright after translation Aobvious, to overcome the defect, the Speech acoustics parameter that the present embodiment can advance with source speaker to be modeled, belonged to The phonetic synthesis model of source speaker.In this way, when the cypher text is synthesized voice, the phonetic synthesis mould can be utilized Type makes voice i.e. the second target voice after translation have the tamber characteristic of source speaker, reach source speaker oneself speak, oneself The sense of hearing effect of translation.For example, the cypher text is English text " hello ", voice i.e. the second target voice is after translation English voice " hello ".
Specifically, phonetic synthesis model may include acoustic model and duration modeling, phonetic synthesis model as shown in Figure 3 Schematic diagram.
After the cypher text for obtaining first object voice, first have to carry out text analyzing processing to the cypher text, really Each syllable information in the fixed cypher text, and each phoneme information for forming each syllable is obtained, then these phonemes are believed Breath is input to acoustic model shown in Fig. 3, so that the acoustic model is determining and exports the parameters,acoustic of each phoneme, acoustics ginseng Number carries the tamber characteristic of source speaker, wherein the parameters,acoustic may include the parameters such as frequency spectrum, fundamental frequency.In addition, will also Above-mentioned phoneme information is input to duration modeling shown in Fig. 3, so that the duration modeling exports duration parameters, the present embodiment does not limit The method of determination of duration parameters.As an example, it may be determined that the word speed of first object voice uses acquiescence word speed, calculates The duration that cypher text is spent when being read according to the word speed, using the duration as duration parameters.
Next, the parameters,acoustic that phonetic synthesis model will be exported using acoustic model, makes each phoneme in cypher text Pronounce according to corresponding parameters,acoustic, phonetic synthesis model also utilizes the duration parameters of duration modeling output, according to specified Duration pronounce, to the tamber characteristic of the active speaker of anamorphic zone translated speech to get to the second target voice.
In a kind of realization method of the present embodiment, S204 can be realized in the following manner, can specifically include following Step:
Step A:The cypher text is subjected to cutting according to the unit-in-context of default size, obtains each target text list Position.
Cypher text is divided according to the unit-in-context of default size, for example, when text of serving as interpreter is Chinese text, Can be divided as unit of phoneme, byte or word etc., for another example, text of serving as interpreter be English text when, can with phoneme, Word etc. is that unit is divided.For ease of distinguishing, the present embodiment determines each unit-in-context marked off from cypher text Justice is target text unit.
Step B:Obtain the parameters,acoustic of each target text unit, wherein the parameters,acoustic carries the source hair The tamber characteristic of sound people.
The present embodiment can utilize acoustic model shown in Fig. 3, the parameters,acoustic of each target text unit be obtained, due to this Acoustic model is the acoustic model for belonging to source speaker, so, the parameters,acoustic obtained using the acoustic model will have source to send out The tamber characteristic of sound people.
It should be noted that the construction method of acoustic model shown in Fig. 3 and how using the acoustic model obtain target The parameters,acoustic of unit-in-context will be specifically introduced in subsequent third embodiment.
Step C:According to the parameters,acoustic of each target text unit, phonetic synthesis is carried out to the cypher text, is generated Second target voice.
When the parameters,acoustic for getting each target text unit in cypher text by step B, for example, the parameters,acoustic May include the parameters such as frequency spectrum, fundamental frequency, then, phonetic synthesis model shown in Fig. 3 can make each target text unit according to Corresponding parameters,acoustic pronounces, so that cypher text to be synthesized to the second target language of the tamber characteristic of specific source speaker Sound.
To sum up, a kind of voice translation method provided in this embodiment, after getting the first object voice of source speaker, Text translation is carried out to the speech recognition text of first object voice, then, by obtaining each unit-in-context in cypher text Parameters,acoustic carry out phonetic synthesis, generate the second target voice.Since the tone color for carrying source speaker in parameters,acoustic is special Sign so that voice also has the tamber characteristic of source speaker after translation, so that voice is sounded more like source after the translation The voice that speaker is directly said.
3rd embodiment
The present embodiment will introduce the construction method of acoustic model in second embodiment, and, it introduces in second embodiment and walks How the specific implementation of rapid B obtains the parameters,acoustic of target text unit using the acoustic model.
In the present embodiment, it after speaker takes translator for the first time when source, can prompt to record to specifications, use To build acoustic model use, recording substance is optional, and source speaker can carry out languages choosing according to the Reading ability of oneself It selects, that is to say, that the recording languages of source speaker selection, it can be identical as the languages of voice after translation (i.e. the second target voice) Or it is different.The present embodiment languages selection result different by above two is based respectively on carries out the construction method of acoustic model It is specific to introduce.
In the first construction method of acoustic model, the recording languages of source speaker selection, with voice after translation (i.e. the Two target voices) languages it is identical, the model building method is specifically introduced below.
Be a kind of flow diagram of acoustic model construction method provided in this embodiment referring to Fig. 4, this method include with Lower step:
S401:Obtain the first sample voice of the source speaker, wherein the languages of the first sample voice with it is described The languages of second target voice are identical.
It in the present embodiment, can be special according to the tone color of source speaker in order to make voice i.e. the second target voice after translation Sign is pronounced, and one section of recording of source speaker can be obtained, this section recording can be identical as the languages of voice after translation, and And the correspondence text of this section recording, all phoneme contents of text languages should be covered as possible.
For ease of distinguishing, this section recording is defined as first sample voice by the present embodiment.
Now using voice before translation, that is, first object voice as Chinese speech, translation after voice i.e. the second target voice be English For voice, first, confirm whether source speaker has the normal ability for reading aloud English, for example, translator can inquire that source is pronounced Whether people can read aloud English, if source speaker replys " can read aloud English " by forms such as voice or buttons, translator One section of a small amount of fixation English text and the source speaker of prompt can be provided and read aloud the fixation English text, the fixation English text Cover all English phonemes as possible, source speaker reads aloud the fixation English text, is fixed so that translator obtains this The voice of English text, the voice are the first sample voice.
S402:The identification text of the first sample voice is subjected to cutting according to the unit-in-context of the default size, Obtain each first sample unit-in-context.
After getting first sample voice, the voice by speech recognition technology, such as based on artificial neural network is known First sample voice is converted into speech recognition text by other technology.Then, by the speech recognition text according to the text of default size Our unit's (identical as the dividing unit of step A in second embodiment) is divided, for example is divided as unit of phoneme, is Convenient for distinguishing, each unit-in-context marked off from the speech recognition text is defined as first sample text list by the present embodiment Position.
S403:The first voice sheet corresponding with the first sample unit-in-context is extracted from the first sample voice Section, and extract parameters,acoustic from first sound bite.
According to the text dividing mode that the identification text to first sample voice carries out, first sample voice is drawn Point, in this way, can determine each first sample unit-in-context corresponding sound bite in first sample voice, for example, will The identification text and first sample voice of first sample voice, are divided as unit of phoneme, to obtain the identification The corresponding sound bite of each phoneme in text.For ease of distinguishing, the present embodiment is by the corresponding voice of first sample unit-in-context Segment is defined as the first sound bite.
For each first sample unit-in-context, corresponding acoustics ginseng is extracted from the first corresponding sound bite Number, such as frequency spectrum, fundamental frequency, in this way, just having got the tamber characteristic data of source speaker.
S404:Joined using each first sample unit-in-context and acoustics corresponding with the first sample unit-in-context Number builds the first acoustic model.
Each first sample unit-in-context and the corresponding parameters,acoustic of each first sample unit-in-context can be carried out Storage, to form the first data acquisition system.By taking the unit-in-context in the first data acquisition system is phoneme as an example, it should be noted that such as The first data acquisition system of fruit can not cover all phonemes of languages after translation, can be set by the phoneme being not covered by and for these phonemes The acquiescence parameters,acoustic set, is added in the first data acquisition system.In this way, first sample text in the first data acquisition system can be based on Correspondence between our unit and parameters,acoustic, structure belong to the acoustic model of source speaker, when specific structure, directly by the One data acquisition system is as training data, and the acoustic model of training source speaker, training process is same as the prior art, the present embodiment The acoustic model of structure is defined as the first acoustic model.
In one embodiment, which may be implemented the step B in second embodiment and " obtains each target text The parameters,acoustic of our unit ", can specifically include:Using first acoustic model, the sound of each target text unit is obtained Learn parameter.In the present embodiment, using the acoustic model of source speaker i.e. the first acoustic model, each target text is directly generated The parameters,acoustic of our unit, specific generation method can be same as the prior art, for example, the generation method can be existing base In the phoneme synthesizing method of parameter.
In second of construction method of acoustic model, the recording languages of source speaker selection, with voice after translation (i.e. the Two target voices) languages it is different, the model building method is specifically introduced below.
Referring to Fig. 5, for the flow diagram of another acoustic model construction method provided in this embodiment, this method includes Following steps:
S501:Obtain the second sample voice of the source speaker, wherein the languages of second sample voice with it is described The languages of second target voice are different.
It in the present embodiment, can be special according to the tone color of source speaker in order to make voice i.e. the second target voice after translation Sign is pronounced, and one section of recording of source speaker can be obtained, this section recording can be different from the languages of voice after translation, such as This section recording can be identical as voice before translation, that is, languages of first object voice, also, the correspondence text of this section recording, should use up Amount covers all phoneme contents of text languages.
For ease of distinguishing, this section recording is defined as the second sample voice by this implementation.
Now still using voice before translation, that is, first object voice as Chinese speech, translation after voice i.e. the second target voice be English For literary voice, first, confirm whether source speaker has the normal ability for reading aloud English, for example, translator can inquire that source is sent out Whether sound people can read aloud English, if source speaker replys " cannot read aloud English " by forms such as voice or buttons, turn over The machine of translating can provide languages options, it is assumed that source speaker selection Chinese, then translator can provide in one section of a small amount of fixation Text and the source speaker of prompt read aloud the fixation Chinese text, which covers all Chinese phonemes as possible, Source speaker reads aloud the fixation Chinese text, so that translator obtains the voice of the fixation Chinese text, which is For second sample voice.
S502:The identification text of second sample voice is subjected to cutting according to the unit-in-context of the default size, Obtain each second sample text unit.
After getting the second sample voice, the voice by speech recognition technology, such as based on artificial neural network is known Second sample voice is converted into speech recognition text by other technology.Then, by the speech recognition text according to the text of default size Our unit's (identical as the dividing unit of step A in second embodiment) is divided, for example is divided as unit of phoneme, is Convenient for distinguishing, each unit-in-context marked off from the speech recognition text is defined as the second sample text list by the present embodiment Position.
S503:The second sample text unit is converted, the first converting text unit is obtained, wherein described One converting text unit is unit-in-context used in the languages of second target voice.
For every 1 second sample text unit, need to correspond to the second sample text Conversion of measurement unit at languages after translation Unit-in-context, transformed unit-in-context is defined as the first converting text unit by the present embodiment.For example, it is assumed that the second sample Unit-in-context is that languages are English after Chinese phoneme, translation, then the first converting text unit is English phoneme.
It should be noted that specific unit-in-context conversion regime, will be specifically introduced in follow-up fourth embodiment.
S504:The second voice sheet corresponding with the second sample text unit is extracted from second sample voice Section, and parameters,acoustic is extracted from second sound bite, obtain acoustics ginseng corresponding with the first converting text unit Number.
According to the text dividing mode that the identification text to the second sample voice carries out, the second sample voice is drawn Point, in this way, can determine every 1 second sample text unit corresponding sound bite in the second sample voice, for example, will The identification text and the second sample voice of second sample voice, are divided as unit of phoneme, to obtain the identification The corresponding sound bite of each phoneme in text.For ease of distinguishing, the present embodiment is by the corresponding voice of the second sample text unit Segment is defined as the second sound bite.
For every 1 second sample text unit, corresponding acoustics ginseng is extracted from the second corresponding sound bite Number, such as frequency spectrum, fundamental frequency, as the parameters,acoustic of the first converting text unit corresponding with the second sample text unit.
S505:Utilize each second sample text unit, the first conversion text corresponding with the second sample text unit Our unit and parameters,acoustic corresponding with the first converting text unit build the second acoustic model.
It can be by each second sample text unit, the first converting text list corresponding with every 1 second sample text unit Position and the corresponding parameters,acoustic of every one first converting text unit are stored, to form the second data set.With the second number According to the unit-in-context in set for for phoneme, it should be noted that if the second data set can not cover languages after translation All phonemes, can by the phoneme being not covered by and be these phonemes setting acquiescence parameters,acoustic, be added to the second data In set.In this way, phoneme and phoneme and sound after phoneme after conversion and conversion before being converted in the second data set can be based on The correspondence between parameter is learned, structure belongs to the acoustic model of source speaker, specifically when structure, directly by the second data set As training data, the acoustic model of training source speaker, training process is same as the prior art, and the present embodiment is by the sound of structure It learns model and is defined as the second acoustic model.
In one embodiment, which may be implemented the step B in second embodiment and " obtains each target text The parameters,acoustic of our unit ", can specifically include:Using second acoustic model, the sound of each target text unit is obtained Learn parameter.In the present embodiment, using the acoustic model of source speaker i.e. the second acoustic model, each target text is directly generated The parameters,acoustic of our unit, specific generation method can be same as the prior art, for example, the generation method can be existing base In the phoneme synthesizing method of parameter.
To sum up, a kind of voice translation method provided in this embodiment, after getting the first object voice of source speaker, Text translation is carried out to the speech recognition text of first object voice, then, by obtaining each unit-in-context in cypher text Parameters,acoustic carry out phonetic synthesis, generate the second target voice.Wherein it is possible to the acoustic mode by building source speaker in advance Type determines the parameters,acoustic of each unit-in-context, due to carrying the tamber characteristic of source speaker in parameters,acoustic so that turn over The rear voice also tamber characteristic with source speaker is translated, so that voice sounds direct more like source speaker after the translation The voice said.
Fourth embodiment
The present embodiment will introduce the specific implementation of S503 in 3rd embodiment, in order to realize S503, need advance structure Unit-in-context mapping model is built, to realize S503 using the unit-in-context conversion function of this article our unit mapping model.This reality Apply the construction method that example describes two kinds of unit-in-context mapping models.
In the first construction method of unit-in-context mapping model, directly establish two kinds of languages unit-in-context sequence it Between correspondence, according to the correspondence realize unit-in-context between conversion, have below to the model building method Body introduction.
As shown in fig. 6, being a kind of flow diagram of sample text unit collection method provided in this embodiment, this method Include the following steps:
S601:Collect multiple first sample texts, wherein the languages of the first sample text and the second sample language The languages of sound are identical.
In order to realize S503, that is, in the identification text of the second sample voice (i.e. the recorded speech of source speaker) Each second sample text unit, in order at unit-in-context used in languages after translation, need to receive its corresponding conversion in advance Each corpus of text of collection is defined as the by identical with the languages of the second sample voice a large amount of corpus of text of collection, the present embodiment One sample text.The present embodiment does not limit the form of the first sample text, the first sample text can be a word, Or in short or one section words.
For example, it is assumed that the second sample voice is Chinese speech, then, it needs to collect a large amount of Chinese text language material in advance (as shown in Figure 7), each Chinese text are first sample text.
S602:The first sample text is subjected to cutting according to the unit-in-context of the default size, obtains each the Three sample text units.
The first sample text is divided according to the unit-in-context of default size (with step A in second embodiment Dividing unit is identical), for example divided as unit of phoneme, for ease of distinguishing, the present embodiment is from the first sample text The each unit-in-context marked off is defined as third sample text unit.
Continue the example of last step, it is assumed that first sample text is Chinese text, needs the Chinese text being converted into Chinese pinyin, and each Chinese phoneme in the Chinese pinyin is marked, Chinese aligned phoneme sequence (as shown in Figure 7) is obtained, For example, Chinese text " hello ", can obtain Chinese pinyin " [n i] [h ao] ", and therefrom mark successively " n ", " i ", The Chinese phonemes of " h ", " ao " this four, i.e. four third sample text units.
S603:The third sample text unit is converted, the second converting text unit is obtained, wherein described Two converting text units are the texts that the third sample text unit is pronounced with the articulation type of second target voice Our unit.
The voice i.e. articulation type of the second target voice marks pronunciation after first sample text being translated, this Sample can find corresponding each third sample text unit in first sample text from the mark pronunciation Unit-in-context, for ease of distinguishing, which is defined as the second converting text unit by the present embodiment.
Continue the example of last step, it is assumed that first sample text is voice i.e. second after Chinese text " hello ", translation Target voice is English voice, then, " hello " can mark pronunciation by way of English phonetic symbol, can mark forAnd therefrom mark successively " n ", " I ", " h ",This four English phonemes, i.e., four second conversions Unit-in-context, in this way, the third sample text unit " n " of aforementioned four Chinese form, " i ", " h ", " ao ", be corresponding in turn to this four Second converting text unit " n " of a English form, " I ", " h ",
It is understood that due to same Chinese character such as " high mountain ", the Chinese character is in different Chinese words or sentence Articulation type may be different, and therefore, the corresponding second converting text unit of third sample text unit for forming the Chinese character also may be used Can be different, certainly, this situation also exists in other languages, but in the present embodiment, as long as in the phoneme notation before and after conversion Appearance follows fixed pronunciation rule.
Based on the above, each third sample text unit and each third sample text unit can be corresponded to The second converting text unit stored, to form unit-in-context set.It should be noted that since this article our unit gathers In the second converting text unit belong to translation after languages phoneme, therefore, should make as possible this article our unit gather in second All unit-in-contexts of languages after the covering translation of converting text unit.
When building unit-in-context mapping model, the third sample text unit in can directly gathering this article our unit The second corresponding converting text unit does tabular mapping, is based on this, and unit-in-context mapping model can be based on text Mapping relations between our unit realize the step S503 in 3rd embodiment.
In the first realization method, the second sample text unit " is converted, obtains first turn by step S503 Exchange of notes our unit " can specifically include:Determine third sample text unit identical with the second sample text unit;By institute The determining corresponding second converting text unit of third sample text unit, as the first converting text unit.In this embodiment party In formula, for every 1 second sample text unit, inquired from above-mentioned set of phonemes identical with the second sample text unit Third sample text unit, and phoneme mapping relations are based on, determine the second conversion text corresponding with the third sample text unit Our unit, as phoneme i.e. the first converting text unit after the conversion of the second sample text unit.
In second of construction method of unit-in-context mapping model, between the unit-in-context sequence of two kinds of languages of training Network model passes through text list than coding/decoding model as shown in Figure 7 using the network model as unit-in-context mapping model Bit mapping model can make unit-in-context mapping result more acurrate, and the model building method is specifically introduced below.
In second of building mode, a kind of flow diagram of coding/decoding model construction method shown in Figure 8, packet Include following steps:
S801:Collect multiple second sample texts, wherein the languages of second sample text and the second sample language The languages of sound are identical.
It should be noted that this step S801 is similar with step S601, the first sample text in S601 need to only be replaced For the second sample text, related content refers to the related introduction of S601, and details are not described herein.
S802:Second sample text is subjected to cutting according to the unit-in-context of the default size, obtains each the Four sample text units.
It should be noted that this step S802 is similar with step S602, the first sample text in S602 need to only be replaced The 4th sample text unit is replaced with for the second sample text, by third sample text unit, related content refers to The related introduction of S602, details are not described herein.
S803:The 4th sample text unit is converted, third converting text unit is obtained, wherein described Three converting text units are the texts that the 4th sample text unit is pronounced with the articulation type of second target voice Our unit.
It should be noted that this step S803 is similar with step S603, it only need to be by the third sample text unit in S603 Replace with the 4th sample text unit, the second converting text unit replaces with third converting text unit, related content is asked Referring to the related introduction of S603, details are not described herein.
S804:For the syllable in second sample text, belonged to monosyllabic 4th sample text by study Syntagmatic and ordinal relation of the unit in corresponding syllable, at least two continuous syllables of study are in second sample text Syntagmatic and ordinal relation and study at least two continuous syllables in the 4th sample text unit in second sample Syntagmatic in this document and ordinal relation build coding/decoding model.
In the present embodiment, the 4th sample text subunit sequence and third converting text subunit sequence, instruction can be utilized Practice the network model among the unit-in-context system of both languages, which may include coding network shown in Fig. 7 And decoding network.To be subsequently Chinese aligned phoneme sequence with the 4th sample text subunit sequence, third converting text subunit sequence is For English aligned phoneme sequence, which is introduced.
Specifically, realize that linking of the coding network between different syllables is handled by the way that one layer of syllable information is added Ability, has the function that optimize the phonotactics in syllable and whole phoneme maps.The coding network can include three volumes Code process, all phonemes respectively in syllable in the cataloged procedure of phoneme, the cataloged procedure of inter-syllable, text it is encoded Journey, every time coding when, it is subsequent coding need consider front encode as a result, following introduce the coding network by taking Fig. 9 as an example Cataloged procedure.
As shown in Figure 9, it is assumed that the second sample text of certain being collected into is Chinese text such as " hello ", then the 4th sample text Our unit's sequence is " n ", " i ", " h ", " ao ".First, will belong to all Chinese phonemes " n " of the Chinese text, " i ", " h ", " ao " is unified to carry out vectorization processing, such as using the methods of Word2Vector, and will belong to monosyllabic Chinese phoneme it Between pass through primary two-way shot and long term Memory Neural Networks (Bidirectional Long Short-term Memory, BLSTM) It is encoded, obtained coding result contains the relationship between phoneme and phoneme in syllable, that is, between study " n " and " i " Syntagmatic and ordinal relation correspond to Chinese syllable " ni ", and, syntagmatic and sequence between study " h " and " ao " are closed System corresponds to Chinese syllable " hao ".
Then, are carried out by vectorization processing, for example is used by all syllables " ni " of the Chinese text, " hao " The methods of Word2Vector, in the volume for obtaining first layer BLSTM networks (phoneme learning network in syllable i.e. shown in Fig. 9) After code result, first layer coding result is combined to the vector of each syllable, is compiled by two-way BLSTM networks between a syllable Code, obtained coding result include relationship between syllable and syllable, that is, syntagmatic between study " ni " and " hao " and Ordinal relation corresponds to Chinese text " hello ".
Finally, by the coding result of second layer BLSTM networks (inter-syllable learning network i.e. shown in Fig. 9), in conjunction with each The vector characteristics of all phonemes carry out third layer BLSTM codings in syllable, obtain corresponding encoded result and contain the Chinese text Relationship between middle phoneme and phoneme, that is, syntagmatic and ordinal relation between study " n ", " i ", " h ", " ao " correspond to Chinese text " hello ".
It is shown in Fig. 7 using third layer coding result as the input of decoding network shown in Fig. 7 after above-mentioned three layers coding Decoding network will the English aligned phoneme sequence " n " of corresponding output, " I ", " h ",
It is understood that when being trained to coding/decoding model using a large amount of Chinese texts, coding/decoding model study Syntagmatic and ordinal relation between two or more syllables have also learnt per monosyllabic each phoneme in the sound Syntagmatic in section and ordinal relation.English aligned phoneme sequence is converted to when needing the Chinese aligned phoneme sequence by certain Chinese text When, it is based on this learning outcome, it can be by the Chinese aligned phoneme sequence of the Chinese text, according to its combination in the Chinese text Relationship and ordinal relation select the English aligned phoneme sequence more arranged in pairs or groups therewith, no matter moreover, the Chinese text is shorter word Or longer sentence, corresponding English aligned phoneme sequence all have preferable linking effect, this mode make aligned phoneme sequence it Between correspondence result it is more flexible accurate.
It should be noted that coding/decoding model is not limited to the training between Chinese aligned phoneme sequence and English aligned phoneme sequence, It is suitable between arbitrary two kinds of different languages.
Based on the above, the step in 3rd embodiment can be realized based on the learning outcome of coding/decoding model S503.In second of realization method, the second sample text unit " is converted, obtains the first conversion by step S503 Unit-in-context " can specifically include:Using the coding/decoding model, the second sample text unit is converted, is obtained First converting text unit.In the present embodiment, using the second sample text unit as the encoding and decoding mould built in advance The input of type, output can be obtained transformed first converting text unit, and in transfer process, coding/decoding model can be based on Above-mentioned learning outcome, according to the syntagmatic and ordinal relation between each second sample text unit, selection and every 1 second First converting text unit of sample text unit collocation, relative to the first realization method of S503, due to this realization method Learnt practical collocation mode between the unit-in-context sequence of different language in advance so that transformed unit-in-context more subject to Really.
To sum up, a kind of voice translation method provided in this embodiment, for the identification text of the recording of source speaker, when need The recording is identified that the unit-in-context sequence of text is converted, that is, when being converted to the unit-in-context sequence of languages after translating, Unit-in-context mapping model can be built in advance, it can be based on correspondence between the unit-in-context sequence of different language or logical It crosses and encoding and decoding network is trained to build unit-in-context mapping model, carrying out unit-in-context by this article our unit mapping model turns It changes, the unit-in-context transformation result of needs can be obtained.
5th embodiment
It is a kind of composition schematic diagram of speech translation apparatus provided in this embodiment, the speech translation apparatus referring to Figure 10 1000 include:
Voice acquisition unit 1001, the first object voice for obtaining source speaker;
Speech interpreting unit 1002, for by carrying out voiced translation to the first object voice, generating the second target Voice, wherein the languages of second target voice are different from the languages of first object voice, second target voice Carry the tamber characteristic of the source speaker.
In a kind of realization method of the present embodiment, the speech interpreting unit 1002 may include:
Text identification subelement, for by carrying out speech recognition to the first object voice, generating speech recognition text This;
Text translates subelement, for by carrying out text translation to the speech recognition text, generating cypher text;
Voiced translation subelement, for by carrying out phonetic synthesis to the cypher text, generating the second target voice.
In a kind of realization method of the present embodiment, the voiced translation subelement may include:
Target unit divides subelement, for the cypher text to be carried out cutting according to the unit-in-context of default size, Obtain each target text unit;
Parameters,acoustic obtains subelement, the parameters,acoustic for obtaining each target text unit, wherein the acoustics ginseng Number carries the tamber characteristic of the source speaker;
Translated speech generates subelement, for the parameters,acoustic according to each target text unit, to the cypher text Phonetic synthesis is carried out, the second target voice is generated.
In a kind of realization method of the present embodiment, described device 1000 can also include:
First sample acquiring unit, the first sample voice for obtaining the source speaker, wherein the first sample The languages of voice are identical as the languages of the second target voice;
First sample division unit, for the text by the identification text of the first sample voice according to the default size Our unit carries out cutting, obtains each first sample unit-in-context;
First snippet extraction unit, for the extraction from the first sample voice and the first sample unit-in-context pair The first sound bite answered;
First parameter extraction unit, for extracting parameters,acoustic from first sound bite;
First model construction unit, for using each first sample unit-in-context and with the first sample text list The corresponding parameters,acoustic in position, builds the first acoustic model;
Then, the parameters,acoustic obtains subelement, specifically can be used for utilizing first acoustic model, obtains each mesh Mark the parameters,acoustic of unit-in-context.
In a kind of realization method of the present embodiment, described device 1000 can also include:
Second sample acquisition unit, the second sample voice for obtaining the source speaker, wherein second sample The languages of voice are different from the languages of the second target voice;
Second sample division unit, for the text by the identification text of second sample voice according to the default size Our unit carries out cutting, obtains each second sample text unit;
Unit-in-context converting unit obtains the first converting text for converting the second sample text unit Unit, wherein the first converting text unit is unit-in-context used in the languages of second target voice;
Second snippet extraction unit, for the extraction from second sample voice and the second sample text unit pair The second sound bite answered;
Second parameter extraction unit obtains and described first for extracting parameters,acoustic from second sound bite The corresponding parameters,acoustic of converting text unit;
Second model construction unit, for utilizing each second sample text unit and the second sample text unit Corresponding first converting text unit and parameters,acoustic corresponding with the first converting text unit build the second acoustics Model;
Then, the parameters,acoustic obtains subelement, specifically can be used for utilizing second acoustic model, obtains each mesh Mark the parameters,acoustic of unit-in-context.
In a kind of realization method of the present embodiment, described device 1000 can also include:
First text collector unit, for collecting multiple first sample texts, wherein the languages of the first sample text It is identical as the languages of the second sample voice;
Third sample division unit, for carrying out the first sample text according to the unit-in-context of the default size Cutting obtains each third sample text unit;
First unit conversion cells obtain the second converting text for converting the third sample text unit Unit, wherein the second converting text unit is pronunciation of the third sample text unit with second target voice The unit-in-context that mode is pronounced;
Then, the unit-in-context converting unit may include:
Same units determination subelement, for determining third sample text list identical with the second sample text unit Position;
Unit-in-context conversion subunit is used for the corresponding second converting text list of identified third sample text unit Position, as the first converting text unit.
In a kind of realization method of the present embodiment, described device 1000 can also include:
Second text collector unit, for collecting multiple second sample texts, wherein the languages of second sample text It is identical as the languages of the second sample voice;
4th sample division unit, for the unit-in-context row by second sample text according to default size described in sound Cutting obtains each 4th sample text unit;
Second unit conversion cells obtain third converting text for converting the 4th sample text unit Unit, wherein the third converting text unit is the 4th sample text unit with the pronunciation of second target voice The unit-in-context that mode is pronounced;
Coding/decoding model construction unit, for for the syllable in second sample text, belonging to same by study Syntagmatic and ordinal relation of the 4th sample text unit of syllable in corresponding syllable, at least two continuous syllables of study exist The 4th sample text in syntagmatic and ordinal relation and at least two continuous syllables of study in second sample text Syntagmatic and ordinal relation of the our unit in second sample text build coding/decoding model;
Then, the unit-in-context converting unit specifically can be used for utilizing the coding/decoding model, by second sample Unit-in-context is converted, and the first converting text unit is obtained.
Sixth embodiment
It is a kind of hardware architecture diagram of speech translation apparatus provided in this embodiment referring to Figure 11, the collision inspection Device 1100 is surveyed for detecting a kind of goal systems including at least one moveable element, wherein movable portion to be detected Part is defined as the first component, moveable element in addition to the first component or moveable element is not defined as second Part.The collision detecting device 1100 include memory 1101 and receiver 1102, and respectively with the memory 1101 and The processor 1103 that the receiver 1102 connects, the memory 1101 is for storing batch processing instruction, the processor 1103 for calling the program instruction that the memory 1101 stores to execute following operation:
The first object voice of acquisition source speaker;
By carrying out voiced translation to the first object voice, the second target voice is generated, wherein second target The languages of voice are different from the languages of first object voice, and second target voice carries the sound of the source speaker Color characteristic.
In a kind of realization method of the present embodiment, the processor 1103 is additionally operable to that the memory 1101 is called to store Program instruction execute following operation:
By carrying out speech recognition to the first object voice, speech recognition text is generated;
By carrying out text translation to the speech recognition text, cypher text is generated;
By carrying out phonetic synthesis to the cypher text, the second target voice is generated.
In a kind of realization method of the present embodiment, the processor 1103 is additionally operable to that the memory 1101 is called to store Program instruction execute following operation:
The cypher text is subjected to cutting according to the unit-in-context of default size, obtains each target text unit;
Obtain the parameters,acoustic of each target text unit, wherein the parameters,acoustic carries the source speaker Tamber characteristic;
According to the parameters,acoustic of each target text unit, phonetic synthesis is carried out to the cypher text, generates the second mesh Poster sound.
In a kind of realization method of the present embodiment, the processor 1103 is additionally operable to that the memory 1101 is called to store Program instruction execute following operation:
Obtain the first sample voice of the source speaker, wherein the languages of the first sample voice and described second The languages of target voice are identical;
The identification text of the first sample voice is subjected to cutting according to the unit-in-context of the default size, is obtained each A first sample unit-in-context;
The first sound bite corresponding with the first sample unit-in-context is extracted from the first sample voice;
Parameters,acoustic is extracted from first sound bite;
Utilize each first sample unit-in-context and parameters,acoustic corresponding with the first sample unit-in-context, structure First acoustic model;
Using first acoustic model, the parameters,acoustic of each target text unit is obtained.
In a kind of realization method of the present embodiment, the processor 1103 is additionally operable to that the memory 1101 is called to store Program instruction execute following operation:
Obtain the second sample voice of the source speaker, wherein the languages of second sample voice and described second The languages of target voice are different;
The identification text of second sample voice is subjected to cutting according to the unit-in-context of the default size, is obtained each A second sample text unit;
The second sample text unit is converted, the first converting text unit is obtained, wherein first conversion Unit-in-context is unit-in-context used in the languages of second target voice;
The second sound bite corresponding with the second sample text unit is extracted from second sample voice;
Parameters,acoustic is extracted from second sound bite, obtains acoustics corresponding with the first converting text unit Parameter;
Utilize each second sample text unit, the first converting text list corresponding with the second sample text unit Position and parameters,acoustic corresponding with the first converting text unit build the second acoustic model;
Using second acoustic model, the parameters,acoustic of each target text unit is obtained.
In a kind of realization method of the present embodiment, the processor 1103 is additionally operable to that the memory 1101 is called to store Program instruction execute following operation:
Collect multiple first sample texts, wherein the languages of the first sample text and second sample voice Languages are identical;
The first sample text is subjected to cutting according to the unit-in-context of the default size, obtains each third sample Unit-in-context;
The third sample text unit is converted, the second converting text unit is obtained, wherein second conversion Unit-in-context is the unit-in-context that the third sample text unit is pronounced with the articulation type of second target voice;
Determine third sample text unit identical with the second sample text unit;
By the corresponding second converting text unit of identified third sample text unit, as the first converting text list Position.
In a kind of realization method of the present embodiment, the processor 1103 is additionally operable to that the memory 1101 is called to store Program instruction execute following operation:
Collect multiple second sample texts, wherein the languages of second sample text and second sample voice Languages are identical;
By second sample text according to the unit-in-context row cutting for presetting size described in sound, each 4th sample is obtained Unit-in-context;
The 4th sample text unit is converted, third converting text unit is obtained, wherein the third conversion Unit-in-context is the unit-in-context that the 4th sample text unit is pronounced with the articulation type of second target voice;
For the syllable in second sample text, is belonged to by study and existed with monosyllabic 4th sample text unit The combination of syntagmatic and ordinal relation, at least two continuous syllables of study in second sample text in corresponding syllable The 4th sample text unit in relationship and ordinal relation and at least two continuous syllables of study is in second sample text In syntagmatic and ordinal relation, build coding/decoding model;
Using the coding/decoding model, the second sample text unit is converted, obtains the first converting text list Position.
In addition, the present embodiment additionally provides a kind of computer readable storage medium, including instruction, when it is transported on computers When row so that computer executes any one realization method in above-mentioned voice translation method.
As seen through the above description of the embodiments, those skilled in the art can be understood that above-mentioned implementation All or part of step in example method can add the mode of required general hardware platform to realize by software.Based on such Understand, substantially the part that contributes to existing technology can be in the form of software products in other words for the technical solution of the application It embodies, which can be stored in a storage medium, such as ROM/RAM, magnetic disc, CD, including several Instruction is used so that a computer equipment (can be the network communications such as personal computer, server, or Media Gateway Equipment, etc.) execute method described in certain parts of each embodiment of the application or embodiment.
It should be noted that each embodiment is described by the way of progressive in this specification, each embodiment emphasis is said Bright is all difference from other examples, and just to refer each other for identical similar portion between each embodiment.For reality For applying device disclosed in example, since it is corresponded to the methods disclosed in the examples, so description is fairly simple, related place Referring to method part illustration.
It should also be noted that, herein, relational terms such as first and second and the like are used merely to one Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation There are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to contain Lid non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those Element, but also include other elements that are not explicitly listed, or further include for this process, method, article or equipment Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that There is also other identical elements in process, method, article or equipment including the element.
The foregoing description of the disclosed embodiments enables professional and technical personnel in the field to realize or use the application. Various modifications to these embodiments will be apparent to those skilled in the art, as defined herein General Principle can in other embodiments be realized in the case where not departing from spirit herein or range.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest range caused.

Claims (10)

1. a kind of voice translation method, which is characterized in that including:
The first object voice of acquisition source speaker;
By carrying out voiced translation to the first object voice, the second target voice is generated, wherein second target voice Languages it is different from the languages of first object voice, the tone color that second target voice carries the source speaker is special Sign.
2. according to the method described in claim 1, it is characterized in that, described turned over by carrying out voice to the first object voice It translates, generates the second target voice, including:
By carrying out speech recognition to the first object voice, speech recognition text is generated;
By carrying out text translation to the speech recognition text, cypher text is generated;
By carrying out phonetic synthesis to the cypher text, the second target voice is generated.
3. according to the method described in claim 2, it is characterized in that, it is described by the cypher text carry out phonetic synthesis, The second target voice is generated, including:
The cypher text is subjected to cutting according to the unit-in-context of default size, obtains each target text unit;
Obtain the parameters,acoustic of each target text unit, wherein the parameters,acoustic carries the tone color of the source speaker Feature;
According to the parameters,acoustic of each target text unit, phonetic synthesis is carried out to the cypher text, generates the second target language Sound.
4. according to the method described in claim 3, it is characterized in that, the method further includes:
Obtain the first sample voice of the source speaker, wherein the languages of the first sample voice and second target The languages of voice are identical;
The identification text of the first sample voice is subjected to cutting according to the unit-in-context of the default size, obtains each the This same unit-in-context;
The first sound bite corresponding with the first sample unit-in-context is extracted from the first sample voice;
Parameters,acoustic is extracted from first sound bite;
Utilize each first sample unit-in-context and parameters,acoustic corresponding with the first sample unit-in-context, structure first Acoustic model;
Then, the parameters,acoustic for obtaining each target text unit, including:
Using first acoustic model, the parameters,acoustic of each target text unit is obtained.
5. according to the method described in claim 3, it is characterized in that, the method further includes:
Obtain the second sample voice of the source speaker, wherein the languages of second sample voice and second target The languages of voice are different;
The identification text of second sample voice is subjected to cutting according to the unit-in-context of the default size, obtains each the Two sample text units;
The second sample text unit is converted, the first converting text unit is obtained, wherein first converting text Unit is unit-in-context used in the languages of second target voice;
The second sound bite corresponding with the second sample text unit is extracted from second sample voice;
Parameters,acoustic is extracted from second sound bite, obtains acoustics ginseng corresponding with the first converting text unit Number;
Using each second sample text unit, the first converting text unit corresponding with the second sample text unit, with And parameters,acoustic corresponding with the first converting text unit, build the second acoustic model;
Then, the parameters,acoustic for obtaining each target text unit, including:
Using second acoustic model, the parameters,acoustic of each target text unit is obtained.
6. according to the method described in claim 5, it is characterized in that, the method further includes:
Collect multiple first sample texts, wherein the languages of the languages of the first sample text and second sample voice It is identical;
The first sample text is subjected to cutting according to the unit-in-context of the default size, obtains each third sample text Unit;
The third sample text unit is converted, the second converting text unit is obtained, wherein second converting text Unit is the unit-in-context that the third sample text unit is pronounced with the articulation type of second target voice;
Then, described to convert the second sample text unit, the first converting text unit is obtained, including:
Determine third sample text unit identical with the second sample text unit;
By the corresponding second converting text unit of identified third sample text unit, as the first converting text unit.
7. according to the method described in claim 5, it is characterized in that, the method further includes:
Collect multiple second sample texts, wherein the languages of the languages of second sample text and second sample voice It is identical;
By second sample text according to the unit-in-context row cutting for presetting size described in sound, each 4th sample text is obtained Unit;
The 4th sample text unit is converted, third converting text unit is obtained, wherein the third converting text Unit is the unit-in-context that the 4th sample text unit is pronounced with the articulation type of second target voice;
For the syllable in second sample text, belonged to monosyllabic 4th sample text unit in correspondence by study The syntagmatic of syntagmatic and ordinal relation, at least two continuous syllables of study in second sample text in syllable With the 4th sample text unit at least two continuous syllables of ordinal relation and study in second sample text Syntagmatic and ordinal relation build coding/decoding model;
Then, described to convert the second sample text unit, the first converting text unit is obtained, including:
Using the coding/decoding model, the second sample text unit is converted, obtains the first converting text unit.
8. a kind of speech translation apparatus, which is characterized in that including:
Voice acquisition unit, the first object voice for obtaining source speaker;
Speech interpreting unit is used for by first object voice progress voiced translation, generating the second target voice, In, the languages of second target voice are different from the languages of first object voice, and second target voice carries The tamber characteristic of the source speaker.
9. a kind of speech translation apparatus, which is characterized in that including:Processor, memory, system bus;
The processor and the memory are connected by the system bus;
The memory includes instruction for storing one or more programs, one or more of programs, and described instruction works as quilt The processor makes the processor execute such as claim 1-7 any one of them methods when executing.
10. a kind of computer readable storage medium, including instruction, when run on a computer so that computer executes such as Method described in claim 1-7 any one.
CN201810167142.5A 2018-02-28 2018-02-28 Voice translation method and device Active CN108447486B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810167142.5A CN108447486B (en) 2018-02-28 2018-02-28 Voice translation method and device
PCT/CN2018/095766 WO2019165748A1 (en) 2018-02-28 2018-07-16 Speech translation method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810167142.5A CN108447486B (en) 2018-02-28 2018-02-28 Voice translation method and device

Publications (2)

Publication Number Publication Date
CN108447486A true CN108447486A (en) 2018-08-24
CN108447486B CN108447486B (en) 2021-12-03

Family

ID=63192800

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810167142.5A Active CN108447486B (en) 2018-02-28 2018-02-28 Voice translation method and device

Country Status (2)

Country Link
CN (1) CN108447486B (en)
WO (1) WO2019165748A1 (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108986793A (en) * 2018-09-28 2018-12-11 北京百度网讯科技有限公司 translation processing method, device and equipment
CN109119063A (en) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium
CN109754808A (en) * 2018-12-13 2019-05-14 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of voice conversion text
CN110415680A (en) * 2018-09-05 2019-11-05 满金坝(深圳)科技有限公司 A kind of simultaneous interpretation method, synchronous translation apparatus and a kind of electronic equipment
CN110610720A (en) * 2019-09-19 2019-12-24 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN110970014A (en) * 2019-10-31 2020-04-07 阿里巴巴集团控股有限公司 Voice conversion, file generation, broadcast, voice processing method, device and medium
WO2020077868A1 (en) * 2018-10-17 2020-04-23 深圳壹账通智能科技有限公司 Simultaneous interpretation method and apparatus, computer device and storage medium
CN111105781A (en) * 2019-12-23 2020-05-05 联想(北京)有限公司 Voice processing method, device, electronic equipment and medium
CN111368559A (en) * 2020-02-28 2020-07-03 北京字节跳动网络技术有限公司 Voice translation method and device, electronic equipment and storage medium
CN111696518A (en) * 2020-06-05 2020-09-22 四川纵横六合科技股份有限公司 Automatic speech synthesis method based on text
CN111785258A (en) * 2020-07-13 2020-10-16 四川长虹电器股份有限公司 Personalized voice translation method and device based on speaker characteristics
CN112420008A (en) * 2019-08-22 2021-02-26 北京峰趣互联网信息服务有限公司 Method and device for recording songs, electronic equipment and storage medium
WO2021134592A1 (en) * 2019-12-31 2021-07-08 深圳市欢太科技有限公司 Speech processing method, apparatus and device, and storage medium
CN113160793A (en) * 2021-04-23 2021-07-23 平安科技(深圳)有限公司 Speech synthesis method, device, equipment and storage medium based on low resource language
CN113362818A (en) * 2021-05-08 2021-09-07 山西三友和智慧信息技术股份有限公司 Voice interaction guidance system and method based on artificial intelligence
WO2021208531A1 (en) * 2020-04-16 2021-10-21 北京搜狗科技发展有限公司 Speech processing method and apparatus, and electronic device
US11488577B2 (en) * 2019-09-27 2022-11-01 Baidu Online Network Technology (Beijing) Co., Ltd. Training method and apparatus for a speech synthesis model, and storage medium
CN116343751A (en) * 2023-05-29 2023-06-27 深圳市泰为软件开发有限公司 Voice translation-based audio analysis method and device

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113808576A (en) * 2020-06-16 2021-12-17 阿里巴巴集团控股有限公司 Voice conversion method, device and computer system
CN112382297A (en) * 2020-11-13 2021-02-19 北京有竹居网络技术有限公司 Method, apparatus, device and medium for generating audio
CN112530404A (en) * 2020-11-30 2021-03-19 深圳市优必选科技股份有限公司 Voice synthesis method, voice synthesis device and intelligent equipment
CN112509553B (en) * 2020-12-02 2023-08-01 问问智能信息科技有限公司 Speech synthesis method, device and computer readable storage medium
CN112818707B (en) * 2021-01-19 2024-02-27 传神语联网网络科技股份有限公司 Reverse text consensus-based multi-turn engine collaborative speech translation system and method
CN113327575B (en) * 2021-05-31 2024-03-01 广州虎牙科技有限公司 Speech synthesis method, device, computer equipment and storage medium
EP4266306A1 (en) * 2022-04-22 2023-10-25 Papercup Technologies Limited A speech processing system and a method of processing a speech signal
CN114818748B (en) * 2022-05-10 2023-04-21 北京百度网讯科技有限公司 Method for generating translation model, translation method and device

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1553381A (en) * 2003-05-26 2004-12-08 杨宏惠 Multi-language correspondent list style language database and synchronous computer inter-transtation and communication
CN101114447A (en) * 2006-07-26 2008-01-30 株式会社东芝 Speech translation device and method
CN101154221A (en) * 2006-09-28 2008-04-02 株式会社东芝 Apparatus performing translation process from inputted speech
CN101727904A (en) * 2008-10-31 2010-06-09 国际商业机器公司 Voice translation method and device
CN102270450A (en) * 2010-06-07 2011-12-07 株式会社曙飞电子 System and method of multi model adaptation and voice recognition
CN102821259A (en) * 2012-07-20 2012-12-12 冠捷显示科技(厦门)有限公司 TV (television) system with multi-language speech translation and realization method thereof
CN104252861A (en) * 2014-09-11 2014-12-31 百度在线网络技术(北京)有限公司 Video voice conversion method, video voice conversion device and server
JP2015026054A (en) * 2013-07-29 2015-02-05 韓國電子通信研究院Electronics and Telecommunications Research Institute Automatic interpretation device and method
CN104899192A (en) * 2014-03-07 2015-09-09 韩国电子通信研究院 Apparatus and method for automatic interpretation
CN105390141A (en) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 Sound conversion method and sound conversion device
CN105426362A (en) * 2014-09-11 2016-03-23 株式会社东芝 Speech Translation Apparatus And Method
CN106791913A (en) * 2016-12-30 2017-05-31 深圳市九洲电器有限公司 Digital television program simultaneous interpretation output intent and system
US20170255616A1 (en) * 2016-03-03 2017-09-07 Electronics And Telecommunications Research Institute Automatic interpretation system and method for generating synthetic sound having characteristics similar to those of original speaker's voice
CN107632980A (en) * 2017-08-03 2018-01-26 北京搜狗科技发展有限公司 Voice translation method and device, the device for voiced translation
CN107992485A (en) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 A kind of simultaneous interpretation method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105786801A (en) * 2014-12-22 2016-07-20 中兴通讯股份有限公司 Speech translation method, communication method and related device
CN106156009A (en) * 2015-04-13 2016-11-23 中兴通讯股份有限公司 Voice translation method and device
CN107465816A (en) * 2017-07-25 2017-12-12 广西定能电子科技有限公司 A kind of call terminal and method of instant original voice translation of conversing
CN107731232A (en) * 2017-10-17 2018-02-23 深圳市沃特沃德股份有限公司 Voice translation method and device

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1553381A (en) * 2003-05-26 2004-12-08 杨宏惠 Multi-language correspondent list style language database and synchronous computer inter-transtation and communication
CN101114447A (en) * 2006-07-26 2008-01-30 株式会社东芝 Speech translation device and method
CN101154221A (en) * 2006-09-28 2008-04-02 株式会社东芝 Apparatus performing translation process from inputted speech
CN101727904A (en) * 2008-10-31 2010-06-09 国际商业机器公司 Voice translation method and device
CN102270450A (en) * 2010-06-07 2011-12-07 株式会社曙飞电子 System and method of multi model adaptation and voice recognition
CN102821259A (en) * 2012-07-20 2012-12-12 冠捷显示科技(厦门)有限公司 TV (television) system with multi-language speech translation and realization method thereof
JP2015026054A (en) * 2013-07-29 2015-02-05 韓國電子通信研究院Electronics and Telecommunications Research Institute Automatic interpretation device and method
CN104899192A (en) * 2014-03-07 2015-09-09 韩国电子通信研究院 Apparatus and method for automatic interpretation
CN104252861A (en) * 2014-09-11 2014-12-31 百度在线网络技术(北京)有限公司 Video voice conversion method, video voice conversion device and server
CN105426362A (en) * 2014-09-11 2016-03-23 株式会社东芝 Speech Translation Apparatus And Method
CN105390141A (en) * 2015-10-14 2016-03-09 科大讯飞股份有限公司 Sound conversion method and sound conversion device
US20170255616A1 (en) * 2016-03-03 2017-09-07 Electronics And Telecommunications Research Institute Automatic interpretation system and method for generating synthetic sound having characteristics similar to those of original speaker's voice
CN106791913A (en) * 2016-12-30 2017-05-31 深圳市九洲电器有限公司 Digital television program simultaneous interpretation output intent and system
CN107632980A (en) * 2017-08-03 2018-01-26 北京搜狗科技发展有限公司 Voice translation method and device, the device for voiced translation
CN107992485A (en) * 2017-11-27 2018-05-04 北京搜狗科技发展有限公司 A kind of simultaneous interpretation method and device

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109119063A (en) * 2018-08-31 2019-01-01 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium
CN109119063B (en) * 2018-08-31 2019-11-22 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium
CN110415680A (en) * 2018-09-05 2019-11-05 满金坝(深圳)科技有限公司 A kind of simultaneous interpretation method, synchronous translation apparatus and a kind of electronic equipment
CN110415680B (en) * 2018-09-05 2022-10-04 梁志军 Simultaneous interpretation method, simultaneous interpretation device and electronic equipment
US11328133B2 (en) 2018-09-28 2022-05-10 Beijing Baidu Netcom Science And Technology Co., Ltd. Translation processing method, translation processing device, and device
CN108986793A (en) * 2018-09-28 2018-12-11 北京百度网讯科技有限公司 translation processing method, device and equipment
WO2020077868A1 (en) * 2018-10-17 2020-04-23 深圳壹账通智能科技有限公司 Simultaneous interpretation method and apparatus, computer device and storage medium
CN109754808A (en) * 2018-12-13 2019-05-14 平安科技(深圳)有限公司 Method, apparatus, computer equipment and the storage medium of voice conversion text
CN112420008A (en) * 2019-08-22 2021-02-26 北京峰趣互联网信息服务有限公司 Method and device for recording songs, electronic equipment and storage medium
CN110610720B (en) * 2019-09-19 2022-02-25 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN110610720A (en) * 2019-09-19 2019-12-24 北京搜狗科技发展有限公司 Data processing method and device and data processing device
US11488577B2 (en) * 2019-09-27 2022-11-01 Baidu Online Network Technology (Beijing) Co., Ltd. Training method and apparatus for a speech synthesis model, and storage medium
CN110970014A (en) * 2019-10-31 2020-04-07 阿里巴巴集团控股有限公司 Voice conversion, file generation, broadcast, voice processing method, device and medium
CN110970014B (en) * 2019-10-31 2023-12-15 阿里巴巴集团控股有限公司 Voice conversion, file generation, broadcasting and voice processing method, equipment and medium
CN111105781A (en) * 2019-12-23 2020-05-05 联想(北京)有限公司 Voice processing method, device, electronic equipment and medium
CN111105781B (en) * 2019-12-23 2022-09-23 联想(北京)有限公司 Voice processing method, device, electronic equipment and medium
WO2021134592A1 (en) * 2019-12-31 2021-07-08 深圳市欢太科技有限公司 Speech processing method, apparatus and device, and storage medium
CN111368559A (en) * 2020-02-28 2020-07-03 北京字节跳动网络技术有限公司 Voice translation method and device, electronic equipment and storage medium
WO2021208531A1 (en) * 2020-04-16 2021-10-21 北京搜狗科技发展有限公司 Speech processing method and apparatus, and electronic device
CN111696518A (en) * 2020-06-05 2020-09-22 四川纵横六合科技股份有限公司 Automatic speech synthesis method based on text
CN111785258B (en) * 2020-07-13 2022-02-01 四川长虹电器股份有限公司 Personalized voice translation method and device based on speaker characteristics
CN111785258A (en) * 2020-07-13 2020-10-16 四川长虹电器股份有限公司 Personalized voice translation method and device based on speaker characteristics
CN113160793A (en) * 2021-04-23 2021-07-23 平安科技(深圳)有限公司 Speech synthesis method, device, equipment and storage medium based on low resource language
CN113362818A (en) * 2021-05-08 2021-09-07 山西三友和智慧信息技术股份有限公司 Voice interaction guidance system and method based on artificial intelligence
CN116343751A (en) * 2023-05-29 2023-06-27 深圳市泰为软件开发有限公司 Voice translation-based audio analysis method and device
CN116343751B (en) * 2023-05-29 2023-08-11 深圳市泰为软件开发有限公司 Voice translation-based audio analysis method and device

Also Published As

Publication number Publication date
CN108447486B (en) 2021-12-03
WO2019165748A1 (en) 2019-09-06

Similar Documents

Publication Publication Date Title
CN108447486A (en) A kind of voice translation method and device
CN110491382B (en) Speech recognition method and device based on artificial intelligence and speech interaction equipment
CN112735373B (en) Speech synthesis method, device, equipment and storage medium
WO2022078146A1 (en) Speech recognition method and apparatus, device, and storage medium
CN112037754B (en) Method for generating speech synthesis training data and related equipment
CN106531150B (en) Emotion synthesis method based on deep neural network model
CN109523989A (en) Phoneme synthesizing method, speech synthetic device, storage medium and electronic equipment
CN107731228A (en) The text conversion method and device of English voice messaging
CN108711420A (en) Multilingual hybrid model foundation, data capture method and device, electronic equipment
CN108231062A (en) A kind of voice translation method and device
CN109887484A (en) A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device
GB2326320A (en) Text to speech synthesis using neural network
CN110010136A (en) The training and text analyzing method, apparatus, medium and equipment of prosody prediction model
KR20140071070A (en) Method and apparatus for learning pronunciation of foreign language using phonetic symbol
CN115953521B (en) Remote digital person rendering method, device and system
CN112463942A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN111951781A (en) Chinese prosody boundary prediction method based on graph-to-sequence
Poncelet et al. Low resource end-to-end spoken language understanding with capsule networks
CN114387945A (en) Voice generation method and device, electronic equipment and storage medium
Bharti et al. Automated speech to sign language conversion using Google API and NLP
CN115424604B (en) Training method of voice synthesis model based on countermeasure generation network
CN112242134A (en) Speech synthesis method and device
CN115762471A (en) Voice synthesis method, device, equipment and storage medium
CN113257225B (en) Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics
Reddy et al. Speech-to-Text and Text-to-Speech Recognition Using Deep Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant