CN108447486A - A kind of voice translation method and device - Google Patents
A kind of voice translation method and device Download PDFInfo
- Publication number
- CN108447486A CN108447486A CN201810167142.5A CN201810167142A CN108447486A CN 108447486 A CN108447486 A CN 108447486A CN 201810167142 A CN201810167142 A CN 201810167142A CN 108447486 A CN108447486 A CN 108447486A
- Authority
- CN
- China
- Prior art keywords
- unit
- voice
- text
- sample
- target
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000013519 translation Methods 0.000 title claims abstract description 111
- 238000000034 method Methods 0.000 title claims abstract description 94
- 230000015572 biosynthetic process Effects 0.000 claims description 26
- 238000003786 synthesis reaction Methods 0.000 claims description 26
- 230000015654 memory Effects 0.000 claims description 16
- 241000208340 Araliaceae Species 0.000 claims description 6
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 claims description 6
- 235000003140 Panax quinquefolius Nutrition 0.000 claims description 6
- 235000008434 ginseng Nutrition 0.000 claims description 6
- 238000006243 chemical reaction Methods 0.000 description 22
- 238000010586 diagram Methods 0.000 description 19
- 238000010276 construction Methods 0.000 description 17
- 238000013507 mapping Methods 0.000 description 16
- 238000005516 engineering process Methods 0.000 description 14
- 238000012549 training Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 230000000694 effects Effects 0.000 description 6
- 238000000605 extraction Methods 0.000 description 6
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000001228 spectrum Methods 0.000 description 4
- 230000007306 turnover Effects 0.000 description 4
- 230000006870 function Effects 0.000 description 2
- 230000002194 synthesizing effect Effects 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
- G10L13/033—Voice editing, e.g. manipulating the voice of the synthesiser
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
This application discloses a kind of voice translation method and device, the method includes:After getting the first object voice of source speaker, by to the progress voiced translation of first object voice, generating the second target voice, wherein, the languages of second target voice are different from the languages of first object voice, and the second target voice carries the tamber characteristic of source speaker.It can be seen that, when voice carries out voiced translation before being translated to the voice of source speaker, the tamber characteristic having due to considering source speaker itself, so that voice also has the tamber characteristic of source speaker after translation, so that voice sounds the voice directly said more like source speaker after the translation.
Description
Technical field
This application involves field of computer technology more particularly to a kind of voice translation methods and device.
Background technology
Increasingly mature with artificial intelligence technology, people pursue more and more to be solved using intellectual technology
Problem.For example, once people require a great deal of time to learn a new language, could with using the language as mother tongue
People links up, and present, people can directly by translator, around speech recognition, intelligent translation and speech synthesis technique,
To realize spoken input, machine translation and the meaning after saying translation of pronouncing.
But in current voiced translation technology, after the voice of source speaker is translated, language after obtained translation
Sound is entirely the tamber characteristic of the speaker in phonetic synthesis model, in sense of hearing, is and entirely different another of source speaker
The tamber characteristic of a speaker.
Invention content
The main purpose of the embodiment of the present application is to provide a kind of voice translation method and device, when to the language of source speaker
When sound is translated, it can make the voice after translation that there is the tamber characteristic of source speaker.
The embodiment of the present application provides a kind of voice translation method, including:
The first object voice of acquisition source speaker;
By carrying out voiced translation to the first object voice, the second target voice is generated, wherein second target
The languages of voice are different from the languages of first object voice, and second target voice carries the sound of the source speaker
Color characteristic.
Optionally, it is described by the first object voice carry out voiced translation, generate the second target voice, including:
By carrying out speech recognition to the first object voice, speech recognition text is generated;
By carrying out text translation to the speech recognition text, cypher text is generated;
By carrying out phonetic synthesis to the cypher text, the second target voice is generated.
Optionally, it is described by the cypher text carry out phonetic synthesis, generate the second target voice, including:
The cypher text is subjected to cutting according to the unit-in-context of default size, obtains each target text unit;
Obtain the parameters,acoustic of each target text unit, wherein the parameters,acoustic carries the source speaker
Tamber characteristic;
According to the parameters,acoustic of each target text unit, phonetic synthesis is carried out to the cypher text, generates the second mesh
Poster sound.
Optionally, the method further includes:
Obtain the first sample voice of the source speaker, wherein the languages of the first sample voice and described second
The languages of target voice are identical;
The identification text of the first sample voice is subjected to cutting according to the unit-in-context of the default size, is obtained each
A first sample unit-in-context;
The first sound bite corresponding with the first sample unit-in-context is extracted from the first sample voice;
Parameters,acoustic is extracted from first sound bite;
Utilize each first sample unit-in-context and parameters,acoustic corresponding with the first sample unit-in-context, structure
First acoustic model;
Then, the parameters,acoustic for obtaining each target text unit, including:
Using first acoustic model, the parameters,acoustic of each target text unit is obtained.
Optionally, the method further includes:
Obtain the second sample voice of the source speaker, wherein the languages of second sample voice and described second
The languages of target voice are different;
The identification text of second sample voice is subjected to cutting according to the unit-in-context of the default size, is obtained each
A second sample text unit;
The second sample text unit is converted, the first converting text unit is obtained, wherein first conversion
Unit-in-context is unit-in-context used in the languages of second target voice;
The second sound bite corresponding with the second sample text unit is extracted from second sample voice;
Parameters,acoustic is extracted from second sound bite, obtains acoustics corresponding with the first converting text unit
Parameter;
Utilize each second sample text unit, the first converting text list corresponding with the second sample text unit
Position and parameters,acoustic corresponding with the first converting text unit build the second acoustic model;
Then, the parameters,acoustic for obtaining each target text unit, including:
Using second acoustic model, the parameters,acoustic of each target text unit is obtained.
Optionally, the method further includes:
Collect multiple first sample texts, wherein the languages of the first sample text and second sample voice
Languages are identical;
The first sample text is subjected to cutting according to the unit-in-context of the default size, obtains each third sample
Unit-in-context;
The third sample text unit is converted, the second converting text unit is obtained, wherein second conversion
Unit-in-context is the unit-in-context that the third sample text unit is pronounced with the articulation type of second target voice;
Then, described to convert the second sample text unit, the first converting text unit is obtained, including:
Determine third sample text unit identical with the second sample text unit;
By the corresponding second converting text unit of identified third sample text unit, as the first converting text list
Position.
Optionally, the method further includes:
Collect multiple second sample texts, wherein the languages of second sample text and second sample voice
Languages are identical;
By second sample text according to the unit-in-context row cutting for presetting size described in sound, each 4th sample is obtained
Unit-in-context;
The 4th sample text unit is converted, third converting text unit is obtained, wherein the third conversion
Unit-in-context is the unit-in-context that the 4th sample text unit is pronounced with the articulation type of second target voice;
For the syllable in second sample text, is belonged to by study and existed with monosyllabic 4th sample text unit
The combination of syntagmatic and ordinal relation, at least two continuous syllables of study in second sample text in corresponding syllable
The 4th sample text unit in relationship and ordinal relation and at least two continuous syllables of study is in second sample text
In syntagmatic and ordinal relation, build coding/decoding model;
Then, described to convert the second sample text unit, the first converting text unit is obtained, including:
Using the coding/decoding model, the second sample text unit is converted, obtains the first converting text list
Position.
The embodiment of the present application also provides a kind of speech translation apparatus, including:
Voice acquisition unit, the first object voice for obtaining source speaker;
Speech interpreting unit is used for by first object voice progress voiced translation, generating the second target voice,
Wherein, the languages of second target voice are different from the languages of first object voice, and second target voice carries
The tamber characteristic of the source speaker.
The embodiment of the present application also provides a kind of speech translation apparatus, including:Processor, memory, system bus;
The processor and the memory are connected by the system bus;
The memory includes instruction, described instruction for storing one or more programs, one or more of programs
The processor is set to execute method described in any one of the above embodiments when being executed by the processor.
The embodiment of the present application also provides a kind of computer readable storage mediums, including instruction, when it is transported on computers
When row so that computer executes the method described in above-mentioned any one.
A kind of voice translation method and device provided by the embodiments of the present application, when the first object language for getting source speaker
After sound, by carrying out voiced translation to first object voice, the second target voice is generated, wherein the languages of the second target voice
Different from the languages of first object voice, the second target voice carries the tamber characteristic of source speaker.As it can be seen that pronouncing to source
When voice carries out voiced translation before the voice of people is translated, the tamber characteristic that has due to considering source speaker itself so that
Voice also has the tamber characteristic of source speaker after translation, so that voice sounds straight more like source speaker after the translation
Connect the voice said.
Description of the drawings
In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below
There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is the application
Some embodiments for those of ordinary skill in the art without creative efforts, can also basis
These attached drawings obtain other attached drawings.
Fig. 1 is a kind of one of flow diagram of voice translation method provided by the embodiments of the present application;
Fig. 2 is the two of a kind of flow diagram of voice translation method provided by the embodiments of the present application;
Fig. 3 is phonetic synthesis model schematic provided by the embodiments of the present application;
Fig. 4 is a kind of one of flow diagram of acoustic model construction method provided by the embodiments of the present application;
Fig. 5 is the two of a kind of flow diagram of acoustic model construction method provided by the embodiments of the present application;
Fig. 6 is a kind of flow diagram of sample text unit collection method provided by the embodiments of the present application;
Relation schematic diagrams of the Fig. 7 between aligned phoneme sequence provided by the embodiments of the present application;
Fig. 8 is a kind of flow diagram of coding/decoding model construction method provided by the embodiments of the present application;
Fig. 9 is cataloged procedure schematic diagram provided by the embodiments of the present application;
Figure 10 is a kind of composition schematic diagram of speech translation apparatus provided by the embodiments of the present application;
Figure 11 is a kind of hardware architecture diagram of speech translation apparatus provided by the embodiments of the present application.
Specific implementation mode
In current voiced translation technology, after the voice of source speaker is translated, voice is complete after obtained translation
It is the tamber characteristic of the speaker in synthetic model entirely, is another speaker entirely different with source speaker in sense of hearing
Tamber characteristic, that is, it is that a people is speaking to sound like, the translation that another person then carries out, and is different two people's
Voice effect.
For this purpose, the embodiment of the present application provides a kind of voice translation method and device, turned in the voice to source speaker
When voice carries out voiced translation before translating, that is, when needing the voiced translation of source speaker into another languages, using belonging to source
The phonetic synthesis model of people carries out voiced translation so that voice has the tamber characteristic of source speaker after translation, so that should
Voice sounds the voice directly said more like source speaker after translation, and then the user experience is improved.
To keep the purpose, technical scheme and advantage of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application
In attached drawing, technical solutions in the embodiments of the present application is clearly and completely described, it is clear that described embodiment is
Some embodiments of the present application, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art
The every other embodiment obtained without making creative work, shall fall in the protection scope of this application.
First embodiment
It is a kind of flow diagram of voice translation method provided in this embodiment, this method includes following step referring to Fig. 1
Suddenly:
S101:The first object voice of acquisition source speaker.
For ease of distinguishing, the voice translated is translated preceding voice, is defined as first object language by the present embodiment
Sound, and the speaker for saying the first object voice is defined as source speaker.
The present embodiment does not limit the source of the first object voice, for example, the first object voice can be for someone
Real speech or recorded speech, can also be the spy carried out to the real speech or the recorded speech after machine processing
Imitate voice.
The present embodiment does not limit the length of the first object voice yet, for example, the first object voice can be one
A word, can also be in short, can also be one section of word.
S102:By carrying out voiced translation to the first object voice, the second target voice is generated, wherein described the
The languages of two target voices are different from the languages of first object voice, and second target voice carries the source pronunciation
The tamber characteristic of people.
For ease of distinguishing, the voice after being translated to first object voice is defined as the second target language by the present embodiment
Sound.It should be noted that when first object voice is above-mentioned special efficacy voice after machine processing, it is also necessary to will further turn over
The second target voice obtained after translating also carries out the special effect processing of same way.
The present embodiment does not limit the languages type of first object voice and the second target voice, if first object voice with
The languages type of second target voice is different but voice is equivalent in meaning.For example, first object voice is Chinese " hello ", the
Two target voices are English " hello ";Alternatively, first object voice is English " hello ", the second target voice is Chinese " you
It is good ".
In practical application, user's such as source speaker can be the languages after the default translation of translator, the language for machine of serving as interpreter
After sound synthetic model gets the first object voice of source speaker, voiced translation can be carried out, makes after translation
Two target voices are preset translation languages.
In the present embodiment, the tamber characteristic of source speaker can be acquired in advance, for building the voice for belonging to source speaker
Synthetic model is based on this, when the first object voice to source speaker carries out voiced translation, may be used and belongs to source speaker
Phonetic synthesis model carry out voiced translation, so that the second target voice after translation is endowed the tamber characteristic of source speaker, this
Kind tone color adaptive mode makes hearer think that the second target voice has the effect of speaking of source speaker in sense of hearing, that is, to make to turn over
Voice and voice after translation are same or similar on tamber effect before translating.
To sum up, a kind of voice translation method provided in this embodiment, after getting the first object voice of source speaker,
By carrying out voiced translation to first object voice, the second target voice is generated, wherein the languages of the second target voice and first
The languages of target voice are different, and the second target voice carries the tamber characteristic of source speaker.As it can be seen that in the language to source speaker
When voice carries out voiced translation before sound is translated, the tamber characteristic that has due to considering source speaker itself so that after translation
The voice also tamber characteristic with source speaker, so that voice is sounded and directly being said more like source speaker after the translation
Voice.
Second embodiment
The present embodiment will introduce the specific reality of S102 in above-mentioned first embodiment by following S202-S204 in conjunction with attached drawing
Existing mode.
It is a kind of flow diagram of voice translation method provided in this embodiment, this method includes following step referring to Fig. 2
Suddenly:
S201:The first object voice of acquisition source speaker.
It should be noted that the S201 in the present embodiment is consistent with the S101 in first embodiment, related description refers to
First embodiment, details are not described herein.
S202:By carrying out speech recognition to the first object voice, speech recognition text is generated.
After getting first object voice, the voice by speech recognition technology, such as based on artificial neural network is known
First object voice is converted into speech recognition text by other technology.
For example, first object voice is Chinese speech " hello ", speech recognition is carried out to it, Chinese text can be obtained
" hello ".
S203:By carrying out text translation to the speech recognition text, cypher text is generated.
For example, it is assumed that languages are Chinese before translation, languages are set to English after translation, then, speech recognition text is
Chinese text can obtain translator of English text, such as will Chinese text by the Chinese text by the translation model of " in → English "
This " hello " carries out text translation, obtains English text " hello ".
S204:By carrying out phonetic synthesis to the cypher text, the second target voice is generated, wherein second mesh
The languages of poster sound are different from the languages of first object voice, and second target voice carries the source speaker
Tamber characteristic.
For current voiced translation present situation, voice and discrimination of the voice in tone color before translation are very bright after translation
Aobvious, to overcome the defect, the Speech acoustics parameter that the present embodiment can advance with source speaker to be modeled, belonged to
The phonetic synthesis model of source speaker.In this way, when the cypher text is synthesized voice, the phonetic synthesis mould can be utilized
Type makes voice i.e. the second target voice after translation have the tamber characteristic of source speaker, reach source speaker oneself speak, oneself
The sense of hearing effect of translation.For example, the cypher text is English text " hello ", voice i.e. the second target voice is after translation
English voice " hello ".
Specifically, phonetic synthesis model may include acoustic model and duration modeling, phonetic synthesis model as shown in Figure 3
Schematic diagram.
After the cypher text for obtaining first object voice, first have to carry out text analyzing processing to the cypher text, really
Each syllable information in the fixed cypher text, and each phoneme information for forming each syllable is obtained, then these phonemes are believed
Breath is input to acoustic model shown in Fig. 3, so that the acoustic model is determining and exports the parameters,acoustic of each phoneme, acoustics ginseng
Number carries the tamber characteristic of source speaker, wherein the parameters,acoustic may include the parameters such as frequency spectrum, fundamental frequency.In addition, will also
Above-mentioned phoneme information is input to duration modeling shown in Fig. 3, so that the duration modeling exports duration parameters, the present embodiment does not limit
The method of determination of duration parameters.As an example, it may be determined that the word speed of first object voice uses acquiescence word speed, calculates
The duration that cypher text is spent when being read according to the word speed, using the duration as duration parameters.
Next, the parameters,acoustic that phonetic synthesis model will be exported using acoustic model, makes each phoneme in cypher text
Pronounce according to corresponding parameters,acoustic, phonetic synthesis model also utilizes the duration parameters of duration modeling output, according to specified
Duration pronounce, to the tamber characteristic of the active speaker of anamorphic zone translated speech to get to the second target voice.
In a kind of realization method of the present embodiment, S204 can be realized in the following manner, can specifically include following
Step:
Step A:The cypher text is subjected to cutting according to the unit-in-context of default size, obtains each target text list
Position.
Cypher text is divided according to the unit-in-context of default size, for example, when text of serving as interpreter is Chinese text,
Can be divided as unit of phoneme, byte or word etc., for another example, text of serving as interpreter be English text when, can with phoneme,
Word etc. is that unit is divided.For ease of distinguishing, the present embodiment determines each unit-in-context marked off from cypher text
Justice is target text unit.
Step B:Obtain the parameters,acoustic of each target text unit, wherein the parameters,acoustic carries the source hair
The tamber characteristic of sound people.
The present embodiment can utilize acoustic model shown in Fig. 3, the parameters,acoustic of each target text unit be obtained, due to this
Acoustic model is the acoustic model for belonging to source speaker, so, the parameters,acoustic obtained using the acoustic model will have source to send out
The tamber characteristic of sound people.
It should be noted that the construction method of acoustic model shown in Fig. 3 and how using the acoustic model obtain target
The parameters,acoustic of unit-in-context will be specifically introduced in subsequent third embodiment.
Step C:According to the parameters,acoustic of each target text unit, phonetic synthesis is carried out to the cypher text, is generated
Second target voice.
When the parameters,acoustic for getting each target text unit in cypher text by step B, for example, the parameters,acoustic
May include the parameters such as frequency spectrum, fundamental frequency, then, phonetic synthesis model shown in Fig. 3 can make each target text unit according to
Corresponding parameters,acoustic pronounces, so that cypher text to be synthesized to the second target language of the tamber characteristic of specific source speaker
Sound.
To sum up, a kind of voice translation method provided in this embodiment, after getting the first object voice of source speaker,
Text translation is carried out to the speech recognition text of first object voice, then, by obtaining each unit-in-context in cypher text
Parameters,acoustic carry out phonetic synthesis, generate the second target voice.Since the tone color for carrying source speaker in parameters,acoustic is special
Sign so that voice also has the tamber characteristic of source speaker after translation, so that voice is sounded more like source after the translation
The voice that speaker is directly said.
3rd embodiment
The present embodiment will introduce the construction method of acoustic model in second embodiment, and, it introduces in second embodiment and walks
How the specific implementation of rapid B obtains the parameters,acoustic of target text unit using the acoustic model.
In the present embodiment, it after speaker takes translator for the first time when source, can prompt to record to specifications, use
To build acoustic model use, recording substance is optional, and source speaker can carry out languages choosing according to the Reading ability of oneself
It selects, that is to say, that the recording languages of source speaker selection, it can be identical as the languages of voice after translation (i.e. the second target voice)
Or it is different.The present embodiment languages selection result different by above two is based respectively on carries out the construction method of acoustic model
It is specific to introduce.
In the first construction method of acoustic model, the recording languages of source speaker selection, with voice after translation (i.e. the
Two target voices) languages it is identical, the model building method is specifically introduced below.
Be a kind of flow diagram of acoustic model construction method provided in this embodiment referring to Fig. 4, this method include with
Lower step:
S401:Obtain the first sample voice of the source speaker, wherein the languages of the first sample voice with it is described
The languages of second target voice are identical.
It in the present embodiment, can be special according to the tone color of source speaker in order to make voice i.e. the second target voice after translation
Sign is pronounced, and one section of recording of source speaker can be obtained, this section recording can be identical as the languages of voice after translation, and
And the correspondence text of this section recording, all phoneme contents of text languages should be covered as possible.
For ease of distinguishing, this section recording is defined as first sample voice by the present embodiment.
Now using voice before translation, that is, first object voice as Chinese speech, translation after voice i.e. the second target voice be English
For voice, first, confirm whether source speaker has the normal ability for reading aloud English, for example, translator can inquire that source is pronounced
Whether people can read aloud English, if source speaker replys " can read aloud English " by forms such as voice or buttons, translator
One section of a small amount of fixation English text and the source speaker of prompt can be provided and read aloud the fixation English text, the fixation English text
Cover all English phonemes as possible, source speaker reads aloud the fixation English text, is fixed so that translator obtains this
The voice of English text, the voice are the first sample voice.
S402:The identification text of the first sample voice is subjected to cutting according to the unit-in-context of the default size,
Obtain each first sample unit-in-context.
After getting first sample voice, the voice by speech recognition technology, such as based on artificial neural network is known
First sample voice is converted into speech recognition text by other technology.Then, by the speech recognition text according to the text of default size
Our unit's (identical as the dividing unit of step A in second embodiment) is divided, for example is divided as unit of phoneme, is
Convenient for distinguishing, each unit-in-context marked off from the speech recognition text is defined as first sample text list by the present embodiment
Position.
S403:The first voice sheet corresponding with the first sample unit-in-context is extracted from the first sample voice
Section, and extract parameters,acoustic from first sound bite.
According to the text dividing mode that the identification text to first sample voice carries out, first sample voice is drawn
Point, in this way, can determine each first sample unit-in-context corresponding sound bite in first sample voice, for example, will
The identification text and first sample voice of first sample voice, are divided as unit of phoneme, to obtain the identification
The corresponding sound bite of each phoneme in text.For ease of distinguishing, the present embodiment is by the corresponding voice of first sample unit-in-context
Segment is defined as the first sound bite.
For each first sample unit-in-context, corresponding acoustics ginseng is extracted from the first corresponding sound bite
Number, such as frequency spectrum, fundamental frequency, in this way, just having got the tamber characteristic data of source speaker.
S404:Joined using each first sample unit-in-context and acoustics corresponding with the first sample unit-in-context
Number builds the first acoustic model.
Each first sample unit-in-context and the corresponding parameters,acoustic of each first sample unit-in-context can be carried out
Storage, to form the first data acquisition system.By taking the unit-in-context in the first data acquisition system is phoneme as an example, it should be noted that such as
The first data acquisition system of fruit can not cover all phonemes of languages after translation, can be set by the phoneme being not covered by and for these phonemes
The acquiescence parameters,acoustic set, is added in the first data acquisition system.In this way, first sample text in the first data acquisition system can be based on
Correspondence between our unit and parameters,acoustic, structure belong to the acoustic model of source speaker, when specific structure, directly by the
One data acquisition system is as training data, and the acoustic model of training source speaker, training process is same as the prior art, the present embodiment
The acoustic model of structure is defined as the first acoustic model.
In one embodiment, which may be implemented the step B in second embodiment and " obtains each target text
The parameters,acoustic of our unit ", can specifically include:Using first acoustic model, the sound of each target text unit is obtained
Learn parameter.In the present embodiment, using the acoustic model of source speaker i.e. the first acoustic model, each target text is directly generated
The parameters,acoustic of our unit, specific generation method can be same as the prior art, for example, the generation method can be existing base
In the phoneme synthesizing method of parameter.
In second of construction method of acoustic model, the recording languages of source speaker selection, with voice after translation (i.e. the
Two target voices) languages it is different, the model building method is specifically introduced below.
Referring to Fig. 5, for the flow diagram of another acoustic model construction method provided in this embodiment, this method includes
Following steps:
S501:Obtain the second sample voice of the source speaker, wherein the languages of second sample voice with it is described
The languages of second target voice are different.
It in the present embodiment, can be special according to the tone color of source speaker in order to make voice i.e. the second target voice after translation
Sign is pronounced, and one section of recording of source speaker can be obtained, this section recording can be different from the languages of voice after translation, such as
This section recording can be identical as voice before translation, that is, languages of first object voice, also, the correspondence text of this section recording, should use up
Amount covers all phoneme contents of text languages.
For ease of distinguishing, this section recording is defined as the second sample voice by this implementation.
Now still using voice before translation, that is, first object voice as Chinese speech, translation after voice i.e. the second target voice be English
For literary voice, first, confirm whether source speaker has the normal ability for reading aloud English, for example, translator can inquire that source is sent out
Whether sound people can read aloud English, if source speaker replys " cannot read aloud English " by forms such as voice or buttons, turn over
The machine of translating can provide languages options, it is assumed that source speaker selection Chinese, then translator can provide in one section of a small amount of fixation
Text and the source speaker of prompt read aloud the fixation Chinese text, which covers all Chinese phonemes as possible,
Source speaker reads aloud the fixation Chinese text, so that translator obtains the voice of the fixation Chinese text, which is
For second sample voice.
S502:The identification text of second sample voice is subjected to cutting according to the unit-in-context of the default size,
Obtain each second sample text unit.
After getting the second sample voice, the voice by speech recognition technology, such as based on artificial neural network is known
Second sample voice is converted into speech recognition text by other technology.Then, by the speech recognition text according to the text of default size
Our unit's (identical as the dividing unit of step A in second embodiment) is divided, for example is divided as unit of phoneme, is
Convenient for distinguishing, each unit-in-context marked off from the speech recognition text is defined as the second sample text list by the present embodiment
Position.
S503:The second sample text unit is converted, the first converting text unit is obtained, wherein described
One converting text unit is unit-in-context used in the languages of second target voice.
For every 1 second sample text unit, need to correspond to the second sample text Conversion of measurement unit at languages after translation
Unit-in-context, transformed unit-in-context is defined as the first converting text unit by the present embodiment.For example, it is assumed that the second sample
Unit-in-context is that languages are English after Chinese phoneme, translation, then the first converting text unit is English phoneme.
It should be noted that specific unit-in-context conversion regime, will be specifically introduced in follow-up fourth embodiment.
S504:The second voice sheet corresponding with the second sample text unit is extracted from second sample voice
Section, and parameters,acoustic is extracted from second sound bite, obtain acoustics ginseng corresponding with the first converting text unit
Number.
According to the text dividing mode that the identification text to the second sample voice carries out, the second sample voice is drawn
Point, in this way, can determine every 1 second sample text unit corresponding sound bite in the second sample voice, for example, will
The identification text and the second sample voice of second sample voice, are divided as unit of phoneme, to obtain the identification
The corresponding sound bite of each phoneme in text.For ease of distinguishing, the present embodiment is by the corresponding voice of the second sample text unit
Segment is defined as the second sound bite.
For every 1 second sample text unit, corresponding acoustics ginseng is extracted from the second corresponding sound bite
Number, such as frequency spectrum, fundamental frequency, as the parameters,acoustic of the first converting text unit corresponding with the second sample text unit.
S505:Utilize each second sample text unit, the first conversion text corresponding with the second sample text unit
Our unit and parameters,acoustic corresponding with the first converting text unit build the second acoustic model.
It can be by each second sample text unit, the first converting text list corresponding with every 1 second sample text unit
Position and the corresponding parameters,acoustic of every one first converting text unit are stored, to form the second data set.With the second number
According to the unit-in-context in set for for phoneme, it should be noted that if the second data set can not cover languages after translation
All phonemes, can by the phoneme being not covered by and be these phonemes setting acquiescence parameters,acoustic, be added to the second data
In set.In this way, phoneme and phoneme and sound after phoneme after conversion and conversion before being converted in the second data set can be based on
The correspondence between parameter is learned, structure belongs to the acoustic model of source speaker, specifically when structure, directly by the second data set
As training data, the acoustic model of training source speaker, training process is same as the prior art, and the present embodiment is by the sound of structure
It learns model and is defined as the second acoustic model.
In one embodiment, which may be implemented the step B in second embodiment and " obtains each target text
The parameters,acoustic of our unit ", can specifically include:Using second acoustic model, the sound of each target text unit is obtained
Learn parameter.In the present embodiment, using the acoustic model of source speaker i.e. the second acoustic model, each target text is directly generated
The parameters,acoustic of our unit, specific generation method can be same as the prior art, for example, the generation method can be existing base
In the phoneme synthesizing method of parameter.
To sum up, a kind of voice translation method provided in this embodiment, after getting the first object voice of source speaker,
Text translation is carried out to the speech recognition text of first object voice, then, by obtaining each unit-in-context in cypher text
Parameters,acoustic carry out phonetic synthesis, generate the second target voice.Wherein it is possible to the acoustic mode by building source speaker in advance
Type determines the parameters,acoustic of each unit-in-context, due to carrying the tamber characteristic of source speaker in parameters,acoustic so that turn over
The rear voice also tamber characteristic with source speaker is translated, so that voice sounds direct more like source speaker after the translation
The voice said.
Fourth embodiment
The present embodiment will introduce the specific implementation of S503 in 3rd embodiment, in order to realize S503, need advance structure
Unit-in-context mapping model is built, to realize S503 using the unit-in-context conversion function of this article our unit mapping model.This reality
Apply the construction method that example describes two kinds of unit-in-context mapping models.
In the first construction method of unit-in-context mapping model, directly establish two kinds of languages unit-in-context sequence it
Between correspondence, according to the correspondence realize unit-in-context between conversion, have below to the model building method
Body introduction.
As shown in fig. 6, being a kind of flow diagram of sample text unit collection method provided in this embodiment, this method
Include the following steps:
S601:Collect multiple first sample texts, wherein the languages of the first sample text and the second sample language
The languages of sound are identical.
In order to realize S503, that is, in the identification text of the second sample voice (i.e. the recorded speech of source speaker)
Each second sample text unit, in order at unit-in-context used in languages after translation, need to receive its corresponding conversion in advance
Each corpus of text of collection is defined as the by identical with the languages of the second sample voice a large amount of corpus of text of collection, the present embodiment
One sample text.The present embodiment does not limit the form of the first sample text, the first sample text can be a word,
Or in short or one section words.
For example, it is assumed that the second sample voice is Chinese speech, then, it needs to collect a large amount of Chinese text language material in advance
(as shown in Figure 7), each Chinese text are first sample text.
S602:The first sample text is subjected to cutting according to the unit-in-context of the default size, obtains each the
Three sample text units.
The first sample text is divided according to the unit-in-context of default size (with step A in second embodiment
Dividing unit is identical), for example divided as unit of phoneme, for ease of distinguishing, the present embodiment is from the first sample text
The each unit-in-context marked off is defined as third sample text unit.
Continue the example of last step, it is assumed that first sample text is Chinese text, needs the Chinese text being converted into
Chinese pinyin, and each Chinese phoneme in the Chinese pinyin is marked, Chinese aligned phoneme sequence (as shown in Figure 7) is obtained,
For example, Chinese text " hello ", can obtain Chinese pinyin " [n i] [h ao] ", and therefrom mark successively " n ", " i ",
The Chinese phonemes of " h ", " ao " this four, i.e. four third sample text units.
S603:The third sample text unit is converted, the second converting text unit is obtained, wherein described
Two converting text units are the texts that the third sample text unit is pronounced with the articulation type of second target voice
Our unit.
The voice i.e. articulation type of the second target voice marks pronunciation after first sample text being translated, this
Sample can find corresponding each third sample text unit in first sample text from the mark pronunciation
Unit-in-context, for ease of distinguishing, which is defined as the second converting text unit by the present embodiment.
Continue the example of last step, it is assumed that first sample text is voice i.e. second after Chinese text " hello ", translation
Target voice is English voice, then, " hello " can mark pronunciation by way of English phonetic symbol, can mark forAnd therefrom mark successively " n ", " I ", " h ",This four English phonemes, i.e., four second conversions
Unit-in-context, in this way, the third sample text unit " n " of aforementioned four Chinese form, " i ", " h ", " ao ", be corresponding in turn to this four
Second converting text unit " n " of a English form, " I ", " h ",
It is understood that due to same Chinese character such as " high mountain ", the Chinese character is in different Chinese words or sentence
Articulation type may be different, and therefore, the corresponding second converting text unit of third sample text unit for forming the Chinese character also may be used
Can be different, certainly, this situation also exists in other languages, but in the present embodiment, as long as in the phoneme notation before and after conversion
Appearance follows fixed pronunciation rule.
Based on the above, each third sample text unit and each third sample text unit can be corresponded to
The second converting text unit stored, to form unit-in-context set.It should be noted that since this article our unit gathers
In the second converting text unit belong to translation after languages phoneme, therefore, should make as possible this article our unit gather in second
All unit-in-contexts of languages after the covering translation of converting text unit.
When building unit-in-context mapping model, the third sample text unit in can directly gathering this article our unit
The second corresponding converting text unit does tabular mapping, is based on this, and unit-in-context mapping model can be based on text
Mapping relations between our unit realize the step S503 in 3rd embodiment.
In the first realization method, the second sample text unit " is converted, obtains first turn by step S503
Exchange of notes our unit " can specifically include:Determine third sample text unit identical with the second sample text unit;By institute
The determining corresponding second converting text unit of third sample text unit, as the first converting text unit.In this embodiment party
In formula, for every 1 second sample text unit, inquired from above-mentioned set of phonemes identical with the second sample text unit
Third sample text unit, and phoneme mapping relations are based on, determine the second conversion text corresponding with the third sample text unit
Our unit, as phoneme i.e. the first converting text unit after the conversion of the second sample text unit.
In second of construction method of unit-in-context mapping model, between the unit-in-context sequence of two kinds of languages of training
Network model passes through text list than coding/decoding model as shown in Figure 7 using the network model as unit-in-context mapping model
Bit mapping model can make unit-in-context mapping result more acurrate, and the model building method is specifically introduced below.
In second of building mode, a kind of flow diagram of coding/decoding model construction method shown in Figure 8, packet
Include following steps:
S801:Collect multiple second sample texts, wherein the languages of second sample text and the second sample language
The languages of sound are identical.
It should be noted that this step S801 is similar with step S601, the first sample text in S601 need to only be replaced
For the second sample text, related content refers to the related introduction of S601, and details are not described herein.
S802:Second sample text is subjected to cutting according to the unit-in-context of the default size, obtains each the
Four sample text units.
It should be noted that this step S802 is similar with step S602, the first sample text in S602 need to only be replaced
The 4th sample text unit is replaced with for the second sample text, by third sample text unit, related content refers to
The related introduction of S602, details are not described herein.
S803:The 4th sample text unit is converted, third converting text unit is obtained, wherein described
Three converting text units are the texts that the 4th sample text unit is pronounced with the articulation type of second target voice
Our unit.
It should be noted that this step S803 is similar with step S603, it only need to be by the third sample text unit in S603
Replace with the 4th sample text unit, the second converting text unit replaces with third converting text unit, related content is asked
Referring to the related introduction of S603, details are not described herein.
S804:For the syllable in second sample text, belonged to monosyllabic 4th sample text by study
Syntagmatic and ordinal relation of the unit in corresponding syllable, at least two continuous syllables of study are in second sample text
Syntagmatic and ordinal relation and study at least two continuous syllables in the 4th sample text unit in second sample
Syntagmatic in this document and ordinal relation build coding/decoding model.
In the present embodiment, the 4th sample text subunit sequence and third converting text subunit sequence, instruction can be utilized
Practice the network model among the unit-in-context system of both languages, which may include coding network shown in Fig. 7
And decoding network.To be subsequently Chinese aligned phoneme sequence with the 4th sample text subunit sequence, third converting text subunit sequence is
For English aligned phoneme sequence, which is introduced.
Specifically, realize that linking of the coding network between different syllables is handled by the way that one layer of syllable information is added
Ability, has the function that optimize the phonotactics in syllable and whole phoneme maps.The coding network can include three volumes
Code process, all phonemes respectively in syllable in the cataloged procedure of phoneme, the cataloged procedure of inter-syllable, text it is encoded
Journey, every time coding when, it is subsequent coding need consider front encode as a result, following introduce the coding network by taking Fig. 9 as an example
Cataloged procedure.
As shown in Figure 9, it is assumed that the second sample text of certain being collected into is Chinese text such as " hello ", then the 4th sample text
Our unit's sequence is " n ", " i ", " h ", " ao ".First, will belong to all Chinese phonemes " n " of the Chinese text, " i ", " h ",
" ao " is unified to carry out vectorization processing, such as using the methods of Word2Vector, and will belong to monosyllabic Chinese phoneme it
Between pass through primary two-way shot and long term Memory Neural Networks (Bidirectional Long Short-term Memory, BLSTM)
It is encoded, obtained coding result contains the relationship between phoneme and phoneme in syllable, that is, between study " n " and " i "
Syntagmatic and ordinal relation correspond to Chinese syllable " ni ", and, syntagmatic and sequence between study " h " and " ao " are closed
System corresponds to Chinese syllable " hao ".
Then, are carried out by vectorization processing, for example is used by all syllables " ni " of the Chinese text, " hao "
The methods of Word2Vector, in the volume for obtaining first layer BLSTM networks (phoneme learning network in syllable i.e. shown in Fig. 9)
After code result, first layer coding result is combined to the vector of each syllable, is compiled by two-way BLSTM networks between a syllable
Code, obtained coding result include relationship between syllable and syllable, that is, syntagmatic between study " ni " and " hao " and
Ordinal relation corresponds to Chinese text " hello ".
Finally, by the coding result of second layer BLSTM networks (inter-syllable learning network i.e. shown in Fig. 9), in conjunction with each
The vector characteristics of all phonemes carry out third layer BLSTM codings in syllable, obtain corresponding encoded result and contain the Chinese text
Relationship between middle phoneme and phoneme, that is, syntagmatic and ordinal relation between study " n ", " i ", " h ", " ao " correspond to
Chinese text " hello ".
It is shown in Fig. 7 using third layer coding result as the input of decoding network shown in Fig. 7 after above-mentioned three layers coding
Decoding network will the English aligned phoneme sequence " n " of corresponding output, " I ", " h ",
It is understood that when being trained to coding/decoding model using a large amount of Chinese texts, coding/decoding model study
Syntagmatic and ordinal relation between two or more syllables have also learnt per monosyllabic each phoneme in the sound
Syntagmatic in section and ordinal relation.English aligned phoneme sequence is converted to when needing the Chinese aligned phoneme sequence by certain Chinese text
When, it is based on this learning outcome, it can be by the Chinese aligned phoneme sequence of the Chinese text, according to its combination in the Chinese text
Relationship and ordinal relation select the English aligned phoneme sequence more arranged in pairs or groups therewith, no matter moreover, the Chinese text is shorter word
Or longer sentence, corresponding English aligned phoneme sequence all have preferable linking effect, this mode make aligned phoneme sequence it
Between correspondence result it is more flexible accurate.
It should be noted that coding/decoding model is not limited to the training between Chinese aligned phoneme sequence and English aligned phoneme sequence,
It is suitable between arbitrary two kinds of different languages.
Based on the above, the step in 3rd embodiment can be realized based on the learning outcome of coding/decoding model
S503.In second of realization method, the second sample text unit " is converted, obtains the first conversion by step S503
Unit-in-context " can specifically include:Using the coding/decoding model, the second sample text unit is converted, is obtained
First converting text unit.In the present embodiment, using the second sample text unit as the encoding and decoding mould built in advance
The input of type, output can be obtained transformed first converting text unit, and in transfer process, coding/decoding model can be based on
Above-mentioned learning outcome, according to the syntagmatic and ordinal relation between each second sample text unit, selection and every 1 second
First converting text unit of sample text unit collocation, relative to the first realization method of S503, due to this realization method
Learnt practical collocation mode between the unit-in-context sequence of different language in advance so that transformed unit-in-context more subject to
Really.
To sum up, a kind of voice translation method provided in this embodiment, for the identification text of the recording of source speaker, when need
The recording is identified that the unit-in-context sequence of text is converted, that is, when being converted to the unit-in-context sequence of languages after translating,
Unit-in-context mapping model can be built in advance, it can be based on correspondence between the unit-in-context sequence of different language or logical
It crosses and encoding and decoding network is trained to build unit-in-context mapping model, carrying out unit-in-context by this article our unit mapping model turns
It changes, the unit-in-context transformation result of needs can be obtained.
5th embodiment
It is a kind of composition schematic diagram of speech translation apparatus provided in this embodiment, the speech translation apparatus referring to Figure 10
1000 include:
Voice acquisition unit 1001, the first object voice for obtaining source speaker;
Speech interpreting unit 1002, for by carrying out voiced translation to the first object voice, generating the second target
Voice, wherein the languages of second target voice are different from the languages of first object voice, second target voice
Carry the tamber characteristic of the source speaker.
In a kind of realization method of the present embodiment, the speech interpreting unit 1002 may include:
Text identification subelement, for by carrying out speech recognition to the first object voice, generating speech recognition text
This;
Text translates subelement, for by carrying out text translation to the speech recognition text, generating cypher text;
Voiced translation subelement, for by carrying out phonetic synthesis to the cypher text, generating the second target voice.
In a kind of realization method of the present embodiment, the voiced translation subelement may include:
Target unit divides subelement, for the cypher text to be carried out cutting according to the unit-in-context of default size,
Obtain each target text unit;
Parameters,acoustic obtains subelement, the parameters,acoustic for obtaining each target text unit, wherein the acoustics ginseng
Number carries the tamber characteristic of the source speaker;
Translated speech generates subelement, for the parameters,acoustic according to each target text unit, to the cypher text
Phonetic synthesis is carried out, the second target voice is generated.
In a kind of realization method of the present embodiment, described device 1000 can also include:
First sample acquiring unit, the first sample voice for obtaining the source speaker, wherein the first sample
The languages of voice are identical as the languages of the second target voice;
First sample division unit, for the text by the identification text of the first sample voice according to the default size
Our unit carries out cutting, obtains each first sample unit-in-context;
First snippet extraction unit, for the extraction from the first sample voice and the first sample unit-in-context pair
The first sound bite answered;
First parameter extraction unit, for extracting parameters,acoustic from first sound bite;
First model construction unit, for using each first sample unit-in-context and with the first sample text list
The corresponding parameters,acoustic in position, builds the first acoustic model;
Then, the parameters,acoustic obtains subelement, specifically can be used for utilizing first acoustic model, obtains each mesh
Mark the parameters,acoustic of unit-in-context.
In a kind of realization method of the present embodiment, described device 1000 can also include:
Second sample acquisition unit, the second sample voice for obtaining the source speaker, wherein second sample
The languages of voice are different from the languages of the second target voice;
Second sample division unit, for the text by the identification text of second sample voice according to the default size
Our unit carries out cutting, obtains each second sample text unit;
Unit-in-context converting unit obtains the first converting text for converting the second sample text unit
Unit, wherein the first converting text unit is unit-in-context used in the languages of second target voice;
Second snippet extraction unit, for the extraction from second sample voice and the second sample text unit pair
The second sound bite answered;
Second parameter extraction unit obtains and described first for extracting parameters,acoustic from second sound bite
The corresponding parameters,acoustic of converting text unit;
Second model construction unit, for utilizing each second sample text unit and the second sample text unit
Corresponding first converting text unit and parameters,acoustic corresponding with the first converting text unit build the second acoustics
Model;
Then, the parameters,acoustic obtains subelement, specifically can be used for utilizing second acoustic model, obtains each mesh
Mark the parameters,acoustic of unit-in-context.
In a kind of realization method of the present embodiment, described device 1000 can also include:
First text collector unit, for collecting multiple first sample texts, wherein the languages of the first sample text
It is identical as the languages of the second sample voice;
Third sample division unit, for carrying out the first sample text according to the unit-in-context of the default size
Cutting obtains each third sample text unit;
First unit conversion cells obtain the second converting text for converting the third sample text unit
Unit, wherein the second converting text unit is pronunciation of the third sample text unit with second target voice
The unit-in-context that mode is pronounced;
Then, the unit-in-context converting unit may include:
Same units determination subelement, for determining third sample text list identical with the second sample text unit
Position;
Unit-in-context conversion subunit is used for the corresponding second converting text list of identified third sample text unit
Position, as the first converting text unit.
In a kind of realization method of the present embodiment, described device 1000 can also include:
Second text collector unit, for collecting multiple second sample texts, wherein the languages of second sample text
It is identical as the languages of the second sample voice;
4th sample division unit, for the unit-in-context row by second sample text according to default size described in sound
Cutting obtains each 4th sample text unit;
Second unit conversion cells obtain third converting text for converting the 4th sample text unit
Unit, wherein the third converting text unit is the 4th sample text unit with the pronunciation of second target voice
The unit-in-context that mode is pronounced;
Coding/decoding model construction unit, for for the syllable in second sample text, belonging to same by study
Syntagmatic and ordinal relation of the 4th sample text unit of syllable in corresponding syllable, at least two continuous syllables of study exist
The 4th sample text in syntagmatic and ordinal relation and at least two continuous syllables of study in second sample text
Syntagmatic and ordinal relation of the our unit in second sample text build coding/decoding model;
Then, the unit-in-context converting unit specifically can be used for utilizing the coding/decoding model, by second sample
Unit-in-context is converted, and the first converting text unit is obtained.
Sixth embodiment
It is a kind of hardware architecture diagram of speech translation apparatus provided in this embodiment referring to Figure 11, the collision inspection
Device 1100 is surveyed for detecting a kind of goal systems including at least one moveable element, wherein movable portion to be detected
Part is defined as the first component, moveable element in addition to the first component or moveable element is not defined as second
Part.The collision detecting device 1100 include memory 1101 and receiver 1102, and respectively with the memory 1101 and
The processor 1103 that the receiver 1102 connects, the memory 1101 is for storing batch processing instruction, the processor
1103 for calling the program instruction that the memory 1101 stores to execute following operation:
The first object voice of acquisition source speaker;
By carrying out voiced translation to the first object voice, the second target voice is generated, wherein second target
The languages of voice are different from the languages of first object voice, and second target voice carries the sound of the source speaker
Color characteristic.
In a kind of realization method of the present embodiment, the processor 1103 is additionally operable to that the memory 1101 is called to store
Program instruction execute following operation:
By carrying out speech recognition to the first object voice, speech recognition text is generated;
By carrying out text translation to the speech recognition text, cypher text is generated;
By carrying out phonetic synthesis to the cypher text, the second target voice is generated.
In a kind of realization method of the present embodiment, the processor 1103 is additionally operable to that the memory 1101 is called to store
Program instruction execute following operation:
The cypher text is subjected to cutting according to the unit-in-context of default size, obtains each target text unit;
Obtain the parameters,acoustic of each target text unit, wherein the parameters,acoustic carries the source speaker
Tamber characteristic;
According to the parameters,acoustic of each target text unit, phonetic synthesis is carried out to the cypher text, generates the second mesh
Poster sound.
In a kind of realization method of the present embodiment, the processor 1103 is additionally operable to that the memory 1101 is called to store
Program instruction execute following operation:
Obtain the first sample voice of the source speaker, wherein the languages of the first sample voice and described second
The languages of target voice are identical;
The identification text of the first sample voice is subjected to cutting according to the unit-in-context of the default size, is obtained each
A first sample unit-in-context;
The first sound bite corresponding with the first sample unit-in-context is extracted from the first sample voice;
Parameters,acoustic is extracted from first sound bite;
Utilize each first sample unit-in-context and parameters,acoustic corresponding with the first sample unit-in-context, structure
First acoustic model;
Using first acoustic model, the parameters,acoustic of each target text unit is obtained.
In a kind of realization method of the present embodiment, the processor 1103 is additionally operable to that the memory 1101 is called to store
Program instruction execute following operation:
Obtain the second sample voice of the source speaker, wherein the languages of second sample voice and described second
The languages of target voice are different;
The identification text of second sample voice is subjected to cutting according to the unit-in-context of the default size, is obtained each
A second sample text unit;
The second sample text unit is converted, the first converting text unit is obtained, wherein first conversion
Unit-in-context is unit-in-context used in the languages of second target voice;
The second sound bite corresponding with the second sample text unit is extracted from second sample voice;
Parameters,acoustic is extracted from second sound bite, obtains acoustics corresponding with the first converting text unit
Parameter;
Utilize each second sample text unit, the first converting text list corresponding with the second sample text unit
Position and parameters,acoustic corresponding with the first converting text unit build the second acoustic model;
Using second acoustic model, the parameters,acoustic of each target text unit is obtained.
In a kind of realization method of the present embodiment, the processor 1103 is additionally operable to that the memory 1101 is called to store
Program instruction execute following operation:
Collect multiple first sample texts, wherein the languages of the first sample text and second sample voice
Languages are identical;
The first sample text is subjected to cutting according to the unit-in-context of the default size, obtains each third sample
Unit-in-context;
The third sample text unit is converted, the second converting text unit is obtained, wherein second conversion
Unit-in-context is the unit-in-context that the third sample text unit is pronounced with the articulation type of second target voice;
Determine third sample text unit identical with the second sample text unit;
By the corresponding second converting text unit of identified third sample text unit, as the first converting text list
Position.
In a kind of realization method of the present embodiment, the processor 1103 is additionally operable to that the memory 1101 is called to store
Program instruction execute following operation:
Collect multiple second sample texts, wherein the languages of second sample text and second sample voice
Languages are identical;
By second sample text according to the unit-in-context row cutting for presetting size described in sound, each 4th sample is obtained
Unit-in-context;
The 4th sample text unit is converted, third converting text unit is obtained, wherein the third conversion
Unit-in-context is the unit-in-context that the 4th sample text unit is pronounced with the articulation type of second target voice;
For the syllable in second sample text, is belonged to by study and existed with monosyllabic 4th sample text unit
The combination of syntagmatic and ordinal relation, at least two continuous syllables of study in second sample text in corresponding syllable
The 4th sample text unit in relationship and ordinal relation and at least two continuous syllables of study is in second sample text
In syntagmatic and ordinal relation, build coding/decoding model;
Using the coding/decoding model, the second sample text unit is converted, obtains the first converting text list
Position.
In addition, the present embodiment additionally provides a kind of computer readable storage medium, including instruction, when it is transported on computers
When row so that computer executes any one realization method in above-mentioned voice translation method.
As seen through the above description of the embodiments, those skilled in the art can be understood that above-mentioned implementation
All or part of step in example method can add the mode of required general hardware platform to realize by software.Based on such
Understand, substantially the part that contributes to existing technology can be in the form of software products in other words for the technical solution of the application
It embodies, which can be stored in a storage medium, such as ROM/RAM, magnetic disc, CD, including several
Instruction is used so that a computer equipment (can be the network communications such as personal computer, server, or Media Gateway
Equipment, etc.) execute method described in certain parts of each embodiment of the application or embodiment.
It should be noted that each embodiment is described by the way of progressive in this specification, each embodiment emphasis is said
Bright is all difference from other examples, and just to refer each other for identical similar portion between each embodiment.For reality
For applying device disclosed in example, since it is corresponded to the methods disclosed in the examples, so description is fairly simple, related place
Referring to method part illustration.
It should also be noted that, herein, relational terms such as first and second and the like are used merely to one
Entity or operation are distinguished with another entity or operation, without necessarily requiring or implying between these entities or operation
There are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant are intended to contain
Lid non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also include other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that
There is also other identical elements in process, method, article or equipment including the element.
The foregoing description of the disclosed embodiments enables professional and technical personnel in the field to realize or use the application.
Various modifications to these embodiments will be apparent to those skilled in the art, as defined herein
General Principle can in other embodiments be realized in the case where not departing from spirit herein or range.Therefore, the application
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest range caused.
Claims (10)
1. a kind of voice translation method, which is characterized in that including:
The first object voice of acquisition source speaker;
By carrying out voiced translation to the first object voice, the second target voice is generated, wherein second target voice
Languages it is different from the languages of first object voice, the tone color that second target voice carries the source speaker is special
Sign.
2. according to the method described in claim 1, it is characterized in that, described turned over by carrying out voice to the first object voice
It translates, generates the second target voice, including:
By carrying out speech recognition to the first object voice, speech recognition text is generated;
By carrying out text translation to the speech recognition text, cypher text is generated;
By carrying out phonetic synthesis to the cypher text, the second target voice is generated.
3. according to the method described in claim 2, it is characterized in that, it is described by the cypher text carry out phonetic synthesis,
The second target voice is generated, including:
The cypher text is subjected to cutting according to the unit-in-context of default size, obtains each target text unit;
Obtain the parameters,acoustic of each target text unit, wherein the parameters,acoustic carries the tone color of the source speaker
Feature;
According to the parameters,acoustic of each target text unit, phonetic synthesis is carried out to the cypher text, generates the second target language
Sound.
4. according to the method described in claim 3, it is characterized in that, the method further includes:
Obtain the first sample voice of the source speaker, wherein the languages of the first sample voice and second target
The languages of voice are identical;
The identification text of the first sample voice is subjected to cutting according to the unit-in-context of the default size, obtains each the
This same unit-in-context;
The first sound bite corresponding with the first sample unit-in-context is extracted from the first sample voice;
Parameters,acoustic is extracted from first sound bite;
Utilize each first sample unit-in-context and parameters,acoustic corresponding with the first sample unit-in-context, structure first
Acoustic model;
Then, the parameters,acoustic for obtaining each target text unit, including:
Using first acoustic model, the parameters,acoustic of each target text unit is obtained.
5. according to the method described in claim 3, it is characterized in that, the method further includes:
Obtain the second sample voice of the source speaker, wherein the languages of second sample voice and second target
The languages of voice are different;
The identification text of second sample voice is subjected to cutting according to the unit-in-context of the default size, obtains each the
Two sample text units;
The second sample text unit is converted, the first converting text unit is obtained, wherein first converting text
Unit is unit-in-context used in the languages of second target voice;
The second sound bite corresponding with the second sample text unit is extracted from second sample voice;
Parameters,acoustic is extracted from second sound bite, obtains acoustics ginseng corresponding with the first converting text unit
Number;
Using each second sample text unit, the first converting text unit corresponding with the second sample text unit, with
And parameters,acoustic corresponding with the first converting text unit, build the second acoustic model;
Then, the parameters,acoustic for obtaining each target text unit, including:
Using second acoustic model, the parameters,acoustic of each target text unit is obtained.
6. according to the method described in claim 5, it is characterized in that, the method further includes:
Collect multiple first sample texts, wherein the languages of the languages of the first sample text and second sample voice
It is identical;
The first sample text is subjected to cutting according to the unit-in-context of the default size, obtains each third sample text
Unit;
The third sample text unit is converted, the second converting text unit is obtained, wherein second converting text
Unit is the unit-in-context that the third sample text unit is pronounced with the articulation type of second target voice;
Then, described to convert the second sample text unit, the first converting text unit is obtained, including:
Determine third sample text unit identical with the second sample text unit;
By the corresponding second converting text unit of identified third sample text unit, as the first converting text unit.
7. according to the method described in claim 5, it is characterized in that, the method further includes:
Collect multiple second sample texts, wherein the languages of the languages of second sample text and second sample voice
It is identical;
By second sample text according to the unit-in-context row cutting for presetting size described in sound, each 4th sample text is obtained
Unit;
The 4th sample text unit is converted, third converting text unit is obtained, wherein the third converting text
Unit is the unit-in-context that the 4th sample text unit is pronounced with the articulation type of second target voice;
For the syllable in second sample text, belonged to monosyllabic 4th sample text unit in correspondence by study
The syntagmatic of syntagmatic and ordinal relation, at least two continuous syllables of study in second sample text in syllable
With the 4th sample text unit at least two continuous syllables of ordinal relation and study in second sample text
Syntagmatic and ordinal relation build coding/decoding model;
Then, described to convert the second sample text unit, the first converting text unit is obtained, including:
Using the coding/decoding model, the second sample text unit is converted, obtains the first converting text unit.
8. a kind of speech translation apparatus, which is characterized in that including:
Voice acquisition unit, the first object voice for obtaining source speaker;
Speech interpreting unit is used for by first object voice progress voiced translation, generating the second target voice,
In, the languages of second target voice are different from the languages of first object voice, and second target voice carries
The tamber characteristic of the source speaker.
9. a kind of speech translation apparatus, which is characterized in that including:Processor, memory, system bus;
The processor and the memory are connected by the system bus;
The memory includes instruction for storing one or more programs, one or more of programs, and described instruction works as quilt
The processor makes the processor execute such as claim 1-7 any one of them methods when executing.
10. a kind of computer readable storage medium, including instruction, when run on a computer so that computer executes such as
Method described in claim 1-7 any one.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810167142.5A CN108447486B (en) | 2018-02-28 | 2018-02-28 | Voice translation method and device |
PCT/CN2018/095766 WO2019165748A1 (en) | 2018-02-28 | 2018-07-16 | Speech translation method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810167142.5A CN108447486B (en) | 2018-02-28 | 2018-02-28 | Voice translation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108447486A true CN108447486A (en) | 2018-08-24 |
CN108447486B CN108447486B (en) | 2021-12-03 |
Family
ID=63192800
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810167142.5A Active CN108447486B (en) | 2018-02-28 | 2018-02-28 | Voice translation method and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108447486B (en) |
WO (1) | WO2019165748A1 (en) |
Cited By (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108986793A (en) * | 2018-09-28 | 2018-12-11 | 北京百度网讯科技有限公司 | translation processing method, device and equipment |
CN109119063A (en) * | 2018-08-31 | 2019-01-01 | 腾讯科技(深圳)有限公司 | Video dubs generation method, device, equipment and storage medium |
CN109754808A (en) * | 2018-12-13 | 2019-05-14 | 平安科技(深圳)有限公司 | Method, apparatus, computer equipment and the storage medium of voice conversion text |
CN110415680A (en) * | 2018-09-05 | 2019-11-05 | 满金坝(深圳)科技有限公司 | A kind of simultaneous interpretation method, synchronous translation apparatus and a kind of electronic equipment |
CN110610720A (en) * | 2019-09-19 | 2019-12-24 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN110970014A (en) * | 2019-10-31 | 2020-04-07 | 阿里巴巴集团控股有限公司 | Voice conversion, file generation, broadcast, voice processing method, device and medium |
WO2020077868A1 (en) * | 2018-10-17 | 2020-04-23 | 深圳壹账通智能科技有限公司 | Simultaneous interpretation method and apparatus, computer device and storage medium |
CN111105781A (en) * | 2019-12-23 | 2020-05-05 | 联想(北京)有限公司 | Voice processing method, device, electronic equipment and medium |
CN111368559A (en) * | 2020-02-28 | 2020-07-03 | 北京字节跳动网络技术有限公司 | Voice translation method and device, electronic equipment and storage medium |
CN111696518A (en) * | 2020-06-05 | 2020-09-22 | 四川纵横六合科技股份有限公司 | Automatic speech synthesis method based on text |
CN111785258A (en) * | 2020-07-13 | 2020-10-16 | 四川长虹电器股份有限公司 | Personalized voice translation method and device based on speaker characteristics |
CN112420008A (en) * | 2019-08-22 | 2021-02-26 | 北京峰趣互联网信息服务有限公司 | Method and device for recording songs, electronic equipment and storage medium |
WO2021134592A1 (en) * | 2019-12-31 | 2021-07-08 | 深圳市欢太科技有限公司 | Speech processing method, apparatus and device, and storage medium |
CN113160793A (en) * | 2021-04-23 | 2021-07-23 | 平安科技(深圳)有限公司 | Speech synthesis method, device, equipment and storage medium based on low resource language |
CN113362818A (en) * | 2021-05-08 | 2021-09-07 | 山西三友和智慧信息技术股份有限公司 | Voice interaction guidance system and method based on artificial intelligence |
WO2021208531A1 (en) * | 2020-04-16 | 2021-10-21 | 北京搜狗科技发展有限公司 | Speech processing method and apparatus, and electronic device |
US11488577B2 (en) * | 2019-09-27 | 2022-11-01 | Baidu Online Network Technology (Beijing) Co., Ltd. | Training method and apparatus for a speech synthesis model, and storage medium |
CN116343751A (en) * | 2023-05-29 | 2023-06-27 | 深圳市泰为软件开发有限公司 | Voice translation-based audio analysis method and device |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113808576A (en) * | 2020-06-16 | 2021-12-17 | 阿里巴巴集团控股有限公司 | Voice conversion method, device and computer system |
CN112382297A (en) * | 2020-11-13 | 2021-02-19 | 北京有竹居网络技术有限公司 | Method, apparatus, device and medium for generating audio |
CN112530404A (en) * | 2020-11-30 | 2021-03-19 | 深圳市优必选科技股份有限公司 | Voice synthesis method, voice synthesis device and intelligent equipment |
CN112509553B (en) * | 2020-12-02 | 2023-08-01 | 问问智能信息科技有限公司 | Speech synthesis method, device and computer readable storage medium |
CN112818707B (en) * | 2021-01-19 | 2024-02-27 | 传神语联网网络科技股份有限公司 | Reverse text consensus-based multi-turn engine collaborative speech translation system and method |
CN113327575B (en) * | 2021-05-31 | 2024-03-01 | 广州虎牙科技有限公司 | Speech synthesis method, device, computer equipment and storage medium |
EP4266306A1 (en) * | 2022-04-22 | 2023-10-25 | Papercup Technologies Limited | A speech processing system and a method of processing a speech signal |
CN114818748B (en) * | 2022-05-10 | 2023-04-21 | 北京百度网讯科技有限公司 | Method for generating translation model, translation method and device |
Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1553381A (en) * | 2003-05-26 | 2004-12-08 | 杨宏惠 | Multi-language correspondent list style language database and synchronous computer inter-transtation and communication |
CN101114447A (en) * | 2006-07-26 | 2008-01-30 | 株式会社东芝 | Speech translation device and method |
CN101154221A (en) * | 2006-09-28 | 2008-04-02 | 株式会社东芝 | Apparatus performing translation process from inputted speech |
CN101727904A (en) * | 2008-10-31 | 2010-06-09 | 国际商业机器公司 | Voice translation method and device |
CN102270450A (en) * | 2010-06-07 | 2011-12-07 | 株式会社曙飞电子 | System and method of multi model adaptation and voice recognition |
CN102821259A (en) * | 2012-07-20 | 2012-12-12 | 冠捷显示科技(厦门)有限公司 | TV (television) system with multi-language speech translation and realization method thereof |
CN104252861A (en) * | 2014-09-11 | 2014-12-31 | 百度在线网络技术(北京)有限公司 | Video voice conversion method, video voice conversion device and server |
JP2015026054A (en) * | 2013-07-29 | 2015-02-05 | 韓國電子通信研究院Electronics and Telecommunications Research Institute | Automatic interpretation device and method |
CN104899192A (en) * | 2014-03-07 | 2015-09-09 | 韩国电子通信研究院 | Apparatus and method for automatic interpretation |
CN105390141A (en) * | 2015-10-14 | 2016-03-09 | 科大讯飞股份有限公司 | Sound conversion method and sound conversion device |
CN105426362A (en) * | 2014-09-11 | 2016-03-23 | 株式会社东芝 | Speech Translation Apparatus And Method |
CN106791913A (en) * | 2016-12-30 | 2017-05-31 | 深圳市九洲电器有限公司 | Digital television program simultaneous interpretation output intent and system |
US20170255616A1 (en) * | 2016-03-03 | 2017-09-07 | Electronics And Telecommunications Research Institute | Automatic interpretation system and method for generating synthetic sound having characteristics similar to those of original speaker's voice |
CN107632980A (en) * | 2017-08-03 | 2018-01-26 | 北京搜狗科技发展有限公司 | Voice translation method and device, the device for voiced translation |
CN107992485A (en) * | 2017-11-27 | 2018-05-04 | 北京搜狗科技发展有限公司 | A kind of simultaneous interpretation method and device |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105786801A (en) * | 2014-12-22 | 2016-07-20 | 中兴通讯股份有限公司 | Speech translation method, communication method and related device |
CN106156009A (en) * | 2015-04-13 | 2016-11-23 | 中兴通讯股份有限公司 | Voice translation method and device |
CN107465816A (en) * | 2017-07-25 | 2017-12-12 | 广西定能电子科技有限公司 | A kind of call terminal and method of instant original voice translation of conversing |
CN107731232A (en) * | 2017-10-17 | 2018-02-23 | 深圳市沃特沃德股份有限公司 | Voice translation method and device |
-
2018
- 2018-02-28 CN CN201810167142.5A patent/CN108447486B/en active Active
- 2018-07-16 WO PCT/CN2018/095766 patent/WO2019165748A1/en active Application Filing
Patent Citations (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1553381A (en) * | 2003-05-26 | 2004-12-08 | 杨宏惠 | Multi-language correspondent list style language database and synchronous computer inter-transtation and communication |
CN101114447A (en) * | 2006-07-26 | 2008-01-30 | 株式会社东芝 | Speech translation device and method |
CN101154221A (en) * | 2006-09-28 | 2008-04-02 | 株式会社东芝 | Apparatus performing translation process from inputted speech |
CN101727904A (en) * | 2008-10-31 | 2010-06-09 | 国际商业机器公司 | Voice translation method and device |
CN102270450A (en) * | 2010-06-07 | 2011-12-07 | 株式会社曙飞电子 | System and method of multi model adaptation and voice recognition |
CN102821259A (en) * | 2012-07-20 | 2012-12-12 | 冠捷显示科技(厦门)有限公司 | TV (television) system with multi-language speech translation and realization method thereof |
JP2015026054A (en) * | 2013-07-29 | 2015-02-05 | 韓國電子通信研究院Electronics and Telecommunications Research Institute | Automatic interpretation device and method |
CN104899192A (en) * | 2014-03-07 | 2015-09-09 | 韩国电子通信研究院 | Apparatus and method for automatic interpretation |
CN104252861A (en) * | 2014-09-11 | 2014-12-31 | 百度在线网络技术(北京)有限公司 | Video voice conversion method, video voice conversion device and server |
CN105426362A (en) * | 2014-09-11 | 2016-03-23 | 株式会社东芝 | Speech Translation Apparatus And Method |
CN105390141A (en) * | 2015-10-14 | 2016-03-09 | 科大讯飞股份有限公司 | Sound conversion method and sound conversion device |
US20170255616A1 (en) * | 2016-03-03 | 2017-09-07 | Electronics And Telecommunications Research Institute | Automatic interpretation system and method for generating synthetic sound having characteristics similar to those of original speaker's voice |
CN106791913A (en) * | 2016-12-30 | 2017-05-31 | 深圳市九洲电器有限公司 | Digital television program simultaneous interpretation output intent and system |
CN107632980A (en) * | 2017-08-03 | 2018-01-26 | 北京搜狗科技发展有限公司 | Voice translation method and device, the device for voiced translation |
CN107992485A (en) * | 2017-11-27 | 2018-05-04 | 北京搜狗科技发展有限公司 | A kind of simultaneous interpretation method and device |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109119063A (en) * | 2018-08-31 | 2019-01-01 | 腾讯科技(深圳)有限公司 | Video dubs generation method, device, equipment and storage medium |
CN109119063B (en) * | 2018-08-31 | 2019-11-22 | 腾讯科技(深圳)有限公司 | Video dubs generation method, device, equipment and storage medium |
CN110415680A (en) * | 2018-09-05 | 2019-11-05 | 满金坝(深圳)科技有限公司 | A kind of simultaneous interpretation method, synchronous translation apparatus and a kind of electronic equipment |
CN110415680B (en) * | 2018-09-05 | 2022-10-04 | 梁志军 | Simultaneous interpretation method, simultaneous interpretation device and electronic equipment |
US11328133B2 (en) | 2018-09-28 | 2022-05-10 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Translation processing method, translation processing device, and device |
CN108986793A (en) * | 2018-09-28 | 2018-12-11 | 北京百度网讯科技有限公司 | translation processing method, device and equipment |
WO2020077868A1 (en) * | 2018-10-17 | 2020-04-23 | 深圳壹账通智能科技有限公司 | Simultaneous interpretation method and apparatus, computer device and storage medium |
CN109754808A (en) * | 2018-12-13 | 2019-05-14 | 平安科技(深圳)有限公司 | Method, apparatus, computer equipment and the storage medium of voice conversion text |
CN112420008A (en) * | 2019-08-22 | 2021-02-26 | 北京峰趣互联网信息服务有限公司 | Method and device for recording songs, electronic equipment and storage medium |
CN110610720B (en) * | 2019-09-19 | 2022-02-25 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
CN110610720A (en) * | 2019-09-19 | 2019-12-24 | 北京搜狗科技发展有限公司 | Data processing method and device and data processing device |
US11488577B2 (en) * | 2019-09-27 | 2022-11-01 | Baidu Online Network Technology (Beijing) Co., Ltd. | Training method and apparatus for a speech synthesis model, and storage medium |
CN110970014A (en) * | 2019-10-31 | 2020-04-07 | 阿里巴巴集团控股有限公司 | Voice conversion, file generation, broadcast, voice processing method, device and medium |
CN110970014B (en) * | 2019-10-31 | 2023-12-15 | 阿里巴巴集团控股有限公司 | Voice conversion, file generation, broadcasting and voice processing method, equipment and medium |
CN111105781A (en) * | 2019-12-23 | 2020-05-05 | 联想(北京)有限公司 | Voice processing method, device, electronic equipment and medium |
CN111105781B (en) * | 2019-12-23 | 2022-09-23 | 联想(北京)有限公司 | Voice processing method, device, electronic equipment and medium |
WO2021134592A1 (en) * | 2019-12-31 | 2021-07-08 | 深圳市欢太科技有限公司 | Speech processing method, apparatus and device, and storage medium |
CN111368559A (en) * | 2020-02-28 | 2020-07-03 | 北京字节跳动网络技术有限公司 | Voice translation method and device, electronic equipment and storage medium |
WO2021208531A1 (en) * | 2020-04-16 | 2021-10-21 | 北京搜狗科技发展有限公司 | Speech processing method and apparatus, and electronic device |
CN111696518A (en) * | 2020-06-05 | 2020-09-22 | 四川纵横六合科技股份有限公司 | Automatic speech synthesis method based on text |
CN111785258B (en) * | 2020-07-13 | 2022-02-01 | 四川长虹电器股份有限公司 | Personalized voice translation method and device based on speaker characteristics |
CN111785258A (en) * | 2020-07-13 | 2020-10-16 | 四川长虹电器股份有限公司 | Personalized voice translation method and device based on speaker characteristics |
CN113160793A (en) * | 2021-04-23 | 2021-07-23 | 平安科技(深圳)有限公司 | Speech synthesis method, device, equipment and storage medium based on low resource language |
CN113362818A (en) * | 2021-05-08 | 2021-09-07 | 山西三友和智慧信息技术股份有限公司 | Voice interaction guidance system and method based on artificial intelligence |
CN116343751A (en) * | 2023-05-29 | 2023-06-27 | 深圳市泰为软件开发有限公司 | Voice translation-based audio analysis method and device |
CN116343751B (en) * | 2023-05-29 | 2023-08-11 | 深圳市泰为软件开发有限公司 | Voice translation-based audio analysis method and device |
Also Published As
Publication number | Publication date |
---|---|
CN108447486B (en) | 2021-12-03 |
WO2019165748A1 (en) | 2019-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108447486A (en) | A kind of voice translation method and device | |
CN110491382B (en) | Speech recognition method and device based on artificial intelligence and speech interaction equipment | |
CN112735373B (en) | Speech synthesis method, device, equipment and storage medium | |
WO2022078146A1 (en) | Speech recognition method and apparatus, device, and storage medium | |
CN112037754B (en) | Method for generating speech synthesis training data and related equipment | |
CN106531150B (en) | Emotion synthesis method based on deep neural network model | |
CN109523989A (en) | Phoneme synthesizing method, speech synthetic device, storage medium and electronic equipment | |
CN107731228A (en) | The text conversion method and device of English voice messaging | |
CN108711420A (en) | Multilingual hybrid model foundation, data capture method and device, electronic equipment | |
CN108231062A (en) | A kind of voice translation method and device | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
GB2326320A (en) | Text to speech synthesis using neural network | |
CN110010136A (en) | The training and text analyzing method, apparatus, medium and equipment of prosody prediction model | |
KR20140071070A (en) | Method and apparatus for learning pronunciation of foreign language using phonetic symbol | |
CN115953521B (en) | Remote digital person rendering method, device and system | |
CN112463942A (en) | Text processing method and device, electronic equipment and computer readable storage medium | |
CN111951781A (en) | Chinese prosody boundary prediction method based on graph-to-sequence | |
Poncelet et al. | Low resource end-to-end spoken language understanding with capsule networks | |
CN114387945A (en) | Voice generation method and device, electronic equipment and storage medium | |
Bharti et al. | Automated speech to sign language conversion using Google API and NLP | |
CN115424604B (en) | Training method of voice synthesis model based on countermeasure generation network | |
CN112242134A (en) | Speech synthesis method and device | |
CN115762471A (en) | Voice synthesis method, device, equipment and storage medium | |
CN113257225B (en) | Emotional voice synthesis method and system fusing vocabulary and phoneme pronunciation characteristics | |
Reddy et al. | Speech-to-Text and Text-to-Speech Recognition Using Deep Learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |