CN112002304A - Speech synthesis method and device - Google Patents

Speech synthesis method and device Download PDF

Info

Publication number
CN112002304A
CN112002304A CN202010880919.XA CN202010880919A CN112002304A CN 112002304 A CN112002304 A CN 112002304A CN 202010880919 A CN202010880919 A CN 202010880919A CN 112002304 A CN112002304 A CN 112002304A
Authority
CN
China
Prior art keywords
information
initial
shortcut
tone
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010880919.XA
Other languages
Chinese (zh)
Other versions
CN112002304B (en
Inventor
张进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Tianli Network Technology Co ltd
Original Assignee
Shanghai Tianli Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Tianli Network Technology Co ltd filed Critical Shanghai Tianli Network Technology Co ltd
Priority to CN202010880919.XA priority Critical patent/CN112002304B/en
Publication of CN112002304A publication Critical patent/CN112002304A/en
Application granted granted Critical
Publication of CN112002304B publication Critical patent/CN112002304B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a voice synthesis method for realizing instant pronunciation on intelligent equipment (including various computers and mobile equipment), which can enable a speaker who is not speaking normally to communicate with a common person by knocking a keyboard. The coding scheme is simple, and a user can send out a standard Chinese single-character pronunciation only by knocking three letters on a keyboard in sequence under the condition of not using shortcut keys. The first letter marks the initial of the pronunciation, the second letter marks the final of the pronunciation, and the third letter marks the tone of the pronunciation and can also mark the length of the pronunciation according to the position in the sentence. If the shortcut key and the word coding method of the patent are combined, the speed is higher, and a user can carry out voice communication with other people at the speed of the ordinary people by knocking the keyboard.

Description

Speech synthesis method and device
Technical Field
The present invention relates to speech generation technologies, and in particular, to a speech synthesis method and apparatus.
The invention relates to a computer input method and a speech synthesis technology, and the complete description should be as follows: and (3) a timely speech synthesis input method coding scheme.
Background
The existing Speech synthesis systems in the market at present are based on a technology of converting characters into Speech (called TTS, from Text To Speech), a section of characters needs To be input first, and then the section of characters is converted into Speech, so that synchronization or timely pronunciation cannot be achieved.
In other words, if a speaker of the speech synthesis system is allowed to communicate with a general person by tapping the keyboard, the speaker needs to input characters through the keyboard first, and then the current speech conversion system is used to convert the whole sentence of characters into speech for speech.
Disclosure of Invention
The embodiment of the invention provides a voice synthesis method and a voice synthesis device, which take sound codes as media in the voice synthesis process, do not need characters as media, have high efficiency, and can synchronize the voice output time with the time of the idea generation of a user in time.
If synchronous and timely speech synthesis is needed, the input method of the patent is needed. Compared with other input methods, the input method related to the patent is a dubbing input method, namely, voice is output by knocking a keyboard. Relative to the input method of the present patent, other input methods can be called as: the matching input method is that the characters are output by knocking the keyboard.
The invention provides a coding scheme of a timely speech synthesis input method, which is basically characterized in that three letters are successively knocked on a keyboard to send a Chinese speech. The expansion method is to use the shortcut key and word input method, which can send a Chinese voice by average knocking 1.5-2 keys, thus realizing that the user can communicate with others by knocking the keyboard at normal speed.
In a first aspect of the embodiments of the present invention, a speech synthesis method is provided, including:
receiving initial consonant information input by a user;
receiving vowel information input by a user;
receiving tone information and tone length short messages input by a user;
fusing the initial consonant information, the vowel information, the tone information and the tone short messages based on initial consonant rules to generate initial consonant code information;
and acquiring voice information corresponding to the acoustic code information, wherein the acoustic code information and the voice information are correspondingly set in advance.
Optionally, in a possible implementation manner of the first aspect, before the step of receiving the initial information input by the user, the method further includes:
an initial consonant information receiving area, a vowel information receiving area and an intonation information receiving area are respectively arranged on an input device;
when the input equipment is triggered for the first time, the initial information receiving area acquires initial information;
when the input device is triggered for the second time, the vowel information receiving area obtains vowel information;
and when the input device is triggered for the third time, the tone information receiving area acquires tone information.
Optionally, in a possible implementation manner of the first aspect, the tone information receiving area includes a beginning-of-sentence area, an end-of-sentence area, a beginning-of-word area, an end-of-word area, and a word area;
the sentence head area, the sentence tail area, the word head area, the word tail area and the single word area are respectively provided with tone mark bit information;
the fusing the initial information, the final information, the tone information and the tone of the light and heavy short messages based on the initial code rule to generate the initial code information comprises the following steps:
and sequencing the initial consonants in the initial consonants and the final consonants in the final consonant information to generate single-character pinyin, and matching the single-character pinyin based on the mark bit information in the tone information and the tone short messages with the length and the lightness to generate initial code information.
Optionally, in a possible implementation manner of the first aspect, after the step of obtaining the voice information corresponding to the vocoded information, where the vocoded information and the voice information are set in advance correspondingly, the method further includes:
and playing the voice information based on a loudspeaker device.
Optionally, in a possible implementation manner of the first aspect, playing the voice information based on a speaker device includes:
receiving a voice library selected by a user;
and matching the voice information with the voice library information to generate playing information, and sending the playing information to the loudspeaker device for playing.
Optionally, in a possible implementation manner of the first aspect, after receiving the initial information input by the user, the method further includes:
receiving shortcut information input by a user, and respectively setting an initial consonant information receiving area and a shortcut information receiving area on an input device;
generating shortcut word information based on the initial consonant information and the shortcut information, wherein the initial consonant information and the shortcut information are preset in a corresponding manner;
after shortcut character confirmation information input by a user is received, voice information corresponding to the shortcut character information is obtained, wherein the voice information and the shortcut character information are correspondingly set in advance.
Optionally, in a possible implementation manner of the first aspect, after the generating shortcut word information based on the initial information and the shortcut information, where the initial information and the shortcut information are preset in advance, the method further includes:
receiving shortcut information input by the user again;
generating shortcut phrase information based on the re-received initial information and shortcut information and the initial information and shortcut information received last time, wherein the re-received initial information and shortcut information are preset in correspondence with the initial information and shortcut information received last time;
and after receiving shortcut phrase confirmation information input by a user, acquiring voice information corresponding to the shortcut phrase information.
Optionally, in a possible implementation manner of the first aspect, after receiving the initial information input by the user, the method further includes:
and acquiring voice information associated with the initial consonant information, wherein the initial consonant information and the voice information associated with the initial consonant information are correspondingly set in advance.
In a second aspect of the embodiments of the present invention, there is provided a speech synthesis apparatus, including:
the initial information receiving module is used for receiving initial information input by a user;
the vowel information receiving module is used for receiving vowel information input by a user;
the tone and tone length short message module is used for receiving tone information input by a user and tone length short messages;
the initial consonant information, the vowel information, the tone information and the tone short messages are fused based on initial consonant rules to generate initial consonant information;
and the voice information generating module is used for acquiring the voice information corresponding to the sound code information, wherein the sound code information and the voice information are correspondingly set in advance.
In a third aspect of the embodiments of the present invention, a readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, is adapted to carry out the method according to the first aspect of the present invention and various possible designs of the first aspect of the present invention.
According to the voice synthesis method and the voice synthesis device, a user can directly input the voice to be sent through the input equipment without performing voice conversion by taking characters as media, the efficiency is high, and the output time of the voice and the time of the idea generation of the user can be synchronized in time.
Drawings
FIG. 1 is a flow chart of a first embodiment of a speech synthesis method;
fig. 2 is a schematic diagram of a first embodiment of an initial information receiving area;
FIG. 3 is a diagram of a first embodiment of a vowel information receiving region;
FIG. 4 is a diagram of a first embodiment of a tone and size information receiving area;
FIG. 5 is a block diagram of the first embodiment of the speech synthesis apparatus
FIG. 6 is a flow chart of the operation of pronouncing for "I am Chinese".
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.
It should be understood that, in various embodiments of the present invention, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present invention.
It should be understood that in the present application, "comprising" and "having" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that, in the present invention, "a plurality" means two or more. "and/or" is merely an association describing an associated object, meaning that three relationships may exist, for example, and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "comprises A, B and C" and "comprises A, B, C" means that all three of A, B, C comprise, "comprises A, B or C" means that one of A, B, C comprises, "comprises A, B and/or C" means that any 1 or any 2 or 3 of A, B, C comprises.
It should be understood that in the present invention, "B corresponding to a", "a corresponds to B", or "B corresponds to a" means that B is associated with a, and B can be determined from a. Determining B from a does not mean determining B from a alone, but may be determined from a and/or other information. And the matching of A and B means that the similarity of A and B is greater than or equal to a preset threshold value.
As used herein, "if" may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context.
The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.
The invention provides a speech synthesis method, which is shown in a schematic structural diagram in fig. 1 and comprises the following steps:
and S110, receiving initial consonant information input by a user. Wherein the receiving device may be a terminal or the like having a physical keyboard or a virtual keyboard. The initial information in the present invention includes, for example: b/p/m/l … …, and vowels (e.g. a/e/o) capable of independently pronouncing, which total 23 letters. The remaining i/u/v in the keyboard (note: v refers to Pinyin Su herein) can be assigned to the double initial zh/ch/sh. As shown schematically in fig. 2.
And S120, receiving the final information input by the user. Except that the single vowel (a o e i u v) is still normally input according to the pinyin rule, other multi-letter vowel input methods are shown in fig. 3 and have the following corresponding relations:
ai=b,an=c,ang=d,ao=f,
ei=h,en=n,eng=g,
ia=ua=j,ian=uai=k,iang=uang=l,iao=m,ie=p,in=q,ing=r,iu=s,
ong=iong=t,ou=w,
uan=x,ue=ve=y,ui=v,un=z,uo=o。
note: for the convenience of memory, the corresponding relation between the multi-character vowels and the letters of the keyboard is also well designed, and firstly, the sequence of the pinyin vowels is generally consistent with the sequence of 26 English letters. At most, two multi-letter finals share one keyboard letter, and finals which are kept close share one letter, such as: the finals ia and ua share one key j, and through scientific tests, in practical application, the shared finals key does not have a conflict phenomenon, namely the possibility that two tones are formed by the same initial consonant and the final consonant is avoided.
In the pronunciation of Chinese speech, when only three single vowels are pronounced, only single vowels a (corresponding to Chinese character: o), e (corresponding to Chinese character: hunger) and o (corresponding to Chinese character: o) can not be expressed by the method of initial consonant + vowel, and the coding mode adopts the overlapping letter mode to code: aa (Pinyin: a corresponding to Chinese character o), ee (Pinyin: e corresponding to Chinese character hungry) and oo (Pinyin: o corresponding to Chinese character).
And S130, receiving the tone and the voice short message of the length input by the user. The third key can mark the tone of the pinyin by only five letters according to a common pinyin method, and the remaining 21 letters can play other roles. Therefore, except for marking tone, the invention also marks the position of the pronunciation in the whole word and sentence, so that different length and weight pronunciations of the same character can be called according to different positions of a pronunciation in the word and sentence.
The keyboard can be divided into five regions (consistent with the five regions of horizontal, vertical, left falling, right falling and turning of the five-stroke input method, namely a first word region, a last word region, a first sentence region, a last sentence region and a single character region), as shown in figure 4. In the pronunciation of Chinese language, the same character is in different positions of a sentence, and the pronunciation is different in size, for example, the character of 'I' is at the beginning of the sentence (for example, I is Chinese), the pronunciation is the heaviest and the longest; the pronunciation at the tail of the sentence (for example, giving things to me) is heavier and longer; slightly heavier and shorter at the beginning of the word (e.g., this is my); lightest at the end of the word (this is for my), shortest; the pronunciation of the single-word area is read one by one according to the characters of the dictionary, is not suitable for being placed in sentences and is more suitable for being placed in special contexts to strengthen tone, for example, the introduction of a book is read, and the pronunciation of the single-word area can be used as the two characters of the introduction of the title.
The five keys in each region represent a total of four tones in the standard mandarin chinese language: yin ping (first sound), yang ping (second sound), upward sound (third sound), upward sound (fourth sound) and soft sound, as shown in fig. 4, note: on English letter keys on the graph, the five symbols are respectively expressed by · -/V \ and correspond to five tones of pinyin: soft, first, second, third and fourth. The five keys are positioned in each area according to the frequency of five tones and standard typing fingering. Taking asdfg (five keys from left to right in the middle of the keyboard) in the initial region as an example, a corresponds to going to the beginning (four tones), s corresponds to going to the beginning (three tones), d corresponds to yang-Ping (two tones), f corresponds to yin-Ping (one tone), and g corresponds to light tone. If the finger is typed by a standard keyboard, the forefinger is most flexible and has the highest use frequency and is responsible for knocking an f key (one sound) and a g key (soft sound), and the little finger has the higher use frequency and is responsible for knocking an a key (four sounds), note: professional statistics show that the pronunciation frequency of the fourth sound is the highest in Chinese speech. After the keys corresponding to the first sound and the fourth sound are determined, the middle d key and the middle s key are naturally allocated to the second sound and the third sound. The layout of the keyboard in the other areas is similar and is not repeated.
And S140, fusing the initial consonant information, the final information, the tone, the voice length and the tone information based on initial consonant rules to generate initial consonant code information. The initial consonant information, the vowel information and the tone information are fused through initial consonant rules to obtain initial consonant information, wherein the difference between the initial consonant and the pinyin is as follows: the pinyin comprises initials, finals and tones, and the initial codes comprise the length and the weight of the tone besides the initials, the finals and the tones of the pinyin, and in the invention, five areas are used for representing: sentence beginning, sentence end, word beginning, word end, and individual word.
S150, acquiring voice information corresponding to the sound code information, wherein the sound code information and the voice information are correspondingly set in advance. And obtaining corresponding voice information after acquiring the voice code information, and then sounding.
By the above mode provided by the invention, the speaker can send out the voice by knocking the keyboard, so that the Chinese character is not required to be typed out by using the current input method and then converted into the voice by using the voice synthesis system, the latter has low efficiency, manual intervention cannot be carried out, and pronunciation errors of homophone characters and polyphone characters can occur.
The method provided by the invention not only has high efficiency (the three keys are knocked to send out a standard single word voice), but also can freely select five pronunciations according to the up-down sentence relation of the voice, thereby being beneficial to improving the naturalness of the voice continuous reading or expressing special meanings. Examples are: similarly, "I is your today", if according to normal speed of speech, even lighter faster speed of speech, express a low posture, true and honest feelings; if the mood is slowed down and increased, the emotion is expressed in a high posture and without worry. The pronunciation of the character of 'you', in the former scene, can adopt the pronunciation of the word end region (shortest and lightest); in the latter scenario, a single-word-region pronunciation (longest and heaviest) may be employed.
The coding principle of the invention has strong expansibility, and the Chinese pronunciations are less than 1300, wherein a considerable part of the Chinese pronunciations are also the speech words and the sound words of various parts which are not commonly used. In this coding method, the number of the three key combinations 26 × 17576 is several ten times the volume of the chinese speech. It can express various local dialects, words like sound and foreign language pronunciations. For example, if a large gong is knocked, or a bronze basin falls on the ground to make a sound, if the word "when" is used for pronunciation, a person can understand that the word is made by a small gong, and in many reviews, duang (the first sound of dragging long) is used for visual expression. There may be no such word nor such pronunciation in the Xinhua dictionary. The pronunciation can be easily and accurately expressed by using the sound code of the invention: dlv, where d is the initial letter, uang is denoted by l, and v means the first sound in the single character area (the pronunciation of this area is complete and long relative to the other areas).
The coding principle of the invention has strong expansibility and can realize the functions which can not be realized by other voice synthesis systems. Such as one and the same sentence: "how you can do so". To express an angry, excited tone, the pitch will increase, each word will be short and sharp, and the accent is placed on the "you" word; to express a kind of pragmatic atmosphere with great attention, the pitch is reduced, and accents and lingering voices are put on the word. If the three-key sound code mode (low pitch and short sound) can not express the intention of the user, the four-key sound code mode can be switched to, and the fourth key is used for representing special context. If it is also desired to sing by the present coding method, the music tune has more complicated pitch and length, and can be extended to five-key or even six-key sound code mode to complete complicated pronunciation function.
For another example, in a scene of private conversations, the sounding mode has the characteristics that the vocal cords do not vibrate, the sound is light and weak, and the airflow sound is loud. Only the silent words library needs to be recorded, and when the sound codes are input, the sound library can be selected.
The coding scheme also has corresponding shortcut input method regulation for the pronunciation of Arabic numerals, English letters and overlapped sound characters:
the number key input after each pronunciation is the pronunciation of the number itself:
number of 0 1 2 3 4 5 6 7 8 9
Phonetic alphabet ling yi er san si wu liu qi ba jiu
Sound code lrd yif era scf sia wus lsa qif baf jss
The capital English letters input after each pronunciation are English pronunciation shortcut keys of the letters, because the pronunciations of some English letters cannot use pinyin phonetic symbols and are not listed, the codes of the English letters are very simple, and the English letters are capital single English letters.
The comma entered after each pronunciation is finished represents that the last pronunciation character is repeated and the pronunciation is short and light. As in the phrase "she walks slowly," the "slow" complete pronunciation vocoders are: "mcam cm", the superimposed pronunciation code is "mca"
The space input after each pronunciation does not function as shortcut or confirmation of pronunciation, but functions as a delimiter. Such as: wov xiz uaf.
Further, before the step of receiving the initial information input by the user, the method further includes:
an initial consonant information receiving area, a vowel information receiving area and an intonation information receiving area are respectively arranged on an input device;
when the input equipment is triggered for the first time, the initial information receiving area acquires initial information;
when the input device is triggered for the second time, the vowel information receiving area obtains vowel information;
and when the input equipment is triggered for the third time, the tone information receiving area acquires tone and voice length and weight information.
In this step, the input device for obtaining the initial consonant, the final sound information and the tone information is the same device, and the input information at this time is judged to be the initial consonant information, the final sound information or the tone information by the number of times of triggering.
Further, as shown in fig. 4, the tone information receiving area includes a beginning-of-sentence area, an end-of-sentence area, a beginning-of-word area, an end-of-word area, and a single-word area;
the sentence head area, the sentence tail area, the word head area, the word tail area and the single word area are respectively provided with tone mark bit information and voice length and weight information;
the step of fusing the initial consonant information, the vowel information, the tone information and the short messages with the length of tone based on the initial consonant code rule and the Chinese pronunciation rule to generate the initial consonant code information comprises the following steps:
and sequencing the initial consonants in the initial consonants and the final consonants in the final consonant information to generate single-tone combinations, and matching the single-tone combinations based on the mark bit information in the tone information and the voice length and weight information to generate initial consonant code information.
Further, after the step of acquiring the voice information corresponding to the vocoded information, where the vocoded information and the voice information are set in advance correspondingly, the method further includes:
and playing the voice information based on a loudspeaker device.
Further, playing the voice message based on a speaker device includes:
receiving voice code information input by a user;
and matching the acoustic code information with a pre-stored voice library to generate playing information, and sending the playing information to the loudspeaker device for playing.
Further, after receiving the initial information input by the user, the method further comprises:
receiving shortcut information input by a user, and respectively setting an initial consonant information receiving area and a shortcut information receiving area on an input device;
generating shortcut word information based on the initial consonant information and the shortcut information, wherein the initial consonant information and the shortcut information are preset in a corresponding manner;
after shortcut character confirmation information input by a user is received, voice information corresponding to the shortcut character information is obtained, wherein the voice information and the shortcut character information are correspondingly set in advance.
In this embodiment, the 26 letters are all corresponding to a common Chinese character pronunciation, which is called a first-level shortcut key, and the principle is to use the single tone with the highest pronunciation frequency as much as possible, with pronouns as the main. Wherein, only the shortcut key w gives the pronunciation of 'me' but gives the pronunciation of 'stomach', and gives the pronunciation of 'me' to the unusual 'o' key. See the first level shortcut key table.
The system can modify the shortcut key according to the pronunciation habit of each person so as to adapt to the requirements of different industries and crowds.
Further, generating shortcut word information based on the initial consonant information and the shortcut information, where the initial consonant information, the shortcut information, and the shortcut word information are preset in a corresponding manner, and the method further includes:
receiving shortcut information input by the user again;
generating shortcut phrase information based on the re-received initial information and shortcut information and the initial information and shortcut information received last time, wherein the re-received initial information and shortcut information are preset in correspondence with the initial information and shortcut information received last time;
and after receiving shortcut phrase confirmation information input by a user, acquiring voice information corresponding to the shortcut phrase information.
Further, after receiving the initial information input by the user, the method further comprises:
and acquiring voice information associated with the initial consonant information, wherein the initial consonant information and the voice information associated with the initial consonant information are correspondingly set in advance.
The method has associated words, and after each key is input, the computer can use 0-9 numbers to represent homophonic words listed on the screen, and an operator can quickly finish the pronunciation of single words or phrases only by knocking the numbers.
In addition, the basic operation mode of the input method for matching pronunciation is to tap three keys to send out a voice, so that the three keys cannot be directly tapped to match the pronunciation of the three-character phrase, and the four keys cannot be directly tapped to match the pronunciation of the four-character phrase. When pronunciations of three-word, four-word or even more words need to be input in a shortcut mode, a connection code semicolon is added after a two-key needs to be knocked; ", tells the system that this is a word with more than three characters, and then continues to strike the remaining shortcut keys of the word, but requires a semicolon for every two key strikes. In other words, in inputting multiple words in succession, a semicolon must be used each time a third key or a multiple of 3 is tapped.
Examples are: the shortcut key of "taihang shan" is: TH; s2
Shortcut key to "australia": AD; LY [ 2 ]
The people's republic of China: ZH; RM; GH; g2
Wherein [ is a space key, which indicates the end of the phrase input.
In one possible embodiment, for the coding of the polyphonic characters, commas are used to represent the pronunciation of the next repeated character, such as: he, code is "hef,"; the code is ' hef ', ' when the user goes to the yaho.
The present invention also provides a speech synthesis apparatus, as shown in fig. 5, including:
the initial information receiving module is used for receiving initial information input by a user;
the vowel information receiving module is used for receiving vowel information input by a user;
the tone and tone length short message module is used for receiving tone information and voice length and length information input by a user;
the initial consonant information, the vowel information, the tone information, the voice length and the tone information are fused based on an initial consonant rule to generate initial consonant information;
and the voice information generating module is used for acquiring the voice information corresponding to the sound code information, wherein the sound code information and the voice information are correspondingly set in advance.
The invention also provides a voice synthesis system which comprises the voice synthesis device, wherein the voice synthesis device acquires initial consonant information, final sound information, tone information, voice length and weight information, calls voice information (audio information) in a voice database based on voice software manufactured by the voice synthesis method, processes an audio file through the voice synthesis software and controls a loudspeaker connected with a computer and other terminals to play.
As shown in fig. 6, a flow chart of the operation of pronouncing "i is a chinese person".
The practical application field and the applicable scene of the invention are as follows:
1. voice proofreading of other input methods: the invention can be applied to the work of typing and voice proofreading of other input methods due to the characteristic of synchronous sounding of typing, and particularly for professional five-stroke typewriters, the invention can realize the function of auxiliary proofreading by corresponding to the pronunciation of one character when the typewriter beats one character in a quick touch typing state. By using the function, a user does not need to master the coding rule, and all input methods can be hung on the coding and pronunciation library.
2. Voice communication of the speaker: the speaker can learn the operation method of the invention through a certain learning and training, and can communicate with the ordinary people through the sound generator of the intelligent device by knocking the keyboard.
3. Improving the naturalness of other speech synthesis technologies: the speech synthesis technology (TTS) popular in the market at present synthesizes speech according to the context by using artificial intelligence on the basis of ready text, although the naturalness is greatly improved, the pronunciation of individual words is not standard or even wrong, and although a part of manual intervention function is also performed, the TTS only distinguishes homophonic and polyphonic characters and corrects wrong pronunciations. The conventional speech synthesis technology cannot realize a function of changing partial speech sound due to emotion or emphasis of a certain content. The system is a coding mode of complete manual intervention, can be used as an auxiliary intervention function of the voice synthesis technologies, can improve the naturalness and can fully express the intention of an author.
4. Dubbing for conventional audio and video programs: compared with the traditional dubbing for a broadcaster, the method for dubbing by adopting the coding dubbing mode has the greatest advantages of convenient editing, only code modification, no need of professional editors operating in audio editing software and greatly reduced cost.
5. For many people broadcasting dramatic dubbing (note: need to resort to more than four code encoding schemes): the traditional broadcast is generally finished in a recording studio of a broadcasting station, needs a plurality of professional broadcast actors to participate, and can be finished by matching professional recording and audio editors together at the later stage, so that the cost is high and the construction period is long. By adopting the invention, one person can code, switch among a plurality of voice libraries and combine with a plurality of inflexion plug-ins, thus completing the production of a broadcast play.
6. And (3) realizing high-difficulty speech synthesis: the poetry recitation and singing are not conventional pronunciation, and by adopting the coding scheme and using the sound codes with more than four codes, the characteristics of pitch, duration, strength and the like of the contexts can be more accurately expressed, so that the optimal synthesis effect is realized.
The present invention also provides a readable storage medium, in which a computer program is stored, which, when being executed by a processor, is adapted to implement the methods provided by the various embodiments described above.
The readable storage medium may be a computer storage medium or a communication medium. Communication media includes any medium that facilitates transfer of a computer program from one place to another. Computer storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, a readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Additionally, the ASIC may reside in user equipment. Of course, the processor and the readable storage medium may also reside as discrete components in a communication device. The readable storage medium may be a read-only memory (ROM), a random-access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
The present invention also provides a program product comprising execution instructions stored in a readable storage medium. The at least one processor of the device may read the execution instructions from the readable storage medium, and the execution of the execution instructions by the at least one processor causes the device to implement the methods provided by the various embodiments described above.
In the above embodiments of the terminal or the server, it should be understood that the Processor may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A method of speech synthesis, comprising:
receiving initial consonant information input by a user;
receiving vowel information input by a user;
receiving tone information and tone length short messages input by a user;
fusing the initial consonant information, the vowel information, the tone information and the tone short messages based on initial consonant rules to generate initial consonant code information;
and acquiring voice information corresponding to the acoustic code information, wherein the acoustic code information and the voice information are correspondingly set in advance.
2. The speech synthesis method according to claim 1,
before the step of receiving initial information input by a user, the method further comprises the following steps:
an initial consonant information receiving area, a vowel information receiving area and an intonation information receiving area are respectively arranged on an input device;
when the input equipment is triggered for the first time, the initial information receiving area acquires initial information;
when the input device is triggered for the second time, the vowel information receiving area obtains vowel information;
and when the input device is triggered for the third time, the tone information receiving area acquires tone information.
3. The speech synthesis method according to claim 2,
the tone information receiving area comprises a sentence head area, a sentence tail area, a word head area, a word tail area and a single word area;
the sentence head area, the sentence tail area, the word head area, the word tail area and the single word area are respectively provided with tone mark bit information;
the fusing the initial information, the final information, the tone information and the tone of the light and heavy short messages based on the initial code rule to generate the initial code information comprises the following steps:
and sequencing the initial consonants in the initial consonants and the final consonants in the final consonant information to generate single-character pinyin, and matching the single-character pinyin based on the mark bit information in the tone information and the tone short messages with the length and the lightness to generate initial code information.
4. The speech synthesis method according to claim 1,
after the step of acquiring the voice information corresponding to the vocoded information, where the vocoded information and the voice information are correspondingly set in advance, the method further includes:
and playing the voice information based on a loudspeaker device.
5. The speech synthesis method according to claim 1,
playing the voice message based on a speaker device comprises:
receiving a voice library selected by a user;
and matching the voice information with the voice library information to generate playing information, and sending the playing information to the loudspeaker device for playing.
6. The speech synthesis method according to claim 1,
after receiving initial consonant information input by a user, the method further comprises the following steps:
receiving shortcut information input by a user, and respectively setting an initial consonant information receiving area and a shortcut information receiving area on an input device;
generating shortcut word information based on the initial consonant information and the shortcut information, wherein the initial consonant information and the shortcut information are preset in a corresponding manner;
after shortcut character confirmation information input by a user is received, voice information corresponding to the shortcut character information is obtained, wherein the voice information and the shortcut character information are correspondingly set in advance.
7. The speech synthesis method according to claim 6,
generating shortcut word information based on the initial consonant information and the shortcut information, wherein the initial consonant information, the shortcut information and the shortcut word information are preset correspondingly, and then the method further comprises the following steps:
receiving shortcut information input by the user again;
generating shortcut phrase information based on the re-received initial information and shortcut information and the initial information and shortcut information received last time, wherein the re-received initial information and shortcut information are preset in correspondence with the initial information and shortcut information received last time;
and after receiving shortcut phrase confirmation information input by a user, acquiring voice information corresponding to the shortcut phrase information.
8. The speech synthesis method according to claim 1,
after receiving initial consonant information input by a user, the method further comprises the following steps:
and acquiring voice information associated with the initial consonant information, wherein the initial consonant information and the voice information associated with the initial consonant information are correspondingly set in advance.
9. A speech synthesis apparatus, comprising:
the initial information receiving module is used for receiving initial information input by a user;
the vowel information receiving module is used for receiving vowel information input by a user;
the tone and tone length short message module is used for receiving tone information input by a user and tone length short messages;
the initial consonant information, the vowel information, the tone information and the tone short messages are fused based on initial consonant rules to generate initial consonant information;
and the voice information generating module is used for acquiring the voice information corresponding to the sound code information, wherein the sound code information and the voice information are correspondingly set in advance.
10. A readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 8.
CN202010880919.XA 2020-08-27 2020-08-27 Speech synthesis method and device Active CN112002304B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010880919.XA CN112002304B (en) 2020-08-27 2020-08-27 Speech synthesis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010880919.XA CN112002304B (en) 2020-08-27 2020-08-27 Speech synthesis method and device

Publications (2)

Publication Number Publication Date
CN112002304A true CN112002304A (en) 2020-11-27
CN112002304B CN112002304B (en) 2024-03-29

Family

ID=73471231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010880919.XA Active CN112002304B (en) 2020-08-27 2020-08-27 Speech synthesis method and device

Country Status (1)

Country Link
CN (1) CN112002304B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113257221A (en) * 2021-07-06 2021-08-13 成都启英泰伦科技有限公司 Voice model training method based on front-end design and voice synthesis method
CN117672182A (en) * 2024-02-02 2024-03-08 江西拓世智能科技股份有限公司 Sound cloning method and system based on artificial intelligence

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1149147A (en) * 1995-05-10 1997-05-07 关屹瀛 Computer inputting technology for Chinese characters by pronunciation, rhyme, tone and meaning and the keyboard thereof
CN1175726A (en) * 1997-08-20 1998-03-11 金太星 Hanyupinying writing inputing method for computer
CN1210295A (en) * 1997-05-27 1999-03-10 扶良文 Intellectual coded inputting method and keyboard for Chinese characters and alphabets
CN1213102A (en) * 1998-09-24 1999-04-07 陈云牧 Chinese morpheme code and its computer keyboard input
CN1258037A (en) * 1999-12-13 2000-06-28 楼建芳 Chinese keyboard and Chinese-character phonetic code input method
KR20020021182A (en) * 2000-09-08 2002-03-20 류충구 Method and apparatus for inputting Chinese characters using information of tone
CN1384421A (en) * 2001-04-30 2002-12-11 刘东华 Text pronunciation digital coding method
WO2004010674A1 (en) * 2002-07-18 2004-01-29 Min-Kyum Kim Apparatus and method for inputting alphabet characters
WO2007104262A1 (en) * 2006-03-15 2007-09-20 Chen Liang Information input method with chinese phonetic letter
CN101071337A (en) * 2007-06-02 2007-11-14 张先锋 Phonetic alphabet letter-digit Chinese character input method and keyboard and screen display method
CN101118463A (en) * 2006-08-04 2008-02-06 中国科学院软件研究所 Chinese phonetic input method used for digital keyboard
CN103054586A (en) * 2012-12-17 2013-04-24 清华大学 Chinese speech automatic audiometric method based on Chinese speech audiometric dynamic word list
CN103325372A (en) * 2013-05-20 2013-09-25 北京航空航天大学 Chinese phonetic symbol tone identification method based on improved tone core model
CN108010516A (en) * 2017-12-04 2018-05-08 广州势必可赢网络科技有限公司 Semantic independent speech emotion feature recognition method and device
CN111124146A (en) * 2019-05-01 2020-05-08 王治阳 Phoneme same-tone near-bit common Chinese character code input method

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1149147A (en) * 1995-05-10 1997-05-07 关屹瀛 Computer inputting technology for Chinese characters by pronunciation, rhyme, tone and meaning and the keyboard thereof
CN1210295A (en) * 1997-05-27 1999-03-10 扶良文 Intellectual coded inputting method and keyboard for Chinese characters and alphabets
CN1175726A (en) * 1997-08-20 1998-03-11 金太星 Hanyupinying writing inputing method for computer
CN1213102A (en) * 1998-09-24 1999-04-07 陈云牧 Chinese morpheme code and its computer keyboard input
CN1258037A (en) * 1999-12-13 2000-06-28 楼建芳 Chinese keyboard and Chinese-character phonetic code input method
KR20020021182A (en) * 2000-09-08 2002-03-20 류충구 Method and apparatus for inputting Chinese characters using information of tone
CN1384421A (en) * 2001-04-30 2002-12-11 刘东华 Text pronunciation digital coding method
WO2004010674A1 (en) * 2002-07-18 2004-01-29 Min-Kyum Kim Apparatus and method for inputting alphabet characters
WO2007104262A1 (en) * 2006-03-15 2007-09-20 Chen Liang Information input method with chinese phonetic letter
CN101118463A (en) * 2006-08-04 2008-02-06 中国科学院软件研究所 Chinese phonetic input method used for digital keyboard
CN101071337A (en) * 2007-06-02 2007-11-14 张先锋 Phonetic alphabet letter-digit Chinese character input method and keyboard and screen display method
CN103054586A (en) * 2012-12-17 2013-04-24 清华大学 Chinese speech automatic audiometric method based on Chinese speech audiometric dynamic word list
CN103325372A (en) * 2013-05-20 2013-09-25 北京航空航天大学 Chinese phonetic symbol tone identification method based on improved tone core model
CN108010516A (en) * 2017-12-04 2018-05-08 广州势必可赢网络科技有限公司 Semantic independent speech emotion feature recognition method and device
CN111124146A (en) * 2019-05-01 2020-05-08 王治阳 Phoneme same-tone near-bit common Chinese character code input method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113257221A (en) * 2021-07-06 2021-08-13 成都启英泰伦科技有限公司 Voice model training method based on front-end design and voice synthesis method
CN113257221B (en) * 2021-07-06 2021-09-17 成都启英泰伦科技有限公司 Voice model training method based on front-end design and voice synthesis method
CN117672182A (en) * 2024-02-02 2024-03-08 江西拓世智能科技股份有限公司 Sound cloning method and system based on artificial intelligence
CN117672182B (en) * 2024-02-02 2024-06-07 江西拓世智能科技股份有限公司 Sound cloning method and system based on artificial intelligence

Also Published As

Publication number Publication date
CN112002304B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
TWI293455B (en) System and method for disambiguating phonetic input
US10043519B2 (en) Generation of text from an audio speech signal
JP4473193B2 (en) Mixed language text speech synthesis method and speech synthesizer
JP2021196598A (en) Model training method, speech synthesis method, apparatus, electronic device, storage medium, and computer program
JP5198046B2 (en) Voice processing apparatus and program thereof
JP6150268B2 (en) Word registration apparatus and computer program therefor
WO2011064829A1 (en) Information processing device
JP2020034883A (en) Voice synthesizer and program
CN112002304B (en) Speech synthesis method and device
JP2022133392A (en) Speech synthesis method and device, electronic apparatus, and storage medium
CN115101046A (en) Method and device for synthesizing voice of specific speaker
JP2019053235A (en) Language learning system and language learning method
CN108109610B (en) Simulated sounding method and simulated sounding system
JP5160594B2 (en) Speech recognition apparatus and speech recognition method
JP2013050742A (en) Speech recognition device and speech recognition method
Liang et al. A Taiwanese text-to-speech system with applications to language learning
JP5088109B2 (en) Morphological analyzer, morphological analyzer, computer program, speech synthesizer, and speech collator
JP6849977B2 (en) Synchronous information generator and method for text display and voice recognition device and method
Unnibhavi et al. Development of Kannada speech corpus for continuous speech recognition
CN112786002B (en) Voice synthesis method, device, equipment and storage medium
Jangtjik et al. The Indonesian Language speech synthesizer based on the hidden Markov model
JP6617441B2 (en) Singing voice output control device
JP7165439B2 (en) How to Train an Augmented Language Speech Recognition Model with Source Language Speech
JP6340839B2 (en) Speech synthesizer, synthesized speech editing method, and synthesized speech editing computer program
JP2002189490A (en) Method of pinyin speech input

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant