CN112002304A

CN112002304A - Speech synthesis method and device

Info

Publication number: CN112002304A
Application number: CN202010880919.XA
Authority: CN
Inventors: 张进
Original assignee: Shanghai Tianli Network Technology Co ltd
Current assignee: Shanghai Tianli Network Technology Co ltd
Priority date: 2020-08-27
Filing date: 2020-08-27
Publication date: 2020-11-27
Anticipated expiration: 2040-08-27
Also published as: CN112002304B

Abstract

The invention provides a voice synthesis method for realizing instant pronunciation on intelligent equipment (including various computers and mobile equipment), which can enable a speaker who is not speaking normally to communicate with a common person by knocking a keyboard. The coding scheme is simple, and a user can send out a standard Chinese single-character pronunciation only by knocking three letters on a keyboard in sequence under the condition of not using shortcut keys. The first letter marks the initial of the pronunciation, the second letter marks the final of the pronunciation, and the third letter marks the tone of the pronunciation and can also mark the length of the pronunciation according to the position in the sentence. If the shortcut key and the word coding method of the patent are combined, the speed is higher, and a user can carry out voice communication with other people at the speed of the ordinary people by knocking the keyboard.

Description

Speech synthesis method and device

Technical Field

The present invention relates to speech generation technologies, and in particular, to a speech synthesis method and apparatus.

The invention relates to a computer input method and a speech synthesis technology, and the complete description should be as follows: and (3) a timely speech synthesis input method coding scheme.

Background

The existing Speech synthesis systems in the market at present are based on a technology of converting characters into Speech (called TTS, from Text To Speech), a section of characters needs To be input first, and then the section of characters is converted into Speech, so that synchronization or timely pronunciation cannot be achieved.

In other words, if a speaker of the speech synthesis system is allowed to communicate with a general person by tapping the keyboard, the speaker needs to input characters through the keyboard first, and then the current speech conversion system is used to convert the whole sentence of characters into speech for speech.

Disclosure of Invention

The embodiment of the invention provides a voice synthesis method and a voice synthesis device, which take sound codes as media in the voice synthesis process, do not need characters as media, have high efficiency, and can synchronize the voice output time with the time of the idea generation of a user in time.

If synchronous and timely speech synthesis is needed, the input method of the patent is needed. Compared with other input methods, the input method related to the patent is a dubbing input method, namely, voice is output by knocking a keyboard. Relative to the input method of the present patent, other input methods can be called as: the matching input method is that the characters are output by knocking the keyboard.

The invention provides a coding scheme of a timely speech synthesis input method, which is basically characterized in that three letters are successively knocked on a keyboard to send a Chinese speech. The expansion method is to use the shortcut key and word input method, which can send a Chinese voice by average knocking 1.5-2 keys, thus realizing that the user can communicate with others by knocking the keyboard at normal speed.

In a first aspect of the embodiments of the present invention, a speech synthesis method is provided, including:

receiving initial consonant information input by a user;

receiving vowel information input by a user;

receiving tone information and tone length short messages input by a user;

fusing the initial consonant information, the vowel information, the tone information and the tone short messages based on initial consonant rules to generate initial consonant code information;

and acquiring voice information corresponding to the acoustic code information, wherein the acoustic code information and the voice information are correspondingly set in advance.

Optionally, in a possible implementation manner of the first aspect, before the step of receiving the initial information input by the user, the method further includes:

an initial consonant information receiving area, a vowel information receiving area and an intonation information receiving area are respectively arranged on an input device;

when the input equipment is triggered for the first time, the initial information receiving area acquires initial information;

when the input device is triggered for the second time, the vowel information receiving area obtains vowel information;

and when the input device is triggered for the third time, the tone information receiving area acquires tone information.

Optionally, in a possible implementation manner of the first aspect, the tone information receiving area includes a beginning-of-sentence area, an end-of-sentence area, a beginning-of-word area, an end-of-word area, and a word area;

the sentence head area, the sentence tail area, the word head area, the word tail area and the single word area are respectively provided with tone mark bit information;

the fusing the initial information, the final information, the tone information and the tone of the light and heavy short messages based on the initial code rule to generate the initial code information comprises the following steps:

and sequencing the initial consonants in the initial consonants and the final consonants in the final consonant information to generate single-character pinyin, and matching the single-character pinyin based on the mark bit information in the tone information and the tone short messages with the length and the lightness to generate initial code information.

Optionally, in a possible implementation manner of the first aspect, after the step of obtaining the voice information corresponding to the vocoded information, where the vocoded information and the voice information are set in advance correspondingly, the method further includes:

and playing the voice information based on a loudspeaker device.

Optionally, in a possible implementation manner of the first aspect, playing the voice information based on a speaker device includes:

receiving a voice library selected by a user;

and matching the voice information with the voice library information to generate playing information, and sending the playing information to the loudspeaker device for playing.

Optionally, in a possible implementation manner of the first aspect, after receiving the initial information input by the user, the method further includes:

receiving shortcut information input by a user, and respectively setting an initial consonant information receiving area and a shortcut information receiving area on an input device;

generating shortcut word information based on the initial consonant information and the shortcut information, wherein the initial consonant information and the shortcut information are preset in a corresponding manner;

after shortcut character confirmation information input by a user is received, voice information corresponding to the shortcut character information is obtained, wherein the voice information and the shortcut character information are correspondingly set in advance.

Optionally, in a possible implementation manner of the first aspect, after the generating shortcut word information based on the initial information and the shortcut information, where the initial information and the shortcut information are preset in advance, the method further includes:

receiving shortcut information input by the user again;

generating shortcut phrase information based on the re-received initial information and shortcut information and the initial information and shortcut information received last time, wherein the re-received initial information and shortcut information are preset in correspondence with the initial information and shortcut information received last time;

and after receiving shortcut phrase confirmation information input by a user, acquiring voice information corresponding to the shortcut phrase information.

and acquiring voice information associated with the initial consonant information, wherein the initial consonant information and the voice information associated with the initial consonant information are correspondingly set in advance.

In a second aspect of the embodiments of the present invention, there is provided a speech synthesis apparatus, including:

the initial information receiving module is used for receiving initial information input by a user;

the vowel information receiving module is used for receiving vowel information input by a user;

the tone and tone length short message module is used for receiving tone information input by a user and tone length short messages;

the initial consonant information, the vowel information, the tone information and the tone short messages are fused based on initial consonant rules to generate initial consonant information;

and the voice information generating module is used for acquiring the voice information corresponding to the sound code information, wherein the sound code information and the voice information are correspondingly set in advance.

In a third aspect of the embodiments of the present invention, a readable storage medium is provided, in which a computer program is stored, which, when being executed by a processor, is adapted to carry out the method according to the first aspect of the present invention and various possible designs of the first aspect of the present invention.

According to the voice synthesis method and the voice synthesis device, a user can directly input the voice to be sent through the input equipment without performing voice conversion by taking characters as media, the efficiency is high, and the output time of the voice and the time of the idea generation of the user can be synchronized in time.

Drawings

FIG. 1 is a flow chart of a first embodiment of a speech synthesis method;

fig. 2 is a schematic diagram of a first embodiment of an initial information receiving area;

FIG. 3 is a diagram of a first embodiment of a vowel information receiving region;

FIG. 4 is a diagram of a first embodiment of a tone and size information receiving area;

FIG. 5 is a block diagram of the first embodiment of the speech synthesis apparatus

FIG. 6 is a flow chart of the operation of pronouncing for "I am Chinese".

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terms "first," "second," "third," "fourth," and the like in the description and in the claims, as well as in the drawings, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein.

It should be understood that, in various embodiments of the present invention, the sequence numbers of the processes do not mean the execution sequence, and the execution sequence of the processes should be determined by the functions and the internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

It should be understood that in the present application, "comprising" and "having" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that, in the present invention, "a plurality" means two or more. "and/or" is merely an association describing an associated object, meaning that three relationships may exist, for example, and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "comprises A, B and C" and "comprises A, B, C" means that all three of A, B, C comprise, "comprises A, B or C" means that one of A, B, C comprises, "comprises A, B and/or C" means that any 1 or any 2 or 3 of A, B, C comprises.

It should be understood that in the present invention, "B corresponding to a", "a corresponds to B", or "B corresponds to a" means that B is associated with a, and B can be determined from a. Determining B from a does not mean determining B from a alone, but may be determined from a and/or other information. And the matching of A and B means that the similarity of A and B is greater than or equal to a preset threshold value.

As used herein, "if" may be interpreted as "at … …" or "when … …" or "in response to a determination" or "in response to a detection", depending on the context.

The technical solution of the present invention will be described in detail below with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

The invention provides a speech synthesis method, which is shown in a schematic structural diagram in fig. 1 and comprises the following steps:

and S110, receiving initial consonant information input by a user. Wherein the receiving device may be a terminal or the like having a physical keyboard or a virtual keyboard. The initial information in the present invention includes, for example: b/p/m/l … …, and vowels (e.g. a/e/o) capable of independently pronouncing, which total 23 letters. The remaining i/u/v in the keyboard (note: v refers to Pinyin Su herein) can be assigned to the double initial zh/ch/sh. As shown schematically in fig. 2.

And S120, receiving the final information input by the user. Except that the single vowel (a o e i u v) is still normally input according to the pinyin rule, other multi-letter vowel input methods are shown in fig. 3 and have the following corresponding relations:

ai＝b,an＝c,ang＝d,ao＝f,

ei＝h,en＝n,eng＝g,

ia＝ua＝j,ian＝uai＝k,iang＝uang＝l,iao＝m,ie＝p,in＝q,ing＝r,iu＝s,

ong＝iong＝t,ou＝w,

uan＝x，ue＝ve＝y，ui＝v,un＝z,uo＝o。

note: for the convenience of memory, the corresponding relation between the multi-character vowels and the letters of the keyboard is also well designed, and firstly, the sequence of the pinyin vowels is generally consistent with the sequence of 26 English letters. At most, two multi-letter finals share one keyboard letter, and finals which are kept close share one letter, such as: the finals ia and ua share one key j, and through scientific tests, in practical application, the shared finals key does not have a conflict phenomenon, namely the possibility that two tones are formed by the same initial consonant and the final consonant is avoided.

In the pronunciation of Chinese speech, when only three single vowels are pronounced, only single vowels a (corresponding to Chinese character: o), e (corresponding to Chinese character: hunger) and o (corresponding to Chinese character: o) can not be expressed by the method of initial consonant + vowel, and the coding mode adopts the overlapping letter mode to code: aa (Pinyin: a corresponding to Chinese character o), ee (Pinyin: e corresponding to Chinese character hungry) and oo (Pinyin: o corresponding to Chinese character).

And S130, receiving the tone and the voice short message of the length input by the user. The third key can mark the tone of the pinyin by only five letters according to a common pinyin method, and the remaining 21 letters can play other roles. Therefore, except for marking tone, the invention also marks the position of the pronunciation in the whole word and sentence, so that different length and weight pronunciations of the same character can be called according to different positions of a pronunciation in the word and sentence.

The keyboard can be divided into five regions (consistent with the five regions of horizontal, vertical, left falling, right falling and turning of the five-stroke input method, namely a first word region, a last word region, a first sentence region, a last sentence region and a single character region), as shown in figure 4. In the pronunciation of Chinese language, the same character is in different positions of a sentence, and the pronunciation is different in size, for example, the character of 'I' is at the beginning of the sentence (for example, I is Chinese), the pronunciation is the heaviest and the longest; the pronunciation at the tail of the sentence (for example, giving things to me) is heavier and longer; slightly heavier and shorter at the beginning of the word (e.g., this is my); lightest at the end of the word (this is for my), shortest; the pronunciation of the single-word area is read one by one according to the characters of the dictionary, is not suitable for being placed in sentences and is more suitable for being placed in special contexts to strengthen tone, for example, the introduction of a book is read, and the pronunciation of the single-word area can be used as the two characters of the introduction of the title.

The five keys in each region represent a total of four tones in the standard mandarin chinese language: yin ping (first sound), yang ping (second sound), upward sound (third sound), upward sound (fourth sound) and soft sound, as shown in fig. 4, note: on English letter keys on the graph, the five symbols are respectively expressed by · -/V \ and correspond to five tones of pinyin: soft, first, second, third and fourth. The five keys are positioned in each area according to the frequency of five tones and standard typing fingering. Taking asdfg (five keys from left to right in the middle of the keyboard) in the initial region as an example, a corresponds to going to the beginning (four tones), s corresponds to going to the beginning (three tones), d corresponds to yang-Ping (two tones), f corresponds to yin-Ping (one tone), and g corresponds to light tone. If the finger is typed by a standard keyboard, the forefinger is most flexible and has the highest use frequency and is responsible for knocking an f key (one sound) and a g key (soft sound), and the little finger has the higher use frequency and is responsible for knocking an a key (four sounds), note: professional statistics show that the pronunciation frequency of the fourth sound is the highest in Chinese speech. After the keys corresponding to the first sound and the fourth sound are determined, the middle d key and the middle s key are naturally allocated to the second sound and the third sound. The layout of the keyboard in the other areas is similar and is not repeated.

And S140, fusing the initial consonant information, the final information, the tone, the voice length and the tone information based on initial consonant rules to generate initial consonant code information. The initial consonant information, the vowel information and the tone information are fused through initial consonant rules to obtain initial consonant information, wherein the difference between the initial consonant and the pinyin is as follows: the pinyin comprises initials, finals and tones, and the initial codes comprise the length and the weight of the tone besides the initials, the finals and the tones of the pinyin, and in the invention, five areas are used for representing: sentence beginning, sentence end, word beginning, word end, and individual word.

S150, acquiring voice information corresponding to the sound code information, wherein the sound code information and the voice information are correspondingly set in advance. And obtaining corresponding voice information after acquiring the voice code information, and then sounding.

By the above mode provided by the invention, the speaker can send out the voice by knocking the keyboard, so that the Chinese character is not required to be typed out by using the current input method and then converted into the voice by using the voice synthesis system, the latter has low efficiency, manual intervention cannot be carried out, and pronunciation errors of homophone characters and polyphone characters can occur.

The method provided by the invention not only has high efficiency (the three keys are knocked to send out a standard single word voice), but also can freely select five pronunciations according to the up-down sentence relation of the voice, thereby being beneficial to improving the naturalness of the voice continuous reading or expressing special meanings. Examples are: similarly, "I is your today", if according to normal speed of speech, even lighter faster speed of speech, express a low posture, true and honest feelings; if the mood is slowed down and increased, the emotion is expressed in a high posture and without worry. The pronunciation of the character of 'you', in the former scene, can adopt the pronunciation of the word end region (shortest and lightest); in the latter scenario, a single-word-region pronunciation (longest and heaviest) may be employed.

The coding principle of the invention has strong expansibility, and the Chinese pronunciations are less than 1300, wherein a considerable part of the Chinese pronunciations are also the speech words and the sound words of various parts which are not commonly used. In this coding method, the number of the three key combinations 26 × 17576 is several ten times the volume of the chinese speech. It can express various local dialects, words like sound and foreign language pronunciations. For example, if a large gong is knocked, or a bronze basin falls on the ground to make a sound, if the word "when" is used for pronunciation, a person can understand that the word is made by a small gong, and in many reviews, duang (the first sound of dragging long) is used for visual expression. There may be no such word nor such pronunciation in the Xinhua dictionary. The pronunciation can be easily and accurately expressed by using the sound code of the invention: dlv, where d is the initial letter, uang is denoted by l, and v means the first sound in the single character area (the pronunciation of this area is complete and long relative to the other areas).

The coding principle of the invention has strong expansibility and can realize the functions which can not be realized by other voice synthesis systems. Such as one and the same sentence: "how you can do so". To express an angry, excited tone, the pitch will increase, each word will be short and sharp, and the accent is placed on the "you" word; to express a kind of pragmatic atmosphere with great attention, the pitch is reduced, and accents and lingering voices are put on the word. If the three-key sound code mode (low pitch and short sound) can not express the intention of the user, the four-key sound code mode can be switched to, and the fourth key is used for representing special context. If it is also desired to sing by the present coding method, the music tune has more complicated pitch and length, and can be extended to five-key or even six-key sound code mode to complete complicated pronunciation function.

For another example, in a scene of private conversations, the sounding mode has the characteristics that the vocal cords do not vibrate, the sound is light and weak, and the airflow sound is loud. Only the silent words library needs to be recorded, and when the sound codes are input, the sound library can be selected.

The coding scheme also has corresponding shortcut input method regulation for the pronunciation of Arabic numerals, English letters and overlapped sound characters:

the number key input after each pronunciation is the pronunciation of the number itself:

number of

0

1

2

3

4

5

6

7

8

9

Phonetic alphabet

ling

yi

er

san

si

wu

liu

qi

ba

jiu

Sound code

lrd

yif

era

scf

sia

wus

lsa

qif

baf

jss

The capital English letters input after each pronunciation are English pronunciation shortcut keys of the letters, because the pronunciations of some English letters cannot use pinyin phonetic symbols and are not listed, the codes of the English letters are very simple, and the English letters are capital single English letters.

The comma entered after each pronunciation is finished represents that the last pronunciation character is repeated and the pronunciation is short and light. As in the phrase "she walks slowly," the "slow" complete pronunciation vocoders are: "mcam cm", the superimposed pronunciation code is "mca"

The space input after each pronunciation does not function as shortcut or confirmation of pronunciation, but functions as a delimiter. Such as: wov xiz uaf.

Further, before the step of receiving the initial information input by the user, the method further includes:

and when the input equipment is triggered for the third time, the tone information receiving area acquires tone and voice length and weight information.

In this step, the input device for obtaining the initial consonant, the final sound information and the tone information is the same device, and the input information at this time is judged to be the initial consonant information, the final sound information or the tone information by the number of times of triggering.

Further, as shown in fig. 4, the tone information receiving area includes a beginning-of-sentence area, an end-of-sentence area, a beginning-of-word area, an end-of-word area, and a single-word area;

the sentence head area, the sentence tail area, the word head area, the word tail area and the single word area are respectively provided with tone mark bit information and voice length and weight information;

the step of fusing the initial consonant information, the vowel information, the tone information and the short messages with the length of tone based on the initial consonant code rule and the Chinese pronunciation rule to generate the initial consonant code information comprises the following steps:

and sequencing the initial consonants in the initial consonants and the final consonants in the final consonant information to generate single-tone combinations, and matching the single-tone combinations based on the mark bit information in the tone information and the voice length and weight information to generate initial consonant code information.

Further, after the step of acquiring the voice information corresponding to the vocoded information, where the vocoded information and the voice information are set in advance correspondingly, the method further includes:

and playing the voice information based on a loudspeaker device.

Further, playing the voice message based on a speaker device includes:

receiving voice code information input by a user;

and matching the acoustic code information with a pre-stored voice library to generate playing information, and sending the playing information to the loudspeaker device for playing.

Further, after receiving the initial information input by the user, the method further comprises:

In this embodiment, the 26 letters are all corresponding to a common Chinese character pronunciation, which is called a first-level shortcut key, and the principle is to use the single tone with the highest pronunciation frequency as much as possible, with pronouns as the main. Wherein, only the shortcut key w gives the pronunciation of 'me' but gives the pronunciation of 'stomach', and gives the pronunciation of 'me' to the unusual 'o' key. See the first level shortcut key table.

The system can modify the shortcut key according to the pronunciation habit of each person so as to adapt to the requirements of different industries and crowds.

Further, generating shortcut word information based on the initial consonant information and the shortcut information, where the initial consonant information, the shortcut information, and the shortcut word information are preset in a corresponding manner, and the method further includes:

receiving shortcut information input by the user again;

The method has associated words, and after each key is input, the computer can use 0-9 numbers to represent homophonic words listed on the screen, and an operator can quickly finish the pronunciation of single words or phrases only by knocking the numbers.

In addition, the basic operation mode of the input method for matching pronunciation is to tap three keys to send out a voice, so that the three keys cannot be directly tapped to match the pronunciation of the three-character phrase, and the four keys cannot be directly tapped to match the pronunciation of the four-character phrase. When pronunciations of three-word, four-word or even more words need to be input in a shortcut mode, a connection code semicolon is added after a two-key needs to be knocked; ", tells the system that this is a word with more than three characters, and then continues to strike the remaining shortcut keys of the word, but requires a semicolon for every two key strikes. In other words, in inputting multiple words in succession, a semicolon must be used each time a third key or a multiple of 3 is tapped.

Examples are: the shortcut key of "taihang shan" is: TH; s2

Shortcut key to "australia": AD; LY [ 2 ]

The people's republic of China: ZH; RM; GH; g2

Wherein [ is a space key, which indicates the end of the phrase input.

In one possible embodiment, for the coding of the polyphonic characters, commas are used to represent the pronunciation of the next repeated character, such as: he, code is "hef,"; the code is ' hef ', ' when the user goes to the yaho.

The present invention also provides a speech synthesis apparatus, as shown in fig. 5, including:

the tone and tone length short message module is used for receiving tone information and voice length and length information input by a user;

the initial consonant information, the vowel information, the tone information, the voice length and the tone information are fused based on an initial consonant rule to generate initial consonant information;

The invention also provides a voice synthesis system which comprises the voice synthesis device, wherein the voice synthesis device acquires initial consonant information, final sound information, tone information, voice length and weight information, calls voice information (audio information) in a voice database based on voice software manufactured by the voice synthesis method, processes an audio file through the voice synthesis software and controls a loudspeaker connected with a computer and other terminals to play.

As shown in fig. 6, a flow chart of the operation of pronouncing "i is a chinese person".

The practical application field and the applicable scene of the invention are as follows:

1. voice proofreading of other input methods: the invention can be applied to the work of typing and voice proofreading of other input methods due to the characteristic of synchronous sounding of typing, and particularly for professional five-stroke typewriters, the invention can realize the function of auxiliary proofreading by corresponding to the pronunciation of one character when the typewriter beats one character in a quick touch typing state. By using the function, a user does not need to master the coding rule, and all input methods can be hung on the coding and pronunciation library.

2. Voice communication of the speaker: the speaker can learn the operation method of the invention through a certain learning and training, and can communicate with the ordinary people through the sound generator of the intelligent device by knocking the keyboard.

3. Improving the naturalness of other speech synthesis technologies: the speech synthesis technology (TTS) popular in the market at present synthesizes speech according to the context by using artificial intelligence on the basis of ready text, although the naturalness is greatly improved, the pronunciation of individual words is not standard or even wrong, and although a part of manual intervention function is also performed, the TTS only distinguishes homophonic and polyphonic characters and corrects wrong pronunciations. The conventional speech synthesis technology cannot realize a function of changing partial speech sound due to emotion or emphasis of a certain content. The system is a coding mode of complete manual intervention, can be used as an auxiliary intervention function of the voice synthesis technologies, can improve the naturalness and can fully express the intention of an author.

4. Dubbing for conventional audio and video programs: compared with the traditional dubbing for a broadcaster, the method for dubbing by adopting the coding dubbing mode has the greatest advantages of convenient editing, only code modification, no need of professional editors operating in audio editing software and greatly reduced cost.

5. For many people broadcasting dramatic dubbing (note: need to resort to more than four code encoding schemes): the traditional broadcast is generally finished in a recording studio of a broadcasting station, needs a plurality of professional broadcast actors to participate, and can be finished by matching professional recording and audio editors together at the later stage, so that the cost is high and the construction period is long. By adopting the invention, one person can code, switch among a plurality of voice libraries and combine with a plurality of inflexion plug-ins, thus completing the production of a broadcast play.

6. And (3) realizing high-difficulty speech synthesis: the poetry recitation and singing are not conventional pronunciation, and by adopting the coding scheme and using the sound codes with more than four codes, the characteristics of pitch, duration, strength and the like of the contexts can be more accurately expressed, so that the optimal synthesis effect is realized.

The present invention also provides a readable storage medium, in which a computer program is stored, which, when being executed by a processor, is adapted to implement the methods provided by the various embodiments described above.

The readable storage medium may be a computer storage medium or a communication medium. Communication media includes any medium that facilitates transfer of a computer program from one place to another. Computer storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, a readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the readable storage medium. Of course, the readable storage medium may also be an integral part of the processor. The processor and the readable storage medium may reside in an Application Specific Integrated Circuits (ASIC). Additionally, the ASIC may reside in user equipment. Of course, the processor and the readable storage medium may also reside as discrete components in a communication device. The readable storage medium may be a read-only memory (ROM), a random-access memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

The present invention also provides a program product comprising execution instructions stored in a readable storage medium. The at least one processor of the device may read the execution instructions from the readable storage medium, and the execution of the execution instructions by the at least one processor causes the device to implement the methods provided by the various embodiments described above.

In the above embodiments of the terminal or the server, it should be understood that the Processor may be a Central Processing Unit (CPU), other general-purpose processors, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method of speech synthesis, comprising:

receiving initial consonant information input by a user;

receiving vowel information input by a user;

receiving tone information and tone length short messages input by a user;

2. The speech synthesis method according to claim 1,

before the step of receiving initial information input by a user, the method further comprises the following steps:

3. The speech synthesis method according to claim 2,

the tone information receiving area comprises a sentence head area, a sentence tail area, a word head area, a word tail area and a single word area;

4. The speech synthesis method according to claim 1,

after the step of acquiring the voice information corresponding to the vocoded information, where the vocoded information and the voice information are correspondingly set in advance, the method further includes:

and playing the voice information based on a loudspeaker device.

5. The speech synthesis method according to claim 1,

playing the voice message based on a speaker device comprises:

receiving a voice library selected by a user;

6. The speech synthesis method according to claim 1,

after receiving initial consonant information input by a user, the method further comprises the following steps:

7. The speech synthesis method according to claim 6,

generating shortcut word information based on the initial consonant information and the shortcut information, wherein the initial consonant information, the shortcut information and the shortcut word information are preset correspondingly, and then the method further comprises the following steps:

receiving shortcut information input by the user again;

8. The speech synthesis method according to claim 1,

9. A speech synthesis apparatus, comprising:

10. A readable storage medium, in which a computer program is stored which, when being executed by a processor, is adapted to carry out the method of any one of claims 1 to 8.