CN103366742A

CN103366742A - Voice input method and system

Info

Publication number: CN103366742A
Application number: CN2012101013029A
Authority: CN
Inventors: 李曜; 许东星
Original assignee: Shengle Information Technolpogy Shanghai Co Ltd
Current assignee: SHANGHAI GEAK ELECTRONICS Co.,Ltd.
Priority date: 2012-03-31
Filing date: 2012-03-31
Publication date: 2013-10-23
Anticipated expiration: 2032-03-31
Also published as: CN103366742B

Abstract

The invention relates to a voice input method and system. The method includes: recording a voice and at the same time segmenting the input voice into voice segments and generating a text for each voice segment; and displaying the text of each voice segment in order and correcting the text of each voice segment in order according to a selection of a user. The voice input method and system enable a voice identification result to be segmented automatically and paragraphed and then returned for a second confirmation of the user so that the user can record a voice while correcting and confirming a returned text.

Description

Pronunciation inputting method and system

Technical field

The invention belongs to field of speech recognition, particularly a kind of pronunciation inputting method and system.

Background technology

Along with the progress of speech recognition technology and the rise of cloud computing, adopt phonetic entry and carry out the transcription of speech-to-text and the scheme that text turns back to portable terminal has been become a kind of trend by cloud server at portable terminal.Because the size restrictions of portable terminal, it is always not fully up to expectations directly to carry out the convenience of text input by physics or dummy keyboard, and can predict phonetic entry will be at the increasing local key-press input that substitutes.

But the speech recognition accuracy rate is difficult to reach 100% present situation and has hindered the process that phonetic entry thoroughly substitutes key-press input.In fact, because the complicacy of true pronunciation under the various conditions in the life, the accuracy rate of speech recognition may reach 100% never, especially under noisy environment, must there be various mistakes in the recognition result, that is to say, certainly exist the process of a secondary-confirmation for the result of speech recognition.Existing phonetic entry scheme is as follows: after pressing record button, can eject the interface that expression is as shown in Figure 1 being recorded on the portable terminal, then the user loquiturs, after finishing, can on interface as shown in Figure 2, the text that recognizes be illustrated in the text input frame 21, if there is identification error in the texts in the text input frame 21, accesses keyboard 22 by the user again and make amendment and confirm and preserve.Yet in this phonetic entry scheme, the user can not do any editor to recognition result in Recording Process, must be after the disposable voice that will input be all finished, the user could revise one by one and confirm the mistake in the returned text and preserve, and then the text that will confirm is used for follow-up such as sending short messages, send out mail, the application of account and so on.So this affirmation process is usually more loaded down with trivial details for the user, friendly not.

Summary of the invention

The object of the present invention is to provide a kind of pronunciation inputting method and system, can automatically carry out identification by stages to the input voice, user Ke Bian recording limit to identification by stages to text revise.

For addressing the above problem, the invention provides a kind of pronunciation inputting method, comprising:

In recording constantly with the phonetic segmentation sound bite of input and generate the text of each sound bite; And

The text that shows successively each sound bite is revised the text of each sound bite successively according to user's selection.

Further, in said method, by cloud server constantly with the phonetic segmentation sound bite of input and generate the text of each sound bite.

Further, in said method, by voice activity detection algorithm constantly with the input the phonetic segmentation sound bite.

Further, in said method, described selection according to the user comprises the step that the text of each sound bite is revised successively:

Need the content revised in the text of each sound bite of user selection;

Generation is corresponding to the syllable of each word in the candidate word of each word in the described content, the described content with corresponding to candidate's syllable of each word in the described content;

According to the described candidate word of user selection, described syllable and described candidate's syllable the text in the pronunciation fragment is revised.

Further, in said method, the step that described described candidate word according to user selection, described syllable and described candidate's syllable are revised the text in the pronunciation fragment comprises:

When the described candidate word of user selection, the described candidate word of selecting is replaced corresponding word in the described content;

When the described syllable of user selection, generate the candidate word corresponding to described syllable, from the candidate word of described syllable, select correct candidate word and replace corresponding word in the described content;

When the described candidate's syllable of user selection, generate the candidate word corresponding to candidate's syllable, from the candidate word of described candidate's syllable, select correct candidate word and replace corresponding word in the described content;

In the described candidate word that generates, candidate's syllable, do not have correct result, then can call input method text is made amendment.

Further, in said method, described in recording constantly with the phonetic segmentation sound bite of input and generate before the step of text of each sound bite, also comprise: when recording, playback environ-ment is carried out noise monitoring and obtain signal to noise ratio (S/N ratio).

Further, in said method, described generation comprises corresponding to the syllable of each word in the candidate word of each word in the described content, the described content with corresponding to the step of candidate's syllable of each word in the described content:

When described signal to noise ratio (S/N ratio) during greater than predetermined threshold value, reduce described candidate word, described candidate's syllable;

When described signal to noise ratio (S/N ratio) during less than predetermined threshold value, increase described candidate word, described candidate's syllable.

According to another side of the present invention, a kind of voice entry system is provided, comprising:

The cutting module is used in recording constantly with the phonetic segmentation sound bite of inputting and the text that generates each sound bite; And

Correcting module for the text that shows successively each sound bite, is revised the text of each sound bite successively according to user's selection.

Further, in said system, described cutting module is positioned on the cloud server.

Further, in said system, described cutting module is passed through voice activity detection algorithm constantly with the phonetic segmentation sound bite of inputting.

Further, in said system, described correcting module comprises:

Selected cell, the text that is used for obtaining each sound bite of user selection need the content revised;

Candidate unit be used for to generate corresponding to the syllable of each word in the candidate word of described each word of content, the described content with corresponding to candidate's syllable of each word in the described content;

Amending unit is used for according to the described candidate word of user selection, described syllable and described candidate's syllable the text of pronunciation fragment being revised.

Further, in said system, described amending unit is used for when the described candidate word of user selection, and the described candidate word of selecting is replaced corresponding word in the described content; When the described syllable of user selection, generate the candidate word corresponding to described syllable, from the candidate word of described syllable, select correct candidate word and replace corresponding word in the described content; When the described candidate's syllable of user selection, generate the candidate word corresponding to candidate's syllable, from the candidate word of described candidate's syllable, select correct candidate word and replace corresponding word in the described content; In the described candidate word that generates, candidate's syllable, do not have correct result, then can call input method text is made amendment.

Further, in said system, also comprise the noise monitoring unit, be used for when recording, playback environ-ment being carried out noise monitoring and obtain signal to noise ratio (S/N ratio).

Further, in said system, described candidate unit is used for reducing described candidate word, described candidate's syllable when described signal to noise ratio (S/N ratio) during greater than predetermined threshold value; When described signal to noise ratio (S/N ratio) during less than predetermined threshold value, increase described candidate word.

Compared with prior art, the present invention by in recording constantly with the phonetic segmentation sound bite of input and generate the text of each sound bite, the text that shows successively each sound bite, selection according to the user is revised the text of each sound bite successively, can the automatic segmentation voice identification result and carry out segmentation and return the secondary-confirmation for the user, while the user can record returned text is made amendment and confirmed.

In addition, need the content revised in the text by each sound bite of user selection, then generate corresponding to the syllable of each word in the candidate word of each word in the described content, the described content with corresponding to candidate's syllable of each word in the described content, according to the described candidate word of user selection, described syllable and described candidate's syllable the text in the pronunciation fragment is revised again, can be made things convenient for the user to select fast correct literal that the content in the text is revised.

In addition, obtain signal to noise ratio (S/N ratio) by in when recording playback environ-ment being carried out noise monitoring, when described signal to noise ratio (S/N ratio) during greater than predetermined threshold value, reduce described candidate word, described candidate's syllable; When described signal to noise ratio (S/N ratio) during less than predetermined threshold value, increase described candidate word, described candidate's syllable, can adjust according to different signal to noise ratio (S/N ratio)s the number of candidate result.

Description of drawings

Fig. 1 is the recording interface synoptic diagram of existing voice input scheme;

Fig. 2 is that the identification text of existing voice input scheme is showed and modification interface synoptic diagram;

Fig. 3 is the process flow diagram of the pronunciation inputting method of the embodiment of the invention one;

Fig. 4 is that recording, the identification text of the embodiment of the invention one showed and modification interface synoptic diagram

Fig. 5 is successively the identification text being showed of the embodiment of the invention one and revises the interface synoptic diagram;

Fig. 6 is the process flow diagram of the pronunciation inputting method of the embodiment of the invention two;

Fig. 7 is the noise monitoring interface synoptic diagram of the embodiment of the invention two;

Fig. 8 is the high-level schematic functional block diagram of the voice entry system of the embodiment of the invention three.

Embodiment

For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.

Embodiment one

Shown in Fig. 3～5, the invention provides a kind of pronunciation inputting method, comprising:

Step S11, in recording constantly with the phonetic segmentation sound bite of input and generate the text of each sound bite, concrete, but the present invention's automatic segmentation voice identification result also carries out segmentation and returns the secondary-confirmation for the user, can be by cloud server constantly with the phonetic segmentation sound bite of inputting and the text that generates each sound bite, by voice activity detection algorithm constantly with the input the phonetic segmentation sound bite, it is the starting point and ending point of determining exactly voice from a segment signal that comprises voice that sound end detects, distinguish voice and non-speech audio, the sound end detection is an importance in the voice processing technology, for example, when the user inputs voice continuously, can be adopted by cloud server the algorithm of end-point detection, efficient voice is cut into one one according to the rhythm that the user pauses in a minute, and be converted into successively text, and turning back on the displaying interface of portable terminal as shown in Figure 4, will record interface and recognition result of this interface shows that the interface is integrated on the same interface;

Step S12 shows the text of each sound bite successively;

Step S13, selection according to the user is revised the text of each sound bite successively, concrete, the user can make amendment and confirms returned text while recording among the present invention, need to prove, in the interaction schemes of the present invention, all text identification results are not displayed, but only the text identification result of current segmentation is illustrated on the interface such as Fig. 5, after the user revises the recognition result 1 of sound bite 1 and confirms, show again next section recognition result 2, the benefit of this exhibition scheme is to show successively limited result on limited screen, allow the user will concentrate on current recognition result, improve to revise the efficient of text, shown in this step can specifically comprise:

Step S131 needs the content revised in the text of each sound bite of user selection, concrete, when the user need to revise part word among the text identification result, can click the concrete literal among the text identification result;

Step S132, generation is corresponding to the candidate word of each word in the described content, the syllable of each word and corresponding to candidate's syllable of each word in the described content in the described content, concrete, when the concrete literal that needs among user's click recognition result to revise, can arrange and eject the some candidate words corresponding with this literal, the corresponding syllable and the some candidate's syllables that comprise this literal, can effectively voice identification result and input method be combined like this, provide a plurality of candidates for user selection, and recognition result deteriorated to syllable from literal, enlarge the scope of hitting, make the user needn't input a string letter, but find own needed word by the candidate;

Step S133, according to the described candidate word of user selection, described syllable and described candidate's syllable the text in the pronunciation fragment is revised, concrete, when the user revises the recognition result that returns and confirms, as shown in Figure 5 " cancellation " and " affirmation " two orders can be provided, be respectively applied to delete rapidly and preserve this text identification result, this step can further comprise:

Step S1331 when the described candidate word of user selection, replaces corresponding word in the described content with the described candidate word of selecting, and is concrete, if correct literal is present in the candidate word, then the user directly clicks the literal that candidate word just can substitute original identification error;

Step S1332, when the described syllable of user selection, generation is corresponding to the candidate word of described syllable, from the candidate word of described syllable, select correct candidate word and replace corresponding word in the described content, concrete, if there is not correct literal in the candidate word, then the user can click correct syllable, selects to want that word of inputting from candidate word corresponding to this syllable of providing again;

Step S1333, when the described candidate's syllable of user selection, generation is corresponding to the candidate word of candidate's syllable, from the candidate word of described candidate's syllable, select correct candidate word and replace corresponding word in the described content, concrete, if do not have correct literal in candidate word corresponding to correct syllable, then the user can click candidate's syllable, selects to want that word of inputting from candidate word corresponding to this candidate's syllable of providing again;

Step S1334 does not have correct result in the described candidate word that generates, candidate's syllable, then can call input method text is made amendment.

The present invention can be simultaneously displayed on recording interface and return results interface on the interface of portable terminal, allow the user can see while recording the text results of returning, and can revise the text results of returning in real time, be that the user can say one section voice continuously, in the situation of not closing recording, the text results of returning is revised and confirmed, then continue recording, also can use other people voice of sound recordings on one side, and revise simultaneously and confirm the identification return results.

Embodiment two

As shown in Figure 6 and Figure 7, the invention provides another kind of pronunciation inputting method, the difference of present embodiment and embodiment is, increased when recording playback environ-ment has been carried out the step that signal to noise ratio (S/N ratio) is obtained in noise monitoring, can adjust according to different signal to noise ratio (S/N ratio)s the number of candidate result, and be not suitable for adopting prompting user in the very noisy situation of phonetic entry, this example can specifically comprise:

Step S21, when recording, playback environ-ment is carried out noise monitoring and obtain signal to noise ratio (S/N ratio), concrete, this step can automatically detect the signal to noise ratio (S/N ratio) of input voice and feed back on interactive interface, prompting user in the very noisy situation of phonetic entry can be not suitable for adopting, also can in subsequent step S242, adjust the number of candidate result according to different signal to noise ratio (S/N ratio)s, because noise is very large for the impact of speech recognition, when the playback environ-ment noise is stronger, the accuracy rate of speech recognition can descend rapidly, the literal that the user need to revise also increases greatly, therefore, the function that can add in the present embodiment noise monitoring, can be according to the result of end-point detection, every section recognition result is calculated respectively voice segments energy corresponding to this result and quiet section energy (quiet section energy equivalence is in the energy of noise), thereby estimate the signal to noise ratio (S/N ratio) of this section voice, and the pollution level of neighbourhood noise shows with as shown in Figure 7 band recording volume bar 71 and the interface of noise volume bar 72 will record the time, after neighbourhood noise surpasses certain threshold value, but prompting user " current noise is excessive, and the keyboard input is used in suggestion ";

Step S22, in recording constantly with the phonetic segmentation sound bite of input and generate the text of each sound bite, concrete, by cloud server constantly with the phonetic segmentation sound bite of input and generate the text of each sound bite, by voice activity detection algorithm constantly with the phonetic segmentation sound bite of input;

Step S23 shows the text of each sound bite successively;

Step S24 revises the text of each sound bite successively according to user's selection, and this step can specifically comprise:

Step S241 needs the content of revising in the text of each sound bite of user selection;

Step S242, generation is corresponding to the syllable of each word in the candidate word of each word in the described content, the described content with corresponding to candidate's syllable of each word in the described content, can make things convenient for the user to select fast correct literal that the content in the text is revised, this step can further comprise:

Step S2421 when described signal to noise ratio (S/N ratio) during greater than predetermined threshold value, reduces described candidate word, described candidate's syllable, and concrete, signal to noise ratio (S/N ratio) is large, and the expression voice are subjected to the pollution of noise little, and the accuracy of recognition result is high, then can suitably reduce the number of candidate result;

Step S2422, when described signal to noise ratio (S/N ratio) during less than predetermined threshold value, increase described candidate word, described candidate's syllable, concrete, signal to noise ratio (S/N ratio) is little, and the expression voice are subjected to noise pollution large, and then wrong possibility appears in recognition result also increases greatly, then need to increase the number of candidate result, be convenient to the user and can therefrom select correct literal;

Step S243 revises the text in the pronunciation fragment according to the described candidate word of user selection, described syllable and described candidate's syllable, and this step can further comprise:

Step S2431 when the described candidate word of user selection, replaces corresponding word in the described content with the described candidate word of selecting;

Step S2432 when the described syllable of user selection, generates the candidate word corresponding to described syllable, selects correct candidate word and replace corresponding word in the described content from the candidate word of described syllable;

Step S2433 when the described candidate's syllable of user selection, generates the candidate word corresponding to candidate's syllable, selects correct candidate word and replace corresponding word in the described content from the candidate word of described candidate's syllable;

Step S2434 does not have correct result in the described candidate word that generates, candidate's syllable, then can call input method text is made amendment.

In the present embodiment multiple voice technology such as noise monitoring, end-point detection, continuous speech recognition or framework are integrated in the reciprocal process, allow the user can fully experience the convenience of phonetic entry, improve the user of user when phonetic entry and key-press input promiscuous operation and experience.

Embodiment three

As shown in Figure 8, the present invention also provides another kind of voice entry system, comprises cutting module 41, correcting module 42 and noise monitoring unit 43.

Cutting module 41 is used in recording constantly with the phonetic segmentation sound bite of inputting and the text that generates each sound bite, concrete, described cutting module 41 is positioned on the cloud server, described cutting module 41 by voice activity detection algorithm constantly with the phonetic segmentation sound bite of input, but this module automatic segmentation voice identification result and carry out segmentation and return the secondary-confirmation for the user.

Correcting module 42 is used for showing successively the text of each sound bite, selection according to the user is revised the text of each sound bite successively, concrete, this module can realize that the user makes amendment and confirms returned text while recording, need to prove, in the interaction schemes of the present invention, all text identification results are not displayed, but only the text identification result of current segmentation is showed on the interface, after the user revises the text identification result of this sound bite and confirms, show again next section recognition result, the benefit of this exhibition scheme is to show successively limited result on limited screen, allow the user will concentrate on current recognition result, improve the efficient of revising text, described correcting module 42 can further comprise selected cell 421, candidate unit 422 and amending unit 423.

The text that selected cell 421 is used for obtaining each sound bite of user selection needs the content revised.

Candidate unit 422 is used for generating the candidate word corresponding to described each word of content, the syllable of each word and corresponding to candidate's syllable of each word in the described content in the described content, concrete, when the concrete literal that needs among user's click recognition result to revise, can arrange and eject the some candidate words corresponding with this literal, the corresponding syllable and the some candidate's syllables that comprise this literal, can effectively voice identification result and input method be combined like this, provide a plurality of candidates for user selection, and recognition result deteriorated to syllable from literal, enlarge the scope of hitting, make the user needn't input a string letter, but find own needed word by the candidate, in addition, described candidate unit 412 also can be used for reducing described candidate word when described signal to noise ratio (S/N ratio) during greater than predetermined threshold value, described candidate's syllable, signal to noise ratio (S/N ratio) is large, the expression voice are subjected to the pollution of noise little, and the accuracy of recognition result is high, then can suitably reduce the number of candidate result; When described signal to noise ratio (S/N ratio) during less than predetermined threshold value, increase described candidate word, described candidate's syllable, signal to noise ratio (S/N ratio) is little, the expression voice are subjected to noise pollution large, then wrong possibility appears in recognition result also increases greatly, then needs to increase the number of candidate result, is convenient to the user and can therefrom selects correct literal.

Amending unit 423 is used for according to the described candidate word of user selection, described syllable and described candidate's syllable the text of pronunciation fragment being revised, concrete, described amending unit 413 is used for when the described candidate word of user selection, and the described candidate word of selecting is replaced corresponding word in the described content; When the described syllable of user selection, generate the candidate word corresponding to described syllable, from the candidate word of described syllable, select correct candidate word and replace corresponding word in the described content; When the described candidate's syllable of user selection, generate the candidate word corresponding to candidate's syllable, from the candidate word of described candidate's syllable, select correct candidate word and replace corresponding word in the described content; In the described candidate word that generates, candidate's syllable, do not have correct result, then can call input method text is made amendment.

Noise monitoring unit 43 is used for when recording playback environ-ment being carried out noise monitoring and obtains signal to noise ratio (S/N ratio), can adjust according to different signal to noise ratio (S/N ratio)s the number of candidate result, and is being not suitable for adopting prompting user in the very noisy situation of phonetic entry.

The present invention by in recording constantly with the phonetic segmentation sound bite of input and generate the text of each sound bite, the text that shows successively each sound bite, selection according to the user is revised the text of each sound bite successively, can the automatic segmentation voice identification result and carry out segmentation and return the secondary-confirmation for the user, while the user can record returned text is made amendment and confirmed.

Each embodiment adopts the mode of going forward one by one to describe in this instructions, and what each embodiment stressed is and the difference of other embodiment that identical similar part is mutually referring to getting final product between each embodiment.For the disclosed system of embodiment, because corresponding with the disclosed method of embodiment, so description is fairly simple, relevant part partly illustrates referring to method and gets final product.

The professional can also further recognize, unit and the algorithm steps of each example of describing in conjunction with embodiment disclosed herein, can realize with electronic hardware, computer software or the combination of the two, for the interchangeability of hardware and software clearly is described, composition and the step of each example described in general manner according to function in the above description.These functions are carried out with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.The professional and technical personnel can specifically should be used for realizing described function with distinct methods to each, but this realization should not thought and exceeds scope of the present invention.

Obviously, those skilled in the art can carry out various changes and modification to invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these revise and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these change and modification.

Claims

1. a pronunciation inputting method is characterized in that, comprising:

2. pronunciation inputting method as claimed in claim 1 is characterized in that, by cloud server constantly with the phonetic segmentation sound bite of input and generate the text of each sound bite.

3. pronunciation inputting method as claimed in claim 1 is characterized in that, by voice activity detection algorithm constantly with the input the phonetic segmentation sound bite.

4. pronunciation inputting method as claimed in claim 1 is characterized in that, described selection according to the user comprises the step that the text of each sound bite is revised successively:

Need the content revised in the text of each sound bite of user selection;

5. pronunciation inputting method as claimed in claim 4 is characterized in that, the step that described described candidate word according to user selection, described syllable and described candidate's syllable are revised the text in the pronunciation fragment comprises:

6. pronunciation inputting method as claimed in claim 5, it is characterized in that, described in recording constantly with the phonetic segmentation sound bite of input and generate before the step of text of each sound bite, also comprise: when recording, playback environ-ment is carried out noise monitoring and obtain signal to noise ratio (S/N ratio).

7. pronunciation inputting method as claimed in claim 6 is characterized in that, described generation comprises corresponding to the syllable of each word in the candidate word of each word in the described content, the described content with corresponding to the step of candidate's syllable of each word in the described content:

When described signal to noise ratio (S/N ratio) during greater than predetermined threshold value, increase described candidate word, described candidate's syllable;

When described signal to noise ratio (S/N ratio) during less than predetermined threshold value, reduce described candidate word, described candidate's syllable.

8. a voice entry system is characterized in that, comprising:

9. voice entry system as claimed in claim 8 is characterized in that, described cutting module is positioned on the cloud server.

10. voice entry system as claimed in claim 8 is characterized in that, described cutting module is passed through voice activity detection algorithm constantly with the phonetic segmentation sound bite of inputting.

11. voice entry system as claimed in claim 8 is characterized in that, described correcting module comprises:

12. voice entry system as claimed in claim 11 is characterized in that, described amending unit is used for when the described candidate word of user selection, and the described candidate word of selecting is replaced corresponding word in the described content; When the described syllable of user selection, generate the candidate word corresponding to described syllable, from the candidate word of described syllable, select correct candidate word and replace corresponding word in the described content; When the described candidate's syllable of user selection, generate the candidate word corresponding to candidate's syllable, from the candidate word of described candidate's syllable, select correct candidate word and replace corresponding word in the described content; In the described candidate word that generates, candidate's syllable, do not have correct result, then can call input method text is made amendment.

13. voice entry system as claimed in claim 12 is characterized in that, also comprises the noise monitoring unit, is used for when recording playback environ-ment being carried out noise monitoring and obtains signal to noise ratio (S/N ratio).

14. voice entry system as claimed in claim 13 is characterized in that, described candidate unit is used for reducing described candidate word, described candidate's syllable when described signal to noise ratio (S/N ratio) during greater than predetermined threshold value; When described signal to noise ratio (S/N ratio) during less than predetermined threshold value, increase described candidate word, described candidate's syllable.