CN102982019B

CN102982019B - Input method corpus phonetic notation method, the method and electronic device for generating evaluation and test corpus

Info

Publication number: CN102982019B
Application number: CN201210486723.8A
Authority: CN
Inventors: 景富香
Original assignee: Baidu International Technology Shenzhen Co Ltd
Current assignee: Baidu International Technology Shenzhen Co Ltd
Priority date: 2012-11-26
Filing date: 2012-11-26
Publication date: 2019-01-15
Anticipated expiration: 2032-11-26
Also published as: CN102982019A

Abstract

The invention discloses a kind of methods of input method corpus phonetic notation, the method and electronic device of generation evaluation and test corpus.Wherein, the method for corpus phonetic notation includes: to carry out phonetic notation respectively to each corpus using at least two different phonetic notation tools, so that each corpus has corresponding at least two phonetic notation；Judge whether at least two phonetic notations of each corpus are identical, selects assessment result preferably phonetic notation using the correct phonetic notation as corpus if different, directly use phonetic notation as the correct phonetic notation of corpus if they are the same.By the above-mentioned means, the present invention can greatly reduce the workload for needing the correct phonetic notation of desk checking corpus, the efficiency for improving corpus phonetic notation also improves the accuracy of corpus phonetic notation simultaneously.

Description

Input method corpus phonetic notation method, the method and electronic device for generating evaluation and test corpus

Technical field

The present invention relates to input method technique fields, more particularly to input method corpus phonetic notation method, generate evaluation and test corpus Method and electronic device.

Background technique

Input method refers to the coding method used in order to which various symbols are inputted computer or other equipment (such as mobile phone). The performance of input method will directly affect the input efficiency on computer or other equipment.Therefore, it is necessary to input method performance into Row evaluation and test is to provide foundation to constantly improve input method.

The evaluation and test of input method be by carrying out typing on evaluation and test corpus, the operation such as select word, and record is ideal in the process The position of candidate result and the editor's number for obtaining ideal candidates result finally count ideal during multiple typings, selecting word The distribution of candidate result position and the average value of editor's number for obtaining ideal candidates result reflect the ease for use of input method.It can See, evaluation and test corpus is therefore how the premise of input method evaluation and test finds objective, practical and correct evaluation and test corpus to input method Evaluation and test important in inhibiting.

General evaluation and test corpus of collecting has collection by hand and automatic method to collect.Currently, evaluating and testing corpus using collecting by hand Low efficiency, and general automatic method generate evaluation and test corpus at least there are the following problems: word cutting mechanism is unreasonable, causes Losing the corpus that most of user actually enters influences the evaluation result of input method so that the corpus obtained is inappropriate；Not at Ripe phonetic notation tool carries out accurate phonetic notation to corpus.

Summary of the invention

The invention mainly solves the technical problem of providing input method corpus phonetic notation method, generate evaluation and test corpus method and Electronic device can be improved the formation efficiency of evaluation and test corpus, while the evaluation and test corpus generated is actually typing with user and is closer to, and prolongs Continuous property is good, and the phonetic notation accuracy for evaluating and testing corpus is high.

In order to solve the above technical problems, one technical scheme adopted by the invention is that: a kind of method of corpus phonetic notation is provided, It include: that phonetic notation is carried out respectively to each corpus using at least two different phonetic notation tools, so that each corpus has Corresponding at least two phonetic notation；Judge whether at least two phonetic notations of each corpus are identical, assessment is selected to tie if different Fruit preferably phonetic notation then directly uses the phonetic notation as the correct note of corpus if they are the same using the correct phonetic notation as the corpus Sound.

In order to solve the above technical problems, another technical solution used in the present invention is: providing a kind of generation input method and comment The method for surveying corpus, comprising: the history input content of capture is cut at least one corpus of user's single typing；Using extremely Few two different phonetic notation tools carry out phonetic notation to each corpus respectively, so that each corpus has corresponding at least two A phonetic notation；Judge whether at least two phonetic notations of each corpus are identical, selects assessment result preferably phonetic notation if different Using the correct phonetic notation as the corpus, then directly use the phonetic notation as the correct phonetic notation of corpus if they are the same, and will be described true The corpus of correct phonetic notation is set as the evaluation and test corpus.

Wherein, the step of corpus that the history input content of capture is cut into user's single typing includes: that will catch The history input content obtained is the critical progress first time cutting separated according to punctuation mark；To the language after the first time cutting Material saves according to text and carries out second of cutting, obtains the corpus of user's single typing.

Wherein, the step of corpus after the cutting to first time saves second of cutting of progress according to text includes: to pass through Juman and knp saves the corpus after first time cutting according to text and carries out second of cutting.

Wherein, it after the step of corpus that the history input content of capture is cut into user's single typing, utilizes Before the step of at least two different phonetic notation tools carry out phonetic notation to the corpus, further includes: the use obtained to cutting The corpus of family single typing carries out noise treatment, to eliminate wherein meaningless corpus.

Wherein, the step of corpus of the user's single typing obtained to cutting carries out noise treatment includes: to utilize Customized Noise rules carry out noise treatment to the corpus for user's single typing that cutting obtains.

Wherein, it after the step of corpus of the user's single typing obtained to cutting carries out noise treatment, also wraps It includes: removing noise treated that each corpus carries out frequency calculating to described, pass through roulette algorithm and carry out corpus selection.

Wherein, corpus carries out frequency calculating to removing noise treated, passes through the step that roulette algorithm carries out corpus selection After rapid, further includes: in selecting the corpus come, for the identical corpus, only retain one of them as institute State the corpus that phonetic notation is carried out using at least two different phonetic notation tools.

Wherein, after the step of generation evaluation and test corpus, further includes: run described in the input of at least one input method tool Corpus is evaluated and tested to obtain corresponding candidate result, and collects corresponding candidate result；By the evaluation and test corpus and the phase The candidate result answered is saved to obtain evaluation and test corpus.

Wherein, it before the step of corpus that the history input content of capture is cut into user's single typing, also wraps Include: the content of predetermined field or type is as history input content on capture network.

In order to solve the above technical problems, another technical solution used in the present invention is: a kind of electronic device is provided, including Phonetic notation module, judgment module and phonetic notation determining module, in which: the phonetic notation module is used to utilize at least two different phonetic notations Tool carries out phonetic notation to each corpus respectively, so that each corpus has corresponding at least two phonetic notation, and will be each Corresponding at least two phonetic notation of corpus is exported to the judgment module；The judgment module is for judging each corpus At least two phonetic notations it is whether identical, and judging result is exported to the phonetic notation determining module；The phonetic notation determining module is used In when at least two phonetic notations of each corpus are identical, directly use the phonetic notation as the correct phonetic notation of corpus, when each When at least two phonetic notation difference of the corpus, select assessment result preferably phonetic notation using the correct phonetic notation as the corpus.

In order to solve the above technical problems, another technical solution used in the present invention is: a kind of electronic device is provided, including Cutting module, phonetic notation module, judgment module and evaluation and test corpus generation module, in which: what the cutting module was used to capture History input content is cut at least one corpus of user's single typing, and user's single typing that cutting is obtained is at least One corpus is exported to the phonetic notation module；The phonetic notation module is used for using at least two different phonetic notation tools to each institute Predicate material carries out phonetic notation respectively, so that each corpus has corresponding at least two phonetic notation, and by the corresponding institute of each corpus At least two phonetic notations are stated to export to the judgment module；The judgment module is used to judge at least two notes of each corpus Whether sound is identical, and judging result is exported to the evaluation and test corpus generation module；The evaluation and test corpus production module is for working as When at least two phonetic notations of each corpus are identical, directly use the phonetic notation as the correct phonetic notation of corpus, when each described When at least two phonetic notation difference of corpus, select assessment result preferably phonetic notation using the correct phonetic notation as the corpus, and will The corpus for determining correct phonetic notation is as the evaluation and test corpus.

Wherein, the cutting module includes the first cutting unit and the second cutting unit, in which: the first cutting unit History input content for that will capture is the critical progress first time cutting separated according to punctuation mark, and by first time cutting The obtained corpus is exported to the second cutting unit；The second cutting unit is used for from the first cutting list Corpus after the first time cutting of member is saved according to text carries out second of cutting, obtains the corpus of user's single typing, And the corpus of user's single typing is exported to the phonetic notation module.

Wherein, the second cutting unit is specifically used for through juman and knp to the corpus after first time cutting according to text Section carries out second of cutting.

Wherein, described device further includes noise module, and the user for obtaining to the cutting module cutting is single The corpus of secondary typing carries out noise treatment, to eliminate wherein meaningless corpus.

Wherein, the user for going the specific term of noise module to obtain using customized Noise rules to cutting is single The corpus of secondary typing carries out noise treatment.

Wherein, described device further includes that corpus chooses module, for it is described each of go after noise resume module it is described Corpus carries out frequency calculating, carries out corpus selection by roulette algorithm.

Wherein, described device further includes deduplication module, selects the corpus come for choosing module in the corpus In, for the identical corpus, only retains one of them and infused as described using at least two different phonetic notation tools The corpus of sound, and the corpus of the reservation is exported to the phonetic notation module.

Wherein, described device further includes evaluation and test corpus module, for running the input institute's commentary of at least one phonetic notation tool The evaluation and test corpus that survey corpus generation module obtains collects corresponding candidate knot to obtain corresponding candidate result Fruit saves the evaluation and test corpus and corresponding candidate result to obtain evaluation and test corpus.

Wherein, described device further includes content capture module, is made for capturing the content of predetermined field or type on network For history input content, and the history input content of capture is exported to the cutting module.

The beneficial effects of the present invention are: on the one hand, the method for corpus phonetic notation of the invention is using multiple phonetic notation tools to every A corpus distinguishes phonetic notation, then the correct phonetic notation of corpus is determined by way of cross check, is significantly reduced and needs desk checking The workload of the correct phonetic notation of corpus, the efficiency for improving corpus phonetic notation simultaneously, also improve the accuracy of corpus phonetic notation.

On the other hand, the present invention generate input method evaluation and test corpus method, due to using multiple phonetic notation tools to corpus into Row phonetic notation, then carry out cross check to multiple phonetic notations of same corpus to determine the correct phonetic notation of corpus, can effectively reduce pair Corpus phonetic notation carries out the workload of manual examination and verification, while also effectively improving the accuracy of evaluation and test corpus phonetic notation.Additionally by catching The history input content cutting obtained obtains the corpus of user's single typing, makes the evaluation and test corpus of acquisition and being actually typing more for user Close to the continuity for evaluating and testing corpus is more preferable.The method that the present invention generates evaluation and test corpus, whole process hardly need artificial ginseng With, can be greatly improved generate evaluation and test corpus efficiency.

Detailed description of the invention

Fig. 1 is the flow chart of one embodiment of method of corpus phonetic notation of the present invention；

Fig. 2 is the flow chart for one embodiment of method that input method evaluation and test corpus of the present invention generates；

Fig. 3 is to cut the history input content of capture in one embodiment of method that input method of the present invention evaluation and test corpus generates It is divided into the flow chart of the corpus of user's single typing；

Fig. 4 is the flow chart for another embodiment of method that input method evaluation and test corpus of the present invention generates；

Fig. 5 is the structural schematic diagram of electronic device first embodiment of the present invention；

Fig. 6 is the structural schematic diagram of electronic device second embodiment of the present invention；

Fig. 7 is the structural schematic diagram of cutting module in electronic device third embodiment of the present invention；

Fig. 8 is the structural schematic diagram of the 4th embodiment of electronic device of the present invention.

Specific embodiment

In the application process of input method, either Japanese inputting method or Chinese character coding input method, all generally existing polyphone The case where.Therefore, the phonetic notation of corpus it is accurate whether will directly affect user to the experience effect of input method.It is to corpus phonetic notation Refer to and phonetic notation is carried out to determine the process of the correct phonetic notation of corpus to corpus using the phonetic notation tool of input method.

Refering to fig. 1, the method implementation of corpus phonetic notation of the present invention includes:

Step S101: phonetic notation is carried out respectively to each corpus using at least two different phonetic notation tools；

For needing the corpus of phonetic notation, it is utilized respectively at least two different phonetic notation tools and carries out phonetic notation, make each corpus All there are at least two phonetic notations.Here phonetic notation tool can be any two or more than two different phonetic notation tools, this hair It is bright without limitation.

Step S102: judge whether at least two phonetic notations of each corpus are identical；

Whether at least two phonetic notations of each corpus obtained after judgment step S101 are identical, if each corpus is at least Two phonetic notations are identical, carry out step S103, carry out step S104 if different.Such as it can be by the more of the same corpus A phonetic notation carries out diff calculating (text compares calculating), if calculated result is difference, then it represents that different phonetic notation tools are to the corpus Phonetic notation have differences, need further verify to determine the correct phonetic notation of corpus, then carry out step S104；If calculated result is It is identical, then it represents that multiple phonetic notation tools to the phonetic notation indifference of the corpus, then to carry out step S103.

Step S103: directly use corresponding phonetic notation as the correct phonetic notation of corpus；

When at least two phonetic notations of each corpus are identical, then it represents that phonetic notation indifference of each phonetic notation tool to the same corpus It is different, it can be assumed that the phonetic notation of the corpus is correctly, directly to use the phonetic notation as the correct phonetic notation of corpus, terminate process.

Step S104: select assessment result preferably phonetic notation using the correct phonetic notation as corpus；

When at least two phonetic notations of each corpus there are it is different when, indicate phonetic notation of each phonetic notation tool to the same corpus It has differences, needs again to assess the phonetic notation of corpus, select assessment result preferably phonetic notation using as the correct of corpus Phonetic notation.When being assessed here, correct phonetic notation of the preferably phonetic notation as corpus can be selected from existing multiple phonetic notations.If When re-starting assessment, each phonetic notation tool is all inappropriate to the phonetic notation of the corpus, can also manually redefine the corpus Correct phonetic notation terminates process.

The method for further illustrating corpus phonetic notation of the invention below by way of citing:

Such as a corpus " stiff neck ", if being respectively " l à ozh ě n " and " l à o using the phonetic notation result of two phonetic notation tools Zh ě n ", then the result of diff calculating is identical, directly uses " l à o zh ě n " as the correct phonetic notation of corpus.If being infused using two Sound tool carry out phonetic notation result be respectively " l à zh ě n " and " l à o zh ě n ", assessment obtain " l à o zh ě n " for assessment result compared with Excellent phonetic notation, then using " l à o zh ě n " as the correct phonetic notation of corpus.If carrying out the result point of phonetic notation using two phonetic notation tools Not Wei " l à o zh ě n ", can find that two phonetic notations are all inappropriate when re-starting assessment, can manually determine that " l à o zh ě n " is Assessment result preferably phonetic notation, using " l à o zh ě n " as the correct phonetic notation of the corpus.

It is equally directed to above-mentioned corpus " stiff neck ", if being respectively " l à o zh ě using the phonetic notation result of three phonetic notation tools N ", " l à o zh ě n " and " l à o zh ě n ", then directly using " l à o zh ě n " as the correct phonetic notation of corpus " stiff neck ", if utilizing three The phonetic notation result of a phonetic notation tool respectively " l à o zh ě n ", " l à zh ě n " and " l à o zh ě n " or respectively " l à o zh ě n ", " l à ozh ě n " and " l à zh ě n " belongs to the different situation of phonetic notation, and it is that assessment result is more excellent that assessment, which obtains " l à o zh ě n ", Phonetic notation, then using " l à o zh ě n " as the correct phonetic notation of corpus.

Pass through the description of above embodiment, it will be understood that be different from the prior art, the method for corpus phonetic notation of the invention Phonetic notation is distinguished to each corpus using multiple phonetic notation tools, then determines the correct phonetic notation of corpus, pole by way of cross check Big reduction needs the quantity of the correct phonetic notation of desk checking corpus, and the efficiency for improving corpus phonetic notation also improves corpus phonetic notation simultaneously Accuracy.The method of the corpus phonetic notation can be used for the corpus phonetic notation of different language type, be such as but not limited to Japanese corpus, the Chinese Language corpus.Research has shown that for an embodiment of the method for corpus phonetic notation of the present invention, relative to existing phonetic notation side Method needs the workload of the correct phonetic notation of desk checking corpus to reduce 90%, and the accuracy of phonetic notation reaches 99.5% or more.

Referring to Fig. 2, one embodiment of method that the present invention generates input method evaluation and test corpus includes:

Step S201: the history input content of capture is cut at least one corpus of user's single typing；

History input content designated herein, can be obtained by capturing webpage contents, and acquisition can also be manually entered.? In actual application, in order to make being actually typing of being more close to the users of history input content of capture, it can grab on network History input content of the content of predetermined field or type as user.Such as it can be by corresponding in crawl portal website The content of the every field such as " amusement ", " finance and economics ", " automobile " is as history input content.It can also pass through crawl user's The contents such as the log inputted on the various social network sites such as twitter, facebook or microblogging or signature are defeated as the history of user Enter content.

The history input content of capture is cut into the corpus of user's single typing, so that the corpus that cutting obtains more is sticked on Nearly user's actually enters.Wherein, referring to Fig. 3, the history input content of capture is cut into user's list in present embodiment The corpus of secondary typing includes following sub-step:

Sub-step S301: the history input content of capture is cut for the first time according to the critical progress that punctuation mark is separation Point；

The characteristics of according to the typing of general user habit and linguistic organization, former and later two portions that usual punctuation mark separates Dividing is more complete linguistic unit, and user is typically also separated input.Therefore, in order to which the corpus for obtaining cutting more accords with The input habit for closing usual user is that the critical history input content to acquisition separated cut for the first time according to punctuation mark Point.

Sub-step S302: the corpus after first time cutting is saved according to text and carries out second of cutting；

Under normal conditions, it when user's input journal, also by the habit of text section input, therefore, is obtained to after first time cutting The corpus arrived saves according to text and carries out second of cutting.To the corpus obtained after first time cutting can by juman and knp according to Text section carries out second of cutting.Here juman and knp is two linguistic unit analysis tools.

Cutting twice of the invention is illustrated below by way of citing to obtain at least one corpus of user's single typing Step:

Such as the history input content for user: " today goes back to Beijing, goes to Shenzhen Airport to take h12 flight 2:00 and goes out Hair." with the result for the critical progress first time cutting that punctuation mark is separation are as follows: " today goes back to Beijing and Shenzhen Airport is gone to take boat Class sets out ".Here for " going to Shenzhen Airport to take h12 flight 2:00 to set out." it is divided into 3 sections in first time cutting, because Encountering letter or when symbol when input, it will usually first by after the confirmation inputted followed by input.It completes for the first time Second of cutting is carried out after cutting, for the result for carrying out second of cutting after above-mentioned first time cutting again are as follows: " today goes back to Beijing It goes to Shenzhen Airport to take flight to set out ".

The corpus after cutting twice is carried out by above-mentioned two step, more meets the input habit of user, to make cutting Corpus afterwards is actually entered closer to user's, it is hereby achieved that more preferably evaluating and testing corpus.

Step S202: phonetic notation is carried out respectively to each corpus using at least two different phonetic notation tools；

For each corpus that step S201 cutting obtains, then it is utilized respectively at least two different phonetic notation tools and is infused Sound makes each corpus have at least two phonetic notations.Here phonetic notation tool can be arbitrary two or more differences Phonetic notation tool, the present invention it is without limitation.

Step S203: judge whether at least two phonetic notations of each corpus are identical；

Whether at least two phonetic notations of each corpus obtained after judgment step S202 are identical, if each corpus is at least Two phonetic notations are identical, carry out step S204, carry out step S205 if different.Such as it can be by the more of the same corpus A phonetic notation carries out diff calculating, if calculated result is difference, then it represents that have differences to the phonetic notation of the corpus, need further core It looks into determine the correct phonetic notation of corpus, then enters step S205；If calculated result is identical, then it represents that multiple phonetic notation tools pair The phonetic notation indifference of the corpus, then enter step S204.

Step S204: directly use corresponding phonetic notation as the correct phonetic notation of corpus；

When at least two phonetic notations of each corpus are identical, then it represents that phonetic notation indifference of each phonetic notation tool to the same corpus It is different, it can be assumed that the phonetic notation of the corpus is correctly, the phonetic notation directly to be used to enter step S206 as the correct phonetic notation of corpus.

Step S205: select assessment result preferably phonetic notation as the correct phonetic notation of corpus；

When at least two phonetic notations of each corpus there are it is different when, indicate phonetic notation of each phonetic notation tool to the same corpus It has differences, needs again to assess the phonetic notation of corpus, select assessment result preferably phonetic notation using as the correct of corpus Phonetic notation.When being assessed here, correct phonetic notation of the preferably phonetic notation as corpus can be selected from existing multiple phonetic notations, if When re-starting assessment, when each phonetic notation tool is all inappropriate to the phonetic notation of the corpus, the corpus can also be manually redefined Correct phonetic notation.After the right pronunciation for determining corpus, S206 is entered step.

The phonetic notation process that the step of above-mentioned progress phonetic notation to corpus keeps up in the embodiment of predicate material phonetic notation method is the same, It different one illustrates herein.

Step S206: the corpus of correct phonetic notation will be determined as evaluation and test corpus；

Phonetic notation will be passed through and determine the corpus of correct phonetic notation as evaluation and test corpus.In actual application, it can use At least one input method tool inputs evaluation and test corpus, corresponding candidate result is obtained, to evaluate and test the phase of corpus and acquisition The candidate result answered generates evaluation and test corpus together, and input evaluation and test corpus obtains corresponding candidate result after can effectively improve Efficiency.Here at least one input method tool can be Google's input method, Baidu's input method, Microsoft's input method, search dog input Any one or more input method tools of method etc..

Pass through the description of above embodiment, it will be understood that compared with the existing technology, the present invention generates the side of evaluation and test corpus Method hardly needs artificial participation, improves the efficiency for generating evaluation and test corpus.By being obtained to the history input content cutting of capture The corpus of user's single typing makes being actually typing closer to evaluating and testing the continuity of corpus more for the evaluation and test corpus obtained and user It is good.Phonetic notation is carried out to corpus using multiple phonetic notation tools, then cross check is carried out to determine language to multiple phonetic notations of same corpus The correct phonetic notation of material effectively reduces the workload for carrying out manual examination and verification to corpus phonetic notation, while also effectively improving corpus phonetic notation Accuracy.

Referring to Fig. 4, another embodiment of method that the present invention generates input method evaluation and test corpus includes:

Step S401: the content of predetermined field or type is as history input content on capture network；

Grab history input content of the content of predetermined field or type as user on network.For example crawl can be passed through The content of the every field such as corresponding " science and technology ", " tourism ", " sport " is as history input content in portal website.It can also lead to Cross the contents conducts such as the log inputted on the various social network sites such as twitter, facebook or the microblogging of crawl user or signature The history input content of user.

Step S402: the history input content of capture is cut at least one corpus of user's single typing；

First can be the critical progress first time cutting of separation according to punctuation mark to the history input content of capture, to the Corpus after cutting is saved according still further to text carries out second of cutting to obtain at least one corpus of user's single typing.

Step S403: noise treatment is carried out to the corpus for user's single typing that cutting obtains；

The content grabbed on network does not have format specification, therefore, may exist in the history input content of crawl and be permitted Mostly meaningless corpus.And if these meaningless corpus will affect evaluation result for producing evaluation and test corpus.Here, to cutting Get to the corpus of user's single typing carry out noise treatment according to customized Noise rules, to remove meaningless language Material.Customized Noise rules are the rule for meeting normal words format specification.Such as in corpus it is possible that as "！！！！$ $& ", " " "." as content, it is this to be typically considered meaningless corpus, need by going Noise treatment removes such corpus.

Step S404: each corpus carries out frequency calculating to removing noise treated, passes through roulette algorithm and carries out corpus It chooses；

Treated for removing noise, and each corpus carries out frequency calculating, then carries out corpus choosing by roulette algorithm It takes.Frequency calculating refers to the number for calculating the same corpus.Here, the frequency of corpus be corpus choose result influence factor it One.The frequency of corpus is higher, and the probability which is selected in roulette is bigger.With roulette algorithm picks corpus, both In view of the common degree of corpus, and the low-down uncommon corpus of the frequency is made also to have the possibility being selected, makes the corpus obtained more Comprehensively.

It is further described below by way of citing and frequency calculating is carried out to corpus, corpus selection is carried out by roulette algorithm Step:

For example remove noise treated that corpus is total up to 100, wherein respectively including tetra- different corpus of A, B, C, D. Be calculated by the frequency: A is 40, and B is 35, and C is 15, and D is 10.That is the frequency of A, B, C, D is respectively 40,35, 15,10.Thus we can be calculated A, B, C, D appearance probability be respectively as follows: 0.4 (40/100), 0.35 (35/100), 0.15(15/100),0.1(10/100).That is if the number of all corpus as if being a circle, A, B, C, D The angle Line Integral being distributed on circle is other are as follows: 0.4 × 360 °=144 °；0.35×360°=126°；0.15×360°=54°；0.1 ×360°=36°.As it can be seen that probability is bigger, when corpus is chosen, selected probability is bigger.

According to above-mentioned method, when choosing corpus in practice, an integer range, such as corpus can be defined first In have 3 A, 2 B, 5 C, can define total section be 1-10, then the section of A can accordingly be defined as 1-3 (i.e. A's Section occupies 1,2,3 these three numerical value), naturally it is also possible to as long as being defined as 4-6 in this way, guaranteeing that A occupies three numerical intervals i.e. It can.And so on, the section that can define B is 4-5, and the section of C is 6-10；Then, pass through certain one 1- of Program Generating A random number in 10 sections, such as the random number that generates are that (i.e. the section of A is any one for any one of 1,2,3 A numerical value), then choosing A corpus.Identical method is also used for the selection of other corpus.After choosing corpus, it will do it One judgement, determines whether the corpus of the selection had been chosen, and is carried out if choosing in next step, if do not chosen It crosses, then the corpus is put into and has been chosen in list, then carry out in next step.Above-mentioned steps are recycled, are reached until having chosen list length To the requirement of evaluation and test corpus.

Step S405: in selecting the corpus come, for identical corpus, only retain one of as using at least Two different phonetic notation tools carry out the corpus of phonetic notation；

For selecting the corpus come, duplicate removal is carried out, to avoid phonetic notation and verification is carried out to duplicate corpus repeatedly, is reduced Unnecessary burden in production evaluation and test corpus.For example the corpus chosen is D, but have 10 D, only retain wherein 1 D, removes Remaining 9 D.

Step S406: phonetic notation is carried out respectively to each corpus using at least two different phonetic notation tools；

For above-mentioned steps treated corpus, phonetic notation is carried out to each corpus using at least two phonetic notation tools.

Step S407: judge whether at least two phonetic notations of each corpus are identical；

Judge whether at least two phonetic notations of each corpus are identical, then carry out step S408 if they are the same；If it is different, then carrying out Step S409.

Step S408: directly use corresponding phonetic notation as the correct phonetic notation of corpus；

It directly uses the phonetic notation of phonetic notation tool as after the correct phonetic notation of corpus, enters step S410.

Step S409: select assessment result preferably phonetic notation as the correct phonetic notation of corpus；

The phonetic notation of corpus is assessed or is re-started phonetic notation to choose assessment result preferably phonetic notation as corpus Correct phonetic notation determines that the correct phonetic notation of corpus enters step S410 later.

Step S410: the corpus of correct phonetic notation will be determined as evaluation and test corpus；

Step S411: at least one input method tool input evaluation and test corpus is run, corresponding candidate result is obtained, will evaluate and test Corpus saves to obtain evaluation and test corpus with corresponding candidate result；

It runs at least one input method tool to input the evaluation and test corpus that above-mentioned steps obtain, obtain corresponding candidate As a result, generating evaluation and test corpus together to evaluate and test the corresponding candidate result of corpus and acquisition, inputted after can effectively improve Evaluate and test the efficiency that corpus obtains corresponding candidate result.Here at least one input method tool can be Google's input method, Baidu Any one or more input method tools of input method, Microsoft's input method, search dog input method etc..

Referring to Fig. 5, electronic device first embodiment of the present invention includes phonetic notation module 11, judgment module 12 and phonetic notation Determining module 13, in which:

Phonetic notation module 11 is used to carry out phonetic notation respectively to each corpus using at least two different phonetic notation tools, so that Each corpus has corresponding at least two phonetic notation, and corresponding at least two phonetic notation of each corpus is exported to judgment module 12；

Phonetic notation module 11 can use two or more different phonetic notation tools and carry out phonetic notation respectively to each corpus, Corresponding at least two phonetic notation of each corpus after phonetic notation is exported to judgment module 12.

Judgment module 12 exports judging result to note for judging whether at least two phonetic notations of each corpus are identical Sound determining module 13；

Judgment module 12 judges whether multiple phonetic notations of each corpus are identical, and judging result is exported and determines mould to phonetic notation Block 13.

Phonetic notation determining module 13 be used for when at least two phonetic notations of each corpus are identical, directly use corresponding phonetic notation as The correct phonetic notation of corpus selects assessment result preferably phonetic notation using as language when at least two phonetic notation difference of each corpus The correct phonetic notation of material.

Phonetic notation determining module 13 is used to determine the correct phonetic notation of corpus.When the judgement of judgment module 12 obtains each corpus extremely When two phonetic notations are identical less, directly use corresponding phonetic notation as the correct phonetic notation of corpus, when at least two phonetic notations of each corpus When different, select assessment result preferably phonetic notation using the correct phonetic notation as corpus.

Referring to Fig. 6, electronic device second embodiment of the present invention includes cutting module 21, phonetic notation module 22, judges mould Block 23 and evaluation and test corpus generation module 24, in which:

Cutting module 21 is used to for the history input content of capture being cut at least one corpus of user's single typing, and At least one corpus for user's single typing that cutting obtains is exported to phonetic notation module 22；

Cutting module 21 can by with punctuation mark be separate it is critical to history input content carry out first time cutting, Second of cutting is carried out to the corpus after first time cutting according still further to Wen Jie and obtains at least one corpus of user's single typing.Its In, cutting module 21 saves the corpus after first time cutting according to text especially by juman and knp and carries out second of cutting.

Phonetic notation module 22 is used to carry out phonetic notation respectively to each corpus using at least two different phonetic notation tools, so that Each corpus has corresponding at least two phonetic notation, and corresponding at least two phonetic notation of each corpus is exported to judgment module 23；

Phonetic notation module 22 can use any of Google's input method, Baidu's input method, Microsoft's input method, search dog input method etc. Multiple phonetic notation tools carry out phonetic notation to each corpus respectively, by corresponding at least two phonetic notation of each corpus after phonetic notation export to Judgment module 23.

Judgment module 23 is exported for judging whether at least two phonetic notations of each corpus are identical, and by judging result to commenting Survey corpus generation module 24；

Judgment module 23 judges whether multiple phonetic notations of each corpus from phonetic notation module 22 are identical, and by judging result It exports and gives evaluation and test corpus generation module 24.

It evaluates and tests corpus production module 24 to be used for when at least two phonetic notations of each corpus are identical, directly with corresponding phonetic notation As the correct phonetic notation of corpus, when at least two phonetic notation difference of each corpus, select assessment result preferably phonetic notation to make For the correct phonetic notation of corpus, and the corpus of correct phonetic notation will be determined as evaluation and test corpus.

The correct phonetic notation that corpus generation module 24 determines corpus is evaluated and tested, and the corpus of correct phonetic notation will be determined as evaluation and test Corpus.

Referring to Fig. 7, in electronic device third embodiment of the present invention cutting module include the first cutting unit 111 and Second cutting unit 112, in which:

First cutting unit 111 is used for the history input content that will capture according to the critical carry out that punctuation mark is separation the Cutting, and the corpus that first time cutting obtains is exported to the second cutting unit 112；

First cutting unit 111 is used to be the critical history input content progress to capture the separated according to punctuation mark Cutting, the corpus that cutting is obtained are exported to the second cutting unit 112.

Second cutting unit 112 is used for the corpus after the first time cutting from the first cutting unit 111 according to Wen Jie Second of cutting is carried out, obtains the corpus of user's single typing, and the corpus of user's single typing is exported and gives phonetic notation module；

Second cutting unit 112 is specifically used for carrying out the corpus after first time cutting according to text section by juman and knp Second of cutting.

Referring to Fig. 8, the 4th embodiment of electronic device of the present invention includes content capture module 31, cutting module 32, goes Noise module 33, corpus choose module 34, deduplication module 35, phonetic notation module 36, judgment module 37, evaluation and test corpus generation module 38 And corpus module 39, in which:

Content capture module 31 is used as history input content, and general for capturing on network the content of predetermined field or type The history input content of capture is exported to cutting module 32；

Content capture module 31 is used to capture the content of the specific area on network or type as history input content.

Cutting module 32 is used to for the history input content of capture being cut at least one corpus of user's single typing, and At least one corpus for user's single typing that cutting obtains is exported to going noise module 33；

The corpus of user single typing of the noise module 33 for obtaining to 32 cutting of cutting module is gone to carry out at noise Reason, to eliminate wherein meaningless corpus；

The corpus for the user's single typing for going noise module 33 that can obtain by customized Noise rules to cutting into Row goes noise treatment, eliminates wherein meaningless corpus.

Corpus chooses module 34 and is used for going noise module 33 treated that each corpus to carry out frequency calculating, passes through wheel disc It gambles algorithm and carries out corpus selection；

Corpus chooses module 34 and carries out frequency calculating to each corpus after removing noise, carries out corpus by roulette algorithm It chooses.

Deduplication module 35 is used to choose module 34 in corpus and select in the corpus come, for identical corpus, only retains One of corpus as using at least two different phonetic notation tools progress phonetic notations, and export to phonetic notation module 36；

Deduplication module 35 is to identical multiple corpus in the corpus come are selected, to avoid occurring in the evaluation and test corpus of generation Duplicate corpus.

Phonetic notation module 36 is used to carry out phonetic notation respectively to each corpus using at least two different phonetic notation tools, so that Each corpus has corresponding at least two phonetic notation, and corresponding at least two phonetic notation of each corpus is exported to judgment module 37；

Judgment module 37 is exported for judging whether at least two phonetic notations of each corpus are identical, and by judging result to commenting Survey corpus generation module 38；

It evaluates and tests corpus generation module 38 to be used for when at least two phonetic notations of each corpus are identical, directly with corresponding phonetic notation As the correct phonetic notation of corpus, when at least two phonetic notation difference of each corpus, select assessment result preferably phonetic notation to make For the correct phonetic notation of corpus, and the corpus of correct phonetic notation will be determined as evaluation and test corpus.

Corpus module 39 is used to run commenting of obtaining of at least one input method tool input evaluation and test corpus generation module 38 Corpus is surveyed to obtain corresponding candidate result, and collects corresponding candidate result, evaluation and test corpus is protected with corresponding candidate result It deposits to obtain evaluation and test corpus.

Corpus module 39 passes through the one of operation Google's input method, Baidu's input method, Microsoft's input method, search dog input method etc. A or any number of input method tool input evaluation and test corpus obtain corresponding candidate result, by evaluation and test corpus and corresponding candidate knot Fruit saves to obtain evaluation and test corpus.

By the elaboration of above embodiment, the present invention has the advantages that on the one hand, the method for corpus phonetic notation of the present invention, Phonetic notation is distinguished to each corpus using multiple phonetic notation tools, then determines the correct phonetic notation of corpus, pole by way of cross check Big reduction needs the quantity of the correct phonetic notation of desk checking corpus, and the efficiency for improving corpus phonetic notation also improves corpus phonetic notation simultaneously Accuracy.Research has shown that the method for corpus phonetic notation of the invention needs desk checking language relative to existing phonetic notation method Expect that the quantity of correct phonetic notation reduces 90%, and the accuracy of phonetic notation reaches 99.5% or more.

On the other hand, the generation method of input method evaluation and test corpus of the present invention, hardly needs artificial participation, improves evaluation and test language The formation efficiency of material.By carrying out noise treatment to the corpus of capture, so that the quality of evaluation and test corpus significantly improves, avoid not Appropriate corpus causes evaluation result phenomenon on the lower side occur.In addition, language fragments phenomenon in evaluation and test corpus is reduced by cutting, The frequency is introduced when corpus is chosen simultaneously to calculate, considers the information of the common degree of corpus, makes the reality of evaluation and test corpus and user obtained Typing is closer to the continuity for evaluating and testing corpus is more preferable.Phonetic notation is carried out to corpus using multiple phonetic notation tools, passes through cross check Mode determines the correct phonetic notation of corpus, effectively reduces the workload that desk checking is carried out to corpus phonetic notation, while also improving corpus The accuracy of phonetic notation.

In several embodiments provided by the present invention, it should be understood that disclosed device and method can pass through Other modes are realized.For example, device embodiments described above are only schematical, for example, the module is drawn Point, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple module or components can To combine or be desirably integrated into another system, or some features can be ignored or not executed.Another point, it is shown or beg for The mutual coupling, direct-coupling or communication connection of opinion can be through some interfaces, the INDIRECT COUPLING of device or unit Or communication connection, it can be electrical property, mechanical or other forms.

The functional module as illustrated by the separation member may or may not be physically separated, as list The component of member display may or may not be physical unit, it can and it is in one place, or may be distributed over In multiple network units.Some or all of unit therein can be selected to realize the present invention program's according to the actual needs Purpose.

In addition, each functional module in each embodiment of the present invention can integrate in one processing unit, it can also To be that each functional module physically exists alone, can also be integrated in one unit with two or more functional modules.On It states integrated unit both and can take the form of hardware realization, can also realize in the form of software functional units.

Mode the above is only the implementation of the present invention is not intended to limit the scope of the invention, all to utilize this Equivalent structure or equivalent flow shift made by description of the invention and accompanying drawing content, it is relevant to be applied directly or indirectly in other Technical field is included within the scope of the present invention.

Claims

1. a kind of method of corpus phonetic notation characterized by comprising

Phonetic notation is carried out respectively to each corpus using at least two different phonetic notation tools, so that each corpus has phase At least two phonetic notations answered；

Judge whether at least two phonetic notations of each corpus are identical, selects assessment result preferably phonetic notation to make if different For the correct phonetic notation of the corpus, then directly use the phonetic notation as the correct phonetic notation of corpus if they are the same.

2. a kind of method for generating input method evaluation and test corpus characterized by comprising

The history input content of capture is cut at least one corpus of user's single typing；

Judge whether at least two phonetic notations of each corpus are identical, selects assessment result preferably phonetic notation to make if different For the correct phonetic notation of the corpus, the phonetic notation is then directly used to determine as the correct phonetic notation of corpus, and by described if they are the same The corpus of correct phonetic notation is as the evaluation and test corpus.

3. according to the method described in claim 2, it is characterized in that, described be cut into user's list for the history input content of capture The step of corpus of secondary typing includes:

By the history input content of capture according to punctuation mark be separate critical progress first time cutting；

Corpus after the first time cutting is saved according to text and carries out second of cutting, obtains the language of user's single typing Material.

4. according to the method described in claim 3, it is characterized in that, the corpus after the cutting to first time is carried out according to text section The step of second of cutting includes: to be cut for the second time to the corpus after first time cutting according to text section by juman and knp Point.

5. according to the method described in claim 2, it is characterized in that, described be cut into user's list for the history input content of capture After the step of corpus of secondary typing, using at least two different phonetic notation tools to the corpus carry out phonetic notation the step of it Before, further includes:

Noise treatment is carried out to the corpus for user's single typing that cutting obtains, to eliminate wherein meaningless corpus.

6. according to the method described in claim 5, it is characterized in that, the corpus of user's single typing that cutting is obtained into The step of row goes noise treatment includes: the corpus of the user's single typing obtained using customized Noise rules to cutting Carry out noise treatment.

7. according to the method described in claim 5, it is characterized in that, the corpus of user's single typing that cutting is obtained into After row the step of going noise treatment, further includes: remove noise treated that each corpus carries out frequency calculating to described, lead to It crosses roulette algorithm and carries out corpus selection.

8. leading to the method according to the description of claim 7 is characterized in that corpus carries out frequency calculating to removing noise treated It crosses after the step of roulette algorithm carries out corpus selection, further includes: in selecting the corpus come, for identical institute Predicate material is only retained one of as the corpus for being carried out phonetic notation using at least two different phonetic notation tools.

9. according to the method described in claim 2, it is characterized in that, after the step of corpus is evaluated and tested in the generation, further includes:

It runs at least one input method tool and inputs the evaluation and test corpus to obtain corresponding candidate result, and collect described corresponding Candidate result；

The evaluation and test corpus and corresponding candidate result are saved to obtain evaluation and test corpus.

10. according to the method described in claim 2, it is characterized in that, described be cut into user for the history input content of capture Before the step of corpus of single typing, further includes: the content of predetermined field or type is as in history input on capture network Hold.

11. a kind of electronic device, which is characterized in that including phonetic notation module, judgment module and phonetic notation determining module, in which:

The phonetic notation module is used to carry out phonetic notation respectively to each corpus using at least two different phonetic notation tools, so that often A corpus has corresponding at least two phonetic notation, and corresponding at least two phonetic notation of each corpus is exported to the judgement Module；

The judgment module for judging whether at least two phonetic notations of each corpus identical, and by judging result export to The phonetic notation determining module；

The phonetic notation determining module be used for when at least two phonetic notations of each corpus are identical, directly use the phonetic notation as The correct phonetic notation of corpus selects assessment result preferably phonetic notation to make when at least two phonetic notation difference of each corpus For the correct phonetic notation of the corpus.

12. a kind of electronic device, which is characterized in that generated including cutting module, phonetic notation module, judgment module and evaluation and test corpus Module, in which:

The cutting module is used to for the history input content of capture being cut at least one corpus of user's single typing, and will At least one corpus for user's single typing that cutting obtains is exported to the phonetic notation module；

The phonetic notation module is used to carry out phonetic notation respectively to each corpus using at least two different phonetic notation tools, so that Obtaining each corpus has corresponding at least two phonetic notation, and corresponding at least two phonetic notation of each corpus is exported to described Judgment module；

The judgment module for judging whether at least two phonetic notations of each corpus identical, and by judging result export to The evaluation and test corpus generation module；

The evaluation and test corpus production module is used for when at least two phonetic notations of each corpus are identical, directly with the phonetic notation As the correct phonetic notation of corpus, when at least two phonetic notation difference of each corpus, assessment result preferably phonetic notation is selected Using the correct phonetic notation as the corpus, and using the corpus for determining correct phonetic notation as the evaluation and test corpus.

13. device according to claim 12, which is characterized in that the cutting module includes the first cutting unit and second Cutting unit, in which:

The history input content that the first cutting unit is used to capture is the critical carry out first separated according to punctuation mark Secondary cutting, and the corpus that first time cutting is obtained is exported to the second cutting unit；

The second cutting unit is used for the corpus after the first time cutting of the first cutting unit according to text Section carries out second of cutting, obtains the corpus of user's single typing, and by the corpus of user's single typing export to The phonetic notation module.

14. device according to claim 13, which is characterized in that the second cutting unit is specifically used for passing through juman And knp saves the corpus after first time cutting according to text and carries out second of cutting.

15. device according to claim 12, which is characterized in that described device further includes noise module, for institute The corpus for stating user's single typing that cutting module cutting obtains carries out noise treatment, to eliminate wherein meaningless language Material.

16. device according to claim 15, which is characterized in that described to go the specific term of noise module using customized Noise rules carry out noise treatment to the corpus for user's single typing that cutting obtains.

17. device according to claim 15, which is characterized in that described device further include corpus choose module, for pair It is described that the corpus each of is gone after noise resume module to carry out frequency calculating, pass through roulette algorithm and carries out corpus selection.

18. device according to claim 17, which is characterized in that described device further includes deduplication module, for described Corpus is chosen module and is selected in the corpus come, for the identical corpus, only retains one of them and is used as the benefit The corpus of phonetic notation is carried out at least two different phonetic notation tools, and the corpus of the reservation is exported to the phonetic notation module.

19. device according to claim 12, which is characterized in that described device further includes evaluation and test corpus module, is used for Running at least one input method tool, to input the obtained evaluation and test corpus of evaluation and test corpus generation module corresponding to obtain Candidate result, and corresponding candidate result is collected, the evaluation and test corpus and corresponding candidate result are saved to obtain To evaluation and test corpus.

20. device according to claim 12, which is characterized in that described device further includes content capture module, for catching The content of predetermined field or type is obtained on network as history input content, and by the history input content of capture export to The cutting module.