CN102982019A

CN102982019A - Method of phonetic notation of input method linguistic data and method and electronic device for generating evaluation linguistic data

Info

Publication number: CN102982019A
Application number: CN2012104867238A
Authority: CN
Inventors: 景富香
Original assignee: Baidu International Technology Shenzhen Co Ltd
Current assignee: Baidu International Technology Shenzhen Co Ltd
Priority date: 2012-11-26
Filing date: 2012-11-26
Publication date: 2013-03-20
Anticipated expiration: 2032-11-26
Also published as: CN102982019B

Abstract

The invention discloses a method of the phonetic notation of input method linguistic data, and a method and an electronic device for generating evaluation linguistic data. The method of the phonetic notation of the linguistic data comprises the steps of: using at least two different phonetic annotation tools to respectively perform phonetic annotation to each linguistic datum, so that each linguistic datum has corresponding at least two phonetic annotations; judging if the at least two phonetic annotations of each linguistic datum are the same, if not, selecting the phonetic annotation of which the evaluation result is optimal as the correct phonetic annotation of the linguistic datum, and if so, using the phonetic annotation as the correct phonetic annotation of the linguistic datum. By the way, with the adoption of the method, the workload of needing workers to check the correct phonetic annotations of the linguistic data is greatly reduced, and the efficiency of the phonetic annotation of the linguistic data is improved while the correct rate of the phonetic annotation of the linguistic data is improved.

Description

Method and the electronic installation of input method language material phonetic notation method, generation evaluation and test language material

Technical field

The present invention relates to input method technique field, particularly relate to input method language material phonetic notation method, generate method and the electronic installation of evaluating and testing language material.

Background technology

Input method refers to the coding method of employing for various symbols being inputted computing machines or other equipment (such as mobile phone).The performance of input method will directly affect the input efficiency on computing machine or other equipment.Therefore, need to evaluate and test the input method performance and think that constantly improving input method provides foundation.

The evaluation and test of input method is by carrying out typing at the evaluation and test language material, selecting the operation such as word, and record during the course ideal candidates result's position and obtain editor's number of times of ideal candidates result, add up at last in a plurality of typings, select the as a result distribution of position of ideal candidates in the word process and the ease for use that the mean value that obtains editor's number of times of ideal candidates result reflects input method.As seen, the evaluation and test language material is the prerequisite of input method evaluation and test, therefore, how to find objective, practical and correct evaluation and test language material to the evaluation and test important in inhibiting of input method.

The general evaluation and test language material of collecting has manual collection and automatic mode to collect.At present, adopt manual efficient of collecting the evaluation and test language material low, and the evaluation and test language material that general automatic mode generates there are the following problems at least: it is unreasonable to cut word mechanism, causes losing the language material of the actual input of most of user, so that the language material that obtains is incorrect, affect the evaluation result of input method; There is not ripe phonetic notation instrument that language material is carried out accurately phonetic notation.

Summary of the invention

The technical matters that the present invention mainly solves provides input method language material phonetic notation method, generates method and the electronic installation of evaluation and test language material, can improve the formation efficiency of evaluation and test language material, the evaluation and test language material and the actual typing of user that generate simultaneously are more approaching, and continuity is good, and the phonetic notation accuracy of evaluation and test language material is high.

For solving the problems of the technologies described above, the technical scheme that the present invention adopts is: the method that a kind of language material phonetic notation is provided, comprise: utilize at least two different phonetic notation instruments that each described language material is carried out respectively phonetic notation, so that each language material has accordingly at least two phonetic notations; Whether at least two phonetic notations judging each described language material are identical, if difference then selects the more excellent phonetic notation of assessment result with the correct phonetic notation as described language material, if identical then directly with the correct phonetic notation of described phonetic notation as language material.

For solving the problems of the technologies described above, another technical solution used in the present invention is: a kind of method that generates input method evaluation and test language material is provided, and comprising: the historical input content that will catch is cut at least one language material of user's single typing; Utilize at least two different phonetic notation instruments that each described language material is carried out respectively phonetic notation, so that each language material has accordingly at least two phonetic notations; Whether at least two phonetic notations judging each described language material are identical, if different then select the more excellent phonetic notation of assessment result with the correct phonetic notation as described language material, if identical then directly with the correct phonetic notation of described phonetic notation as language material, and with the described language material of correct phonetic notation that determines as described evaluation and test language material.

Wherein, the described historical input content that will the catch step that is cut into the language material of user's single typing comprises: the historical input content that will catch critical carries out the cutting first time according to punctuation mark for what separate; Language material after the described cutting first time is carried out the cutting second time according to Wen Jie, obtain the language material of described user's single typing.

Wherein, described language material after the cutting first time is comprised according to the step that Wen Jie carries out for the second time cutting: carry out the cutting second time by juman and the knp language material after to the cutting first time according to Wen Jie.

Wherein, the described historical input content that will catch is cut into after the step of language material of user's single typing, utilize at least two different phonetic notation instruments described language material to be carried out before the step of phonetic notation, also comprise: the language material of described user's single typing that cutting is obtained carries out the denoising sound to be processed, to eliminate wherein insignificant language material.

Wherein, the language material of described user's single typing that cutting is obtained carries out the step that the denoising sound processes and comprises: utilize the language material of described user's single typing that self-defining Noise rules obtains cutting to carry out the denoising sound and process.

Wherein, the language material of described user's single typing that cutting is obtained carries out after the step that the denoising sound processes, and also comprise: each the described language material after described denoising sound is processed carries out the frequency and calculates, and carries out language material by the roulette algorithm and chooses.

Wherein, language material after the processing of denoising sound is carried out the frequency to be calculated, undertaken after the step that language material chooses by the roulette algorithm, also comprise: in the described language material of choosing out, for identical described language material, only keep the language material that one of them carries out phonetic notation as at least two different phonetic notation instruments of described utilization.

Wherein, after the step of described generation evaluation and test language material, also comprise: move at least one input method instrument and input described evaluation and test language material obtaining corresponding candidate result, and collect corresponding candidate result; Described evaluation and test language material and corresponding candidate result are preserved to obtain evaluating and testing corpus.

Wherein, the described historical input content that will catch is cut into before the step of language material of user's single typing, also comprises: catch the content of predetermined field on the network or type as historical input content.

For solving the problems of the technologies described above, another technical solution used in the present invention is: a kind of electronic installation is provided, comprise phonetic notation module, judge module and phonetic notation determination module, wherein: described phonetic notation module is used for utilizing at least two different phonetic notation instruments that each described language material is carried out respectively phonetic notation, so that each language material has accordingly at least two phonetic notations, and corresponding described at least two phonetic notations of each language material are exported to described judge module; Described judge module is used for judging whether at least two phonetic notations of each described language material are identical, and judged result is exported to described phonetic notation determination module; Described phonetic notation determination module is used for when at least two phonetic notations of each described language material are identical, directly with the correct phonetic notation of described phonetic notation as language material, when at least two phonetic notations of each described language material not simultaneously, select the more excellent phonetic notation of assessment result with the correct phonetic notation as described language material.

For solving the problems of the technologies described above, another technical solution used in the present invention is: a kind of electronic installation is provided, comprise cutting module, phonetic notation module, judge module and evaluation and test language material generation module, wherein: the historical input content that described cutting module is used for catching is cut at least one language material of user's single typing, and at least one language material of user's single typing that cutting is obtained is exported to described phonetic notation module; Described phonetic notation module is used for utilizing at least two different phonetic notation instruments that each described language material is carried out respectively phonetic notation, so that each language material has accordingly at least two phonetic notations, and corresponding described at least two phonetic notations of each language material are exported to described judge module; Described judge module is used for judging whether at least two phonetic notations of each described language material are identical, and judged result is exported to described evaluation and test language material generation module; Described evaluation and test language material production module is used for when at least two phonetic notations of each described language material are identical, directly with the correct phonetic notation of described phonetic notation as language material, when at least two phonetic notations of each described language material not simultaneously, select the more excellent phonetic notation of assessment result with the correct phonetic notation as described language material, and with the described language material of correct phonetic notation that determines as described evaluation and test language material.

Wherein, described cutting module comprises the first cutting unit and the second cutting unit, wherein: the historical input content that described the first cutting unit is used for catching carries out the cutting first time according to punctuation mark for the critical of separation, and the described language material that for the first time cutting obtains is exported to described the second cutting unit; Described the second cutting unit is used for carrying out the cutting second time from the language material after the described cutting first time of described the first cutting unit according to Wen Jie, obtain the language material of described user's single typing, and the language material of described user's single typing is exported to described phonetic notation module.

Wherein, described the second cutting unit specifically is used for carrying out the cutting second time by juman and the knp language material after to the cutting first time according to Wen Jie.

Wherein, described device also comprises denoising sound module, and the language material that is used for described user's single typing that cutting obtains to described cutting module carries out the denoising sound to be processed, to eliminate wherein insignificant language material.

Wherein, the concrete term of described denoising sound module utilizes the language material of described user's single typing that self-defining Noise rules obtains cutting to carry out the denoising sound to process.

Wherein, described device comprises that also language material chooses module, is used for that each the described language material after the described denoising sound resume module is carried out the frequency and calculates, and carries out language material by the roulette algorithm and chooses.

Wherein, described device also comprises the molality piece, be used for choosing the described language material that module is chosen out at described language material, for identical described language material, only keep the language material that one of them carries out phonetic notation as at least two different phonetic notation instruments of described utilization, and the language material of described reservation is exported to described phonetic notation module.

Wherein, described device also comprises evaluation and test corpus module, be used at least one phonetic notation instrument of operation and input described evaluation and test language material that described evaluation and test language material generation module obtains to obtain corresponding candidate result, and collect corresponding candidate result, described evaluation and test language material and corresponding candidate result are preserved to obtain evaluating and testing corpus.

Wherein, described device also comprises the content capture module, be used for catching the content of predetermined field on the network or type as historical input content, and the described historical input content that will catch is exported to described cutting module.

The invention has the beneficial effects as follows: on the one hand, the method of language material phonetic notation of the present invention adopts a plurality of phonetic notation instruments to the respectively phonetic notation of each language material, determine again the correct phonetic notation of language material by the mode of cross check, reduce greatly the workload that needs the correct phonetic notation of desk checking language material, improve the efficient of language material phonetic notation simultaneously, also improve the accuracy of language material phonetic notation.

On the other hand, the present invention generates the method for input method evaluation and test language material, owing to adopt a plurality of phonetic notation instruments that language material is carried out phonetic notation, again a plurality of phonetic notations of same language material are carried out cross check to determine the correct phonetic notation of language material, can effectively reduce the workload of the language material phonetic notation being carried out manual examination and verification, also effectively improve simultaneously the accuracy of evaluation and test language material phonetic notation.More press close to by the evaluation and test language material that the historical input content cutting of catching obtained the language material of user's single typing, make obtaining and user's actual typing in addition, the continuity of evaluation and test language material is better.The present invention generates the method for evaluation and test language material, and whole process needs artificial participation hardly, can greatly improve the efficient that generates the evaluation and test language material.

Description of drawings

Fig. 1 is the process flow diagram of method one embodiment of language material phonetic notation of the present invention;

Fig. 2 is the process flow diagram of method one embodiment of input method evaluation and test language material of the present invention generation;

Fig. 3 is that input method of the present invention is evaluated and tested the process flow diagram that the historical input content that will catch in method one embodiment of language material generation is cut into the language material of user's single typing;

Fig. 4 is the process flow diagram of another embodiment of method of input method evaluation and test language material of the present invention generation;

Fig. 5 is the structural representation of electronic installation the first embodiment of the present invention;

Fig. 6 is the structural representation of electronic installation the second embodiment of the present invention;

Fig. 7 is the structural representation of cutting module in electronic installation the 3rd embodiment of the present invention;

Fig. 8 is the structural representation of electronic installation the 4th embodiment of the present invention.

Embodiment

In the application process of input method, no matter be Japanese inputting method or Chinese character coding input method, situations of ubiquity polyphone all.Therefore, the whether accurate of the phonetic notation of language material will directly affect the user to the experience effect of input method.Phonetic notation refers to utilize the phonetic notation instrument of input method that language material is carried out phonetic notation with the process of the correct phonetic notation of definite language material to language material.

Consult Fig. 1, the method embodiment of language material phonetic notation of the present invention comprises:

Step S101: utilize at least two different phonetic notation instruments that each language material is carried out respectively phonetic notation;

For the language material that needs phonetic notation, utilize respectively at least two different phonetic notation instruments to carry out phonetic notation, make each language material have at least two phonetic notations.The phonetic notation instrument here can be any two or more different phonetic notation instruments, and the present invention does not limit this.

Step S102: whether at least two phonetic notations judging each language material are identical;

Whether at least two phonetic notations of each language material that obtains after the determining step S101 are identical, then carry out step S103 if at least two phonetic notations of each language material are identical, if difference is then carried out step S104.Such as calculating (text relatively calculates) by a plurality of phonetic notations of same language material being carried out diff, if result of calculation is different, represent that then different phonetic notation instruments there are differences the phonetic notation of this language material, need further to verify to determine the correct phonetic notation of language material, then carry out step S104; If result of calculation is identical, then represent the phonetic notation indifference of a plurality of phonetic notation instruments to this language material then carried out step S103.

Step S103: directly use corresponding phonetic notation as the correct phonetic notation of language material;

Identical when at least two phonetic notations of each language material, then represent each phonetic notation instrument to the phonetic notation indifference of same language material, the phonetic notation that can assert this language material is correct, directly uses this phonetic notation as the correct phonetic notation of language material, process ends.

Step S104: select the more excellent phonetic notation of assessment result with the correct phonetic notation as language material;

When at least two phonetic notations of each language material exist not simultaneously, represent that each phonetic notation instrument there are differences the phonetic notation of same language material, need again assess the phonetic notation of language material, select the more excellent phonetic notation of assessment result with the correct phonetic notation as language material.When assessing, can from existing a plurality of phonetic notations, select more excellent phonetic notation as the correct phonetic notation of language material here.If when re-starting assessment, each phonetic notation instrument is all incorrect to the phonetic notation of this language material, also can manually redefine the correct phonetic notation of this language material, process ends.

Below further specify by way of example the method for language material phonetic notation of the present invention:

Such as a language material " stiff neck ", if utilize the phonetic notation result of two phonetic notation instruments to be respectively " l à ozh ě n " and " l à o zh ě n ", then the result of diff calculating is identical, directly with " l à o zh ě n " the correct phonetic notation as language material.If the result who utilizes two phonetic notation instruments to carry out phonetic notation is respectively " l à zh ě n " and " l à o zh ě n ", assessment obtains " l à o zh ě n " for the more excellent phonetic notation of assessment result, then with " l à o zh ě n " the correct phonetic notation as language material.If the result who utilizes two phonetic notation instruments to carry out phonetic notation is respectively " l à o zh ě n ", can find that two phonetic notations are all incorrect when re-starting assessment, can determine manually that " l à o zh ě n " is the more excellent phonetic notation of assessment result, with " l à o zh ě n " the correct phonetic notation as this language material.

Be directed to equally above-mentioned language material " stiff neck ", if utilize the phonetic notation result of three phonetic notation instruments to be respectively " l à o zh ě n ", " l à o zh ě n " and " l à o zh ě n ", then directly with " l à o zh ě n " the correct phonetic notation as language material " stiff neck ", if utilize the phonetic notation result of three phonetic notation instruments to be respectively " l à o zh ě n ", " l à zh ě n " and " l à o zh ě n " or be respectively " l à o zh ě n ", " l à ozh ě n " and " l à zh ě n ", all belong to the not identical situation of phonetic notation, assessment obtains " l à o zh ě n " for the more excellent phonetic notation of assessment result, then with " l à o zh ě n " the correct phonetic notation as language material.

Description by above-mentioned embodiment, be appreciated that, be different from prior art, the method of language material phonetic notation of the present invention adopts a plurality of phonetic notation instruments to the respectively phonetic notation of each language material, determine again the correct phonetic notation of language material by the mode of cross check, reduce greatly the quantity that needs the correct phonetic notation of desk checking language material, the efficient that improves the language material phonetic notation also improves the accuracy of language material phonetic notation simultaneously.The method of this language material phonetic notation can be used for the language material phonetic notation of different language kind, is such as but not limited to Japanese language material, Chinese data.Show that after deliberation an embodiment for the method for language material phonetic notation of the present invention with respect to existing phonetic notation method, need the workload of the correct phonetic notation of desk checking language material to reduce 90%, and the accuracy of phonetic notation reaches more than 99.5%.

See also Fig. 2, method one embodiment that the present invention generates input method evaluation and test language material comprises:

Step S201: the historical input content that will catch is cut at least one language material of user's single typing;

Here alleged historical input content can obtain by capturing webpage contents, also can manually input and obtain.In actual application, for the actual typing that the historical input content of catching is close to the users more, can grasp the content of predetermined field on the network or type as user's historical input content.Grasp the content of the every field such as corresponding " amusement ", " finance and economics ", " automobile " in the portal website as historical input content such as passing through.Also can be by the historical input content of the contents such as the daily record inputted on the various social network sites such as crawl user's twitter, facebook or microblogging or signature as the user.

The historical input content of catching is cut into the language material of user's single typing, so that the actual input that the language material that cutting obtains is close to the users more.Wherein, see also Fig. 3, the language material that the historical input content that will catch in the present embodiment is cut into the typing of user's single comprises following substep:

Substep S301: the historical input content that will catch critical carries out the cutting first time according to punctuation mark for what separate;

According to general user's typing custom and the characteristics of linguistic organization, separated former and later two parts of punctuation mark are comparatively complete linguistic units usually, and the user generally also is separately input.Therefore, for the language material that cutting is obtained more meets common user's input habit, critical the historical input content that obtains is carried out the cutting first time according to punctuation mark for what separate.

Substep S302: the language material after the cutting first time is carried out the cutting second time according to Wen Jie;

Generally, during user's input journal, also have the custom by the input of literary composition joint, therefore, the language material that obtains after the cutting first time is carried out the cutting second time according to Wen Jie.Can carry out the cutting second time according to Wen Jie by juman and knp to the language material that obtains after the cutting first time.The juman here and knp are two linguistic unit analysis tools.

Below specify by way of example twice cutting of the present invention with the step of at least one language material of obtaining the typing of user's single:

Such as the historical input content for the user: " go back to Beijing today, go to the Shenzhen Airport to take h12 flight 2:00 and set out." take punctuation mark as " return today Beijing go to the Shenzhen Airport to take flight set out " as the critical result who carries out for the first time cutting that separates.Here for " going to the Shenzhen Airport to take h12 flight 2:00 sets out." when first time cutting, be divided into 3 sections because during input when running into letter or symbol, usually can first will more then input after the affirmation of having inputted.Carry out cutting second time after finishing for the first time cutting, for the result who carries out again the cutting second time after the above-mentioned cutting first time be: " return today Beijing go to the Shenzhen Airport to take flight set out ".

Carry out language material after twice cutting by above-mentioned two steps, more meet user's input habit, thereby the actual input that the language material after the cutting more is close to the users can more preferably be evaluated and tested language material thus.

Step S202: utilize at least two different phonetic notation instruments that each language material is carried out respectively phonetic notation;

Each language material that cutting obtains for step S201 utilizes respectively at least two different phonetic notation instruments to carry out phonetic notation again, makes each language material have at least two phonetic notations.The phonetic notation instrument here can be two or more different phonetic notation instruments arbitrarily, and the present invention does not limit this.

Step S203: whether at least two phonetic notations judging each language material are identical;

Whether at least two phonetic notations of each language material that obtains after the determining step S202 are identical, then carry out step S204 if at least two phonetic notations of each language material are identical, if difference is then carried out step S205.Such as calculating by a plurality of phonetic notations of same language material being carried out diff, if result of calculation is different, then represent the phonetic notation of this language material be there are differences, need further to verify to determine the correct phonetic notation of language material, then enter step S205; If result of calculation is identical, then represent the phonetic notation indifference of a plurality of phonetic notation instruments to this language material then entered step S204.

Step S204: directly use corresponding phonetic notation as the correct phonetic notation of language material;

Identical when at least two phonetic notations of each language material, then represent each phonetic notation instrument to the phonetic notation indifference of same language material, the phonetic notation that can assert this language material is correct, directly uses this phonetic notation as the correct phonetic notation of language material, enters step S206.

Step S205: select the more excellent phonetic notation of assessment result as the correct phonetic notation of language material;

When at least two phonetic notations of each language material exist not simultaneously, represent that each phonetic notation instrument there are differences the phonetic notation of same language material, need again assess the phonetic notation of language material, select the more excellent phonetic notation of assessment result with the correct phonetic notation as language material.When assessing here, can from existing a plurality of phonetic notations, select more excellent phonetic notation as the correct phonetic notation of language material, if re-start when assessment, when each phonetic notation instrument is all incorrect to the phonetic notation of this language material, also can manually redefine the correct phonetic notation of this language material.Determine to enter step S206 after the right pronunciation of language material.

The phonetic notation flow process that the above-mentioned step that language material is carried out phonetic notation is caught up with in the embodiment of predicate material phonetic notation method is the same, in this for example explanation that differs.

Step S206: will determine the language material of correct phonetic notation as the evaluation and test language material;

To determine that also the language material of correct phonetic notation is as the evaluation and test language material through phonetic notation.In actual application, can utilize at least one input method instrument that the evaluation and test language material is inputted, obtain corresponding candidate result, generate together the evaluation and test corpus with evaluation and test language material and the corresponding candidate result of obtaining, can input later on the efficient that the evaluation and test language material obtains the corresponding candidate result by Effective Raise.At least one input method instrument here can be any one or a plurality of input method instrument of Google's input method, Baidu's input method, Microsoft's input method, search dog input method etc.

By the description of above-mentioned embodiment, be appreciated that with respect to prior art the present invention generates the method for evaluation and test language material, needs hardly artificial participation, improve the efficient that generates the evaluation and test language material.More press close to by the evaluation and test language material that the historical input content cutting of catching obtained the language material of user's single typing, make obtaining and user's actual typing, the continuity of evaluation and test language material is better.Adopt a plurality of phonetic notation instruments that language material is carried out phonetic notation, more a plurality of phonetic notations of same language material are carried out cross check to determine the correct phonetic notation of language material, effectively reduce the workload that manual examination and verification are carried out in the language material phonetic notation, also effectively improve simultaneously the accuracy of language material phonetic notation.

See also Fig. 4, another embodiment of method that the present invention generates input method evaluation and test language material comprises:

Step S401: catch on the network content of predetermined field or type as historical input content;

The content of predetermined field or type is as user's historical input content on the crawl network.Grasp the content of the every field such as corresponding " science and technology ", " tourism ", " physical culture " in the portal website as historical input content such as passing through.Also can be by the historical input content of the contents such as the daily record inputted on the various social network sites such as crawl user's twitter, facebook or microblogging or signature as the user.

Step S402: the historical input content that will catch is cut at least one language material of user's single typing;

Can critical carry out cutting first time according to punctuation mark for what separate first to the historical input content of catching, the language material after the cutting first time is carried out the cutting second time to obtain at least one language material of user's single typing according to Wen Jie again.

Step S403: the language material of user's single typing that cutting is obtained carries out the denoising sound to be processed;

The content that grasps on the network does not have format specification, therefore, may have many insignificant language materials in the historical input content of crawl.And if these insignificant language materials can affect evaluation result for the production of the evaluation and test language material.Here, the language material of user's single typing that cutting is obtained carries out the denoising sound according to self-defining Noise rules to be processed, to remove insignificant language material.Self-defining Noise rules is the rule that meets the normal words format specification.Such as may occur in the language material picture "! $$﹠amp; ", " " ".。。。。" such content, thisly usually be considered to insignificant language material, need to process by the denoising sound and remove such language material.

Step S404: each language material after the processing of denoising sound is carried out the frequency calculate, carry out language material by the roulette algorithm and choose;

Carry out the frequency for each language material after the processing of denoising sound and calculate, then carry out language material by the roulette algorithm and choose.The frequency is calculated the number that refers to calculate same language material.Here, the frequency of language material is that language material is chosen one of influence factor of result.The frequency of language material is higher, and this language material selected probability in roulette is just larger.With roulette algorithm picks language material, both considered the degree commonly used of language material, make again the low-down uncommon language material of the frequency that the possibility that is selected is also arranged, make the language material that obtains more comprehensive.

Below further set forth by way of example language material carried out frequency calculating, carry out the step that language material is chosen by the roulette algorithm:

Be total up to 100 such as the language material after the processing of denoising sound, wherein comprise respectively A, B, four different language materials of C, D.Calculate through the frequency: A is 40, and B is 35, and C is 15, and D is 10.The frequency that is A, B, C, D is respectively 40,35,15,10.We can calculate the probability that A, B, C, D occur and are respectively thus: 0.4 (40/100), 0.35 (35/100), 0.15 (15/100), 0.1 (10/100).That is to say if the number of all language materials as being a circle, long-pending being respectively of angle faces that A, B, C, D distribute at circle: 0.4 * 360 °=144 °; 0.35 * 360 °=126 °; 0.15 * 360 °=54 °; 0.1 * 360 °=36 °.As seen, probability is larger, and probability selected when language material is chosen is larger.

According to above-mentioned method, when choosing language material in practice, can at first define an integer range, such as 3 A are arranged in the corpus, 2 B, 5 C, can define total interval is 1-10, so the interval of A can the corresponding 1-3 of being defined as (interval that is A occupies 1,2,3 these three numerical value), can certainly be defined as 4-6 like this, as long as it is interval to guarantee that A occupies three numerical value.The like, the interval that can define B is 4-5, the interval of C is 6-10; Then, by a random number in certain 1-10 interval of program generation, be 1,2 such as the random number that generates, any one of 3 (being any one numerical value in the interval of A) chosen the A language material so.Also adopt identical method for choosing of other language materials.After choosing language material, can carry out a judgement, determine whether this language material of choosing was chosen, if chose then carry out next step, if do not choose, then this language material is put into and choose in the tabulation, carry out again next step.The circulation above-mentioned steps is until chosen the requirement that list length reaches the evaluation and test corpus.

Step S405: in the language material of choosing out, for identical language material, only keep one of them as utilizing at least two language materials that different phonetic notation instruments carries out phonetic notation;

For the language material of choosing out, go heavily, to avoid repeatedly the language material that repeats being carried out phonetic notation and verification, to reduce producing unnecessary burden in the evaluation and test language material.Be D such as the language material of choosing, but 10 D are arranged, only keep wherein 1 D, remove all the other 9 D.

Step S406: utilize at least two different phonetic notation instruments that each language material is carried out respectively phonetic notation;

Language material for after the above-mentioned steps processing utilizes at least two phonetic notation instruments that each language material is carried out phonetic notation.

Step S407: whether at least two phonetic notations judging each language material are identical;

Whether at least two phonetic notations judging each language material are identical, if the identical step S408 that then carries out; If different, then carry out step S409.

Step S408: directly use corresponding phonetic notation as the correct phonetic notation of language material;

Directly after the correct phonetic notation of phonetic notation as language material with the phonetic notation instrument, enter step S410.

Step S409: select the more excellent phonetic notation of assessment result as the correct phonetic notation of language material;

Phonetic notation is assessed or re-started to the phonetic notation of language material to choose the more excellent phonetic notation of assessment result as the correct phonetic notation of language material, determine that the correct phonetic notation of language material enters step S410 afterwards.

Step S410: will determine the language material of correct phonetic notation as the evaluation and test language material;

Step S411: move at least one input method instrument input evaluation and test language material, obtain corresponding candidate result, will evaluate and test language material and obtain evaluating and testing corpus with corresponding candidate result preservation;

Moving at least one input method instrument inputs the evaluation and test language material that above-mentioned steps obtains, obtain corresponding candidate result, generate together the evaluation and test corpus with evaluation and test language material and the corresponding candidate result of obtaining, can input later on the efficient that the evaluation and test language material obtains the corresponding candidate result by Effective Raise.At least one input method instrument here can be any one or a plurality of input method instrument of Google's input method, Baidu's input method, Microsoft's input method, search dog input method etc.

See also Fig. 5, electronic installation the first embodiment of the present invention comprises phonetic notation module 11, judge module 12 and phonetic notation determination module 13, wherein:

Phonetic notation module 11 is used for utilizing at least two different phonetic notation instruments that each language material is carried out respectively phonetic notation, so that each language material has accordingly at least two phonetic notations, and accordingly at least two phonetic notations of each language material is exported to judge module 12;

Phonetic notation module 11 can utilize two or more different phonetic notation instruments that each language material is carried out respectively phonetic notation, and judge module 12 is exported in accordingly at least two phonetic notations of each language material after the phonetic notation.

Judge module 12 is used for judging whether at least two phonetic notations of each language material are identical, and judged result is exported to phonetic notation determination module 13;

Judge module 12 judges whether a plurality of phonetic notations of each language material are identical, and judged result is exported to phonetic notation determination module 13.

Phonetic notation determination module 13 is used for directly using corresponding phonetic notation as the correct phonetic notation of language material when at least two phonetic notations of each language material are identical, when at least two phonetic notations of each language material not simultaneously, the more excellent phonetic notation of selection assessment result is with the correct phonetic notation as language material.

Phonetic notation determination module 13 is used for determining the correct phonetic notation of language material.When judge module 12 judges that at least two phonetic notations that obtain each language material are identical, directly use corresponding phonetic notation as the correct phonetic notation of language material, when at least two phonetic notations of each language material not simultaneously, select the more excellent phonetic notation of assessment result with the correct phonetic notation as language material.

See also Fig. 6, electronic installation the second embodiment of the present invention comprises cutting module 21, phonetic notation module 22, judge module 23 and evaluation and test language material generation module 24, wherein:

The historical input content that cutting module 21 is used for catching is cut at least one language material of user's single typing, and at least one language material of user's single typing that cutting is obtained is exported to phonetic notation module 22;

Cutting module 21 can be by critical carrying out cutting first time to historical input content take punctuation mark as what separate, carries out at least one language material that the cutting second time obtains the typing of user's single according to the language material of Wen Jie after to the cutting first time again.Wherein, cutting module 21 is specifically carried out the cutting second time by juman and the knp language material after to the cutting first time according to Wen Jie.

Phonetic notation module 22 is used for utilizing at least two different phonetic notation instruments that each language material is carried out respectively phonetic notation, so that each language material has accordingly at least two phonetic notations, and accordingly at least two phonetic notations of each language material is exported to judge module 23;

Phonetic notation module 22 can utilize any a plurality of phonetic notation instruments of Google's input method, Baidu's input method, Microsoft's input method, search dog input method etc. that each language material is carried out respectively phonetic notation, and judge module 23 is exported in accordingly at least two phonetic notations of each language material after the phonetic notation.

Judge module 23 is used for judging whether at least two phonetic notations of each language material are identical, and judged result is exported to evaluation and test language material generation module 24;

Whether judge module 23 is judged from a plurality of phonetic notations of each language material of phonetic notation module 22 identical, and judged result is exported to evaluation and test language material generation module 24.

Evaluation and test language material production module 24 is used for when at least two phonetic notations of each language material are identical, directly with the correct phonetic notation of corresponding phonetic notation as language material, when at least two phonetic notations of each language material not simultaneously, select the more excellent phonetic notation of assessment result with the correct phonetic notation as language material, and the language material that will determine correct phonetic notation is as the evaluation and test language material.

Evaluation and test language material generation module 24 is determined the correct phonetic notation of language materials, and the language material that will determine correct phonetic notation is as the evaluation and test language material.

See also Fig. 7, the cutting module comprises the first cutting unit 111 and the second cutting unit 112 in electronic installation the 3rd embodiment of the present invention, wherein:

The historical input content that the first cutting unit 111 is used for catching carries out the cutting first time according to punctuation mark for the critical of separation, and the language material that for the first time cutting obtains is exported to the second cutting unit 112;

The first cutting unit 111 is used for for the critical of separation the historical input content of catching being carried out the cutting first time according to punctuation mark, and the language material that cutting obtains is exported to the second cutting unit 112.

The second cutting unit 112 is used for carrying out the cutting second time from the language material after the cutting first time of the first cutting unit 111 according to Wen Jie, obtains the language material of user's single typing, and the language material of user's single typing is exported to the phonetic notation module;

The second cutting unit 112 is concrete to be used for carrying out the cutting second time by juman and the knp language material after to the cutting first time according to Wen Jie.

See also Fig. 8, electronic installation the 4th embodiment of the present invention comprises that content capture module 31, cutting module 32, denoising sound module 33, language material are chosen module 34, removed molality piece 35, phonetic notation module 36, judge module 37, evaluation and test language material generation module 38 and corpus module 39, wherein:

Content capture module 31 is used for catching the content of predetermined field on the network or type as historical input content, and the historical input content that will catch is exported to cutting module 32;

Content capture module 31 is used for catching the content of specific area on the network or type as historical input content.

The historical input content that cutting module 32 is used for catching is cut at least one language material of user's single typing, and at least one language material of user's single typing that cutting is obtained is exported to denoising sound module 33;

The language material that denoising sound module 33 is used for user's single typing that 32 cuttings obtain to the cutting module carries out the denoising sound to be processed, to eliminate wherein insignificant language material;

The language material of user's single typing that denoising sound module 33 can obtain cutting by self-defining Noise rules carries out the denoising sound to be processed, and eliminates wherein insignificant language material.

Language material is chosen module 34 and is used for each language material after 33 processing of denoising sound module is carried out frequency calculating, carries out language material by the roulette algorithm and chooses;

Each language material that language material is chosen behind 34 pairs of denoising sounds of module carries out frequency calculating, carries out language material by the roulette algorithm and chooses.

Go molality piece 35 to be used for choosing the language material that module 34 is chosen out at language material, for identical language material, only keep one of them and carry out the language material of phonetic notation as at least two different phonetic notation instruments of utilization, and export to phonetic notation module 36;

Remove identical a plurality of language materials in 35 pairs of language materials of choosing out of molality piece, with the language material that duplicates in the evaluation and test language material of avoiding generating.

Phonetic notation module 36 is used for utilizing at least two different phonetic notation instruments that each language material is carried out respectively phonetic notation, so that each language material has accordingly at least two phonetic notations, and accordingly at least two phonetic notations of each language material is exported to judge module 37;

Judge module 37 is used for judging whether at least two phonetic notations of each language material are identical, and judged result is exported to evaluation and test language material generation module 38;

Evaluation and test language material generation module 38 is used for when at least two phonetic notations of each language material are identical, directly with the correct phonetic notation of corresponding phonetic notation as language material, when at least two phonetic notations of each language material not simultaneously, select the more excellent phonetic notation of assessment result with the correct phonetic notation as language material, and the language material that will determine correct phonetic notation is as the evaluation and test language material.

Corpus module 39 is used for evaluation and test language material that at least one input method instrument input evaluation and test language material generation module 38 of operation obtains to obtain corresponding candidate result, and collect corresponding candidate result, will evaluate and test language material and preserve to obtain evaluating and testing corpus with corresponding candidate result.

Corpus module 39 by operation Google input method, Baidu's input method, Microsoft's input method, search dog input method etc. one or any a plurality of input method instrument input evaluation and test language material obtain corresponding candidate result, will evaluate and test language material and corresponding candidate result and preserve to obtain evaluating and testing corpus.

Elaboration by above-mentioned embodiment, the invention has the advantages that: on the one hand, the method of language material phonetic notation of the present invention, adopt a plurality of phonetic notation instruments to the respectively phonetic notation of each language material, determine again the correct phonetic notation of language material by the mode of cross check, reduce greatly the quantity that needs the correct phonetic notation of desk checking language material, the efficient that improves the language material phonetic notation also improves the accuracy of language material phonetic notation simultaneously.Show that after deliberation the method for language material phonetic notation of the present invention with respect to existing phonetic notation method, need the quantity of the correct phonetic notation of desk checking language material to reduce 90%, and the accuracy of phonetic notation reaches more than 99.5%.

On the other hand, the generation method of input method evaluation and test language material of the present invention needs artificial participation hardly, improves the formation efficiency of evaluation and test language material.Process by the language material of catching being carried out the denoising sound, so that the quality of evaluation and test language material significantly improves, avoid inappropriate language material to cause evaluation result appearance phenomenon on the lower side.In addition, reduce language fragments phenomenon in the evaluation and test language material by cutting, introduce the frequency when language material is chosen simultaneously and calculate, consider the information of language material degree commonly used, the evaluation and test language material that obtains is more pressed close to user's actual typing, the continuity of evaluating and testing language material is better.Adopt a plurality of phonetic notation instruments that language material is carried out phonetic notation, determine the correct phonetic notation of language material by the mode of cross check, effectively reduce the workload that desk checking is carried out in the language material phonetic notation, also improve simultaneously the accuracy of language material phonetic notation.

In several embodiments provided by the present invention, should be understood that disclosed apparatus and method can realize by another way.For example, device embodiments described above only is schematic, for example, the division of described module, only be that a kind of logic function is divided, during actual the realization other dividing mode can be arranged, for example a plurality of modules or assembly can in conjunction with or can be integrated into another system, or some features can ignore, or do not carry out.Another point, the shown or coupling each other discussed or direct-coupling or communication connection can be by some interfaces, indirect coupling or the communication connection of device or unit can be electrically, machinery or other form.

Described functional module as the separating component explanation can or can not be physically to separate also, the parts that show as the unit can be or can not be physical locations also, namely can be positioned at a place, perhaps also can be distributed on a plurality of network element.Can select according to the actual needs wherein some or all of unit to realize the present invention program's purpose.

In addition, each functional module in each embodiment of the present invention can be integrated in the processing unit, also can be that the independent physics of each functional module exists, and also can two or more functional modules be integrated in the unit.Above-mentioned integrated unit both can adopt the form of hardware to realize, also can adopt the form of SFU software functional unit to realize.

The above only is embodiments of the present invention; be not so limit claim of the present invention; every equivalent structure or equivalent flow process conversion that utilizes instructions of the present invention and accompanying drawing content to do; or directly or indirectly be used in other relevant technical fields, all in like manner be included in the scope of patent protection of the present invention.

Claims

1. the method for a language material phonetic notation is characterized in that, comprising:

Utilize at least two different phonetic notation instruments that each described language material is carried out respectively phonetic notation, so that each language material has accordingly at least two phonetic notations;

Whether at least two phonetic notations judging each described language material are identical, if difference then selects the more excellent phonetic notation of assessment result with the correct phonetic notation as described language material, if identical then directly with the correct phonetic notation of described phonetic notation as language material.

2. a method that generates input method evaluation and test language material is characterized in that, comprising:

The historical input content of catching is cut at least one language material of user's single typing;

Whether at least two phonetic notations judging each described language material are identical, if different then select the more excellent phonetic notation of assessment result with the correct phonetic notation as described language material, if identical then directly with the correct phonetic notation of described phonetic notation as language material, and with the described language material of correct phonetic notation that determines as described evaluation and test language material.

3. method according to claim 2 is characterized in that, the step that the described historical input content that will catch is cut into the language material of user's single typing comprises:

The historical input content of catching critical is carried out the cutting first time according to punctuation mark for what separate;

Language material after the described cutting first time is carried out the cutting second time according to Wen Jie, obtain the language material of described user's single typing.

4. method according to claim 3 is characterized in that, described language material after the cutting first time is comprised according to the step that Wen Jie carries out for the second time cutting: carry out the cutting second time by juman and the knp language material after to the cutting first time according to Wen Jie.

5. method according to claim 2 is characterized in that, the described historical input content that will catch is cut into after the step of language material of user's single typing, utilizes at least two different phonetic notation instruments described language material to be carried out also comprise before the step of phonetic notation:

The language material of described user's single typing that cutting is obtained carries out the denoising sound to be processed, to eliminate wherein insignificant language material.

6. method according to claim 5, it is characterized in that the language material of described user's single typing that cutting is obtained carries out the step that the denoising sound processes and comprises: utilize the language material of described user's single typing that self-defining Noise rules obtains cutting to carry out the denoising sound and process.

7. method according to claim 5, it is characterized in that, the language material of described user's single typing that cutting is obtained carries out after the step that the denoising sound processes, and also comprise: each the described language material after described denoising sound is processed carries out the frequency and calculates, and carries out language material by the roulette algorithm and chooses.

8. method according to claim 7, it is characterized in that, language material after the processing of denoising sound is carried out the frequency to be calculated, undertaken after the step that language material chooses by the roulette algorithm, also comprise: in the described language material of choosing out, for identical described language material, only keep the language material that one of them carries out phonetic notation as at least two different phonetic notation instruments of described utilization.

9. method according to claim 2 is characterized in that, after the step of described generation evaluation and test language material, also comprises:

Move at least one input method instrument and input described evaluation and test language material obtaining corresponding candidate result, and collect corresponding candidate result;

Described evaluation and test language material and corresponding candidate result are preserved to obtain evaluating and testing corpus.

10. method according to claim 2 is characterized in that, the described historical input content that will catch is cut into before the step of language material of user's single typing, also comprises: catch the content of predetermined field on the network or type as historical input content.

11. an electronic installation is characterized in that, comprises phonetic notation module, judge module and phonetic notation determination module, wherein:

Described phonetic notation module is used for utilizing at least two different phonetic notation instruments that each described language material is carried out respectively phonetic notation, so that each language material has accordingly at least two phonetic notations, and corresponding described at least two phonetic notations of each language material are exported to described judge module;

Described judge module is used for judging whether at least two phonetic notations of each described language material are identical, and judged result is exported to described phonetic notation determination module;

Described phonetic notation determination module is used for when at least two phonetic notations of each described language material are identical, directly with the correct phonetic notation of described phonetic notation as language material, when at least two phonetic notations of each described language material not simultaneously, select the more excellent phonetic notation of assessment result with the correct phonetic notation as described language material.

12. an electronic installation is characterized in that, comprises cutting module, phonetic notation module, judge module and evaluation and test language material generation module, wherein:

The historical input content that described cutting module is used for catching is cut at least one language material of user's single typing, and at least one language material of user's single typing that cutting is obtained is exported to described phonetic notation module;

Described judge module is used for judging whether at least two phonetic notations of each described language material are identical, and judged result is exported to described evaluation and test language material generation module;

Described evaluation and test language material production module is used for when at least two phonetic notations of each described language material are identical, directly with the correct phonetic notation of described phonetic notation as language material, when at least two phonetic notations of each described language material not simultaneously, select the more excellent phonetic notation of assessment result with the correct phonetic notation as described language material, and with the described language material of correct phonetic notation that determines as described evaluation and test language material.

13. device according to claim 12 is characterized in that, described cutting module comprises the first cutting unit and the second cutting unit, wherein:

The historical input content that described the first cutting unit is used for catching carries out the cutting first time according to punctuation mark for the critical of separation, and the described language material that for the first time cutting obtains is exported to described the second cutting unit;

Described the second cutting unit is used for carrying out the cutting second time from the language material after the described cutting first time of described the first cutting unit according to Wen Jie, obtain the language material of described user's single typing, and the language material of described user's single typing is exported to described phonetic notation module.

14. device according to claim 13 is characterized in that, described the second cutting unit specifically is used for carrying out the cutting second time by juman and the knp language material after to the cutting first time according to Wen Jie.

15. device according to claim 12 is characterized in that, described device also comprises denoising sound module, and the language material that is used for described user's single typing that cutting obtains to described cutting module carries out the denoising sound to be processed, to eliminate wherein insignificant language material.

16. device according to claim 15 is characterized in that, the concrete term of described denoising sound module utilizes the language material of described user's single typing that self-defining Noise rules obtains cutting to carry out the denoising sound and processes.

17. device according to claim 15 is characterized in that, described device comprises that also language material chooses module, is used for that each the described language material after the described denoising sound resume module is carried out the frequency and calculates, and carries out language material by the roulette algorithm and chooses.

18. device according to claim 17, it is characterized in that, described device also comprises the molality piece, be used for choosing the described language material that module is chosen out at described language material, for identical described language material, only keep the language material that one of them carries out phonetic notation as at least two different phonetic notation instruments of described utilization, and the language material of described reservation is exported to described phonetic notation module.

19. device according to claim 12, it is characterized in that, described device also comprises evaluation and test corpus module, be used at least one input method instrument of operation and input described evaluation and test language material that described evaluation and test language material generation module obtains to obtain corresponding candidate result, and collect corresponding candidate result, described evaluation and test language material and corresponding candidate result are preserved to obtain evaluating and testing corpus.

20. device according to claim 12, it is characterized in that, described device also comprises the content capture module, be used for catching the content of predetermined field on the network or type as historical input content, and the described historical input content that will catch is exported to described cutting module.