CN103577396A - Methods and systems for generating simplified and traditional Chinese conversion template and realizing simplified and traditional Chinese conversion based on template - Google Patents

Methods and systems for generating simplified and traditional Chinese conversion template and realizing simplified and traditional Chinese conversion based on template Download PDF

Info

Publication number
CN103577396A
CN103577396A CN201210284530.4A CN201210284530A CN103577396A CN 103577396 A CN103577396 A CN 103577396A CN 201210284530 A CN201210284530 A CN 201210284530A CN 103577396 A CN103577396 A CN 103577396A
Authority
CN
China
Prior art keywords
phrase
mixing
mix
candidate
mixes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210284530.4A
Other languages
Chinese (zh)
Other versions
CN103577396B (en
Inventor
朱纯深
郝天永
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
City University of Hong Kong CityU
Original Assignee
City University of Hong Kong CityU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by City University of Hong Kong CityU filed Critical City University of Hong Kong CityU
Priority to CN201210284530.4A priority Critical patent/CN103577396B/en
Publication of CN103577396A publication Critical patent/CN103577396A/en
Application granted granted Critical
Publication of CN103577396B publication Critical patent/CN103577396B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention provides a method and a system for generating a simplified and traditional Chinese conversion template and a method and a system for simplified and traditional Chinese conversion based on the template. The method for the simplified and traditional Chinese conversion comprises the following steps of obtaining a first mixing phrase; using the number identifier to replace the numbers in the first mixing phrase to obtain a first middle mixing phrase; finding a second middle mixing phrase corresponding to the first middle mixing phrase from the pre-generated template; according to the number identifier and the second middle mixing phrase, obtaining a second mixing phase corresponding to the first mixing phrase, wherein the first mixing phrase and the second mixing phrase comprise one-to-many characters and numbers, the first mixing phrase is one of simplified Chinese mixing phrase and traditional Chinese mixing phrase, and the second mixing phrase is the other one of simplified Chinese mixing phrase and traditional Chinese mixing phrase. The mixing phrases containing the one-to-many phrases and the numbers can be accurately and quickly switched between the simplified Chinese and the traditional Chinese.

Description

Generate conversion between simplified and traditional Chinese template and based on template, carry out method, the system of conversion between simplified and traditional Chinese
Technical field
The application relates to a kind of method and system of Chinese conversion between simplified and traditional Chinese, relates in particular to a kind of method, system that generates conversion between simplified and traditional Chinese template and carry out conversion between simplified and traditional Chinese based on template.
Background technology
Chinese text has dividing of simplified form of Chinese Character and Chinese-traditional, in the daily interchange on four ground, two sides, often simplified form of Chinese Character and Chinese-traditional need to be changed mutually.Now, often can run into the situation of the corresponding a plurality of traditional Chinese character of some simplified Chinese character.For example, simplified form of Chinese Character " inner " can corresponding Chinese-traditional " inner " Huo “ Li "; simplified form of Chinese Character " goes out " can corresponding Chinese-traditional " to go out " Huo “ Out ", simplified form of Chinese Character " is sent out " can corresponding Chinese-traditional " Hair " Huo “ Hair "; vice versa; " do " (raisins) that can be corresponding simplified such as Chinese-traditional " universe " and " universe " (" wearing " and " work " (works) that the) , Fan Body Chinese " work " such as Qianrong, universe is can correspondence simplified etc.This one-to-many situation can obtain part and solve under existing all types of templates auxiliary.Yet when carrying out conversion between simplified and traditional Chinese, often can run into the mixing phrase (ad hoc numerical phrases) that comprises that one-to-many character and all types of numeral form, for example, " having 40 li ", " having sung scene 2 " etc.Current conversion between simplified and traditional Chinese technology has following defect when conversion the type is mixed phrase: most of this mixing phrase is not containing conventional dictionary entry or vocabulary entry, use conventional dictionary to change, simultaneously due to numeral can not exhaustive, dictionary that therefore cannot complete structure the type.For example, simplified form of Chinese Character does not contain any fixedly entry in " having 40 li ", thereby " inner " (at this, " inner " represents the unit length of 500 meters) can be converted to “ Li mistakenly " (at this , “ Li " expression inside); Again for example, simplified form of Chinese Character " has been sung scene 2 " and has not been contained equally any fixedly entry, and " going out " should be converted into Chinese-traditional “ Out ", be but converted to mistakenly Chinese-traditional and " gone out ".In addition,, because this mixing phrase is along with digital change can generate countless modification, this makes transformation model based on probability can become invalid as N-Gram model, or can not list with the template of any type, or is difficult to process by any converting system.
Summary of the invention
In order to improve conversion between simplified and traditional Chinese precision and conversion efficiency, the application provides a kind of method and system that generate conversion between simplified and traditional Chinese template, based on template, has carried out the method and system of conversion between simplified and traditional Chinese.
The application's a scheme provides a kind of method that generates conversion between simplified and traditional Chinese template, for the first and second conversions that mix between phrase, described first and second mix phrase comprises one-to-many character and numeral, described first to mix phrase be a kind of in traditional font mixing phrase and simplified mixing phrase, described the second mixing phrase is the another kind in traditional font mixing phrase and simplified mixing phrase, comprising:
Obtain the first mixing phrase-the second and mix phrase candidate couple;
From described first, mix phrase-the second and mix the middle candidate couple of phrase candidate centering extraction;
From described middle candidate's centering, obtain the candidate couple with maximum coverage rate;
Described in use, there is the candidate of maximum coverage rate to generating conversion between simplified and traditional Chinese template.
Another scheme of the application provides a kind of simplified and traditional Chinese conversion method, for the first and second conversions that mix between phrase, described first and second mix phrase comprises one-to-many character and numeral, described first to mix phrase be a kind of in traditional font mixing phrase and simplified mixing phrase, described the second mixing phrase is the another kind in traditional font mixing phrase and simplified mixing phrase, comprising:
Obtain the first mixing phrase;
By numeric identifier, replace the described first described numeral of mixing in phrase, with in the middle of obtaining first, mix phrase;
From the template generating, search the second middle phrase that mixes corresponding to mixing phrase in the middle of described first;
According to mixing phrase in the middle of described numeric identifier and described second, obtain the second mixing phrase that described the first mixing phrase is corresponding.
Another scheme of the application provides a kind of system that generates conversion between simplified and traditional Chinese template, for the first and second conversions that mix between phrase, described first and second mix phrase comprises one-to-many character and numeral, described first to mix phrase be a kind of in traditional font mixing phrase and simplified mixing phrase, described the second mixing phrase is the another kind in traditional font mixing phrase and simplified mixing phrase, comprising:
Candidate, to acquisition module, obtains the first mixing phrase-the second and mixes phrase candidate couple;
Middle candidate, to extraction module, mixes the middle candidate couple of phrase candidate centering extraction for mixing phrase-the second from described first;
Maximum coverage rate candidate is to acquisition module, for obtaining the candidate couple with maximum coverage rate from described middle candidate's centering;
Template generation module, for the candidate described in using with maximum coverage rate to generating conversion between simplified and traditional Chinese template.
Another scheme of the application provides a kind of simplified and traditional Chinese converting system, for the first and second conversions that mix between phrase, described first and second mix phrase comprises one-to-many character and numeral, described first to mix phrase be a kind of in traditional font mixing phrase and simplified mixing phrase, described the second mixing phrase is the another kind in traditional font mixing phrase and simplified mixing phrase, comprising:
First mixes phrase acquisition module, for obtaining the first mixing phrase;
In the middle of first, mix phrase acquisition module, for replace the described first described numeral of mixing phrase by numeric identifier, with in the middle of obtaining first, mix phrase;
Module searched in the second middle phrase that mixes, for search the second middle phrase that mixes corresponding to mixing phrase in the middle of described first from the template generating;
Second mixes phrase acquisition module, for obtaining according to mixing phrase in the middle of described numeric identifier and described second the second mixing phrase that described the first mixing phrase is corresponding.
In sum, utilize the first Chinese generating in advance to mix phrase-the second Chinese and mix phrase candidate couple, can be fast and complete exactly and comprise first of one-to-many character and numeral and mix phrase and the second conversion between simplified and traditional Chinese mixing between phrase.
By the explanation to the embodiment of the present application referring to accompanying drawing, the application's above-mentioned and other objects, features and advantages will be more obvious.
Accompanying drawing explanation
The application's embodiment is described below with reference to appended accompanying drawing, wherein:
Fig. 1 is the process flow diagram of the method for the generation conversion between simplified and traditional Chinese template that provides of the application's embodiment mono-;
Fig. 2 is that first mixing phrase-the second that obtain that the application's embodiment mono-provides mixes the right process flow diagram of phrase candidate;
Fig. 3 is the process flow diagram of the simplified and traditional Chinese conversion method that provides of the application's embodiment bis-;
Fig. 4 is the calcspar of the system of the generation conversion between simplified and traditional Chinese template that provides of the application's embodiment tri-;
Fig. 5 is the structural drawing of the candidate that provides of the application's embodiment tri-to acquisition module;
Fig. 6 is the calcspar of the simplified and traditional Chinese converting system that provides of the application's embodiment tetra-.
Embodiment
Specific embodiment below in conjunction with accompanying drawing DETAILED DESCRIPTION The present application.It should be noted that the embodiments described herein, only for illustrating, is not limited to the application.
Embodiment mono-
The present embodiment provides a kind of method that generates conversion between simplified and traditional Chinese template, the method is for the first and second conversions that mix between phrase, described first and second mix phrase comprises one-to-many character and numeral, described first to mix phrase be a kind of in traditional font mixings phrase and simplified mixing phrase, and described the second mixing phrase is the another kind in traditional font mixing phrase and simplified mixing phrase.As shown in Figure 1, the method comprises:
S110, obtains the first mixing phrase-the second and mixes phrase candidate couple.
Easy in order to describe, in the present embodiment, the first mixing phrase is simplified mixing phrase, and this first mixing phrase is the phrase that simplified form of Chinese Character and numeral are mixed mutually, the second mixing phrase is traditional font mixing phrase, and this second mixing phrase is the phrase that Chinese-traditional and numeral are mixed mutually.
Particularly, as shown in Figure 2, this step comprises:
S211, obtains the first middle mixing phrase, thereby obtains the second mixing phrase.
Particularly, the the first middle mixing phrase obtaining of take is example as " little river Long has 40 li ", be expressed as: " <TC> little He Long has 40 li ", wherein, " <TC> " is expressed as traditional font mixing phrase, character " inner " is one-to-many character, and " 40 " are numeral.
Secondly; by numeric identifier " [num] ", replace the numeral " 40 " in mixing phrase in the middle of first; obtain mixing in the middle of second phrase " the little river Long of <TC> have [num] inner "; it will be understood by those skilled in the art that; numeric identifier " [num] " is only the needs of giving an example, and the application's protection domain is not limited to this.
Again, the one-to-many character " inner " of take converts mixing phrase " the little river Long of <TC> have [num] inner " in the middle of second as benchmark, thereby mixes phrases in the middle of obtaining a plurality of the 3rd.Wherein, described in the application " conversion " can be one and mix phrase by increasing, reduce or changing character, or reach by other means pro forma expansion, reduction or change.
For example, for reducing this mode of character, will " <TC> little He Long have [num] inner " be reduced to " <TC> He Long have [num] inner ", " <TC> Long have [num] inner ", " <TC> have [num] inner ", " <TC>[num] inner ", " in <TC> " etc., from above example, can find out, the character reduction of this example is from the beginning of traditional font mixing phrase, reduce successively a character, it will be understood by those skilled in the art that, character reduction also can be from the ending of traditional font mixing phrase, the quantity of each reduction character is also not limited to one, also can be other quantity, in addition, the quantity of each reduction character not necessarily must be identical.
For example, for increasing this mode of character, can utilize extending structure to expand traditional font mixing phrase, to increase its length, this extending structure can be positioned at the beginning of traditional font mixing phrase, can be also the ending that is positioned at traditional font mixing phrase, can be also the centre that is positioned at traditional font mixing phrase, and the length of extending structure can present certain regularity, also can there is randomness.For example, " <TC> little He Long have [num] inner " is transformed to " the little He Long on <TC> side have [num] inner ", wherein, extending structure is " side ", and it is positioned at traditional font mixing phrase " the little river Long of <TC> have [num] inner " before.
As can be seen from the above, in a plurality of the 3rd middle the mixing in phrase of obtaining, some comprises numeral and one-to-many character, and some only comprises one-to-many character, and some only comprises numeral, and some does not comprise one-to-many character and numeral.
What the present embodiment adopted is exemplified as: by reducing this mode of character, will " <TC> little He Long have [num] inner " be reduced to " <TC> He Long have [num] inner ", " <TC> Long have [num] inner ", " <TC> have [num] inner ", " <TC>[num] inner ", " in <TC> " etc.
Again, in the middle of the 3rd, mix phrase and filter out the 3rd middle phrase that mixes that does not comprise one-to-many character and numeric identifier, obtain the second mixing phrase.
Accept above-mentioned example, utilize numeric identifier " [num] " to filter out phrase " in <TC> ", thereby finally obtain the second mixing phrase: " the little river Long of <TC> have [num] inner ", " <TC> He Long have [num] inner ", " <TC> Long have [num] inner ", " <TC> have [num] inner " and " <TC>[num] inner ", the the second mixing phrase herein obtaining has overlapping structure " [num] is inner ".
It is to be noted, the present embodiment also can numeric identifier " [num] " converts traditional font mixing phrase " the little river Long of <TC> have [num] inner " for benchmark, for example, in the middle of being transformed to a plurality of the 3rd, mixes traditional font mixing phrase " the little river Long of <TC> have [num] inner " phrases: " <TC> river Long have [num] inner ", " <TC> Long have [num] inner ", " <TC> little He Long has [num] " etc.And then utilize one-to-many character " inner " to filter out not comprise one-to-many character and numeric identifier the 3rd in the middle of mix phrase, thereby obtain the second mixing phrase.
It will be understood by those skilled in the art that, the present embodiment also can be take one-to-many character " inner " and numeric identifier " [num] " simultaneously and be converted as benchmark, and then utilize one-to-many character " inner " and numeric identifier " [num] " to filter out the mixing phrase that does not comprise one-to-many character " inner " and numeric identifier " [num] ", thereby obtain the second mixing phrase.In this case, filtration step is omissible, thereby makes operation efficiency very high.
It is to be noted, the present embodiment also can first be take numeral and to mixing phrase in the middle of first, be converted as benchmark, and then replace the numeral in the mixing phrase after conversion by numeric identifier " [num] ", and then utilize one-to-many character and numeric identifier " [num] " to filter, concrete process can, with reference to foregoing description, not repeat them here.
It will be appreciated by persons skilled in the art that no matter be to adopt which kind of conversion and filter type, final as long as guarantee that acquisition second mixes phrase and comprises one-to-many character and numeric identifier.
S212, changes the second mixing phrase, obtains first of corresponding the second mixing phrase and mixes phrase.
Accept above-mentioned example, respectively by the second mixing phrase " the little river Long of <TC> have [num] inner ", " <TC> He Long have [num] inner ", " <TC> Long have [num] inner ", " <TC> have [num] inner " and " <TC>[num] inner " is converted to the first mixing phrase " <SC> river length have [num] inner ", " <SC> river length have [num] inner ", " <SC> long have [num] inner ", " <SC> have [num] inner " and " <SC>[num] inner ", wherein, " <SC> " represents simplified mixing phrase.
S213, forms the first mixing phrase-the second mixing phrase candidate couple by the first mixing phrase with the second corresponding mixing phrase.
Accept above-mentioned example, respectively by the second mixing phrase " the little river Long of <TC> have [num] inner ", " <TC> He Long have [num] inner ", " <TC> Long have [num] inner ", " <TC> have [num] inner " and " <TC>[num] inner " and corresponding first mixes phrase " <SC> river length have [num] inner ", " <SC> river length have [num] inner ", " <SC> long have [num] inner ", " <SC> have [num] inner " and " <SC>[num] inner " forms the first mixing phrase-the second and mixes phrase candidate couple: " <SC> river length have [num] inner → the little river Long of <TC> have [num] inner ", " <SC> river length have [num] inner → <TC> He Long have [num] inner ", " <SC> is long have [num] inner → <TC> Long have [num] inner ", " <SC> have [num] inner → <TC> have [num] inner " and " <SC>[num] inner → <TC>[num] inner ".
It should be noted that, in actual applications, not all candidate is to having overlay structure, and reason is: possible this sentence is to start with numeral, with the ending of one-to-many character, therefore can only obtain a candidate couple; Or, from the diverse sentence of content, obtain corresponding candidate couple, from the angle of different sentences, the candidate who obtains is not to having overlay structure.The application is for simplified characterization, take there is overlay structure candidate to being example, but this and form the restriction to the application.
S120, mixes phrase candidate centering from the first mixing phrase-the second and extracts middle candidate couple.
Accept above-mentioned example, utilize training text, statistics candidate centering first is mixed phrase to the inversion frequency of the second mixing phrase, and inversion frequency is greater than the candidate of preset first threshold value to being remained.For example, for candidate for " <SC> river length have [num] inner → <TC> little He Long have [num] inner ", from the first mixing phrase " <SC> river length have [num] " to the second mixing phrase, the inversion frequency of " the little river Long of <TC> have [num] inner " is 12 times, for candidate for " <SC> river length have [num] inner → <TC> He Long have [num] inner ", inversion frequency is 18 times, for candidate for " <SC> is long have [num] inner → <TC> Long have [num] inner ", inversion frequency is 20 times, for candidate for " <SC> have [num] inner → <TC> have [num] inner ", inversion frequency is 25 times, for candidate for " <SC>[num] inner → <TC>[num] inner ", inversion frequency is 34 times, preset first threshold value is 19 times, retain like this candidate to " <SC> is long have [num] inner → <TC> Long have [num] inner ", " <SC> have [num] inner → <TC> have [num] inner " and " <SC>[num] inner → <TC>[num] inner ".
Then, the right confidence degree of statistics candidate, retains the candidate couple that confidence degree is greater than default the second predetermined threshold value.Particularly, owing to there being one-to-many character, generate first mix phrase-the second mix phrase candidate to time, may occur that is mixed the corresponding two kinds of mixing phrases of phrase, so need to judge that each first mixing phrase-the second mixes the right confidence degree of phrase candidate, retains the candidate couple that confidence degree is greater than the second predetermined threshold value.For example, generate candidate to " long have [num] inner → <TC> Long have [num] inner " time, also may generate candidate to " long have [num] inner → <TC> Long have [num] Li "; Generate candidate to " <SC> have [num] inner → <TC> have [num] inner " time, also may generate candidate to " <SC> have [num] inner → <TC> have [num] Li "; When generating candidate to " <SC>[num] inner → <TC>[num] inner ", also may generate candidate to " <SC>[num] inner → <TC>[num] Li ".Now, for candidate for " <SC> is long have [num] inner → <TC> Long have [num] inner ", " <SC> have [num] inner → <TC> have [num] inner " and " <SC>[num] inner → <TC>[num] inner ", its inversion frequency is respectively 20 times, 25 times and 34 times.Accordingly, for candidate for " long have [num] inner → <TC> Long have [num] Li ", " <SC> have [num] inner → <TC> has [num] Li " and " <SC>[num] inner → <TC>[num] Li ", its inversion frequency is respectively 1 time, 3 times and 10 times.Then, statistics " <SC> is long have [num] inner → <TC> Long have [num] inner ", " <SC> have [num] inner → <TC> have [num] inner " and the confidence degree of " <SC>[num] inner → <TC>[num] inner " be respectively 20/ (20+1), 25/ (25+3) and 34/ (34+10), the second predetermined threshold value is 6/7, by right confidence degree and the second predetermined threshold value of each candidate compared, retain candidate to " <SC> is long have [num] inner → <TC> Long have [num] inner " and " <SC> have [num] inner → <TC> have [num] inner ".The object of doing is like this candidate couple who not only obtains switched at high frequency, guarantees that again the right possibility of transformation of this candidate obtaining has very high confidence degree, can meet actual converted needs.
S130, chooses the candidate couple with maximum coverage rate from middle candidate's centering.
Accept above-mentioned example, candidate to " <SC> have [num] inner → <TC> have [num] inner " and " <SC>[num] inner → <TC>[num] inner " be two suitable candidates couple, and candidate to " <SC>[num] inner → <TC>[num] inner " be candidate to " <SC> have [num] inner → <TC> have [num] inner " template.Therefore, can, to two candidates to merging, to reduce storage space and the carrying cost of template, and can, when guaranteeing conversion accuracy, improve conversion efficiency.This merging can be assessed by calculating coverage rate, this coverage rate represents: in a training text, to the sentence quantity of the simplified or traditional font of mating shared proportion in all sentence quantity, simplified or traditional font is herein that the simplified and traditional type by test text is determined with candidate.The formula that calculates coverage rate is as follows:
Coverage ( p SC &RightArrow; p TC ) = | matched sentences of p SC / p TC | | all sentences |
Wherein, Coverage (p sC→ p tC) represent to be converted to by simplified form of Chinese Character mixing phrase the coverage rate of Chinese-traditional mixing phrase, matched sentences of p sC/ p tCrepresent and the sentence quantity of candidate to the simplified or traditional font of mating, the p in molecule sCor p tCdepend on the language form for generation of the training text of template, for example, if training text is simplified form of Chinese Character, answer choice for use p herein sC.Larger coverage rate can be mated more example.For example, by candidate to " <SC> have [num] inner → <TC> have [num] inner " and " <SC>[num] inner → <TC>[num] inner " merge into candidate to " <SC>[num] inner → <TC>[num] inner ", a rear candidate to than candidate to " <SC> have [num] inner → <TC> have [num] inner " there is larger coverage rate.In the present embodiment, candidate is more suitable for as for conversion between simplified and traditional Chinese object template " <SC>[num] inner → <TC>[num] inner ".
S140, is used the candidate with maximum coverage rate to generating conversion between simplified and traditional Chinese template.
The conversion between simplified and traditional Chinese template that the present embodiment generates can be for fast and carry out exactly the use of conversion between simplified and traditional Chinese later.
Embodiment bis-
The present embodiment provides a kind of method of simplified and traditional Chinese conversion, for the first and second conversions that mix between phrase, described first and second mix phrase comprises one-to-many character and numeral, described first to mix phrase be a kind of in traditional font mixing phrase and simplified mixing phrase, described the second mixing phrase is the another kind in traditional font mixing phrase and simplified mixing phrase, as shown in Figure 3, comprising:
S310, obtains the first mixing phrase.
Wherein, before obtaining the first mixing phrase, can also comprise: receive first of user's input and mix phrase, judge that whether this first mixing phrase comprises one-to-many character and numeral, if so, proceeds to S320; If not, stop.Wherein, numeral can be Chinese figure or arabic numeral.
Easy in order to describe, in the present embodiment, the first mixing phrase is the simplified mixing phrase that comprises one-to-many character and numeral, and the second mixing phrase is to comprise one-to-many character and digital traditional font mixing phrase.
Particularly, in the present embodiment, the first mixing phrase is " river length has 40 li more than ", this mixing phrase is expressed as " <SC> river length has 40 li more than ", wherein " <SC> " represents that this mixing phrase is simplified mixing phrase, wherein, character " inner " is one-to-many character.Wherein, mix phrase " river length has 40 li more than " just for the needs of giving an example, the application's protection domain is not limited to this.
S320, replaces the first numeral of mixing phrase by numeric identifier, obtains the second middle phrase that mixes.
Accept above-mentioned example, by numeric identifier " [num] ", replace the numeral " 40 " in the first mixing phrase " river length has 40 li more than ", such first mixes phrase " river length has 40 li more than " mixes phrase " river length have [num] inner more than " in the middle of becoming second, and it is expressed as " <SC> river length have [num] inner more than ".It is pointed out that numeric identifier " [num] " is just for the needs of giving an example, the application's protection domain is not limited to this.
S330, searches the second middle phrase that mixes corresponding to mixing phrase in the middle of first from embodiment mono-template generating.
In embodiment mono-, the template of generation comprises that a plurality of first mixes phrase-the second mixing phrase candidate couple.In the present embodiment, template comprises that the first mixing phrase-the second mixes phrase candidate to " <SC>[num] inner → <TC>[num] inner ", system adopts maximum matching way from back to front to mate whole sentence, particularly, from last character of this sentence, first mate the longest sentence, then before sentence, start to reduce character, until reduce to remaining phrase, do not comprise one-to-many character or numeral.Then, sentence leading portion is moved to a character and restart above matching process, i.e. coupling " <SC> river length have [num] inner more than ", " <SC> river length have [num] inner more than ", " <SC> is long have [num] inner more than " .... until " <SC>[num] inner ", if there is match objects in matching process, automatically stop this process.It should be noted that, when carrying out one-to-many matching process, also can carry out other matching processs simultaneously, for example, character match, name coupling, place name coupling etc. one to one, other matching process and one-to-many matching process are independent of each other.Because this matching process and other dictionary matching are to synchronize to carry out (not needing to carry out separately this process) and template is to be stored in Hash table (hashtable), so matching efficiency is very high.By above-mentioned matching process, can obtain mixing in the middle of first phrase " <SC> river length have [num] inner more than " corresponding second in the middle of mix phrase " the little river Long of <TC> have [num] inner more than ", wherein, " <TC> " represents traditional font mixing phrase.
S340, obtains according to mixing phrase in the middle of numeric identifier and second the second mixing phrase that the first mixing phrase is corresponding.
Particularly, known according to the description of step S310, between numeric identifier " [num] " and numeral " 40 ", there is corresponding relation.Therefore, in this step, can by numeral, obtain in " 40 " step of replacing S330 second in the middle of mix the numeric identifier " [num] " in phrase " the little river Long of <TC> have [num] inner more than ", thereby obtain first in step S310 mix phrase " <SC> river length has 40 li more than " corresponding second mix phrase " the little river Long of <TC> has 40 li more than ".
In sum, utilize the first Chinese generating in advance to mix phrase-the second Chinese and mix phrase candidate couple, can be fast and complete exactly and comprise first of one-to-many character and numeral and mix phrase and the second conversion mixing between phrase.
Embodiment tri-
The present embodiment provides a kind of system that generates conversion between simplified and traditional Chinese template, for the first and second conversions that mix between phrase, first and second mix phrase comprises one-to-many character and numeral, first to mix phrase be a kind of in traditional font mixing phrase and simplified mixing phrase, the second mixing phrase is the another kind in traditional font mixing phrase and simplified mixing phrase, as shown in Figure 4, comprising:
Candidate, to acquisition module 410, obtains the first mixing phrase-the second and mixes phrase candidate couple.Wherein, can be referring to the S110 of embodiment mono-to the concrete function of acquisition module 410 about candidate.
Middle candidate, to extraction module 420, extracts middle candidate couple for mix phrase candidate centering from the first mixing phrase-the second.Wherein, can be referring to the S120 of embodiment mono-to the concrete function of extraction module 420 about middle candidate.
Maximum coverage rate candidate is to acquisition module 430, for obtaining the candidate couple with maximum coverage rate from middle candidate's centering.Can be referring to the S130 of embodiment mono-to the concrete function of acquisition module 430 about maximum coverage rate candidate.
Template generation module 440, for being used the candidate with maximum coverage rate to generating conversion between simplified and traditional Chinese template.
Preferably, as shown in Figure 5, candidate comprises acquisition module 410:
Second mixes phrase acquiring unit 510, for obtaining the second mixing phrase;
First mixes phrase acquiring unit 520, for the second mixing phrase is changed, obtains first of corresponding the second mixing phrase and mixes phrase;
Candidate, to component units 530, forms the first mixing phrase-the second mixing phrase candidate couple for the first mixing phrase and second being mixed to phrase.
Preferably, second mix phrase acquiring unit 510 for:
In the middle of obtaining first, mix phrase, the first middle phrase that mixes comprises one-to-many character and numeral;
Utilize numeric identifier to replace in the middle of first and mix the numeral in phrase, obtain the second middle phrase that mixes;
In the middle of utilizing one-to-many character and/or numeric identifier conversion second, mix phrase, obtain the 3rd middle phrase that mixes;
In the middle of the 3rd, mix phrase and filter out the mixing phrase that does not comprise one-to-many character and numeric identifier, obtain the second mixing phrase.
Preferably, second mix phrase acquiring unit 510 for:
In the middle of obtaining first, mix phrase, the first middle phrase that mixes comprises one-to-many character and numeral;
In the middle of utilizing one-to-many character and/or digital conversion first, mix phrase, obtain the 4th middle phrase that mixes;
Utilize numeric identifier to replace in the middle of the 4th and mix the numeral in phrase, obtain the 5th middle phrase that mixes;
In the middle of the 5th, mix phrase and filter out the mixing phrase that does not comprise one-to-many character and numeric identifier, obtain the second mixing phrase.
Preferably, middle candidate to extraction module 420 for:
Statistics first is mixed phrase-the second and is mixed phrase candidate to the inversion frequency in training text;
Retain the first mixing phrase-the second mixing phrase candidate couple that inversion frequency is greater than the first predetermined threshold value;
Whether the first mixing phrase and the right confidence degree of the second mixing phrase candidate that statistics is greater than the first predetermined threshold value are greater than the second predetermined threshold value;
If so, confidence degree being greater than to first of the second predetermined threshold value mixes phrase and second and mixes phrase candidate to as middle candidate couple.
Template by conversion between simplified and traditional Chinese that the present embodiment generates can be for fast and carry out exactly the use of conversion between simplified and traditional Chinese later.
Embodiment tetra-
The present embodiment provides a kind of system of simplified and traditional Chinese conversion, this system is for the first and second conversions that mix between phrase, wherein first to mix phrase be a kind of in simplified mixing phrase and traditional font mixing phrase, the second mixing phrase is the another kind in simplified mixing phrase and traditional font mixing phrase, and first and second mix phrases comprises one-to-many character and numeral.As shown in Figure 6, this system comprises:
First mixes phrase acquisition module 610, for obtaining the first mixing phrase.Wherein, the first function of mixing phrase acquisition module 610 can, referring to the step 310 of embodiment bis-, not repeat them here.
In the middle of first, mix phrase acquisition module 620, for replace the first numeral of mixing phrase by numeric identifier, with in the middle of obtaining first, mix phrase.The first function of mixing phrase acquisition module 620 can, referring to the step 320 of embodiment bis-, not repeat them here.
Module 630 searched in the second middle phrase that mixes, for search the second middle phrase that mixes corresponding to mixing phrase in the middle of first from the template generating according to the method for embodiment mono-.Wherein, the function that module 630 searched in the second middle mixing phrase can, referring to the step 330 of embodiment bis-, not repeat them here.Wherein,
Second mixes phrase acquisition module 640, for obtaining according to mixing phrase in the middle of numeric identifier and second the second mixing phrase that the first mixing phrase is corresponding.Wherein, the second function of mixing phrase acquisition module 640 can, referring to the step 340 of embodiment bis-, not repeat them here.
Preferably, module 630 searched in the second middle phrase that mixes, and for passing through character string matching method, utilizes the first mixing phrase-the second that embodiment mono-obtains to mix phrase candidate to searching the second middle mixing phrase with the first middle phrase that mixes from template.
Preferably, second mixes phrase acquisition module 640, for using numeral to replace the numeric identifier of mixing phrase in the middle of second, and then obtains the second mixing phrase that the first mixing phrase is corresponding.
Preferably, this system also comprises: judge module 650, for receiving first of user's input, mix phrase, and judge in the first mixing phrase and comprise one-to-many character and numeral, and export to the first mixing phrase acquisition module 610.Wherein, numeral is Chinese figure or arabic numeral.
To sum up, utilize the first Chinese generating in advance to mix phrase-the second Chinese and mix phrase candidate to template, can be fast and complete exactly and comprise first of one-to-many character and numeral and mix phrase and the second conversion between simplified and traditional Chinese mixing between phrase.
The method of generation conversion between simplified and traditional Chinese template that the application provides and the method for the Chinese conversion between simplified and traditional Chinese based on template and step thereof can by one or more treatment facilities with data-handling capacity for example one or more computer run computer executable instructions (this computer executable instructions has reflected the thought that realizes instant communication method that the application proposes) realize.This treatment facility can comprise storage medium and the central processing unit of storing aforementioned computer executable instructions.
The application's the system of generation conversion between simplified and traditional Chinese template and the system of the Chinese conversion between simplified and traditional Chinese based on template can be one or more treatment facilities of the aforementioned computer executable instructions of operation.Modules in this system has the apparatus assembly of corresponding function in the time of can moving aforementioned computer executable instructions for this treatment facility.
Although described the application with reference to exemplary embodiments, should be appreciated that term used is explanation and exemplary and nonrestrictive term.The spirit or the essence that because the application can specifically implement in a variety of forms, do not depart from invention, so be to be understood that, above-described embodiment is not limited to any aforesaid details, and explain widely in the spirit and scope that should limit in the claim of enclosing, therefore fall into whole variations in claim or its equivalent scope and remodeling and all should be the claim of enclosing and contain.

Claims (20)

1. a method that generates conversion between simplified and traditional Chinese template, for the first and second conversions that mix between phrase, described first and second mix phrase comprises one-to-many character and numeral, described first to mix phrase be a kind of in traditional font mixing phrase and simplified mixing phrase, described the second mixing phrase is the another kind in traditional font mixing phrase and simplified mixing phrase, it is characterized in that, comprising:
Obtain the first mixing phrase-the second and mix phrase candidate couple;
From described first, mix phrase-the second and mix the middle candidate couple of phrase candidate centering extraction;
From described middle candidate's centering, obtain the candidate couple with maximum coverage rate;
Described in use, there is the candidate of maximum coverage rate to generating conversion between simplified and traditional Chinese template.
2. method according to claim 1, is characterized in that, obtains the right step of the first mixing phrase-the second mixing phrase candidate and comprises:
Obtain the second mixing phrase;
By described second, mix phrase and change, obtain first of corresponding described the second mixing phrase and mix phrase;
Described the first mixing phrase and described second is mixed to phrase and form described the first mixing phrase-the second mixing phrase candidate couple.
3. method according to claim 2, is characterized in that, the step of obtaining the second mixing phrase comprises:
In the middle of obtaining first, mix phrase, the described first middle phrase that mixes comprises described one-to-many character and described numeral;
Utilize described numeric identifier to replace in the middle of described first and mix the numeral in phrase, obtain the second middle phrase that mixes;
Utilize described one-to-many character and/or described numeric identifier to convert the described second middle phrase that mixes, obtain the 3rd middle phrase that mixes;
In the middle of the described the 3rd, mix phrase and filter out the mixing phrase that does not comprise described one-to-many character and described numeric identifier, obtain described second and mix phrase.
4. method according to claim 2, is characterized in that, the step of obtaining the second mixing phrase comprises:
In the middle of obtaining first, mix phrase, the described first middle phrase that mixes comprises described one-to-many character and described numeral;
In the middle of utilizing described in described one-to-many character and/or described digital conversion first, mix phrase, mix phrase in the middle of obtaining the 4th;
Utilize described numeric identifier to replace in the middle of the described the 4th and mix the numeral in phrase, obtain the 5th middle phrase that mixes;
In the middle of the described the 5th, mix phrase and filter out the mixing phrase that does not comprise described one-to-many character and described numeric identifier, obtain described second and mix phrase.
5. method according to claim 1, is characterized in that, mixes the right step of the phrase-the second mixing phrase candidate middle candidate of centering extraction comprise from described first:
Add up described first and mix phrase-the second mixing phrase candidate to the inversion frequency in training text;
Retain the first mixing phrase-the second mixing phrase candidate couple that inversion frequency is greater than the first predetermined threshold value;
Whether the first mixing phrase and the right confidence degree of the second mixing phrase candidate that statistics is greater than the first predetermined threshold value are greater than the second predetermined threshold value;
If so the first mixing phrase and second that, confidence degree is greater than to the second predetermined threshold value mixes phrase candidate to candidate couple in the middle of described.
6. a simplified and traditional Chinese conversion method, for the first and second conversions that mix between phrase, described first and second mix phrase comprises one-to-many character and numeral, described first to mix phrase be a kind of in traditional font mixing phrase and simplified mixing phrase, described the second mixing phrase is the another kind in traditional font mixing phrase and simplified mixing phrase, it is characterized in that, comprising:
Obtain the first mixing phrase;
By numeric identifier, replace the described first described numeral of mixing in phrase, with in the middle of obtaining first, mix phrase;
From the template generating according to the method described in claim 1-5 any one, search the second middle phrase that mixes corresponding to mixing phrase in the middle of described first;
According to mixing phrase in the middle of described numeric identifier and described second, obtain the second mixing phrase that described the first mixing phrase is corresponding.
7. method according to claim 6, it is characterized in that, by character string matching method, utilize described the first mixing phrase-the second mixing phrase candidate to searching the described second middle mixing phrase corresponding to mixing phrase in the middle of described first with the described first middle phrase that mixes from described template.
8. method according to claim 6, is characterized in that, obtains the described first step of mixing the second mixing phrase that phrase is corresponding comprise according to mixing phrase in the middle of described numeric identifier and described second:
By the described digital numeric identifier replacing in the described second middle mixing phrase, and then obtain the second mixing phrase that described the first mixing phrase is corresponding.
9. method according to claim 6, is characterized in that, the step of obtaining the first mixing phrase also comprises before:
Receive first of user's input and mix phrase, judge that described first mixes the step that comprises described one-to-many character and described numeral in phrase.
10. method according to claim 6, is characterized in that, described numeral is Chinese figure or arabic numeral.
11. 1 kinds of systems that generate conversion between simplified and traditional Chinese template, for the first and second conversions that mix between phrase, described first and second mix phrase comprises one-to-many character and numeral, described first to mix phrase be a kind of in traditional font mixing phrase and simplified mixing phrase, described the second mixing phrase is the another kind in traditional font mixing phrase and simplified mixing phrase, it is characterized in that, comprising:
Candidate, to acquisition module, obtains the first mixing phrase-the second and mixes phrase candidate couple;
Middle candidate, to extraction module, mixes the middle candidate couple of phrase candidate centering extraction for mixing phrase-the second from described first;
Maximum coverage rate candidate is to acquisition module, for obtaining the candidate couple with maximum coverage rate from described middle candidate's centering;
Template generation module, for the candidate described in using with maximum coverage rate to generating conversion between simplified and traditional Chinese template.
12. systems according to claim 11, is characterized in that, have candidate to acquisition module for:
Second mixes phrase acquiring unit, for obtaining the second mixing phrase;
First mixes phrase acquiring unit, for mixing phrase by described second, changes, and obtains first of corresponding described the second mixing phrase and mixes phrase;
Candidate, to component units, forms described the first mixing phrase-the second mixing phrase candidate couple for described the first mixing phrase and described second being mixed to phrase.
13. systems according to claim 12, is characterized in that, second mixes phrases unit is used for:
In the middle of obtaining first, mix phrase, the described first middle phrase that mixes comprises described one-to-many character and described numeral;
Utilize described numeric identifier to replace in the middle of described first and mix the numeral in phrase, obtain the second middle phrase that mixes;
Utilize described one-to-many character and/or described numeric identifier to convert the described second middle phrase that mixes, obtain the 3rd middle phrase that mixes;
In the middle of the described the 3rd, mix phrase and filter out the mixing phrase that does not comprise described one-to-many character and described numeric identifier, obtain described second and mix phrase.
14. systems according to claim 12, is characterized in that, second mixes phrases unit is used for:
In the middle of obtaining first, mix phrase, the described first middle phrase that mixes comprises described one-to-many character and described numeral;
In the middle of utilizing described in described one-to-many character and/or described digital conversion first, mix phrase, mix phrase in the middle of obtaining the 4th;
Utilize described numeric identifier to replace in the middle of the described the 4th and mix the numeral in phrase, obtain the 5th middle phrase that mixes;
In the middle of the described the 5th, mix phrase and filter out the mixing phrase that does not comprise described one-to-many character and described numeric identifier, obtain described second and mix phrase.
15. systems according to claim 11, is characterized in that, middle candidate is used for extraction module:
Add up described first and mix phrase-the second mixing phrase candidate to the inversion frequency in training text;
Retain the first mixing phrase-the second mixing phrase candidate couple that inversion frequency is greater than the first predetermined threshold value;
Whether the first mixing phrase and the right confidence degree of the second mixing phrase candidate that statistics is greater than the first predetermined threshold value are greater than the second predetermined threshold value;
If so the first mixing phrase and second that, confidence degree is greater than to the second predetermined threshold value mixes phrase candidate to candidate couple in the middle of described.
16. 1 kinds of simplified and traditional Chinese converting systems, for the first and second conversions that mix between phrase, described first and second mix phrase comprises one-to-many character and numeral, described first to mix phrase be a kind of in traditional font mixing phrase and simplified mixing phrase, described the second mixing phrase is the another kind in traditional font mixing phrase and simplified mixing phrase, it is characterized in that, comprising:
First mixes phrase acquisition module, for obtaining the first mixing phrase;
In the middle of first, mix phrase acquisition module, for replace the described first described numeral of mixing phrase by numeric identifier, with in the middle of obtaining first, mix phrase;
Module searched in the second middle phrase that mixes, for search the second middle phrase that mixes corresponding to mixing phrase in the middle of described first from the template generating according to the system described in claim 11-15 any one;
Second mixes phrase acquisition module, for obtaining according to mixing phrase in the middle of described numeric identifier and described second the second mixing phrase that described the first mixing phrase is corresponding.
17. systems according to claim 16, is characterized in that, described in search module for:
By character string matching method, utilize described the first mixing phrase-the second mixing phrase candidate to searching the described second middle mixing phrase corresponding to mixing phrase in the middle of described first with the described first middle phrase that mixes from described template.
18. systems according to claim 16, is characterized in that, described second mixes phrase acquisition module is used for:
By the described digital numeric identifier replacing in the described second middle mixing phrase, and then obtain the second mixing phrase that described the first mixing phrase is corresponding.
19. systems according to claim 16, is characterized in that, also comprise: judge module is used for:
Receive first of user's input and mix phrase, judge in described the first mixing phrase and comprise described one-to-many character and described numeral.
20. systems according to claim 16, is characterized in that, described numeral is Chinese figure or arabic numeral.
CN201210284530.4A 2012-08-10 2012-08-10 Methods and systems for generating simplified and traditional Chinese conversion template and realizing simplified and traditional Chinese conversion based on template Active CN103577396B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210284530.4A CN103577396B (en) 2012-08-10 2012-08-10 Methods and systems for generating simplified and traditional Chinese conversion template and realizing simplified and traditional Chinese conversion based on template

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210284530.4A CN103577396B (en) 2012-08-10 2012-08-10 Methods and systems for generating simplified and traditional Chinese conversion template and realizing simplified and traditional Chinese conversion based on template

Publications (2)

Publication Number Publication Date
CN103577396A true CN103577396A (en) 2014-02-12
CN103577396B CN103577396B (en) 2017-04-12

Family

ID=50049205

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210284530.4A Active CN103577396B (en) 2012-08-10 2012-08-10 Methods and systems for generating simplified and traditional Chinese conversion template and realizing simplified and traditional Chinese conversion based on template

Country Status (1)

Country Link
CN (1) CN103577396B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050010391A1 (en) * 2003-07-10 2005-01-13 International Business Machines Corporation Chinese character / Pin Yin / English translator
CN101859295A (en) * 2009-04-07 2010-10-13 英业达股份有限公司 System and method for converting simplified Chinese character/word and traditional Chinese character/word with labels and prompts

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050010391A1 (en) * 2003-07-10 2005-01-13 International Business Machines Corporation Chinese character / Pin Yin / English translator
CN101859295A (en) * 2009-04-07 2010-10-13 英业达股份有限公司 System and method for converting simplified Chinese character/word and traditional Chinese character/word with labels and prompts

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
罗菲: "海峡两岸微别字形研究", 《中国优秀博硕士学位论文全文数据库(硕士) 哲学与人文科学辑》 *

Also Published As

Publication number Publication date
CN103577396B (en) 2017-04-12

Similar Documents

Publication Publication Date Title
Bai et al. Qwen-vl: A frontier large vision-language model with versatile abilities
Bai et al. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond
CN101271452B (en) Method and device for generating version and machine translation
CN105224640A (en) A kind of method and apparatus extracting viewpoint
Lee et al. Phrase retrieval learns passage retrieval, too
CN101676898A (en) Method and device for translating Chinese organization name into English with the aid of network knowledge
CN110046348B (en) Method for recognizing main body in subway design specification based on rules and dictionaries
US20190171713A1 (en) Semantic parsing method and apparatus
CN103942192A (en) Bilingual largest noun group separating-fusing translation method
CN104572634A (en) Method for interactively extracting comparable corpus and bilingual dictionary and device thereof
CN106502988B (en) A kind of method and apparatus that objective attribute target attribute extracts
CN104991909A (en) Automatic thesaurus construction method for specific software historical code library
Laskar et al. Improved neural machine translation for low-resource english–assamese pair
CN103793375A (en) Method for accurately replacing terms and phrases in automatic translation processing
Molina et al. Discursive sentence compression
Sarkar Part-of-speech tagging for code-mixed indian social media text at icon 2015
CN103577396A (en) Methods and systems for generating simplified and traditional Chinese conversion template and realizing simplified and traditional Chinese conversion based on template
CN103150329A (en) Word alignment method and device of bitext
Shao et al. Prompt-NER: Zero-shot Named Entity Recognition in Astronomy Literature via Large Language Models
CN105653516B (en) The method and apparatus of parallel corpora alignment
CN102081638A (en) Method and device for matching keywords
CN102375808B (en) A kind of frame disambiguation during labeling by Chinese frame net method and device
Nghiem et al. A hybrid approach for semantic enrichment of MathML mathematical expressions
Cao et al. A practical approach to extracting names of geographical entities and their relations from the web
Zhang et al. MC^ 2: A Multilingual Corpus of Minority Languages in China

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant