CN103885939A

CN103885939A - Uyghur-Chinese bi-directional translation memory system construction method

Info

Publication number: CN103885939A
Application number: CN201210553917.5A
Authority: CN
Inventors: 塔拉甫·加盘; 王天军; 邹帅
Original assignee: XINJIANG INFORMATION INDUSTRY Co Ltd
Current assignee: XINJIANG INFORMATION INDUSTRY Co Ltd
Priority date: 2012-12-19
Filing date: 2012-12-19
Publication date: 2014-06-25

Abstract

The invention discloses a Uyghur-Chinese bi-directional translation memory system construction method. The Uyghur-Chinese bi-directional translation memory system construction method includes (1) memory vault structure and management, (2) Uyghur and Chinese sentence alignment and storage, (3) translation memory retrieval and (4) translation edit environment. According to the Uyghur-Chinese bi-directional translation memory system construction method, translation efficiency and quality are improved.

Description

The building method of Uighur-Chinese two-way translation memory system

Technical field

The present invention relates to the translation memory library technology of widespread use in machine translation system, particularly the building method of Uighur-Chinese two-way translation memory system.

Background technology

Along with the development of infotech, the communication obstacle between the people of different language is constantly highlighting.Although machine translation mothod plays a good role in this respect, mechanical translation still faces heavy difficulty.Present stage, machine translation system mainly taked rule-based (being mainly linguistic knowledge aspect) and based on two kinds of methods such as corpus (being mainly instance aspect).

Because Uighur and Chinese are the language that does not belong to the family of languages of the same race, carry out profound analysis list word segmentation from philological angle, form, structure, ambiguity word, the aspects such as sentence syntactic structure and semantic structure are more difficult realizations.So the translation of Chinese dimension is mainly the translation based on corpus now, although obtain good effect, build dimension Chinese corpus and relate to very many-sided factor, put off until some time later corpus content coverage rate and be difficult to comprise full field, so translation quality is difficult to guarantee.Although mechanical translation performance is not ideal at present, supplementary translation data base is still expected to become the effective means of increasing work efficiency.

Due to the weak point of the translation technology of rule-based and corpus, consider again in professional domain (scientific and technical literature, product description, user manual etc.) that vocabulary or sentence comparison fix, run into the many of repetition sentence, therefore proposed translation memory technology.Translation memory also can be regarded the re-using of existing resource as, translating new text is the translation that re-uses the former translation of translator, besides can also also will participate in in translators in translation self, so last translation quality is guaranteed to a certain extent.Being applied in of translation memory technology is commonplace abroad, and has occurred the supplementary translation software products such as a large amount of picture Transit (STAR), Trados.Supplementary translation memory technique has also obtained certain development at home, has occurred some supplementary translation softwares as refined letter CAT.Therefore, in order to cater to the needs of Uighur information processing, facilitate the translation person of Uighur as mother tongue, improve their translation efficiency and quality, develop a translation memory instrument and have very important significance.

Summary of the invention

The object of the present invention is to provide a kind of building method of Uighur-Chinese two-way translation memory system, improve translation efficiency and translation quality.

The object of the present invention is achieved like this: a kind of building method of Uighur-Chinese two-way translation memory system, 1. data base structure and management: the tissue of various information and storage are seen as combining by a lot of translation memories unit, also can regard a Parallel Corpus as, the example sentence that in data base, storage was translated in the past, the dimension Chinese data base that adopts sentence sentence level to align; Data base is searched to word, add sentence, delete sentence, data base imports, and derives; 2. tie up the storage of Chinese sentence alignment: all dimension Chinese sentences collected in dimension Chinese data base are all using XML language as code speech, translation memory is with the form storage of " translation unit ", dimension sentence is accurately corresponding with Chinese sentence, and the sentence that the dimension Chinese is corresponding passes through sentence mark to <tu> ... id under <tu> describes; 3. translation memory retrieval: in translation memory system, the example being retrieved approaches sentence to be translated, the quality of translation is just better; Adopt " the shortest edit distance approach " (minimum edit distance) to calculate the similarity between sentence to be translated and existing sentence, obtain and have between two sentences after several words need to mate by levenshtein distance (LD) algorithm, by fuzzy matching computing formula, obtain the similarity between former sentence and target sentence; 4. translate editing environment: before translation, (be mainly .txt by internal ramp (filter) by corresponding format, .doc) original text in document imports, complete subordinate sentence, participle in internal system, calculate sentence similarity by fuzzy matching, search accordingly in existing vocabulary by binary chop method and in word list, show word and corresponding translation, then by exporting as the translation of former document format after translation.

For the sentence to be translated of translator input, in translation memory library, search and return coupling completely or similar sentence for translator's selection.How in translation memory library, searching similar sentence is that the very crucial editing distance of using in the natural language processing field of being everlasting of having used herein calculates the similarity problem of inputting sentence in sentence and data base.In translation process, translation memory system calculates sentence pattern identical in automatic search data base or that part is similar by similarity, and recommend reference translation to translator, allow translator decide in its sole discretion and whether accept, edit or refuse, simultaneously also original text and the translation of continuous study and the new sentence of automatic storage on backstage of translation memory library.

The present invention has designed and Implemented translation memory system model, and in data base design, adopts Uighur and the Chinese sentence mode storage mode with sentence Accurate align, and data base is inquired about simultaneously, deletes update.Wherein gordian technique is statement similarity in data base, this technology realizes by " editing distance " (edit distance) conventional in natural language, the corresponding sentence of sentence that is wherein greater than threshold value offers user and translates reference, result proves, this two-way translation data base system has played good effect in translation.The present invention improves translation efficiency and translation quality.

Accompanying drawing explanation

Below in conjunction with accompanying drawing, the invention will be further described.

Fig. 1 is dimension Chinese translation memory system model schematic diagram.

Embodiment

A building method for Uighur-Chinese two-way translation memory system, 1. data base structure and management.In whole data base, the tissue of various information and storage can be regarded as by the combining of a lot of translation memories unit, and also can regard a Parallel Corpus as.The example sentence that in data base, storage was translated in the past.In data base design, adopt the dimension Chinese data base of sentence sentence level alignment herein.After data base designs, also will well manage data base, comprise data base is searched to word, add sentence, delete sentence, data base imports, and derives etc.; 2. tie up the storage of Chinese sentence alignment.All dimension Chinese sentences collected in dimension Chinese data base are all using XML language as code speech.Translation memory is with the form storage of " translation unit ", and dimension sentence is accurately corresponding with Chinese sentence.The sentence that the dimension Chinese is corresponding passes through sentence mark to <tu> ... id under <tu> describes; 3. translation memory retrieval.In translation memory system, the example being retrieved approaches sentence to be translated, and the quality of translation is just better.The calculating of statement similarity is one of gordian technique in translation memory system, so similarity is calculated the efficiency and the quality that directly affect translation memory system.In translation memory technology, commonly use based on character string and the similarity calculating method based on linguistic knowledge aspect at present.Consider that dimension Chinese sentence is certainly in structure, semanteme, the difference of the aspects such as form and complicacy, literary grace is calculated the similarity between sentence to be translated and existing sentence with " the shortest edit distance approach " (minimum edit distance).Obtain and have between two sentences after several words need to mate by levenshtein distance (LD) algorithm, by fuzzy matching computing formula, obtain the similarity between former sentence and target sentence; 4. translate editing environment.The environment that translation editing environment also can be regarded as translator carries out translation.Native system translation is carried out in system.Before translation, (be mainly .txt by internal ramp (filter) by corresponding format, .doc) original text in document imports, complete subordinate sentence in internal system, participle, calculate sentence similarity by fuzzy matching, in existing vocabulary, search accordingly and in word list, show word and corresponding translation by the method for binary chop, by exporting as the translation of former document format after translation.

As shown in Figure 1, text to be translated carries out subordinate sentence, then progressively extracts each sentence and calculates sentence similarity according to data base.Wherein the highest sentence of similarity is carried out to human-edited, then export translation result.

If following table 1 is data base structural table.After data base designs, also will well manage data base, comprise data base is searched to word, add sentence, delete sentence, data base imports, and derives etc.

Table 1

Following programming example sentence is the dimension Chinese sentence alignment storage example in data base.Wherein comprise Uighur mark " ug-CN ", Chinese signs " zh-Hans ", Uighur sentence, and the Chinese sentence of alignment.

As described in Table 2, it is sentence matrix form.According to the needs that calculate similar formula, the matching operation that first calculating two sentences needs is counted d[n, m], concrete steps are as follows: the first step: obtain the length m of former sentence length n and target sentences, if a side length 0 is returned to the opposing party's length.Second step: the matrix d of initialization (n+1) * (m+1), the value of the first row first row is 0 to increase to corresponding sentence length.The 3rd step: each character (i, j is since 1) in array.If s[i] and t[j] (s[i] be i word of former sentence, t[j] be j word of existing sentence in data base) value equal, source value is 0, otherwise is 1.D[i] value of [j] is d[i-1, j]+souroe (value on the left side adds 0 or 1), d[i, j-1]+source (value of top adds 0 or 1), d[i-1, j-1] reckling in+source (value that tiltedly goes up angle adds 0 or 1).The 4th step: according to above-mentioned three steps calculate after, lower right corner d[n, m] value be just the distance of two character strings.

Table 2

Claims

1. the building method of Uighur-Chinese two-way translation memory system, its method is: 1. data base structure and management: the tissue of various information and storage are seen as combining by a lot of translation memories unit, also can regard a Parallel Corpus as, the example sentence that in data base, storage was translated in the past, the dimension Chinese data base that adopts sentence sentence level to align; Data base is searched to word, add sentence, delete sentence, data base imports, and derives; 2. tie up the storage of Chinese sentence alignment: all dimension Chinese sentences collected in dimension Chinese data base are all using XML language as code speech, translation memory is with the form storage of " translation unit ", dimension sentence is accurately corresponding with Chinese sentence, and the sentence that the dimension Chinese is corresponding passes through sentence mark to <tu> ... id under <tu> describes; 3. translation memory retrieval: in translation memory system, the example being retrieved approaches sentence to be translated, the quality of translation is just better; Adopt the shortest edit distance approach to calculate the similarity between sentence to be translated and existing sentence, obtain and have between two sentences after several words need to mate by levenshtein distance algorithm, by fuzzy matching computing formula, obtain the similarity between former sentence and target sentence; 4. translate editing environment: before translation, by internal ramp, the original text in the document of corresponding format is imported, complete subordinate sentence, participle in internal system, calculate sentence similarity by fuzzy matching, search accordingly in existing vocabulary by binary chop method and in word list, show word and corresponding translation, then by exporting as the translation of former document format after translation.