CN103885939A - Uyghur-Chinese bi-directional translation memory system construction method - Google Patents

Uyghur-Chinese bi-directional translation memory system construction method Download PDF

Info

Publication number
CN103885939A
CN103885939A CN201210553917.5A CN201210553917A CN103885939A CN 103885939 A CN103885939 A CN 103885939A CN 201210553917 A CN201210553917 A CN 201210553917A CN 103885939 A CN103885939 A CN 103885939A
Authority
CN
China
Prior art keywords
sentence
translation
chinese
data base
dimension
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210553917.5A
Other languages
Chinese (zh)
Inventor
塔拉甫·加盘
王天军
邹帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XINJIANG INFORMATION INDUSTRY Co Ltd
Original Assignee
XINJIANG INFORMATION INDUSTRY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by XINJIANG INFORMATION INDUSTRY Co Ltd filed Critical XINJIANG INFORMATION INDUSTRY Co Ltd
Priority to CN201210553917.5A priority Critical patent/CN103885939A/en
Publication of CN103885939A publication Critical patent/CN103885939A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a Uyghur-Chinese bi-directional translation memory system construction method. The Uyghur-Chinese bi-directional translation memory system construction method includes (1) memory vault structure and management, (2) Uyghur and Chinese sentence alignment and storage, (3) translation memory retrieval and (4) translation edit environment. According to the Uyghur-Chinese bi-directional translation memory system construction method, translation efficiency and quality are improved.

Description

The building method of Uighur-Chinese two-way translation memory system
Technical field
The present invention relates to the translation memory library technology of widespread use in machine translation system, particularly the building method of Uighur-Chinese two-way translation memory system.
Background technology
Along with the development of infotech, the communication obstacle between the people of different language is constantly highlighting.Although machine translation mothod plays a good role in this respect, mechanical translation still faces heavy difficulty.Present stage, machine translation system mainly taked rule-based (being mainly linguistic knowledge aspect) and based on two kinds of methods such as corpus (being mainly instance aspect).
Because Uighur and Chinese are the language that does not belong to the family of languages of the same race, carry out profound analysis list word segmentation from philological angle, form, structure, ambiguity word, the aspects such as sentence syntactic structure and semantic structure are more difficult realizations.So the translation of Chinese dimension is mainly the translation based on corpus now, although obtain good effect, build dimension Chinese corpus and relate to very many-sided factor, put off until some time later corpus content coverage rate and be difficult to comprise full field, so translation quality is difficult to guarantee.Although mechanical translation performance is not ideal at present, supplementary translation data base is still expected to become the effective means of increasing work efficiency.
Due to the weak point of the translation technology of rule-based and corpus, consider again in professional domain (scientific and technical literature, product description, user manual etc.) that vocabulary or sentence comparison fix, run into the many of repetition sentence, therefore proposed translation memory technology.Translation memory also can be regarded the re-using of existing resource as, translating new text is the translation that re-uses the former translation of translator, besides can also also will participate in in translators in translation self, so last translation quality is guaranteed to a certain extent.Being applied in of translation memory technology is commonplace abroad, and has occurred the supplementary translation software products such as a large amount of picture Transit (STAR), Trados.Supplementary translation memory technique has also obtained certain development at home, has occurred some supplementary translation softwares as refined letter CAT.Therefore, in order to cater to the needs of Uighur information processing, facilitate the translation person of Uighur as mother tongue, improve their translation efficiency and quality, develop a translation memory instrument and have very important significance.
Summary of the invention
The object of the present invention is to provide a kind of building method of Uighur-Chinese two-way translation memory system, improve translation efficiency and translation quality.
The object of the present invention is achieved like this: a kind of building method of Uighur-Chinese two-way translation memory system, 1. data base structure and management: the tissue of various information and storage are seen as combining by a lot of translation memories unit, also can regard a Parallel Corpus as, the example sentence that in data base, storage was translated in the past, the dimension Chinese data base that adopts sentence sentence level to align; Data base is searched to word, add sentence, delete sentence, data base imports, and derives; 2. tie up the storage of Chinese sentence alignment: all dimension Chinese sentences collected in dimension Chinese data base are all using XML language as code speech, translation memory is with the form storage of " translation unit ", dimension sentence is accurately corresponding with Chinese sentence, and the sentence that the dimension Chinese is corresponding passes through sentence mark to <tu> ... id under <tu> describes; 3. translation memory retrieval: in translation memory system, the example being retrieved approaches sentence to be translated, the quality of translation is just better; Adopt " the shortest edit distance approach " (minimum edit distance) to calculate the similarity between sentence to be translated and existing sentence, obtain and have between two sentences after several words need to mate by levenshtein distance (LD) algorithm, by fuzzy matching computing formula, obtain the similarity between former sentence and target sentence; 4. translate editing environment: before translation, (be mainly .txt by internal ramp (filter) by corresponding format, .doc) original text in document imports, complete subordinate sentence, participle in internal system, calculate sentence similarity by fuzzy matching, search accordingly in existing vocabulary by binary chop method and in word list, show word and corresponding translation, then by exporting as the translation of former document format after translation.
For the sentence to be translated of translator input, in translation memory library, search and return coupling completely or similar sentence for translator's selection.How in translation memory library, searching similar sentence is that the very crucial editing distance of using in the natural language processing field of being everlasting of having used herein calculates the similarity problem of inputting sentence in sentence and data base.In translation process, translation memory system calculates sentence pattern identical in automatic search data base or that part is similar by similarity, and recommend reference translation to translator, allow translator decide in its sole discretion and whether accept, edit or refuse, simultaneously also original text and the translation of continuous study and the new sentence of automatic storage on backstage of translation memory library.
The present invention has designed and Implemented translation memory system model, and in data base design, adopts Uighur and the Chinese sentence mode storage mode with sentence Accurate align, and data base is inquired about simultaneously, deletes update.Wherein gordian technique is statement similarity in data base, this technology realizes by " editing distance " (edit distance) conventional in natural language, the corresponding sentence of sentence that is wherein greater than threshold value offers user and translates reference, result proves, this two-way translation data base system has played good effect in translation.The present invention improves translation efficiency and translation quality.
Accompanying drawing explanation
Below in conjunction with accompanying drawing, the invention will be further described.
Fig. 1 is dimension Chinese translation memory system model schematic diagram.
Embodiment
A building method for Uighur-Chinese two-way translation memory system, 1. data base structure and management.In whole data base, the tissue of various information and storage can be regarded as by the combining of a lot of translation memories unit, and also can regard a Parallel Corpus as.The example sentence that in data base, storage was translated in the past.In data base design, adopt the dimension Chinese data base of sentence sentence level alignment herein.After data base designs, also will well manage data base, comprise data base is searched to word, add sentence, delete sentence, data base imports, and derives etc.; 2. tie up the storage of Chinese sentence alignment.All dimension Chinese sentences collected in dimension Chinese data base are all using XML language as code speech.Translation memory is with the form storage of " translation unit ", and dimension sentence is accurately corresponding with Chinese sentence.The sentence that the dimension Chinese is corresponding passes through sentence mark to <tu> ... id under <tu> describes; 3. translation memory retrieval.In translation memory system, the example being retrieved approaches sentence to be translated, and the quality of translation is just better.The calculating of statement similarity is one of gordian technique in translation memory system, so similarity is calculated the efficiency and the quality that directly affect translation memory system.In translation memory technology, commonly use based on character string and the similarity calculating method based on linguistic knowledge aspect at present.Consider that dimension Chinese sentence is certainly in structure, semanteme, the difference of the aspects such as form and complicacy, literary grace is calculated the similarity between sentence to be translated and existing sentence with " the shortest edit distance approach " (minimum edit distance).Obtain and have between two sentences after several words need to mate by levenshtein distance (LD) algorithm, by fuzzy matching computing formula, obtain the similarity between former sentence and target sentence; 4. translate editing environment.The environment that translation editing environment also can be regarded as translator carries out translation.Native system translation is carried out in system.Before translation, (be mainly .txt by internal ramp (filter) by corresponding format, .doc) original text in document imports, complete subordinate sentence in internal system, participle, calculate sentence similarity by fuzzy matching, in existing vocabulary, search accordingly and in word list, show word and corresponding translation by the method for binary chop, by exporting as the translation of former document format after translation.
As shown in Figure 1, text to be translated carries out subordinate sentence, then progressively extracts each sentence and calculates sentence similarity according to data base.Wherein the highest sentence of similarity is carried out to human-edited, then export translation result.
If following table 1 is data base structural table.After data base designs, also will well manage data base, comprise data base is searched to word, add sentence, delete sentence, data base imports, and derives etc.
Table 1
Figure DEST_PATH_GDA00002862436600031
Following programming example sentence is the dimension Chinese sentence alignment storage example in data base.Wherein comprise Uighur mark " ug-CN ", Chinese signs " zh-Hans ", Uighur sentence, and the Chinese sentence of alignment.
Figure DEST_PATH_GDA00002862436600041
As described in Table 2, it is sentence matrix form.According to the needs that calculate similar formula, the matching operation that first calculating two sentences needs is counted d[n, m], concrete steps are as follows: the first step: obtain the length m of former sentence length n and target sentences, if a side length 0 is returned to the opposing party's length.Second step: the matrix d of initialization (n+1) * (m+1), the value of the first row first row is 0 to increase to corresponding sentence length.The 3rd step: each character (i, j is since 1) in array.If s[i] and t[j] (s[i] be i word of former sentence, t[j] be j word of existing sentence in data base) value equal, source value is 0, otherwise is 1.D[i] value of [j] is d[i-1, j]+souroe (value on the left side adds 0 or 1), d[i, j-1]+source (value of top adds 0 or 1), d[i-1, j-1] reckling in+source (value that tiltedly goes up angle adds 0 or 1).The 4th step: according to above-mentioned three steps calculate after, lower right corner d[n, m] value be just the distance of two character strings.
Table 2
Figure DEST_PATH_GDA00002862436600042

Claims (1)

1. the building method of Uighur-Chinese two-way translation memory system, its method is: 1. data base structure and management: the tissue of various information and storage are seen as combining by a lot of translation memories unit, also can regard a Parallel Corpus as, the example sentence that in data base, storage was translated in the past, the dimension Chinese data base that adopts sentence sentence level to align; Data base is searched to word, add sentence, delete sentence, data base imports, and derives; 2. tie up the storage of Chinese sentence alignment: all dimension Chinese sentences collected in dimension Chinese data base are all using XML language as code speech, translation memory is with the form storage of " translation unit ", dimension sentence is accurately corresponding with Chinese sentence, and the sentence that the dimension Chinese is corresponding passes through sentence mark to <tu> ... id under <tu> describes; 3. translation memory retrieval: in translation memory system, the example being retrieved approaches sentence to be translated, the quality of translation is just better; Adopt the shortest edit distance approach to calculate the similarity between sentence to be translated and existing sentence, obtain and have between two sentences after several words need to mate by levenshtein distance algorithm, by fuzzy matching computing formula, obtain the similarity between former sentence and target sentence; 4. translate editing environment: before translation, by internal ramp, the original text in the document of corresponding format is imported, complete subordinate sentence, participle in internal system, calculate sentence similarity by fuzzy matching, search accordingly in existing vocabulary by binary chop method and in word list, show word and corresponding translation, then by exporting as the translation of former document format after translation.
CN201210553917.5A 2012-12-19 2012-12-19 Uyghur-Chinese bi-directional translation memory system construction method Pending CN103885939A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210553917.5A CN103885939A (en) 2012-12-19 2012-12-19 Uyghur-Chinese bi-directional translation memory system construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210553917.5A CN103885939A (en) 2012-12-19 2012-12-19 Uyghur-Chinese bi-directional translation memory system construction method

Publications (1)

Publication Number Publication Date
CN103885939A true CN103885939A (en) 2014-06-25

Family

ID=50954834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210553917.5A Pending CN103885939A (en) 2012-12-19 2012-12-19 Uyghur-Chinese bi-directional translation memory system construction method

Country Status (1)

Country Link
CN (1) CN103885939A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239295A (en) * 2014-09-10 2014-12-24 华建宇通科技(北京)有限责任公司 Multilevel Uigur lexical analysis method for Uigur-Chinese translation systems
CN104391840A (en) * 2014-11-24 2015-03-04 上海迈外迪网络科技有限公司 Translation method and device
CN104933194A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Vietnamese bilingual parallel text automatic acquisition system and realizing method thereof
CN104933192A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Automatic Chinese and Filipino bilingual parallel text collection system and implementation method
CN104933193A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Bahasa Melayu bilingual parallel text automatic acquisition system and realizing method thereof
CN104933195A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Burmese bilingual parallel text automatic acquisition system and realizing method thereof
CN104965925A (en) * 2015-07-13 2015-10-07 广西达译商务服务有限责任公司 Automatic Chinese-Khmer bilingual parallel text acquisition system and implementation method
CN105045861A (en) * 2015-07-13 2015-11-11 广西达译商务服务有限责任公司 System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method
CN105045862A (en) * 2015-07-13 2015-11-11 广西达译商务服务有限责任公司 System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method
CN105138548A (en) * 2015-07-13 2015-12-09 广西达译商务服务有限责任公司 System for automatically collecting Chinese-Thai bilingual parallel corpus and implementation method
CN105183723A (en) * 2015-09-17 2015-12-23 成都优译信息技术有限公司 Associating method for translation software and language material searching
CN105335359A (en) * 2015-11-18 2016-02-17 成都优译信息技术有限公司 Term extracting method used for translation teaching system
CN106250373A (en) * 2016-08-04 2016-12-21 安徽云商信息科技有限公司 Automatically the rapid translation machine of language is identified
CN106844354A (en) * 2017-01-11 2017-06-13 中国科学院合肥物质科学研究院 A kind of webpage takes word Chinese interpretation method and its device
CN107967303A (en) * 2017-11-10 2018-04-27 传神语联网网络科技股份有限公司 The method and device that language material is shown
CN109344410A (en) * 2018-09-19 2019-02-15 中译语通科技股份有限公司 A kind of machine translation control system and method, information data processing terminal
CN109710952A (en) * 2018-12-27 2019-05-03 北京百度网讯科技有限公司 Translation history search method, device, equipment and medium based on artificial intelligence
CN111563387A (en) * 2019-02-12 2020-08-21 阿里巴巴集团控股有限公司 Sentence similarity determining method and device and sentence translation method and device
CN111898388A (en) * 2020-07-20 2020-11-06 北京字节跳动网络技术有限公司 Video subtitle translation editing method and device, electronic equipment and storage medium
CN112036191A (en) * 2020-08-31 2020-12-04 文思海辉智科科技有限公司 Data processing method and device and readable storage medium
CN113033220A (en) * 2021-04-15 2021-06-25 沈阳雅译网络技术有限公司 Lavenstein ratio-based method for constructing literary-modern translation system

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452459A (en) * 2007-11-30 2009-06-10 英业达股份有限公司 System for searching similar translation result by utilizing indexes and method thereof
CN102193914A (en) * 2011-05-26 2011-09-21 中国科学院计算技术研究所 Computer aided translation method and system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101452459A (en) * 2007-11-30 2009-06-10 英业达股份有限公司 System for searching similar translation result by utilizing indexes and method thereof
CN102193914A (en) * 2011-05-26 2011-09-21 中国科学院计算技术研究所 Computer aided translation method and system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
塔依尔江·苏拉依曼等: "维吾尔文-汉文计算机辅助翻译***中双向翻译记忆子***的设计与实现", 《新疆大学学报(自然科学版)》 *

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104239295B (en) * 2014-09-10 2017-01-18 华建宇通科技(北京)有限责任公司 Multilevel Uigur lexical analysis method for Uigur-Chinese translation systems
CN104239295A (en) * 2014-09-10 2014-12-24 华建宇通科技(北京)有限责任公司 Multilevel Uigur lexical analysis method for Uigur-Chinese translation systems
CN104391840A (en) * 2014-11-24 2015-03-04 上海迈外迪网络科技有限公司 Translation method and device
CN104965925A (en) * 2015-07-13 2015-10-07 广西达译商务服务有限责任公司 Automatic Chinese-Khmer bilingual parallel text acquisition system and implementation method
CN104933193A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Bahasa Melayu bilingual parallel text automatic acquisition system and realizing method thereof
CN104933195A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Burmese bilingual parallel text automatic acquisition system and realizing method thereof
CN104933192A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Automatic Chinese and Filipino bilingual parallel text collection system and implementation method
CN105045861A (en) * 2015-07-13 2015-11-11 广西达译商务服务有限责任公司 System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method
CN105045862A (en) * 2015-07-13 2015-11-11 广西达译商务服务有限责任公司 System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method
CN105138548A (en) * 2015-07-13 2015-12-09 广西达译商务服务有限责任公司 System for automatically collecting Chinese-Thai bilingual parallel corpus and implementation method
CN104933194A (en) * 2015-07-13 2015-09-23 广西达译商务服务有限责任公司 Chinese and Vietnamese bilingual parallel text automatic acquisition system and realizing method thereof
CN105183723A (en) * 2015-09-17 2015-12-23 成都优译信息技术有限公司 Associating method for translation software and language material searching
CN105335359A (en) * 2015-11-18 2016-02-17 成都优译信息技术有限公司 Term extracting method used for translation teaching system
CN106250373A (en) * 2016-08-04 2016-12-21 安徽云商信息科技有限公司 Automatically the rapid translation machine of language is identified
CN106844354A (en) * 2017-01-11 2017-06-13 中国科学院合肥物质科学研究院 A kind of webpage takes word Chinese interpretation method and its device
CN107967303A (en) * 2017-11-10 2018-04-27 传神语联网网络科技股份有限公司 The method and device that language material is shown
CN107967303B (en) * 2017-11-10 2021-03-26 传神语联网网络科技股份有限公司 Corpus display method and apparatus
CN109344410A (en) * 2018-09-19 2019-02-15 中译语通科技股份有限公司 A kind of machine translation control system and method, information data processing terminal
CN109710952A (en) * 2018-12-27 2019-05-03 北京百度网讯科技有限公司 Translation history search method, device, equipment and medium based on artificial intelligence
CN109710952B (en) * 2018-12-27 2023-06-16 北京百度网讯科技有限公司 Translation history retrieval method, device, equipment and medium based on artificial intelligence
CN111563387A (en) * 2019-02-12 2020-08-21 阿里巴巴集团控股有限公司 Sentence similarity determining method and device and sentence translation method and device
CN111563387B (en) * 2019-02-12 2023-05-02 阿里巴巴集团控股有限公司 Sentence similarity determining method and device, sentence translating method and device
CN111898388A (en) * 2020-07-20 2020-11-06 北京字节跳动网络技术有限公司 Video subtitle translation editing method and device, electronic equipment and storage medium
CN112036191A (en) * 2020-08-31 2020-12-04 文思海辉智科科技有限公司 Data processing method and device and readable storage medium
CN112036191B (en) * 2020-08-31 2023-11-28 文思海辉智科科技有限公司 Data processing method and device and readable storage medium
CN113033220A (en) * 2021-04-15 2021-06-25 沈阳雅译网络技术有限公司 Lavenstein ratio-based method for constructing literary-modern translation system

Similar Documents

Publication Publication Date Title
CN103885939A (en) Uyghur-Chinese bi-directional translation memory system construction method
CN104679850B (en) Address structure method and device
CN1661593B (en) Method for translating computer language and translation system
CN104679867B (en) Address method of knowledge processing and device based on figure
CN105975625A (en) Chinglish inquiring correcting method and system oriented to English search engine
CN102253930B (en) A kind of method of text translation and device
CN102799578B (en) Translation rule extraction method and translation method based on dependency grammar tree
CN104298662A (en) Machine translation method and translation system based on organism named entities
CN105335487A (en) Agricultural specialist information retrieval system and method on basis of agricultural technology information ontology library
CN101796508A (en) Coreference resolution in an ambiguity-sensitive natural language processing system
CN108665141B (en) Method for automatically extracting emergency response process model from emergency plan
Kituku et al. A review on machine translation approaches
CN103324700A (en) Noumenon concept attribute learning method based on Web information
Vel Pre-processing techniques of text mining using computational linguistics and python libraries
Zhang et al. A tree-to-tree alignment-based model for statistical machine translation
CN113312922A (en) Improved chapter-level triple information extraction method
CN113343717A (en) Neural machine translation method based on translation memory library
Ture et al. Exploiting representations from statistical machine translation for cross-language information retrieval
Sun [Retracted] Analysis of Chinese Machine Translation Training Based on Deep Learning Technology
CN108255818B (en) Combined machine translation method using segmentation technology
Bick Dan2eng: wide-coverage Danish-English machine translation
Iswarya et al. Adapting hybrid machine translation techniques for cross-language text retrieval system
Gao et al. BIMTag: semantic annotation of web BIM product resources based on IFC ontology
Hong Construction of corpus in Artificial Intelligence age
CN115048948B (en) Cross-language abstracting method for cross-Chinese low-resource by fusing topic association diagram

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20140625

WD01 Invention patent application deemed withdrawn after publication