CN103885939A - Uyghur-Chinese bi-directional translation memory system construction method - Google Patents
Uyghur-Chinese bi-directional translation memory system construction method Download PDFInfo
- Publication number
- CN103885939A CN103885939A CN201210553917.5A CN201210553917A CN103885939A CN 103885939 A CN103885939 A CN 103885939A CN 201210553917 A CN201210553917 A CN 201210553917A CN 103885939 A CN103885939 A CN 103885939A
- Authority
- CN
- China
- Prior art keywords
- sentence
- translation
- chinese
- data base
- dimension
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Machine Translation (AREA)
Abstract
The invention discloses a Uyghur-Chinese bi-directional translation memory system construction method. The Uyghur-Chinese bi-directional translation memory system construction method includes (1) memory vault structure and management, (2) Uyghur and Chinese sentence alignment and storage, (3) translation memory retrieval and (4) translation edit environment. According to the Uyghur-Chinese bi-directional translation memory system construction method, translation efficiency and quality are improved.
Description
Technical field
The present invention relates to the translation memory library technology of widespread use in machine translation system, particularly the building method of Uighur-Chinese two-way translation memory system.
Background technology
Along with the development of infotech, the communication obstacle between the people of different language is constantly highlighting.Although machine translation mothod plays a good role in this respect, mechanical translation still faces heavy difficulty.Present stage, machine translation system mainly taked rule-based (being mainly linguistic knowledge aspect) and based on two kinds of methods such as corpus (being mainly instance aspect).
Because Uighur and Chinese are the language that does not belong to the family of languages of the same race, carry out profound analysis list word segmentation from philological angle, form, structure, ambiguity word, the aspects such as sentence syntactic structure and semantic structure are more difficult realizations.So the translation of Chinese dimension is mainly the translation based on corpus now, although obtain good effect, build dimension Chinese corpus and relate to very many-sided factor, put off until some time later corpus content coverage rate and be difficult to comprise full field, so translation quality is difficult to guarantee.Although mechanical translation performance is not ideal at present, supplementary translation data base is still expected to become the effective means of increasing work efficiency.
Due to the weak point of the translation technology of rule-based and corpus, consider again in professional domain (scientific and technical literature, product description, user manual etc.) that vocabulary or sentence comparison fix, run into the many of repetition sentence, therefore proposed translation memory technology.Translation memory also can be regarded the re-using of existing resource as, translating new text is the translation that re-uses the former translation of translator, besides can also also will participate in in translators in translation self, so last translation quality is guaranteed to a certain extent.Being applied in of translation memory technology is commonplace abroad, and has occurred the supplementary translation software products such as a large amount of picture Transit (STAR), Trados.Supplementary translation memory technique has also obtained certain development at home, has occurred some supplementary translation softwares as refined letter CAT.Therefore, in order to cater to the needs of Uighur information processing, facilitate the translation person of Uighur as mother tongue, improve their translation efficiency and quality, develop a translation memory instrument and have very important significance.
Summary of the invention
The object of the present invention is to provide a kind of building method of Uighur-Chinese two-way translation memory system, improve translation efficiency and translation quality.
The object of the present invention is achieved like this: a kind of building method of Uighur-Chinese two-way translation memory system, 1. data base structure and management: the tissue of various information and storage are seen as combining by a lot of translation memories unit, also can regard a Parallel Corpus as, the example sentence that in data base, storage was translated in the past, the dimension Chinese data base that adopts sentence sentence level to align; Data base is searched to word, add sentence, delete sentence, data base imports, and derives; 2. tie up the storage of Chinese sentence alignment: all dimension Chinese sentences collected in dimension Chinese data base are all using XML language as code speech, translation memory is with the form storage of " translation unit ", dimension sentence is accurately corresponding with Chinese sentence, and the sentence that the dimension Chinese is corresponding passes through sentence mark to <tu> ... id under <tu> describes; 3. translation memory retrieval: in translation memory system, the example being retrieved approaches sentence to be translated, the quality of translation is just better; Adopt " the shortest edit distance approach " (minimum edit distance) to calculate the similarity between sentence to be translated and existing sentence, obtain and have between two sentences after several words need to mate by levenshtein distance (LD) algorithm, by fuzzy matching computing formula, obtain the similarity between former sentence and target sentence; 4. translate editing environment: before translation, (be mainly .txt by internal ramp (filter) by corresponding format, .doc) original text in document imports, complete subordinate sentence, participle in internal system, calculate sentence similarity by fuzzy matching, search accordingly in existing vocabulary by binary chop method and in word list, show word and corresponding translation, then by exporting as the translation of former document format after translation.
For the sentence to be translated of translator input, in translation memory library, search and return coupling completely or similar sentence for translator's selection.How in translation memory library, searching similar sentence is that the very crucial editing distance of using in the natural language processing field of being everlasting of having used herein calculates the similarity problem of inputting sentence in sentence and data base.In translation process, translation memory system calculates sentence pattern identical in automatic search data base or that part is similar by similarity, and recommend reference translation to translator, allow translator decide in its sole discretion and whether accept, edit or refuse, simultaneously also original text and the translation of continuous study and the new sentence of automatic storage on backstage of translation memory library.
The present invention has designed and Implemented translation memory system model, and in data base design, adopts Uighur and the Chinese sentence mode storage mode with sentence Accurate align, and data base is inquired about simultaneously, deletes update.Wherein gordian technique is statement similarity in data base, this technology realizes by " editing distance " (edit distance) conventional in natural language, the corresponding sentence of sentence that is wherein greater than threshold value offers user and translates reference, result proves, this two-way translation data base system has played good effect in translation.The present invention improves translation efficiency and translation quality.
Accompanying drawing explanation
Below in conjunction with accompanying drawing, the invention will be further described.
Fig. 1 is dimension Chinese translation memory system model schematic diagram.
Embodiment
A building method for Uighur-Chinese two-way translation memory system, 1. data base structure and management.In whole data base, the tissue of various information and storage can be regarded as by the combining of a lot of translation memories unit, and also can regard a Parallel Corpus as.The example sentence that in data base, storage was translated in the past.In data base design, adopt the dimension Chinese data base of sentence sentence level alignment herein.After data base designs, also will well manage data base, comprise data base is searched to word, add sentence, delete sentence, data base imports, and derives etc.; 2. tie up the storage of Chinese sentence alignment.All dimension Chinese sentences collected in dimension Chinese data base are all using XML language as code speech.Translation memory is with the form storage of " translation unit ", and dimension sentence is accurately corresponding with Chinese sentence.The sentence that the dimension Chinese is corresponding passes through sentence mark to <tu> ... id under <tu> describes; 3. translation memory retrieval.In translation memory system, the example being retrieved approaches sentence to be translated, and the quality of translation is just better.The calculating of statement similarity is one of gordian technique in translation memory system, so similarity is calculated the efficiency and the quality that directly affect translation memory system.In translation memory technology, commonly use based on character string and the similarity calculating method based on linguistic knowledge aspect at present.Consider that dimension Chinese sentence is certainly in structure, semanteme, the difference of the aspects such as form and complicacy, literary grace is calculated the similarity between sentence to be translated and existing sentence with " the shortest edit distance approach " (minimum edit distance).Obtain and have between two sentences after several words need to mate by levenshtein distance (LD) algorithm, by fuzzy matching computing formula, obtain the similarity between former sentence and target sentence; 4. translate editing environment.The environment that translation editing environment also can be regarded as translator carries out translation.Native system translation is carried out in system.Before translation, (be mainly .txt by internal ramp (filter) by corresponding format, .doc) original text in document imports, complete subordinate sentence in internal system, participle, calculate sentence similarity by fuzzy matching, in existing vocabulary, search accordingly and in word list, show word and corresponding translation by the method for binary chop, by exporting as the translation of former document format after translation.
As shown in Figure 1, text to be translated carries out subordinate sentence, then progressively extracts each sentence and calculates sentence similarity according to data base.Wherein the highest sentence of similarity is carried out to human-edited, then export translation result.
If following table 1 is data base structural table.After data base designs, also will well manage data base, comprise data base is searched to word, add sentence, delete sentence, data base imports, and derives etc.
Table 1
Following programming example sentence is the dimension Chinese sentence alignment storage example in data base.Wherein comprise Uighur mark " ug-CN ", Chinese signs " zh-Hans ", Uighur sentence, and the Chinese sentence of alignment.
As described in Table 2, it is sentence matrix form.According to the needs that calculate similar formula, the matching operation that first calculating two sentences needs is counted d[n, m], concrete steps are as follows: the first step: obtain the length m of former sentence length n and target sentences, if a side length 0 is returned to the opposing party's length.Second step: the matrix d of initialization (n+1) * (m+1), the value of the first row first row is 0 to increase to corresponding sentence length.The 3rd step: each character (i, j is since 1) in array.If s[i] and t[j] (s[i] be i word of former sentence, t[j] be j word of existing sentence in data base) value equal, source value is 0, otherwise is 1.D[i] value of [j] is d[i-1, j]+souroe (value on the left side adds 0 or 1), d[i, j-1]+source (value of top adds 0 or 1), d[i-1, j-1] reckling in+source (value that tiltedly goes up angle adds 0 or 1).The 4th step: according to above-mentioned three steps calculate after, lower right corner d[n, m] value be just the distance of two character strings.
Table 2
Claims (1)
1. the building method of Uighur-Chinese two-way translation memory system, its method is: 1. data base structure and management: the tissue of various information and storage are seen as combining by a lot of translation memories unit, also can regard a Parallel Corpus as, the example sentence that in data base, storage was translated in the past, the dimension Chinese data base that adopts sentence sentence level to align; Data base is searched to word, add sentence, delete sentence, data base imports, and derives; 2. tie up the storage of Chinese sentence alignment: all dimension Chinese sentences collected in dimension Chinese data base are all using XML language as code speech, translation memory is with the form storage of " translation unit ", dimension sentence is accurately corresponding with Chinese sentence, and the sentence that the dimension Chinese is corresponding passes through sentence mark to <tu> ... id under <tu> describes; 3. translation memory retrieval: in translation memory system, the example being retrieved approaches sentence to be translated, the quality of translation is just better; Adopt the shortest edit distance approach to calculate the similarity between sentence to be translated and existing sentence, obtain and have between two sentences after several words need to mate by levenshtein distance algorithm, by fuzzy matching computing formula, obtain the similarity between former sentence and target sentence; 4. translate editing environment: before translation, by internal ramp, the original text in the document of corresponding format is imported, complete subordinate sentence, participle in internal system, calculate sentence similarity by fuzzy matching, search accordingly in existing vocabulary by binary chop method and in word list, show word and corresponding translation, then by exporting as the translation of former document format after translation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210553917.5A CN103885939A (en) | 2012-12-19 | 2012-12-19 | Uyghur-Chinese bi-directional translation memory system construction method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210553917.5A CN103885939A (en) | 2012-12-19 | 2012-12-19 | Uyghur-Chinese bi-directional translation memory system construction method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103885939A true CN103885939A (en) | 2014-06-25 |
Family
ID=50954834
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210553917.5A Pending CN103885939A (en) | 2012-12-19 | 2012-12-19 | Uyghur-Chinese bi-directional translation memory system construction method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103885939A (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104239295A (en) * | 2014-09-10 | 2014-12-24 | 华建宇通科技(北京)有限责任公司 | Multilevel Uigur lexical analysis method for Uigur-Chinese translation systems |
CN104391840A (en) * | 2014-11-24 | 2015-03-04 | 上海迈外迪网络科技有限公司 | Translation method and device |
CN104933194A (en) * | 2015-07-13 | 2015-09-23 | 广西达译商务服务有限责任公司 | Chinese and Vietnamese bilingual parallel text automatic acquisition system and realizing method thereof |
CN104933192A (en) * | 2015-07-13 | 2015-09-23 | 广西达译商务服务有限责任公司 | Automatic Chinese and Filipino bilingual parallel text collection system and implementation method |
CN104933193A (en) * | 2015-07-13 | 2015-09-23 | 广西达译商务服务有限责任公司 | Chinese and Bahasa Melayu bilingual parallel text automatic acquisition system and realizing method thereof |
CN104933195A (en) * | 2015-07-13 | 2015-09-23 | 广西达译商务服务有限责任公司 | Chinese and Burmese bilingual parallel text automatic acquisition system and realizing method thereof |
CN104965925A (en) * | 2015-07-13 | 2015-10-07 | 广西达译商务服务有限责任公司 | Automatic Chinese-Khmer bilingual parallel text acquisition system and implementation method |
CN105045861A (en) * | 2015-07-13 | 2015-11-11 | 广西达译商务服务有限责任公司 | System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method |
CN105045862A (en) * | 2015-07-13 | 2015-11-11 | 广西达译商务服务有限责任公司 | System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method |
CN105138548A (en) * | 2015-07-13 | 2015-12-09 | 广西达译商务服务有限责任公司 | System for automatically collecting Chinese-Thai bilingual parallel corpus and implementation method |
CN105183723A (en) * | 2015-09-17 | 2015-12-23 | 成都优译信息技术有限公司 | Associating method for translation software and language material searching |
CN105335359A (en) * | 2015-11-18 | 2016-02-17 | 成都优译信息技术有限公司 | Term extracting method used for translation teaching system |
CN106250373A (en) * | 2016-08-04 | 2016-12-21 | 安徽云商信息科技有限公司 | Automatically the rapid translation machine of language is identified |
CN106844354A (en) * | 2017-01-11 | 2017-06-13 | 中国科学院合肥物质科学研究院 | A kind of webpage takes word Chinese interpretation method and its device |
CN107967303A (en) * | 2017-11-10 | 2018-04-27 | 传神语联网网络科技股份有限公司 | The method and device that language material is shown |
CN109344410A (en) * | 2018-09-19 | 2019-02-15 | 中译语通科技股份有限公司 | A kind of machine translation control system and method, information data processing terminal |
CN109710952A (en) * | 2018-12-27 | 2019-05-03 | 北京百度网讯科技有限公司 | Translation history search method, device, equipment and medium based on artificial intelligence |
CN111563387A (en) * | 2019-02-12 | 2020-08-21 | 阿里巴巴集团控股有限公司 | Sentence similarity determining method and device and sentence translation method and device |
CN111898388A (en) * | 2020-07-20 | 2020-11-06 | 北京字节跳动网络技术有限公司 | Video subtitle translation editing method and device, electronic equipment and storage medium |
CN112036191A (en) * | 2020-08-31 | 2020-12-04 | 文思海辉智科科技有限公司 | Data processing method and device and readable storage medium |
CN113033220A (en) * | 2021-04-15 | 2021-06-25 | 沈阳雅译网络技术有限公司 | Lavenstein ratio-based method for constructing literary-modern translation system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101452459A (en) * | 2007-11-30 | 2009-06-10 | 英业达股份有限公司 | System for searching similar translation result by utilizing indexes and method thereof |
CN102193914A (en) * | 2011-05-26 | 2011-09-21 | 中国科学院计算技术研究所 | Computer aided translation method and system |
-
2012
- 2012-12-19 CN CN201210553917.5A patent/CN103885939A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101452459A (en) * | 2007-11-30 | 2009-06-10 | 英业达股份有限公司 | System for searching similar translation result by utilizing indexes and method thereof |
CN102193914A (en) * | 2011-05-26 | 2011-09-21 | 中国科学院计算技术研究所 | Computer aided translation method and system |
Non-Patent Citations (1)
Title |
---|
塔依尔江·苏拉依曼等: "维吾尔文-汉文计算机辅助翻译***中双向翻译记忆子***的设计与实现", 《新疆大学学报(自然科学版)》 * |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104239295B (en) * | 2014-09-10 | 2017-01-18 | 华建宇通科技(北京)有限责任公司 | Multilevel Uigur lexical analysis method for Uigur-Chinese translation systems |
CN104239295A (en) * | 2014-09-10 | 2014-12-24 | 华建宇通科技(北京)有限责任公司 | Multilevel Uigur lexical analysis method for Uigur-Chinese translation systems |
CN104391840A (en) * | 2014-11-24 | 2015-03-04 | 上海迈外迪网络科技有限公司 | Translation method and device |
CN104965925A (en) * | 2015-07-13 | 2015-10-07 | 广西达译商务服务有限责任公司 | Automatic Chinese-Khmer bilingual parallel text acquisition system and implementation method |
CN104933193A (en) * | 2015-07-13 | 2015-09-23 | 广西达译商务服务有限责任公司 | Chinese and Bahasa Melayu bilingual parallel text automatic acquisition system and realizing method thereof |
CN104933195A (en) * | 2015-07-13 | 2015-09-23 | 广西达译商务服务有限责任公司 | Chinese and Burmese bilingual parallel text automatic acquisition system and realizing method thereof |
CN104933192A (en) * | 2015-07-13 | 2015-09-23 | 广西达译商务服务有限责任公司 | Automatic Chinese and Filipino bilingual parallel text collection system and implementation method |
CN105045861A (en) * | 2015-07-13 | 2015-11-11 | 广西达译商务服务有限责任公司 | System for automatically collecting Hanyu and Bahasa Indonesia bilingualism parallel texts, and implementation method |
CN105045862A (en) * | 2015-07-13 | 2015-11-11 | 广西达译商务服务有限责任公司 | System for automatically acquiring bilingual parallel corpus of Chinese-foreign languages and realization method |
CN105138548A (en) * | 2015-07-13 | 2015-12-09 | 广西达译商务服务有限责任公司 | System for automatically collecting Chinese-Thai bilingual parallel corpus and implementation method |
CN104933194A (en) * | 2015-07-13 | 2015-09-23 | 广西达译商务服务有限责任公司 | Chinese and Vietnamese bilingual parallel text automatic acquisition system and realizing method thereof |
CN105183723A (en) * | 2015-09-17 | 2015-12-23 | 成都优译信息技术有限公司 | Associating method for translation software and language material searching |
CN105335359A (en) * | 2015-11-18 | 2016-02-17 | 成都优译信息技术有限公司 | Term extracting method used for translation teaching system |
CN106250373A (en) * | 2016-08-04 | 2016-12-21 | 安徽云商信息科技有限公司 | Automatically the rapid translation machine of language is identified |
CN106844354A (en) * | 2017-01-11 | 2017-06-13 | 中国科学院合肥物质科学研究院 | A kind of webpage takes word Chinese interpretation method and its device |
CN107967303A (en) * | 2017-11-10 | 2018-04-27 | 传神语联网网络科技股份有限公司 | The method and device that language material is shown |
CN107967303B (en) * | 2017-11-10 | 2021-03-26 | 传神语联网网络科技股份有限公司 | Corpus display method and apparatus |
CN109344410A (en) * | 2018-09-19 | 2019-02-15 | 中译语通科技股份有限公司 | A kind of machine translation control system and method, information data processing terminal |
CN109710952A (en) * | 2018-12-27 | 2019-05-03 | 北京百度网讯科技有限公司 | Translation history search method, device, equipment and medium based on artificial intelligence |
CN109710952B (en) * | 2018-12-27 | 2023-06-16 | 北京百度网讯科技有限公司 | Translation history retrieval method, device, equipment and medium based on artificial intelligence |
CN111563387A (en) * | 2019-02-12 | 2020-08-21 | 阿里巴巴集团控股有限公司 | Sentence similarity determining method and device and sentence translation method and device |
CN111563387B (en) * | 2019-02-12 | 2023-05-02 | 阿里巴巴集团控股有限公司 | Sentence similarity determining method and device, sentence translating method and device |
CN111898388A (en) * | 2020-07-20 | 2020-11-06 | 北京字节跳动网络技术有限公司 | Video subtitle translation editing method and device, electronic equipment and storage medium |
CN112036191A (en) * | 2020-08-31 | 2020-12-04 | 文思海辉智科科技有限公司 | Data processing method and device and readable storage medium |
CN112036191B (en) * | 2020-08-31 | 2023-11-28 | 文思海辉智科科技有限公司 | Data processing method and device and readable storage medium |
CN113033220A (en) * | 2021-04-15 | 2021-06-25 | 沈阳雅译网络技术有限公司 | Lavenstein ratio-based method for constructing literary-modern translation system |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103885939A (en) | Uyghur-Chinese bi-directional translation memory system construction method | |
CN104679850B (en) | Address structure method and device | |
CN1661593B (en) | Method for translating computer language and translation system | |
CN104679867B (en) | Address method of knowledge processing and device based on figure | |
CN105975625A (en) | Chinglish inquiring correcting method and system oriented to English search engine | |
CN102253930B (en) | A kind of method of text translation and device | |
CN102799578B (en) | Translation rule extraction method and translation method based on dependency grammar tree | |
CN104298662A (en) | Machine translation method and translation system based on organism named entities | |
CN105335487A (en) | Agricultural specialist information retrieval system and method on basis of agricultural technology information ontology library | |
CN101796508A (en) | Coreference resolution in an ambiguity-sensitive natural language processing system | |
CN108665141B (en) | Method for automatically extracting emergency response process model from emergency plan | |
Kituku et al. | A review on machine translation approaches | |
CN103324700A (en) | Noumenon concept attribute learning method based on Web information | |
Vel | Pre-processing techniques of text mining using computational linguistics and python libraries | |
Zhang et al. | A tree-to-tree alignment-based model for statistical machine translation | |
CN113312922A (en) | Improved chapter-level triple information extraction method | |
CN113343717A (en) | Neural machine translation method based on translation memory library | |
Ture et al. | Exploiting representations from statistical machine translation for cross-language information retrieval | |
Sun | [Retracted] Analysis of Chinese Machine Translation Training Based on Deep Learning Technology | |
CN108255818B (en) | Combined machine translation method using segmentation technology | |
Bick | Dan2eng: wide-coverage Danish-English machine translation | |
Iswarya et al. | Adapting hybrid machine translation techniques for cross-language text retrieval system | |
Gao et al. | BIMTag: semantic annotation of web BIM product resources based on IFC ontology | |
Hong | Construction of corpus in Artificial Intelligence age | |
CN115048948B (en) | Cross-language abstracting method for cross-Chinese low-resource by fusing topic association diagram |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20140625 |
|
WD01 | Invention patent application deemed withdrawn after publication |