CN107797995A - A kind of Chinese and English fragment language material generation method - Google Patents

A kind of Chinese and English fragment language material generation method Download PDF

Info

Publication number
CN107797995A
CN107797995A CN201711160312.9A CN201711160312A CN107797995A CN 107797995 A CN107797995 A CN 107797995A CN 201711160312 A CN201711160312 A CN 201711160312A CN 107797995 A CN107797995 A CN 107797995A
Authority
CN
China
Prior art keywords
word
english
chinese
sentence
fragment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201711160312.9A
Other languages
Chinese (zh)
Inventor
宋安琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Language Network (wuhan) Information Technology Co Ltd
Original Assignee
Language Network (wuhan) Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Language Network (wuhan) Information Technology Co Ltd filed Critical Language Network (wuhan) Information Technology Co Ltd
Priority to CN201711160312.9A priority Critical patent/CN107797995A/en
Publication of CN107797995A publication Critical patent/CN107797995A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • G06F40/45Example-based machine translation; Alignment
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention relates to machine translation field, more particularly to a kind of Chinese and English fragment language material generation method.Including subordinate sentence processing, accurate matching, fuzzy matching, amendment word corresponding relation, generation fragment five key steps of language material.Judge that compound word corresponding relation is found out at non-notional word interval by part of speech, word-based corresponding relation and non-notional word interval judge extraction fragment.Chinese and English fragment language material generation method provided by the invention, it is easy to accomplish, the fragment language material accuracy of generation it is high.For improving having great significance for machine aided translation efficiency.

Description

A kind of Chinese and English fragment language material generation method
Technical field
The present invention relates to machine translation field, more particularly to a kind of Chinese and English fragment language material generation method.
Background technology
With the development of information technology, international exchange is increasingly frequent, accurate understanding different language become one it is important Demand.The communication disorders of people, machine translation, as one important side of natural language processing field between solution different language To, obtained increasing concern and development, wherein the machine translation based on neutral net be substituted it is original based on system The machine translation of meter becomes industry main flow.Either newest machine translation based on neutral net or past based on system The machine translation of meter, is mostly based on corpus.Existing corpus generally comprises single word language material and sentence language material.In reality In translation, single word language material is similar to English-Chinese dictionary, is translated for article, and efficiency is substantially inadequate;And due to different Article is translated, identical sentence is simultaneously few, therefore sentence language material is limited for the help of translation.Not in identical text really The often fragment easily reused, fragment are that length is more than a word, several continuous words less than a sentence Set.Fragment language material is then the accurate intertranslation text of Chinese and English fragment.
Obviously, raising of the fragment language material for translation efficiency has great significance.It is directed to however, existing corpus lacks The language material of fragment.
The content of the invention
The technical problems to be solved by the invention are to provide a kind of Chinese and English fragment language material generation method, to generate fragment language Material.
In order to solve the above technical problems, the present invention provides a kind of Chinese and English fragment language material generation method, comprise the following steps:
Step 1, choose a pair of translated Chinese and English sentences;Word segmentation processing is made to English, the Chinese sentence respectively;
Step 2, according to english Chinese dictionary lexical or textual analysis, search English word lexical or textual analysis and the identical all English of Chinese language words be single Word, the English word of record matching and the corresponding relation of Chinese language words.
Step 3, the English word for being not yet recorded corresponding relation by step 2, if the dictionary of the English word Lexical or textual analysis is with certain Chinese language words similarity on given threshold, then it is assumed that the English word matches with the Chinese language words meaning, remembers Record the corresponding relation;
Step 4, amendment word corresponding relation, i.e.,
The corresponding relation of the corresponding Chinese language words of multiple English words obtained after traversal step 3, English word is pressed The order sequence occurred in english sentence, if word sequence number is continuous, confirms alignment relation, adds English-Chinese dictionary lexical or textual analysis; Otherwise, if most similar two order of words are discontinuous, the part of speech of interval word is judged, if the interval word is whole For non-notional word, then the corresponding relation including the interval word is recorded;If most similar two order of words are discontinuous, And non-notional word be present in the word of interval, then abandon the corresponding relation of the plurality of English word and the Chinese language words;Will be described right It should be related to and be merged into English-Chinese dictionary;
Step 5, generation fragment language material, i.e.,
Setting segment start is first Chinese word, and the position occurred by Chinese word in sentence travels through Chinese successively Word and corresponding English word, if English word sequence number is continuous, continue next English word;If current clip includes Chinese word records this fragment more than 2;If English word sequence number is discontinuous and is notional word, record does not include discontinuous The fragment of word, restart to set segment start;If English word sequence number is discontinuous and is non-notional word, continue next English word, if next word is continuous, continue.If next word is discontinuous, record do not include most latter two do not connect The fragment of continuous word, restart to set segment start;When Chinese word language runs into subordinate sentence punctuation mark, reset fragment and rise Point is next Chinese word.
Chinese sentence traversal is completed, and obtains fragment language material one by one.
Preferably, English, the Chinese sentence is included as word segmentation processing:
English sentence segments:The English-Chinese terminological dictionary that general English-Chinese dictionary and the bilingual sentence to have alignd are related to merges Into single English-Chinese dictionary file, lemmatization processing is carried out to english sentence, according to maximum forward matching method according to single in dictionary Word segments to english sentence;
Chinese sentence participle:Segmented from Chinese word segmentation machine centering sentence.
Further, the Chinese word segmentation machine has new word discovery function.
It is identical all with Chinese language words according to english Chinese dictionary lexical or textual analysis, lookup English word lexical or textual analysis described in step 2 English word, the English word of record matching and the corresponding relation of Chinese language words specifically include:
With the English after word segmentation processing for object, since first notional word, according in the English-Chinese dictionary of the English word The word occurred in Chinese sentence is searched in literary lexical or textual analysis, if there is the lexical or textual analysis identical of certain Chinese word and the English word, then Record English word and the corresponding relation of the Chinese word;Continue next English notional word, to the last a word.
Preferably, the similarity given threshold is 20%.
Further, more translated Chinese and English sentences pair are selected, 1 is repeated the above steps to step 5, obtains foot Enough fragment language materials.
The invention provides a kind of Chinese and English fragment language material generation method, to generate fragment language material.This method is easily achieved, The fragment language material accuracy of generation is high.For improving having great significance for machine aided translation efficiency.
Brief description of the drawings
Technical scheme is further described in detail with reference to the accompanying drawings and detailed description.
Fig. 1 is the overall flow figure of the present invention.
Embodiment
With reference to shown in Fig. 1, the present invention specifically includes following steps:
Step 1, sentence participle
Choose a pair of translated Chinese and English sentences;English, the Chinese sentence is segmented respectively, including,
English sentence segments:The English-Chinese terminological dictionary that general English-Chinese dictionary and the bilingual sentence to have alignd are related to merges Into single English-Chinese dictionary file, lemmatization processing is carried out to english sentence, according to maximum forward matching method according to single in dictionary Word segments to english sentence.
Chinese sentence participle:Segmented from Chinese word segmentation machine centering sentence containing new word discovery function.
Step 2, accurate matching
With the English after word segmentation processing for object, since first notional word, according in the English-Chinese dictionary of the English word The word occurred in Chinese sentence is searched in literary lexical or textual analysis, if there is the lexical or textual analysis identical of certain Chinese word and the English word, then Record English word and the corresponding relation of the Chinese word;Continue next English notional word, to the last a word.
Step 3, fuzzy matching
For being not yet recorded the English word of corresponding relation by step 2, if the dictionary definition of the English word with Certain Chinese language words similarity is on given threshold, then it is assumed that the English word matches with the Chinese language words meaning, and it is right to record this It should be related to;Can be with the corresponding Chinese language words of multiple English words.
Such as:Chinese sentence is " traditional supercomputer is only good at scientific engineering computing, and superserver takes into account this Both sides application, it is the main flow of high-end computer ", english sentence is " The traditional supercomputers are good at the scientific engineering computing only,while this super-server is good at both,thus being the mainstream of the high-end computers.”." super clothes Business device " is a word, has the implication of " super " in super dictionary definitions, corresponds to " superserver ", server dictionary definitions In have the implication of " server ", correspond to " superserver "." high-end " is a word, is had " advanced " in high dictionary definitions Implication, and the similarity of " high-end " is (1+1)/(2+2)=25%, more than 20%, is corresponded to " high-end ", in end dictionary definitions There is the implication of " end ", and the similarity of " high-end " is (1+1)/(2+2)=25%, more than 20%, is corresponded to " high-end ".
Step 4, amendment word corresponding relation, i.e.,
The corresponding relation of the corresponding Chinese language words of multiple English words obtained after traversal step 3, English word is pressed The order sequence occurred in english sentence, if word sequence number is continuous, confirms alignment relation, adds English-Chinese dictionary lexical or textual analysis. Otherwise, if most similar two order of words are discontinuous, the part of speech of interval word is judged, if the interval word is whole For non-notional word, then the corresponding relation including the interval word is recorded;If most similar two order of words are discontinuous, And non-notional word be present in the word of interval, then abandon the corresponding relation of the plurality of English word and the Chinese language words;Will be described right It should be related to and be merged into English-Chinese dictionary.
Such as:Super and server is corresponding super " superserver " in upper example, and "-" belongs to non-notional word, then recorded " super-server " and " superserver " corresponding relation, similar " high-end " and " high-end " there is corresponding relation.
Step 5, generation fragment language material
Setting segment start is first Chinese word, and the position occurred by Chinese word in sentence travels through Chinese successively Word and corresponding English word, if English word sequence number is continuous, continue next English word.If current clip includes Chinese word records this fragment more than 2.If English word sequence number is discontinuous and is notional word, record does not include discontinuous The fragment of word, restart to set segment start.If English word sequence number is discontinuous and is non-notional word, continue next English word, if next word is continuous, continue.If next word is discontinuous, record do not include most latter two do not connect The fragment of continuous word, restart to set segment start.When Chinese word language runs into the subordinate sentence such as comma, fullstop, exclamation mark punctuate symbol Number, it is next Chinese word to reset segment start.
After the completion of Chinese sentence traversal, fragment language material one by one is obtained.
Such as:" tradition " correspondence " traditional " in upper example, " " correspondence " of ", discontinuously, but " of " is non-real justice Word, continue, " supercomputer " correspondence " supercomputers ", continuously, recorded segment " traditional computer " is corresponding “traditional supercomputers”." only " correspondence " only ", discontinuously, " only " are notional word, reset piece Duan Qidian." being good at " correspondence " are good at ", " being good at " and a upper fragment word " supercomputer " discontinuously, and " are Good at " are notional word, therefore it is " being good at " to reset segment start.Continue to match, " science " is corresponding " scientific ", it is continuous " being good at ", recorded segment " being good at science " correspondence " are good at scientific ". " engineering " correspondence " engineering ", recorded segment " being good at Scientific Engineering " correspondence " are good at scientific Engineering " and " Scientific Engineering " correspondence " scientific engineering "." calculating " correspondence " computing ", even It is continuous, recorded segment " being good at scientific engineering computing " correspondence " are good at scientific engineering Computing ", " scientific engineering computing " and " engineering calculation " are right for " scientific engineering computing " correspondence Answer " engineering computing ".
The more translated Chinese and English sentences pair of selection, repeat the above steps 1 to step 5, obtain enough pieces Section language material.
It should be noted last that above embodiment is merely illustrative of the technical solution of the present invention and unrestricted, Although the present invention is described in detail with reference to preferred embodiment, it will be understood by those within the art that, can be right Technical scheme is modified or equivalent substitution, and without departing from the spirit and scope of technical solution of the present invention, its is equal It should cover among scope of the presently claimed invention.

Claims (6)

1. a kind of Chinese and English fragment language material generation method, it is characterised in that comprise the following steps:
Step 1, subordinate sentence processing, that is, choose a pair of translated Chinese and English sentences;English, the Chinese sentence is divided respectively Word processing;
Step 2, according to english Chinese dictionary lexical or textual analysis, search English word lexical or textual analysis and the identical all English words of Chinese language words, The English word of record matching and the corresponding relation of Chinese language words;
Step 3, the English word for being not yet recorded corresponding relation by step 2, if the dictionary definition of the English word With certain Chinese language words similarity on given threshold, then it is assumed that the English word matches with the Chinese language words meaning, and record should Corresponding relation;
Step 4, amendment word corresponding relation, i.e.,
The corresponding relation of the corresponding Chinese language words of multiple English words obtained after traversal step 3, English word is pressed in English The order sequence occurred in sentence, if word sequence number is continuous, confirms alignment relation, adds English-Chinese dictionary lexical or textual analysis;Otherwise, If most similar two order of words are discontinuous, the part of speech of interval word is judged, if the interval word is all non- Notional word, then record the corresponding relation including the interval word;If most similar two order of words are discontinuous, and Every non-notional word in word be present, then the corresponding relation of the plurality of English word and the Chinese language words is abandoned;By the corresponding pass System is merged into English-Chinese dictionary;
Step 5, generation fragment language material, i.e.,
Setting segment start is first Chinese word, and the position occurred by Chinese word in sentence travels through Chinese word successively With corresponding English word, if English word sequence number is continuous, continue next English word;If current clip includes Chinese Word records this fragment more than 2;If English word sequence number is discontinuous and is notional word, record does not include discontinuous word Fragment, restart set segment start;If English word sequence number is discontinuous and is non-notional word, continue next English Word, if next word is continuous, continue;If next word is discontinuous, record does not include most latter two discontinuous list The fragment of word, restart to set segment start;When Chinese word language runs into subordinate sentence punctuation mark, resetting segment start is Next Chinese word.
2. Chinese and English fragment language material generation method according to claim 1, it is characterised in that to English, the middle sentence Son includes as word segmentation processing:
English sentence segments:The English-Chinese terminological dictionary that general English-Chinese dictionary and the bilingual sentence to have alignd are related to is merged into list Individual English-Chinese dictionary file, lemmatization processing is carried out to english sentence, according to maximum forward matching method according to word pair in dictionary English sentence is segmented;
Chinese sentence participle:Segmented from Chinese word segmentation machine centering sentence.
3. Chinese and English fragment language material generation method according to claim 4, it is characterised in that the Chinese word segmentation machine has New word discovery function.
4. Chinese and English fragment language material generation method according to claim 1, it is characterised in that described in step 2 according to English Chinese character allusion quotation lexical or textual analysis, searches English word lexical or textual analysis and the identical all English words of Chinese language words, and the English of record matching is single The corresponding relation of word and Chinese language words specifically includes:
With the English after word segmentation processing for object, since first notional word, released according to the English-Chinese dictionary Chinese of the English word Justice searches the word occurred in Chinese sentence, if there is the lexical or textual analysis identical of certain Chinese word and the English word, then records The corresponding relation of English word and the Chinese word;Continue next English notional word, to the last a word.
5. Chinese and English fragment language material generation method according to claim 1, it is characterised in that the similarity given threshold For 20%.
6. Chinese and English fragment language material generation method according to claim 1, it is characterised in that selection has more been translated Chinese and English sentence pair, repeat the above steps 1 to step 5, obtain enough fragment language materials.
CN201711160312.9A 2017-11-20 2017-11-20 A kind of Chinese and English fragment language material generation method Pending CN107797995A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711160312.9A CN107797995A (en) 2017-11-20 2017-11-20 A kind of Chinese and English fragment language material generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711160312.9A CN107797995A (en) 2017-11-20 2017-11-20 A kind of Chinese and English fragment language material generation method

Publications (1)

Publication Number Publication Date
CN107797995A true CN107797995A (en) 2018-03-13

Family

ID=61535901

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711160312.9A Pending CN107797995A (en) 2017-11-20 2017-11-20 A kind of Chinese and English fragment language material generation method

Country Status (1)

Country Link
CN (1) CN107797995A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344389A (en) * 2018-08-15 2019-02-15 中国科学院计算技术研究所 A kind of construction method and system of the blind control bilingualism corpora of the Chinese
CN109657244A (en) * 2018-12-18 2019-04-19 语联网(武汉)信息技术有限公司 A kind of English long sentence automatic segmentation method and system
CN109857746A (en) * 2018-11-09 2019-06-07 语联网(武汉)信息技术有限公司 Automatic update method, device and the electronic equipment of bilingual word bank
CN109918677A (en) * 2019-03-21 2019-06-21 广东小天才科技有限公司 A kind of method and system of English word semanteme parsing
CN110209771A (en) * 2019-06-14 2019-09-06 哈尔滨哈银消费金融有限责任公司 User's geographic information analysis and text mining method and apparatus

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1801140A (en) * 2004-12-30 2006-07-12 中国科学院自动化研究所 Method and apparatus for automatic acquisition of machine translation template
CN103678287A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Method for unifying keyword translation
CN104375988A (en) * 2014-11-04 2015-02-25 北京第二外国语学院 Word and expression alignment method and device
CN105068997A (en) * 2015-07-15 2015-11-18 清华大学 Parallel corpus construction method and device
CN105912522A (en) * 2016-03-31 2016-08-31 长安大学 Automatic extraction method and extractor of English corpora based on constituent analyses

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1801140A (en) * 2004-12-30 2006-07-12 中国科学院自动化研究所 Method and apparatus for automatic acquisition of machine translation template
CN103678287A (en) * 2013-11-30 2014-03-26 武汉传神信息技术有限公司 Method for unifying keyword translation
CN104375988A (en) * 2014-11-04 2015-02-25 北京第二外国语学院 Word and expression alignment method and device
CN105068997A (en) * 2015-07-15 2015-11-18 清华大学 Parallel corpus construction method and device
CN105912522A (en) * 2016-03-31 2016-08-31 长安大学 Automatic extraction method and extractor of English corpora based on constituent analyses

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李萌涛,孙强华: "《大学英语六级阅读理解集训》", 31 October 2002 *
许鑫: "《基于文本特征计算的信息分析方法》", 30 November 2015 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344389A (en) * 2018-08-15 2019-02-15 中国科学院计算技术研究所 A kind of construction method and system of the blind control bilingualism corpora of the Chinese
CN109344389B (en) * 2018-08-15 2020-08-18 中国科学院计算技术研究所 Method and system for constructing Chinese blind comparison bilingual corpus
CN109857746A (en) * 2018-11-09 2019-06-07 语联网(武汉)信息技术有限公司 Automatic update method, device and the electronic equipment of bilingual word bank
CN109857746B (en) * 2018-11-09 2021-05-04 语联网(武汉)信息技术有限公司 Automatic updating method and device for bilingual word stock and electronic equipment
CN109657244A (en) * 2018-12-18 2019-04-19 语联网(武汉)信息技术有限公司 A kind of English long sentence automatic segmentation method and system
CN109657244B (en) * 2018-12-18 2023-04-18 语联网(武汉)信息技术有限公司 English long sentence automatic segmentation method and system
CN109918677A (en) * 2019-03-21 2019-06-21 广东小天才科技有限公司 A kind of method and system of English word semanteme parsing
CN110209771A (en) * 2019-06-14 2019-09-06 哈尔滨哈银消费金融有限责任公司 User's geographic information analysis and text mining method and apparatus

Similar Documents

Publication Publication Date Title
CN107797995A (en) A kind of Chinese and English fragment language material generation method
Boudin et al. Keyphrase extraction for n-best reranking in multi-sentence compression
Zhou et al. Resolving surface forms to wikipedia topics
CN102214189B (en) Data mining-based word usage knowledge acquisition system and method
El-Shishtawy et al. An accurate arabic root-based lemmatizer for information retrieval purposes
Dziob et al. plWordNet 4.1-a linguistically motivated, corpus-based bilingual resource
CN101308512B (en) Mutual translation pair extraction method and device based on web page
CN105630770A (en) Word segmentation phonetic transcription and ligature writing method and device based on SC grammar
Simard Building and using parallel text for translation
CN108132917B (en) Document error correction marking method
Ashna et al. Lexicon based sentiment analysis system for malayalam language
Fu et al. Generating chinese named entity data from a parallel corpus
Van Der Goot et al. Lexical normalization for code-switched data and its effect on POS-tagging
Sun et al. GEDIT: geographic-enhanced and dependency-guided tagging for joint POI and accessibility extraction at *** maps
Sagot et al. Error mining in parsing results
Sembok et al. Arabic word stemming algorithms and retrieval effectiveness
Attia et al. Gwu-hasp: Hybrid arabic spelling and punctuation corrector
Hu et al. CSCD-IME: correcting spelling errors generated by pinyin IME
Nguyen et al. An approach to construct a named entity annotated English-Vietnamese bilingual corpus
Arcan A comparison of statistical and neural machine translation for Slovene, Serbian and Croatian
Yamamoto et al. Learning sequence-to-sequence correspondences from parallel corpora via sequential pattern mining
Guo et al. Character-level dependency model for joint word segmentation, POS tagging, and dependency parsing in Chinese
Garcia Comparing bilingual word embeddings to translation dictionaries for extracting multilingual collocation equivalents
Chakrawarti et al. Phrase-Based Statistical Machine Translation of Hindi Poetries into English
Zitouni et al. Cross-language information propagation for arabic mention detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180313