CN107797995A - A kind of Chinese and English fragment language material generation method - Google Patents
A kind of Chinese and English fragment language material generation method Download PDFInfo
- Publication number
- CN107797995A CN107797995A CN201711160312.9A CN201711160312A CN107797995A CN 107797995 A CN107797995 A CN 107797995A CN 201711160312 A CN201711160312 A CN 201711160312A CN 107797995 A CN107797995 A CN 107797995A
- Authority
- CN
- China
- Prior art keywords
- word
- english
- chinese
- sentence
- fragment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
- G06F40/45—Example-based machine translation; Alignment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention relates to machine translation field, more particularly to a kind of Chinese and English fragment language material generation method.Including subordinate sentence processing, accurate matching, fuzzy matching, amendment word corresponding relation, generation fragment five key steps of language material.Judge that compound word corresponding relation is found out at non-notional word interval by part of speech, word-based corresponding relation and non-notional word interval judge extraction fragment.Chinese and English fragment language material generation method provided by the invention, it is easy to accomplish, the fragment language material accuracy of generation it is high.For improving having great significance for machine aided translation efficiency.
Description
Technical field
The present invention relates to machine translation field, more particularly to a kind of Chinese and English fragment language material generation method.
Background technology
With the development of information technology, international exchange is increasingly frequent, accurate understanding different language become one it is important
Demand.The communication disorders of people, machine translation, as one important side of natural language processing field between solution different language
To, obtained increasing concern and development, wherein the machine translation based on neutral net be substituted it is original based on system
The machine translation of meter becomes industry main flow.Either newest machine translation based on neutral net or past based on system
The machine translation of meter, is mostly based on corpus.Existing corpus generally comprises single word language material and sentence language material.In reality
In translation, single word language material is similar to English-Chinese dictionary, is translated for article, and efficiency is substantially inadequate;And due to different
Article is translated, identical sentence is simultaneously few, therefore sentence language material is limited for the help of translation.Not in identical text really
The often fragment easily reused, fragment are that length is more than a word, several continuous words less than a sentence
Set.Fragment language material is then the accurate intertranslation text of Chinese and English fragment.
Obviously, raising of the fragment language material for translation efficiency has great significance.It is directed to however, existing corpus lacks
The language material of fragment.
The content of the invention
The technical problems to be solved by the invention are to provide a kind of Chinese and English fragment language material generation method, to generate fragment language
Material.
In order to solve the above technical problems, the present invention provides a kind of Chinese and English fragment language material generation method, comprise the following steps:
Step 1, choose a pair of translated Chinese and English sentences;Word segmentation processing is made to English, the Chinese sentence respectively;
Step 2, according to english Chinese dictionary lexical or textual analysis, search English word lexical or textual analysis and the identical all English of Chinese language words be single
Word, the English word of record matching and the corresponding relation of Chinese language words.
Step 3, the English word for being not yet recorded corresponding relation by step 2, if the dictionary of the English word
Lexical or textual analysis is with certain Chinese language words similarity on given threshold, then it is assumed that the English word matches with the Chinese language words meaning, remembers
Record the corresponding relation;
Step 4, amendment word corresponding relation, i.e.,
The corresponding relation of the corresponding Chinese language words of multiple English words obtained after traversal step 3, English word is pressed
The order sequence occurred in english sentence, if word sequence number is continuous, confirms alignment relation, adds English-Chinese dictionary lexical or textual analysis;
Otherwise, if most similar two order of words are discontinuous, the part of speech of interval word is judged, if the interval word is whole
For non-notional word, then the corresponding relation including the interval word is recorded;If most similar two order of words are discontinuous,
And non-notional word be present in the word of interval, then abandon the corresponding relation of the plurality of English word and the Chinese language words;Will be described right
It should be related to and be merged into English-Chinese dictionary;
Step 5, generation fragment language material, i.e.,
Setting segment start is first Chinese word, and the position occurred by Chinese word in sentence travels through Chinese successively
Word and corresponding English word, if English word sequence number is continuous, continue next English word;If current clip includes
Chinese word records this fragment more than 2;If English word sequence number is discontinuous and is notional word, record does not include discontinuous
The fragment of word, restart to set segment start;If English word sequence number is discontinuous and is non-notional word, continue next
English word, if next word is continuous, continue.If next word is discontinuous, record do not include most latter two do not connect
The fragment of continuous word, restart to set segment start;When Chinese word language runs into subordinate sentence punctuation mark, reset fragment and rise
Point is next Chinese word.
Chinese sentence traversal is completed, and obtains fragment language material one by one.
Preferably, English, the Chinese sentence is included as word segmentation processing:
English sentence segments:The English-Chinese terminological dictionary that general English-Chinese dictionary and the bilingual sentence to have alignd are related to merges
Into single English-Chinese dictionary file, lemmatization processing is carried out to english sentence, according to maximum forward matching method according to single in dictionary
Word segments to english sentence;
Chinese sentence participle:Segmented from Chinese word segmentation machine centering sentence.
Further, the Chinese word segmentation machine has new word discovery function.
It is identical all with Chinese language words according to english Chinese dictionary lexical or textual analysis, lookup English word lexical or textual analysis described in step 2
English word, the English word of record matching and the corresponding relation of Chinese language words specifically include:
With the English after word segmentation processing for object, since first notional word, according in the English-Chinese dictionary of the English word
The word occurred in Chinese sentence is searched in literary lexical or textual analysis, if there is the lexical or textual analysis identical of certain Chinese word and the English word, then
Record English word and the corresponding relation of the Chinese word;Continue next English notional word, to the last a word.
Preferably, the similarity given threshold is 20%.
Further, more translated Chinese and English sentences pair are selected, 1 is repeated the above steps to step 5, obtains foot
Enough fragment language materials.
The invention provides a kind of Chinese and English fragment language material generation method, to generate fragment language material.This method is easily achieved,
The fragment language material accuracy of generation is high.For improving having great significance for machine aided translation efficiency.
Brief description of the drawings
Technical scheme is further described in detail with reference to the accompanying drawings and detailed description.
Fig. 1 is the overall flow figure of the present invention.
Embodiment
With reference to shown in Fig. 1, the present invention specifically includes following steps:
Step 1, sentence participle
Choose a pair of translated Chinese and English sentences;English, the Chinese sentence is segmented respectively, including,
English sentence segments:The English-Chinese terminological dictionary that general English-Chinese dictionary and the bilingual sentence to have alignd are related to merges
Into single English-Chinese dictionary file, lemmatization processing is carried out to english sentence, according to maximum forward matching method according to single in dictionary
Word segments to english sentence.
Chinese sentence participle:Segmented from Chinese word segmentation machine centering sentence containing new word discovery function.
Step 2, accurate matching
With the English after word segmentation processing for object, since first notional word, according in the English-Chinese dictionary of the English word
The word occurred in Chinese sentence is searched in literary lexical or textual analysis, if there is the lexical or textual analysis identical of certain Chinese word and the English word, then
Record English word and the corresponding relation of the Chinese word;Continue next English notional word, to the last a word.
Step 3, fuzzy matching
For being not yet recorded the English word of corresponding relation by step 2, if the dictionary definition of the English word with
Certain Chinese language words similarity is on given threshold, then it is assumed that the English word matches with the Chinese language words meaning, and it is right to record this
It should be related to;Can be with the corresponding Chinese language words of multiple English words.
Such as:Chinese sentence is " traditional supercomputer is only good at scientific engineering computing, and superserver takes into account this
Both sides application, it is the main flow of high-end computer ", english sentence is " The traditional supercomputers
are good at the scientific engineering computing only,while this super-server
is good at both,thus being the mainstream of the high-end computers.”." super clothes
Business device " is a word, has the implication of " super " in super dictionary definitions, corresponds to " superserver ", server dictionary definitions
In have the implication of " server ", correspond to " superserver "." high-end " is a word, is had " advanced " in high dictionary definitions
Implication, and the similarity of " high-end " is (1+1)/(2+2)=25%, more than 20%, is corresponded to " high-end ", in end dictionary definitions
There is the implication of " end ", and the similarity of " high-end " is (1+1)/(2+2)=25%, more than 20%, is corresponded to " high-end ".
Step 4, amendment word corresponding relation, i.e.,
The corresponding relation of the corresponding Chinese language words of multiple English words obtained after traversal step 3, English word is pressed
The order sequence occurred in english sentence, if word sequence number is continuous, confirms alignment relation, adds English-Chinese dictionary lexical or textual analysis.
Otherwise, if most similar two order of words are discontinuous, the part of speech of interval word is judged, if the interval word is whole
For non-notional word, then the corresponding relation including the interval word is recorded;If most similar two order of words are discontinuous,
And non-notional word be present in the word of interval, then abandon the corresponding relation of the plurality of English word and the Chinese language words;Will be described right
It should be related to and be merged into English-Chinese dictionary.
Such as:Super and server is corresponding super " superserver " in upper example, and "-" belongs to non-notional word, then recorded
" super-server " and " superserver " corresponding relation, similar " high-end " and " high-end " there is corresponding relation.
Step 5, generation fragment language material
Setting segment start is first Chinese word, and the position occurred by Chinese word in sentence travels through Chinese successively
Word and corresponding English word, if English word sequence number is continuous, continue next English word.If current clip includes
Chinese word records this fragment more than 2.If English word sequence number is discontinuous and is notional word, record does not include discontinuous
The fragment of word, restart to set segment start.If English word sequence number is discontinuous and is non-notional word, continue next
English word, if next word is continuous, continue.If next word is discontinuous, record do not include most latter two do not connect
The fragment of continuous word, restart to set segment start.When Chinese word language runs into the subordinate sentence such as comma, fullstop, exclamation mark punctuate symbol
Number, it is next Chinese word to reset segment start.
After the completion of Chinese sentence traversal, fragment language material one by one is obtained.
Such as:" tradition " correspondence " traditional " in upper example, " " correspondence " of ", discontinuously, but " of " is non-real justice
Word, continue, " supercomputer " correspondence " supercomputers ", continuously, recorded segment " traditional computer " is corresponding
“traditional supercomputers”." only " correspondence " only ", discontinuously, " only " are notional word, reset piece
Duan Qidian." being good at " correspondence " are good at ", " being good at " and a upper fragment word " supercomputer " discontinuously, and " are
Good at " are notional word, therefore it is " being good at " to reset segment start.Continue to match, " science " is corresponding
" scientific ", it is continuous " being good at ", recorded segment " being good at science " correspondence " are good at scientific ".
" engineering " correspondence " engineering ", recorded segment " being good at Scientific Engineering " correspondence " are good at scientific
Engineering " and " Scientific Engineering " correspondence " scientific engineering "." calculating " correspondence " computing ", even
It is continuous, recorded segment " being good at scientific engineering computing " correspondence " are good at scientific engineering
Computing ", " scientific engineering computing " and " engineering calculation " are right for " scientific engineering computing " correspondence
Answer " engineering computing ".
The more translated Chinese and English sentences pair of selection, repeat the above steps 1 to step 5, obtain enough pieces
Section language material.
It should be noted last that above embodiment is merely illustrative of the technical solution of the present invention and unrestricted,
Although the present invention is described in detail with reference to preferred embodiment, it will be understood by those within the art that, can be right
Technical scheme is modified or equivalent substitution, and without departing from the spirit and scope of technical solution of the present invention, its is equal
It should cover among scope of the presently claimed invention.
Claims (6)
1. a kind of Chinese and English fragment language material generation method, it is characterised in that comprise the following steps:
Step 1, subordinate sentence processing, that is, choose a pair of translated Chinese and English sentences;English, the Chinese sentence is divided respectively
Word processing;
Step 2, according to english Chinese dictionary lexical or textual analysis, search English word lexical or textual analysis and the identical all English words of Chinese language words,
The English word of record matching and the corresponding relation of Chinese language words;
Step 3, the English word for being not yet recorded corresponding relation by step 2, if the dictionary definition of the English word
With certain Chinese language words similarity on given threshold, then it is assumed that the English word matches with the Chinese language words meaning, and record should
Corresponding relation;
Step 4, amendment word corresponding relation, i.e.,
The corresponding relation of the corresponding Chinese language words of multiple English words obtained after traversal step 3, English word is pressed in English
The order sequence occurred in sentence, if word sequence number is continuous, confirms alignment relation, adds English-Chinese dictionary lexical or textual analysis;Otherwise,
If most similar two order of words are discontinuous, the part of speech of interval word is judged, if the interval word is all non-
Notional word, then record the corresponding relation including the interval word;If most similar two order of words are discontinuous, and
Every non-notional word in word be present, then the corresponding relation of the plurality of English word and the Chinese language words is abandoned;By the corresponding pass
System is merged into English-Chinese dictionary;
Step 5, generation fragment language material, i.e.,
Setting segment start is first Chinese word, and the position occurred by Chinese word in sentence travels through Chinese word successively
With corresponding English word, if English word sequence number is continuous, continue next English word;If current clip includes Chinese
Word records this fragment more than 2;If English word sequence number is discontinuous and is notional word, record does not include discontinuous word
Fragment, restart set segment start;If English word sequence number is discontinuous and is non-notional word, continue next English
Word, if next word is continuous, continue;If next word is discontinuous, record does not include most latter two discontinuous list
The fragment of word, restart to set segment start;When Chinese word language runs into subordinate sentence punctuation mark, resetting segment start is
Next Chinese word.
2. Chinese and English fragment language material generation method according to claim 1, it is characterised in that to English, the middle sentence
Son includes as word segmentation processing:
English sentence segments:The English-Chinese terminological dictionary that general English-Chinese dictionary and the bilingual sentence to have alignd are related to is merged into list
Individual English-Chinese dictionary file, lemmatization processing is carried out to english sentence, according to maximum forward matching method according to word pair in dictionary
English sentence is segmented;
Chinese sentence participle:Segmented from Chinese word segmentation machine centering sentence.
3. Chinese and English fragment language material generation method according to claim 4, it is characterised in that the Chinese word segmentation machine has
New word discovery function.
4. Chinese and English fragment language material generation method according to claim 1, it is characterised in that described in step 2 according to English
Chinese character allusion quotation lexical or textual analysis, searches English word lexical or textual analysis and the identical all English words of Chinese language words, and the English of record matching is single
The corresponding relation of word and Chinese language words specifically includes:
With the English after word segmentation processing for object, since first notional word, released according to the English-Chinese dictionary Chinese of the English word
Justice searches the word occurred in Chinese sentence, if there is the lexical or textual analysis identical of certain Chinese word and the English word, then records
The corresponding relation of English word and the Chinese word;Continue next English notional word, to the last a word.
5. Chinese and English fragment language material generation method according to claim 1, it is characterised in that the similarity given threshold
For 20%.
6. Chinese and English fragment language material generation method according to claim 1, it is characterised in that selection has more been translated
Chinese and English sentence pair, repeat the above steps 1 to step 5, obtain enough fragment language materials.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711160312.9A CN107797995A (en) | 2017-11-20 | 2017-11-20 | A kind of Chinese and English fragment language material generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201711160312.9A CN107797995A (en) | 2017-11-20 | 2017-11-20 | A kind of Chinese and English fragment language material generation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107797995A true CN107797995A (en) | 2018-03-13 |
Family
ID=61535901
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201711160312.9A Pending CN107797995A (en) | 2017-11-20 | 2017-11-20 | A kind of Chinese and English fragment language material generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107797995A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344389A (en) * | 2018-08-15 | 2019-02-15 | 中国科学院计算技术研究所 | A kind of construction method and system of the blind control bilingualism corpora of the Chinese |
CN109657244A (en) * | 2018-12-18 | 2019-04-19 | 语联网(武汉)信息技术有限公司 | A kind of English long sentence automatic segmentation method and system |
CN109857746A (en) * | 2018-11-09 | 2019-06-07 | 语联网(武汉)信息技术有限公司 | Automatic update method, device and the electronic equipment of bilingual word bank |
CN109918677A (en) * | 2019-03-21 | 2019-06-21 | 广东小天才科技有限公司 | A kind of method and system of English word semanteme parsing |
CN110209771A (en) * | 2019-06-14 | 2019-09-06 | 哈尔滨哈银消费金融有限责任公司 | User's geographic information analysis and text mining method and apparatus |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1801140A (en) * | 2004-12-30 | 2006-07-12 | 中国科学院自动化研究所 | Method and apparatus for automatic acquisition of machine translation template |
CN103678287A (en) * | 2013-11-30 | 2014-03-26 | 武汉传神信息技术有限公司 | Method for unifying keyword translation |
CN104375988A (en) * | 2014-11-04 | 2015-02-25 | 北京第二外国语学院 | Word and expression alignment method and device |
CN105068997A (en) * | 2015-07-15 | 2015-11-18 | 清华大学 | Parallel corpus construction method and device |
CN105912522A (en) * | 2016-03-31 | 2016-08-31 | 长安大学 | Automatic extraction method and extractor of English corpora based on constituent analyses |
-
2017
- 2017-11-20 CN CN201711160312.9A patent/CN107797995A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1801140A (en) * | 2004-12-30 | 2006-07-12 | 中国科学院自动化研究所 | Method and apparatus for automatic acquisition of machine translation template |
CN103678287A (en) * | 2013-11-30 | 2014-03-26 | 武汉传神信息技术有限公司 | Method for unifying keyword translation |
CN104375988A (en) * | 2014-11-04 | 2015-02-25 | 北京第二外国语学院 | Word and expression alignment method and device |
CN105068997A (en) * | 2015-07-15 | 2015-11-18 | 清华大学 | Parallel corpus construction method and device |
CN105912522A (en) * | 2016-03-31 | 2016-08-31 | 长安大学 | Automatic extraction method and extractor of English corpora based on constituent analyses |
Non-Patent Citations (2)
Title |
---|
李萌涛,孙强华: "《大学英语六级阅读理解集训》", 31 October 2002 * |
许鑫: "《基于文本特征计算的信息分析方法》", 30 November 2015 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109344389A (en) * | 2018-08-15 | 2019-02-15 | 中国科学院计算技术研究所 | A kind of construction method and system of the blind control bilingualism corpora of the Chinese |
CN109344389B (en) * | 2018-08-15 | 2020-08-18 | 中国科学院计算技术研究所 | Method and system for constructing Chinese blind comparison bilingual corpus |
CN109857746A (en) * | 2018-11-09 | 2019-06-07 | 语联网(武汉)信息技术有限公司 | Automatic update method, device and the electronic equipment of bilingual word bank |
CN109857746B (en) * | 2018-11-09 | 2021-05-04 | 语联网(武汉)信息技术有限公司 | Automatic updating method and device for bilingual word stock and electronic equipment |
CN109657244A (en) * | 2018-12-18 | 2019-04-19 | 语联网(武汉)信息技术有限公司 | A kind of English long sentence automatic segmentation method and system |
CN109657244B (en) * | 2018-12-18 | 2023-04-18 | 语联网(武汉)信息技术有限公司 | English long sentence automatic segmentation method and system |
CN109918677A (en) * | 2019-03-21 | 2019-06-21 | 广东小天才科技有限公司 | A kind of method and system of English word semanteme parsing |
CN110209771A (en) * | 2019-06-14 | 2019-09-06 | 哈尔滨哈银消费金融有限责任公司 | User's geographic information analysis and text mining method and apparatus |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107797995A (en) | A kind of Chinese and English fragment language material generation method | |
Boudin et al. | Keyphrase extraction for n-best reranking in multi-sentence compression | |
Zhou et al. | Resolving surface forms to wikipedia topics | |
CN102214189B (en) | Data mining-based word usage knowledge acquisition system and method | |
El-Shishtawy et al. | An accurate arabic root-based lemmatizer for information retrieval purposes | |
Dziob et al. | plWordNet 4.1-a linguistically motivated, corpus-based bilingual resource | |
CN101308512B (en) | Mutual translation pair extraction method and device based on web page | |
CN105630770A (en) | Word segmentation phonetic transcription and ligature writing method and device based on SC grammar | |
Simard | Building and using parallel text for translation | |
CN108132917B (en) | Document error correction marking method | |
Ashna et al. | Lexicon based sentiment analysis system for malayalam language | |
Fu et al. | Generating chinese named entity data from a parallel corpus | |
Van Der Goot et al. | Lexical normalization for code-switched data and its effect on POS-tagging | |
Sun et al. | GEDIT: geographic-enhanced and dependency-guided tagging for joint POI and accessibility extraction at *** maps | |
Sagot et al. | Error mining in parsing results | |
Sembok et al. | Arabic word stemming algorithms and retrieval effectiveness | |
Attia et al. | Gwu-hasp: Hybrid arabic spelling and punctuation corrector | |
Hu et al. | CSCD-IME: correcting spelling errors generated by pinyin IME | |
Nguyen et al. | An approach to construct a named entity annotated English-Vietnamese bilingual corpus | |
Arcan | A comparison of statistical and neural machine translation for Slovene, Serbian and Croatian | |
Yamamoto et al. | Learning sequence-to-sequence correspondences from parallel corpora via sequential pattern mining | |
Guo et al. | Character-level dependency model for joint word segmentation, POS tagging, and dependency parsing in Chinese | |
Garcia | Comparing bilingual word embeddings to translation dictionaries for extracting multilingual collocation equivalents | |
Chakrawarti et al. | Phrase-Based Statistical Machine Translation of Hindi Poetries into English | |
Zitouni et al. | Cross-language information propagation for arabic mention detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20180313 |