CN107797995A

CN107797995A - A kind of Chinese and English fragment language material generation method

Info

Publication number: CN107797995A
Application number: CN201711160312.9A
Authority: CN
Inventors: 宋安琪
Original assignee: Language Network (wuhan) Information Technology Co Ltd
Current assignee: Language Network (wuhan) Information Technology Co Ltd
Priority date: 2017-11-20
Filing date: 2017-11-20
Publication date: 2018-03-13

Abstract

The present invention relates to machine translation field, more particularly to a kind of Chinese and English fragment language material generation method.Including subordinate sentence processing, accurate matching, fuzzy matching, amendment word corresponding relation, generation fragment five key steps of language material.Judge that compound word corresponding relation is found out at non-notional word interval by part of speech, word-based corresponding relation and non-notional word interval judge extraction fragment.Chinese and English fragment language material generation method provided by the invention, it is easy to accomplish, the fragment language material accuracy of generation it is high.For improving having great significance for machine aided translation efficiency.

Description

A kind of Chinese and English fragment language material generation method

Technical field

The present invention relates to machine translation field, more particularly to a kind of Chinese and English fragment language material generation method.

Background technology

With the development of information technology, international exchange is increasingly frequent, accurate understanding different language become one it is important Demand.The communication disorders of people, machine translation, as one important side of natural language processing field between solution different language To, obtained increasing concern and development, wherein the machine translation based on neutral net be substituted it is original based on system The machine translation of meter becomes industry main flow.Either newest machine translation based on neutral net or past based on system The machine translation of meter, is mostly based on corpus.Existing corpus generally comprises single word language material and sentence language material.In reality In translation, single word language material is similar to English-Chinese dictionary, is translated for article, and efficiency is substantially inadequate；And due to different Article is translated, identical sentence is simultaneously few, therefore sentence language material is limited for the help of translation.Not in identical text really The often fragment easily reused, fragment are that length is more than a word, several continuous words less than a sentence Set.Fragment language material is then the accurate intertranslation text of Chinese and English fragment.

Obviously, raising of the fragment language material for translation efficiency has great significance.It is directed to however, existing corpus lacks The language material of fragment.

The content of the invention

The technical problems to be solved by the invention are to provide a kind of Chinese and English fragment language material generation method, to generate fragment language Material.

In order to solve the above technical problems, the present invention provides a kind of Chinese and English fragment language material generation method, comprise the following steps：

Step 1, choose a pair of translated Chinese and English sentences；Word segmentation processing is made to English, the Chinese sentence respectively；

Step 2, according to english Chinese dictionary lexical or textual analysis, search English word lexical or textual analysis and the identical all English of Chinese language words be single Word, the English word of record matching and the corresponding relation of Chinese language words.

Step 3, the English word for being not yet recorded corresponding relation by step 2, if the dictionary of the English word Lexical or textual analysis is with certain Chinese language words similarity on given threshold, then it is assumed that the English word matches with the Chinese language words meaning, remembers Record the corresponding relation；

Step 4, amendment word corresponding relation, i.e.,

The corresponding relation of the corresponding Chinese language words of multiple English words obtained after traversal step 3, English word is pressed The order sequence occurred in english sentence, if word sequence number is continuous, confirms alignment relation, adds English-Chinese dictionary lexical or textual analysis； Otherwise, if most similar two order of words are discontinuous, the part of speech of interval word is judged, if the interval word is whole For non-notional word, then the corresponding relation including the interval word is recorded；If most similar two order of words are discontinuous, And non-notional word be present in the word of interval, then abandon the corresponding relation of the plurality of English word and the Chinese language words；Will be described right It should be related to and be merged into English-Chinese dictionary；

Step 5, generation fragment language material, i.e.,

Setting segment start is first Chinese word, and the position occurred by Chinese word in sentence travels through Chinese successively Word and corresponding English word, if English word sequence number is continuous, continue next English word；If current clip includes Chinese word records this fragment more than 2；If English word sequence number is discontinuous and is notional word, record does not include discontinuous The fragment of word, restart to set segment start；If English word sequence number is discontinuous and is non-notional word, continue next English word, if next word is continuous, continue.If next word is discontinuous, record do not include most latter two do not connect The fragment of continuous word, restart to set segment start；When Chinese word language runs into subordinate sentence punctuation mark, reset fragment and rise Point is next Chinese word.

Chinese sentence traversal is completed, and obtains fragment language material one by one.

Preferably, English, the Chinese sentence is included as word segmentation processing：

English sentence segments：The English-Chinese terminological dictionary that general English-Chinese dictionary and the bilingual sentence to have alignd are related to merges Into single English-Chinese dictionary file, lemmatization processing is carried out to english sentence, according to maximum forward matching method according to single in dictionary Word segments to english sentence；

Chinese sentence participle：Segmented from Chinese word segmentation machine centering sentence.

Further, the Chinese word segmentation machine has new word discovery function.

It is identical all with Chinese language words according to english Chinese dictionary lexical or textual analysis, lookup English word lexical or textual analysis described in step 2 English word, the English word of record matching and the corresponding relation of Chinese language words specifically include：

With the English after word segmentation processing for object, since first notional word, according in the English-Chinese dictionary of the English word The word occurred in Chinese sentence is searched in literary lexical or textual analysis, if there is the lexical or textual analysis identical of certain Chinese word and the English word, then Record English word and the corresponding relation of the Chinese word；Continue next English notional word, to the last a word.

Preferably, the similarity given threshold is 20%.

Further, more translated Chinese and English sentences pair are selected, 1 is repeated the above steps to step 5, obtains foot Enough fragment language materials.

The invention provides a kind of Chinese and English fragment language material generation method, to generate fragment language material.This method is easily achieved, The fragment language material accuracy of generation is high.For improving having great significance for machine aided translation efficiency.

Brief description of the drawings

Technical scheme is further described in detail with reference to the accompanying drawings and detailed description.

Fig. 1 is the overall flow figure of the present invention.

Embodiment

With reference to shown in Fig. 1, the present invention specifically includes following steps：

Step 1, sentence participle

Choose a pair of translated Chinese and English sentences；English, the Chinese sentence is segmented respectively, including,

English sentence segments：The English-Chinese terminological dictionary that general English-Chinese dictionary and the bilingual sentence to have alignd are related to merges Into single English-Chinese dictionary file, lemmatization processing is carried out to english sentence, according to maximum forward matching method according to single in dictionary Word segments to english sentence.

Chinese sentence participle：Segmented from Chinese word segmentation machine centering sentence containing new word discovery function.

Step 2, accurate matching

Step 3, fuzzy matching

For being not yet recorded the English word of corresponding relation by step 2, if the dictionary definition of the English word with Certain Chinese language words similarity is on given threshold, then it is assumed that the English word matches with the Chinese language words meaning, and it is right to record this It should be related to；Can be with the corresponding Chinese language words of multiple English words.

Such as：Chinese sentence is " traditional supercomputer is only good at scientific engineering computing, and superserver takes into account this Both sides application, it is the main flow of high-end computer ", english sentence is " The traditional supercomputers are good at the scientific engineering computing only,while this super-server is good at both,thus being the mainstream of the high-end computers.”." super clothes Business device " is a word, has the implication of " super " in super dictionary definitions, corresponds to " superserver ", server dictionary definitions In have the implication of " server ", correspond to " superserver "." high-end " is a word, is had " advanced " in high dictionary definitions Implication, and the similarity of " high-end " is (1+1)/(2+2)=25%, more than 20%, is corresponded to " high-end ", in end dictionary definitions There is the implication of " end ", and the similarity of " high-end " is (1+1)/(2+2)=25%, more than 20%, is corresponded to " high-end ".

Step 4, amendment word corresponding relation, i.e.,

The corresponding relation of the corresponding Chinese language words of multiple English words obtained after traversal step 3, English word is pressed The order sequence occurred in english sentence, if word sequence number is continuous, confirms alignment relation, adds English-Chinese dictionary lexical or textual analysis. Otherwise, if most similar two order of words are discontinuous, the part of speech of interval word is judged, if the interval word is whole For non-notional word, then the corresponding relation including the interval word is recorded；If most similar two order of words are discontinuous, And non-notional word be present in the word of interval, then abandon the corresponding relation of the plurality of English word and the Chinese language words；Will be described right It should be related to and be merged into English-Chinese dictionary.

Such as：Super and server is corresponding super " superserver " in upper example, and "-" belongs to non-notional word, then recorded " super-server " and " superserver " corresponding relation, similar " high-end " and " high-end " there is corresponding relation.

Step 5, generation fragment language material

Setting segment start is first Chinese word, and the position occurred by Chinese word in sentence travels through Chinese successively Word and corresponding English word, if English word sequence number is continuous, continue next English word.If current clip includes Chinese word records this fragment more than 2.If English word sequence number is discontinuous and is notional word, record does not include discontinuous The fragment of word, restart to set segment start.If English word sequence number is discontinuous and is non-notional word, continue next English word, if next word is continuous, continue.If next word is discontinuous, record do not include most latter two do not connect The fragment of continuous word, restart to set segment start.When Chinese word language runs into the subordinate sentence such as comma, fullstop, exclamation mark punctuate symbol Number, it is next Chinese word to reset segment start.

After the completion of Chinese sentence traversal, fragment language material one by one is obtained.

Such as：" tradition " correspondence " traditional " in upper example, " " correspondence " of ", discontinuously, but " of " is non-real justice Word, continue, " supercomputer " correspondence " supercomputers ", continuously, recorded segment " traditional computer " is corresponding “traditional supercomputers”." only " correspondence " only ", discontinuously, " only " are notional word, reset piece Duan Qidian." being good at " correspondence " are good at ", " being good at " and a upper fragment word " supercomputer " discontinuously, and " are Good at " are notional word, therefore it is " being good at " to reset segment start.Continue to match, " science " is corresponding " scientific ", it is continuous " being good at ", recorded segment " being good at science " correspondence " are good at scientific ". " engineering " correspondence " engineering ", recorded segment " being good at Scientific Engineering " correspondence " are good at scientific Engineering " and " Scientific Engineering " correspondence " scientific engineering "." calculating " correspondence " computing ", even It is continuous, recorded segment " being good at scientific engineering computing " correspondence " are good at scientific engineering Computing ", " scientific engineering computing " and " engineering calculation " are right for " scientific engineering computing " correspondence Answer " engineering computing ".

The more translated Chinese and English sentences pair of selection, repeat the above steps 1 to step 5, obtain enough pieces Section language material.

It should be noted last that above embodiment is merely illustrative of the technical solution of the present invention and unrestricted, Although the present invention is described in detail with reference to preferred embodiment, it will be understood by those within the art that, can be right Technical scheme is modified or equivalent substitution, and without departing from the spirit and scope of technical solution of the present invention, its is equal It should cover among scope of the presently claimed invention.

Claims

1. a kind of Chinese and English fragment language material generation method, it is characterised in that comprise the following steps：

Step 1, subordinate sentence processing, that is, choose a pair of translated Chinese and English sentences；English, the Chinese sentence is divided respectively Word processing；

Step 2, according to english Chinese dictionary lexical or textual analysis, search English word lexical or textual analysis and the identical all English words of Chinese language words, The English word of record matching and the corresponding relation of Chinese language words；

Step 3, the English word for being not yet recorded corresponding relation by step 2, if the dictionary definition of the English word With certain Chinese language words similarity on given threshold, then it is assumed that the English word matches with the Chinese language words meaning, and record should Corresponding relation；

Step 4, amendment word corresponding relation, i.e.,

The corresponding relation of the corresponding Chinese language words of multiple English words obtained after traversal step 3, English word is pressed in English The order sequence occurred in sentence, if word sequence number is continuous, confirms alignment relation, adds English-Chinese dictionary lexical or textual analysis；Otherwise, If most similar two order of words are discontinuous, the part of speech of interval word is judged, if the interval word is all non- Notional word, then record the corresponding relation including the interval word；If most similar two order of words are discontinuous, and Every non-notional word in word be present, then the corresponding relation of the plurality of English word and the Chinese language words is abandoned；By the corresponding pass System is merged into English-Chinese dictionary；

Step 5, generation fragment language material, i.e.,

Setting segment start is first Chinese word, and the position occurred by Chinese word in sentence travels through Chinese word successively With corresponding English word, if English word sequence number is continuous, continue next English word；If current clip includes Chinese Word records this fragment more than 2；If English word sequence number is discontinuous and is notional word, record does not include discontinuous word Fragment, restart set segment start；If English word sequence number is discontinuous and is non-notional word, continue next English Word, if next word is continuous, continue；If next word is discontinuous, record does not include most latter two discontinuous list The fragment of word, restart to set segment start；When Chinese word language runs into subordinate sentence punctuation mark, resetting segment start is Next Chinese word.

2. Chinese and English fragment language material generation method according to claim 1, it is characterised in that to English, the middle sentence Son includes as word segmentation processing：

English sentence segments：The English-Chinese terminological dictionary that general English-Chinese dictionary and the bilingual sentence to have alignd are related to is merged into list Individual English-Chinese dictionary file, lemmatization processing is carried out to english sentence, according to maximum forward matching method according to word pair in dictionary English sentence is segmented；

3. Chinese and English fragment language material generation method according to claim 4, it is characterised in that the Chinese word segmentation machine has New word discovery function.

4. Chinese and English fragment language material generation method according to claim 1, it is characterised in that described in step 2 according to English Chinese character allusion quotation lexical or textual analysis, searches English word lexical or textual analysis and the identical all English words of Chinese language words, and the English of record matching is single The corresponding relation of word and Chinese language words specifically includes：

With the English after word segmentation processing for object, since first notional word, released according to the English-Chinese dictionary Chinese of the English word Justice searches the word occurred in Chinese sentence, if there is the lexical or textual analysis identical of certain Chinese word and the English word, then records The corresponding relation of English word and the Chinese word；Continue next English notional word, to the last a word.

5. Chinese and English fragment language material generation method according to claim 1, it is characterised in that the similarity given threshold For 20%.

6. Chinese and English fragment language material generation method according to claim 1, it is characterised in that selection has more been translated Chinese and English sentence pair, repeat the above steps 1 to step 5, obtain enough fragment language materials.