CN109446537A

CN109446537A - A kind of translation evaluation method and device for machine translation

Info

Publication number: CN109446537A
Application number: CN201811306229.2A
Authority: CN
Inventors: 詹文法; 邵志伟; 陶鹏程; 张振林; 刘德阳
Original assignee: Anqing Normal University
Current assignee: Anqing Normal University
Priority date: 2018-11-05
Filing date: 2018-11-05
Publication date: 2019-03-08
Anticipated expiration: 2038-11-05
Also published as: CN109446537B

Abstract

The invention discloses a kind of translation evaluation method and devices for machine translation, which comprises obtains several corpus in corpus, and by the splicing result for the context term vector for including in each corpus；And the term vector of the word for the different parts of speech for including in several corpus is initialized；CBOW model using the splicing result and the term vector as the input of CBOW model, after obtaining training；The target word of each corpus is obtained, and is translated using the CBOW model after training；The translation that model to be assessed is directed to the target word is obtained, and according to the similarity between the corresponding translation of the model to be assessed translation corresponding with the CBOW model after training, assesses the accuracy of model translation to be assessed.Using the embodiment of the present invention, accuracy evaluation can be carried out to translation result automatically.

Description

A kind of translation evaluation method and device for machine translation

Technical field

The present invention relates to a kind of translation evaluation method and devices, are more particularly to a kind of translation evaluation for machine translation Method and device.

Background technique

With the development of modern society, the mankind are increasing to the conversion requirements between language.In practical applications, traditional Machine translation is rule-based, and feature is that the grammer Matching Relation based on syntax and semantics theory, by analyzing context obtains To translation result.But since rule can not cover all sentences, conventional machines translation is literal translation or the sentence of syntax mostly The conversion of type.

With the continuous development of artificial intelligence technology, expression learning art neural network based starts fine in every field It appears.Especially in the multiple tasks based on image recognition and speech recognition, the method based on expression study is in performance It has been more than traditional method based on statistical learning.Modern machines interpretation method is based on " bilingual library ", and feature is Include the bilingualism corpora of many sentence patterns using one, is extracted and inputted sentence when translation according to the sentence pattern in corpus Original language, is converted into object language referring next to bilingual sentence pattern by the similar example sentence of son.

Natural language is the abstract expression of the wisdom of humanity, be difficult to represent by existing data structure come.In natural language It says in treatment process, the basic unit of data is word or word.Similar to " apple ", a kind of fruit can be both indicated, it can also be with table Show " Apple Inc. ".What " microphone " and " microphone " indicated is a kind of article, but can not set up correct connection from literal. Therefore, most of translation systems can correctly translate the substantially meaning of sentence at present.But the word, sentence between different language are used Method has marked difference, and the result of translation has word order mistake mostly, word is used with, the problems such as misusing.Particularly with long sentence, machine Device translation cannot reach better accuracy, and the prior art is caused to there is technical issues that the result of translation still needs to.

Summary of the invention

Technical problem to be solved by the present invention lies in provide a kind of translation evaluation method and dress for machine translation It sets, to solve the technical issues of result of translation existing in the prior art still needs to manual evaluation.

The present invention is to solve above-mentioned technical problem by the following technical programs:

The embodiment of the invention provides a kind of translation evaluation methods for machine translation, which comprises

Several corpus in corpus are obtained, and by the splicing knot for the context term vector for including in each corpus Fruit；And the term vector of the word for the different parts of speech for including in several corpus is initialized；

CBOW model using the splicing result and the term vector as the input of CBOW model, after obtaining training；

The target word of each corpus is obtained, and is translated using the CBOW model after training；

The translation that model to be assessed is directed to the target word is obtained, and according to the corresponding translation of the model to be assessed and instruction The similarity between the corresponding translation of CBOW model after white silk, assesses the accuracy of model translation to be assessed.

Optionally, the term vector of the word to the different parts of speech for including in several corpus initializes, Include:

Respectively using the value range not being overlapped mutually, to the word of the word for the different parts of speech for including in several corpus Vector is initialized.

Optionally, training is obtained using the splicing result and the term vector as the input of CBOW model described Before CBOW model afterwards, the method also includes:

By the punctuation mark removal in each corpus in addition to the punctuation mark of setting, wherein the punctuation mark of setting It include: one of punctuation mark that punctuation mark, corpus for expressing the tone of corpus terminate or combination.

Optionally, the target word for obtaining each corpus, comprising:

Using formula,Obtain the target word of each corpus, wherein

P (w | c) is the probability of target word；W is target word；C is the context of target word；Exp () is to be with the natural truth of a matter The exponential function at bottom；；X is the input layer of CBOW model；∑ is summing function；V is corpus；()^TFor transposed matrix.

Optionally, the corpus is individual sentence.

The embodiment of the invention provides a kind of translation evaluation device for machine translation, described device includes:

Module is obtained, for obtaining several corpus in corpus, and the cliction up and down that will include in each corpus The splicing result of vector；And the term vector of the word for the different parts of speech for including in several corpus is initialized；

Optionally, the acquisition module, is used for:

Optionally, described device further include: removal module, for by each corpus in addition to the punctuation mark of setting Punctuation mark removal, wherein the punctuation mark of setting includes: that punctuation mark, the corpus for expressing the tone of corpus terminate One of punctuation mark or combination.

Optionally, the acquisition module, is used for:

Using formula,Obtain the target word of each corpus, wherein

Optionally, the corpus is individual sentence.

The present invention has the advantage that compared with prior art

Using the embodiment of the present invention, since context word order plays an important role for translation, by each The splicing result for the context term vector for including in corpus, available more accurate translation model, and then this can be used Inventive embodiments training model the translation result of model in the prior art is proofreaded, compared with the existing technology in need Manual evaluation, the embodiment of the present invention can carry out accuracy evaluation to translation result automatically.

Detailed description of the invention

Fig. 1 is a kind of flow diagram of the translation evaluation method for machine translation provided in an embodiment of the present invention；

Fig. 2 is a kind of structural schematic diagram of CBOW model provided in an embodiment of the present invention；

Fig. 3 is a kind of structural schematic diagram of the translation evaluation device for machine translation provided in an embodiment of the present invention.

Specific embodiment

It elaborates below to the embodiment of the present invention, the present embodiment carries out under the premise of the technical scheme of the present invention Implement, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to following implementation Example.

The embodiment of the invention provides a kind of translation evaluation method and devices for machine translation, first below with regard to this hair A kind of translation evaluation method for machine translation that bright embodiment provides is introduced.

Fig. 1 is a kind of flow diagram of the translation evaluation method for machine translation provided in an embodiment of the present invention, such as Shown in Fig. 1, which comprises

S101: several corpus in corpus are obtained, and by the spelling for the context term vector for including in each corpus Binding fruit；And the term vector of the word for the different parts of speech for including in several corpus is initialized；

Specifically, can be respectively using the value range not being overlapped mutually, to the different words for including in several corpus The term vector of the word of property is initialized.The corpus is individual sentence.

Illustratively, can learn to establish language model from Large Scale Corpus.Since the quality of language model is direct The judgement to sentence correctness is influenced, so it is more important to choose suitable corpus.Chinese corpus can choose wikipedia Chinese vocabulary entry is modeled.

S102: the CBOW using the splicing result and the term vector as the input of CBOW model, after obtaining training Model；

Fig. 2 is a kind of structural schematic diagram of CBOW model provided in an embodiment of the present invention, as shown in Fig. 2, CBOW model (Continuous Bag of Words, continuous bag of words) include: input layer x and output layer y.Input layer receives different Phrase is exported after being translated by output layer.

S103: the target word of each corpus is obtained, and is translated using the CBOW model after training.

Specifically, can use formula,The target word of each corpus is obtained, In,

(w, c) is the n member phrase w selected from corpus_i-(n-1)/2,...,w_i+(n-1)/2, general n selects odd number, can be with Guarantee that the word quantity of context is consistent.

The optimization aim of model can be with:

Wherein,

D is corpus.

S104: the translation that model to be assessed is directed to the target word is obtained, and is translated according to the model to be assessed is corresponding Similarity between text translation corresponding with the CBOW model after training, assesses the accuracy of model translation to be assessed.

In practical applications, for a translation, repeatedly judged using sliding window.Such as: window size 5, Respectively with the 1,2nd of translation the ... a word is that medium term is judged.Judgement obtains a similarity value every time, then calculates similar The average value of degree, the similarity finally obtained are to the marking value of this translation, and the higher correctness for illustrating translation of marking value is more It is high.

Using embodiment illustrated in fig. 1 of the present invention, since context word order plays an important role for translation, By the splicing result for the context term vector for including in each corpus, available more accurate translation model, Jin Erke To use the model of training of the embodiment of the present invention to proofread the translation result of model in the prior art, relative to existing skill Manual evaluation is needed in art, the embodiment of the present invention can carry out accuracy evaluation to translation result automatically.

Specifically in a kind of specific embodiment of the embodiment of the present invention, before S102 step, the method is also wrapped It includes:

Before training pattern, when handling corpus, additional character is removed, retains the punctuate symbol useful to model Number.Such as: fullstop, exclamation mark, question mark etc..

The present invention increases the sentences information such as word order, part of speech, punctuation mark, improves language by improving language model The expression ability of model, can indicate more complicated sentence.It can be sentenced by the improvement of language model in conjunction with machine translation The correctness of disconnected machine translation translation, improves the accuracy rate of machine translation.

Corresponding for embodiment illustrated in fig. 1 of the present invention, the embodiment of the invention also provides a kind of for machine translation Translation evaluation device.

Fig. 3 is a kind of structural schematic diagram of the translation evaluation device for machine translation provided in an embodiment of the present invention, such as Shown in Fig. 3, described device includes:

Module 301 is obtained, for obtaining several corpus in corpus, and the context that will include in each corpus The splicing result of term vector；And the term vector of the word for the different parts of speech for including in several corpus is initialized；

In a kind of specific embodiment of the embodiment of the present invention, the acquisition module 301 is used for:

In a kind of specific embodiment of the embodiment of the present invention, the acquisition module 301, be used for: described device is also wrapped It includes: removal module, for the punctuation mark in each corpus in addition to the punctuation mark of setting to be removed, wherein setting Punctuation mark includes: one of punctuation mark that punctuation mark, the corpus for expressing the tone of corpus terminate or combination.

In a kind of specific embodiment of the embodiment of the present invention, the acquisition module 301 is used for: formula is utilized,Obtain the target word of each corpus, wherein

In a kind of specific embodiment of the embodiment of the present invention, the corpus is individual sentence.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims

1. a kind of translation evaluation method for machine translation, which is characterized in that the described method includes:

Several corpus in corpus are obtained, and by the splicing result for the context term vector for including in each corpus；And The term vector of the word for the different parts of speech for including in several corpus is initialized；

Obtain the translation that model to be assessed is directed to the target word, and according to the corresponding translation of the model to be assessed and training after The corresponding translation of CBOW model between similarity, assess the accuracy of model translation to be assessed.

2. a kind of translation evaluation method for machine translation according to claim 1, which is characterized in that described to described The term vector of the word for the different parts of speech for including in several corpus is initialized, comprising:

Respectively using the value range not being overlapped mutually, to the term vector of the word for the different parts of speech for including in several corpus It is initialized.

3. a kind of translation evaluation method for machine translation according to claim 1, which is characterized in that described by institute The input of splicing result and the term vector as CBOW model is stated, before the CBOW model after obtaining training, the method Further include:

By the punctuation mark removal in each corpus in addition to the punctuation mark of setting, wherein the punctuation mark of setting includes: For expressing one of the punctuation mark of the tone of corpus, punctuation mark that corpus terminates or combination.

4. a kind of translation evaluation method for machine translation according to claim 1, which is characterized in that described to obtain often The target word of one corpus, comprising:

Using formula,Obtain the target word of each corpus, wherein

P (w | c) is the probability of target word；W is target word；C is the context of target word；Exp () is using the natural truth of a matter bottom of as Exponential function；；X is the input layer of CBOW model；∑ is summing function；V is corpus；()^TFor transposed matrix.

5. a kind of translation evaluation method for machine translation according to claim 1, which is characterized in that the corpus is Individual sentence.

6. a kind of translation evaluation device for machine translation, which is characterized in that described device includes:

Module is obtained, for obtaining several corpus in corpus, and the context term vector that will include in each corpus Splicing result；And the term vector of the word for the different parts of speech for including in several corpus is initialized；

7. a kind of translation evaluation device for machine translation according to claim 6, which is characterized in that the acquisition mould Block is used for:

8. a kind of translation evaluation device for machine translation according to claim 6, which is characterized in that described device is also It include: removal module, for removing the punctuation mark in each corpus in addition to the punctuation mark of setting, wherein setting Punctuation mark include: one of punctuation mark that punctuation mark, corpus for expressing the tone of corpus terminate or combination.

9. a kind of translation evaluation device for machine translation according to claim 6, which is characterized in that the acquisition mould Block is used for:

Using formula,Obtain the target word of each corpus, wherein

10. a kind of translation evaluation device for machine translation according to claim 6, which is characterized in that the corpus For individual sentence.