CN109446537A - A kind of translation evaluation method and device for machine translation - Google Patents

A kind of translation evaluation method and device for machine translation Download PDF

Info

Publication number
CN109446537A
CN109446537A CN201811306229.2A CN201811306229A CN109446537A CN 109446537 A CN109446537 A CN 109446537A CN 201811306229 A CN201811306229 A CN 201811306229A CN 109446537 A CN109446537 A CN 109446537A
Authority
CN
China
Prior art keywords
corpus
translation
model
target word
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811306229.2A
Other languages
Chinese (zh)
Other versions
CN109446537B (en
Inventor
詹文法
邵志伟
陶鹏程
张振林
刘德阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anqing Normal University
Original Assignee
Anqing Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anqing Normal University filed Critical Anqing Normal University
Priority to CN201811306229.2A priority Critical patent/CN109446537B/en
Publication of CN109446537A publication Critical patent/CN109446537A/en
Application granted granted Critical
Publication of CN109446537B publication Critical patent/CN109446537B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of translation evaluation method and devices for machine translation, which comprises obtains several corpus in corpus, and by the splicing result for the context term vector for including in each corpus;And the term vector of the word for the different parts of speech for including in several corpus is initialized;CBOW model using the splicing result and the term vector as the input of CBOW model, after obtaining training;The target word of each corpus is obtained, and is translated using the CBOW model after training;The translation that model to be assessed is directed to the target word is obtained, and according to the similarity between the corresponding translation of the model to be assessed translation corresponding with the CBOW model after training, assesses the accuracy of model translation to be assessed.Using the embodiment of the present invention, accuracy evaluation can be carried out to translation result automatically.

Description

A kind of translation evaluation method and device for machine translation
Technical field
The present invention relates to a kind of translation evaluation method and devices, are more particularly to a kind of translation evaluation for machine translation Method and device.
Background technique
With the development of modern society, the mankind are increasing to the conversion requirements between language.In practical applications, traditional Machine translation is rule-based, and feature is that the grammer Matching Relation based on syntax and semantics theory, by analyzing context obtains To translation result.But since rule can not cover all sentences, conventional machines translation is literal translation or the sentence of syntax mostly The conversion of type.
With the continuous development of artificial intelligence technology, expression learning art neural network based starts fine in every field It appears.Especially in the multiple tasks based on image recognition and speech recognition, the method based on expression study is in performance It has been more than traditional method based on statistical learning.Modern machines interpretation method is based on " bilingual library ", and feature is Include the bilingualism corpora of many sentence patterns using one, is extracted and inputted sentence when translation according to the sentence pattern in corpus Original language, is converted into object language referring next to bilingual sentence pattern by the similar example sentence of son.
Natural language is the abstract expression of the wisdom of humanity, be difficult to represent by existing data structure come.In natural language It says in treatment process, the basic unit of data is word or word.Similar to " apple ", a kind of fruit can be both indicated, it can also be with table Show " Apple Inc. ".What " microphone " and " microphone " indicated is a kind of article, but can not set up correct connection from literal. Therefore, most of translation systems can correctly translate the substantially meaning of sentence at present.But the word, sentence between different language are used Method has marked difference, and the result of translation has word order mistake mostly, word is used with, the problems such as misusing.Particularly with long sentence, machine Device translation cannot reach better accuracy, and the prior art is caused to there is technical issues that the result of translation still needs to.
Summary of the invention
Technical problem to be solved by the present invention lies in provide a kind of translation evaluation method and dress for machine translation It sets, to solve the technical issues of result of translation existing in the prior art still needs to manual evaluation.
The present invention is to solve above-mentioned technical problem by the following technical programs:
The embodiment of the invention provides a kind of translation evaluation methods for machine translation, which comprises
Several corpus in corpus are obtained, and by the splicing knot for the context term vector for including in each corpus Fruit;And the term vector of the word for the different parts of speech for including in several corpus is initialized;
CBOW model using the splicing result and the term vector as the input of CBOW model, after obtaining training;
The target word of each corpus is obtained, and is translated using the CBOW model after training;
The translation that model to be assessed is directed to the target word is obtained, and according to the corresponding translation of the model to be assessed and instruction The similarity between the corresponding translation of CBOW model after white silk, assesses the accuracy of model translation to be assessed.
Optionally, the term vector of the word to the different parts of speech for including in several corpus initializes, Include:
Respectively using the value range not being overlapped mutually, to the word of the word for the different parts of speech for including in several corpus Vector is initialized.
Optionally, training is obtained using the splicing result and the term vector as the input of CBOW model described Before CBOW model afterwards, the method also includes:
By the punctuation mark removal in each corpus in addition to the punctuation mark of setting, wherein the punctuation mark of setting It include: one of punctuation mark that punctuation mark, corpus for expressing the tone of corpus terminate or combination.
Optionally, the target word for obtaining each corpus, comprising:
Using formula,Obtain the target word of each corpus, wherein
P (w | c) is the probability of target word;W is target word;C is the context of target word;Exp () is to be with the natural truth of a matter The exponential function at bottom;;X is the input layer of CBOW model;∑ is summing function;V is corpus;()TFor transposed matrix.
Optionally, the corpus is individual sentence.
The embodiment of the invention provides a kind of translation evaluation device for machine translation, described device includes:
Module is obtained, for obtaining several corpus in corpus, and the cliction up and down that will include in each corpus The splicing result of vector;And the term vector of the word for the different parts of speech for including in several corpus is initialized;
CBOW model using the splicing result and the term vector as the input of CBOW model, after obtaining training;
The target word of each corpus is obtained, and is translated using the CBOW model after training;
The translation that model to be assessed is directed to the target word is obtained, and according to the corresponding translation of the model to be assessed and instruction The similarity between the corresponding translation of CBOW model after white silk, assesses the accuracy of model translation to be assessed.
Optionally, the acquisition module, is used for:
Respectively using the value range not being overlapped mutually, to the word of the word for the different parts of speech for including in several corpus Vector is initialized.
Optionally, described device further include: removal module, for by each corpus in addition to the punctuation mark of setting Punctuation mark removal, wherein the punctuation mark of setting includes: that punctuation mark, the corpus for expressing the tone of corpus terminate One of punctuation mark or combination.
Optionally, the acquisition module, is used for:
Using formula,Obtain the target word of each corpus, wherein
P (w | c) is the probability of target word;W is target word;C is the context of target word;Exp () is to be with the natural truth of a matter The exponential function at bottom;;X is the input layer of CBOW model;∑ is summing function;V is corpus;()TFor transposed matrix.
Optionally, the corpus is individual sentence.
The present invention has the advantage that compared with prior art
Using the embodiment of the present invention, since context word order plays an important role for translation, by each The splicing result for the context term vector for including in corpus, available more accurate translation model, and then this can be used Inventive embodiments training model the translation result of model in the prior art is proofreaded, compared with the existing technology in need Manual evaluation, the embodiment of the present invention can carry out accuracy evaluation to translation result automatically.
Detailed description of the invention
Fig. 1 is a kind of flow diagram of the translation evaluation method for machine translation provided in an embodiment of the present invention;
Fig. 2 is a kind of structural schematic diagram of CBOW model provided in an embodiment of the present invention;
Fig. 3 is a kind of structural schematic diagram of the translation evaluation device for machine translation provided in an embodiment of the present invention.
Specific embodiment
It elaborates below to the embodiment of the present invention, the present embodiment carries out under the premise of the technical scheme of the present invention Implement, the detailed implementation method and specific operation process are given, but protection scope of the present invention is not limited to following implementation Example.
The embodiment of the invention provides a kind of translation evaluation method and devices for machine translation, first below with regard to this hair A kind of translation evaluation method for machine translation that bright embodiment provides is introduced.
Fig. 1 is a kind of flow diagram of the translation evaluation method for machine translation provided in an embodiment of the present invention, such as Shown in Fig. 1, which comprises
S101: several corpus in corpus are obtained, and by the spelling for the context term vector for including in each corpus Binding fruit;And the term vector of the word for the different parts of speech for including in several corpus is initialized;
Specifically, can be respectively using the value range not being overlapped mutually, to the different words for including in several corpus The term vector of the word of property is initialized.The corpus is individual sentence.
Illustratively, can learn to establish language model from Large Scale Corpus.Since the quality of language model is direct The judgement to sentence correctness is influenced, so it is more important to choose suitable corpus.Chinese corpus can choose wikipedia Chinese vocabulary entry is modeled.
S102: the CBOW using the splicing result and the term vector as the input of CBOW model, after obtaining training Model;
Fig. 2 is a kind of structural schematic diagram of CBOW model provided in an embodiment of the present invention, as shown in Fig. 2, CBOW model (Continuous Bag of Words, continuous bag of words) include: input layer x and output layer y.Input layer receives different Phrase is exported after being translated by output layer.
S103: the target word of each corpus is obtained, and is translated using the CBOW model after training.
Specifically, can use formula,The target word of each corpus is obtained, In,
P (w | c) is the probability of target word;W is target word;C is the context of target word;Exp () is to be with the natural truth of a matter The exponential function at bottom;;X is the input layer of CBOW model;∑ is summing function;V is corpus;()TFor transposed matrix.
(w, c) is the n member phrase w selected from corpusi-(n-1)/2,...,wi+(n-1)/2, general n selects odd number, can be with Guarantee that the word quantity of context is consistent.
The optimization aim of model can be with:
Wherein,
D is corpus.
S104: the translation that model to be assessed is directed to the target word is obtained, and is translated according to the model to be assessed is corresponding Similarity between text translation corresponding with the CBOW model after training, assesses the accuracy of model translation to be assessed.
In practical applications, for a translation, repeatedly judged using sliding window.Such as: window size 5, Respectively with the 1,2nd of translation the ... a word is that medium term is judged.Judgement obtains a similarity value every time, then calculates similar The average value of degree, the similarity finally obtained are to the marking value of this translation, and the higher correctness for illustrating translation of marking value is more It is high.
Using embodiment illustrated in fig. 1 of the present invention, since context word order plays an important role for translation, By the splicing result for the context term vector for including in each corpus, available more accurate translation model, Jin Erke To use the model of training of the embodiment of the present invention to proofread the translation result of model in the prior art, relative to existing skill Manual evaluation is needed in art, the embodiment of the present invention can carry out accuracy evaluation to translation result automatically.
Specifically in a kind of specific embodiment of the embodiment of the present invention, before S102 step, the method is also wrapped It includes:
By the punctuation mark removal in each corpus in addition to the punctuation mark of setting, wherein the punctuation mark of setting It include: one of punctuation mark that punctuation mark, corpus for expressing the tone of corpus terminate or combination.
Before training pattern, when handling corpus, additional character is removed, retains the punctuate symbol useful to model Number.Such as: fullstop, exclamation mark, question mark etc..
The present invention increases the sentences information such as word order, part of speech, punctuation mark, improves language by improving language model The expression ability of model, can indicate more complicated sentence.It can be sentenced by the improvement of language model in conjunction with machine translation The correctness of disconnected machine translation translation, improves the accuracy rate of machine translation.
Corresponding for embodiment illustrated in fig. 1 of the present invention, the embodiment of the invention also provides a kind of for machine translation Translation evaluation device.
Fig. 3 is a kind of structural schematic diagram of the translation evaluation device for machine translation provided in an embodiment of the present invention, such as Shown in Fig. 3, described device includes:
Module 301 is obtained, for obtaining several corpus in corpus, and the context that will include in each corpus The splicing result of term vector;And the term vector of the word for the different parts of speech for including in several corpus is initialized;
CBOW model using the splicing result and the term vector as the input of CBOW model, after obtaining training;
The target word of each corpus is obtained, and is translated using the CBOW model after training;
The translation that model to be assessed is directed to the target word is obtained, and according to the corresponding translation of the model to be assessed and instruction The similarity between the corresponding translation of CBOW model after white silk, assesses the accuracy of model translation to be assessed.
Using embodiment illustrated in fig. 1 of the present invention, since context word order plays an important role for translation, By the splicing result for the context term vector for including in each corpus, available more accurate translation model, Jin Erke To use the model of training of the embodiment of the present invention to proofread the translation result of model in the prior art, relative to existing skill Manual evaluation is needed in art, the embodiment of the present invention can carry out accuracy evaluation to translation result automatically.
In a kind of specific embodiment of the embodiment of the present invention, the acquisition module 301 is used for:
Respectively using the value range not being overlapped mutually, to the word of the word for the different parts of speech for including in several corpus Vector is initialized.
In a kind of specific embodiment of the embodiment of the present invention, the acquisition module 301, be used for: described device is also wrapped It includes: removal module, for the punctuation mark in each corpus in addition to the punctuation mark of setting to be removed, wherein setting Punctuation mark includes: one of punctuation mark that punctuation mark, the corpus for expressing the tone of corpus terminate or combination.
In a kind of specific embodiment of the embodiment of the present invention, the acquisition module 301 is used for: formula is utilized,Obtain the target word of each corpus, wherein
P (w | c) is the probability of target word;W is target word;C is the context of target word;Exp () is to be with the natural truth of a matter The exponential function at bottom;;X is the input layer of CBOW model;∑ is summing function;V is corpus;()TFor transposed matrix.
In a kind of specific embodiment of the embodiment of the present invention, the corpus is individual sentence.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (10)

1. a kind of translation evaluation method for machine translation, which is characterized in that the described method includes:
Several corpus in corpus are obtained, and by the splicing result for the context term vector for including in each corpus;And The term vector of the word for the different parts of speech for including in several corpus is initialized;
CBOW model using the splicing result and the term vector as the input of CBOW model, after obtaining training;
The target word of each corpus is obtained, and is translated using the CBOW model after training;
Obtain the translation that model to be assessed is directed to the target word, and according to the corresponding translation of the model to be assessed and training after The corresponding translation of CBOW model between similarity, assess the accuracy of model translation to be assessed.
2. a kind of translation evaluation method for machine translation according to claim 1, which is characterized in that described to described The term vector of the word for the different parts of speech for including in several corpus is initialized, comprising:
Respectively using the value range not being overlapped mutually, to the term vector of the word for the different parts of speech for including in several corpus It is initialized.
3. a kind of translation evaluation method for machine translation according to claim 1, which is characterized in that described by institute The input of splicing result and the term vector as CBOW model is stated, before the CBOW model after obtaining training, the method Further include:
By the punctuation mark removal in each corpus in addition to the punctuation mark of setting, wherein the punctuation mark of setting includes: For expressing one of the punctuation mark of the tone of corpus, punctuation mark that corpus terminates or combination.
4. a kind of translation evaluation method for machine translation according to claim 1, which is characterized in that described to obtain often The target word of one corpus, comprising:
Using formula,Obtain the target word of each corpus, wherein
P (w | c) is the probability of target word;W is target word;C is the context of target word;Exp () is using the natural truth of a matter bottom of as Exponential function;;X is the input layer of CBOW model;∑ is summing function;V is corpus;()TFor transposed matrix.
5. a kind of translation evaluation method for machine translation according to claim 1, which is characterized in that the corpus is Individual sentence.
6. a kind of translation evaluation device for machine translation, which is characterized in that described device includes:
Module is obtained, for obtaining several corpus in corpus, and the context term vector that will include in each corpus Splicing result;And the term vector of the word for the different parts of speech for including in several corpus is initialized;
CBOW model using the splicing result and the term vector as the input of CBOW model, after obtaining training;
The target word of each corpus is obtained, and is translated using the CBOW model after training;
Obtain the translation that model to be assessed is directed to the target word, and according to the corresponding translation of the model to be assessed and training after The corresponding translation of CBOW model between similarity, assess the accuracy of model translation to be assessed.
7. a kind of translation evaluation device for machine translation according to claim 6, which is characterized in that the acquisition mould Block is used for:
Respectively using the value range not being overlapped mutually, to the term vector of the word for the different parts of speech for including in several corpus It is initialized.
8. a kind of translation evaluation device for machine translation according to claim 6, which is characterized in that described device is also It include: removal module, for removing the punctuation mark in each corpus in addition to the punctuation mark of setting, wherein setting Punctuation mark include: one of punctuation mark that punctuation mark, corpus for expressing the tone of corpus terminate or combination.
9. a kind of translation evaluation device for machine translation according to claim 6, which is characterized in that the acquisition mould Block is used for:
Using formula,Obtain the target word of each corpus, wherein
P (w | c) is the probability of target word;W is target word;C is the context of target word;Exp () is using the natural truth of a matter bottom of as Exponential function;;X is the input layer of CBOW model;∑ is summing function;V is corpus;()TFor transposed matrix.
10. a kind of translation evaluation device for machine translation according to claim 6, which is characterized in that the corpus For individual sentence.
CN201811306229.2A 2018-11-05 2018-11-05 Translation evaluation method and device for machine translation Active CN109446537B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811306229.2A CN109446537B (en) 2018-11-05 2018-11-05 Translation evaluation method and device for machine translation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811306229.2A CN109446537B (en) 2018-11-05 2018-11-05 Translation evaluation method and device for machine translation

Publications (2)

Publication Number Publication Date
CN109446537A true CN109446537A (en) 2019-03-08
CN109446537B CN109446537B (en) 2022-11-25

Family

ID=65550840

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811306229.2A Active CN109446537B (en) 2018-11-05 2018-11-05 Translation evaluation method and device for machine translation

Country Status (1)

Country Link
CN (1) CN109446537B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274827A (en) * 2020-01-20 2020-06-12 南京新一代人工智能研究院有限公司 Suffix translation method based on multi-target learning of word bag

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246177A1 (en) * 2010-04-06 2011-10-06 Samsung Electronics Co. Ltd. Syntactic analysis and hierarchical phrase model based machine translation system and method
CN105808530A (en) * 2016-03-23 2016-07-27 苏州大学 Translation method and device in statistical machine translation
US20160350288A1 (en) * 2015-05-29 2016-12-01 Oracle International Corporation Multilingual embeddings for natural language processing

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246177A1 (en) * 2010-04-06 2011-10-06 Samsung Electronics Co. Ltd. Syntactic analysis and hierarchical phrase model based machine translation system and method
US20160350288A1 (en) * 2015-05-29 2016-12-01 Oracle International Corporation Multilingual embeddings for natural language processing
CN105808530A (en) * 2016-03-23 2016-07-27 苏州大学 Translation method and device in statistical machine translation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
姚亮等: "基于语义分布相似度的翻译模型领域自适应研究", 《山东大学学报(理学版)》 *
樊文婷等: "融合先验信息的蒙汉神经网络机器翻译模型", 《中文信息学报》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274827A (en) * 2020-01-20 2020-06-12 南京新一代人工智能研究院有限公司 Suffix translation method based on multi-target learning of word bag

Also Published As

Publication number Publication date
CN109446537B (en) 2022-11-25

Similar Documents

Publication Publication Date Title
CN110750959B (en) Text information processing method, model training method and related device
CN110598203A (en) Military imagination document entity information extraction method and device combined with dictionary
CN108549637A (en) Method for recognizing semantics, device based on phonetic and interactive system
CN112784696B (en) Lip language identification method, device, equipment and storage medium based on image identification
CN105138507A (en) Pattern self-learning based Chinese open relationship extraction method
CN110767213A (en) Rhythm prediction method and device
CN111599340A (en) Polyphone pronunciation prediction method and device and computer readable storage medium
CN109949799B (en) Semantic parsing method and system
CN105404621A (en) Method and system for blind people to read Chinese character
CN110276069A (en) A kind of Chinese braille mistake automatic testing method, system and storage medium
CN110334187A (en) Burmese sentiment analysis method and device based on transfer learning
CN113255331B (en) Text error correction method, device and storage medium
CN113268576B (en) Deep learning-based department semantic information extraction method and device
CN110377882A (en) For determining the method, apparatus, system and storage medium of the phonetic of text
CN115064154A (en) Method and device for generating mixed language voice recognition model
CN113779992A (en) Method for realizing BcBERT-SW-BilSTM-CRF model based on vocabulary enhancement and pre-training
CN109446537A (en) A kind of translation evaluation method and device for machine translation
CN116822530A (en) Knowledge graph-based question-answer pair generation method
CN114357975A (en) Multilingual term recognition and bilingual term alignment method
CN109960782A (en) A kind of Tibetan language segmenting method and device based on deep neural network
CN113886521A (en) Text relation automatic labeling method based on similar vocabulary
CN114330375A (en) Term translation method and system based on fixed paradigm
CN109657207B (en) Formatting processing method and processing device for clauses
CN107423293A (en) The method and apparatus of data translation
CN112199927A (en) Ancient book mark point filling method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant