CN110874536B - Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method - Google Patents

Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method Download PDF

Info

Publication number
CN110874536B
CN110874536B CN201810995294.4A CN201810995294A CN110874536B CN 110874536 B CN110874536 B CN 110874536B CN 201810995294 A CN201810995294 A CN 201810995294A CN 110874536 B CN110874536 B CN 110874536B
Authority
CN
China
Prior art keywords
sentence
corpus
bilingual
inter
quality
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810995294.4A
Other languages
Chinese (zh)
Other versions
CN110874536A (en
Inventor
陆军
汪嘉怿
施杨斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Group Holding Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201810995294.4A priority Critical patent/CN110874536B/en
Publication of CN110874536A publication Critical patent/CN110874536A/en
Application granted granted Critical
Publication of CN110874536B publication Critical patent/CN110874536B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a corpus quality evaluation model generation method, a double-sentence pair inter-translation quality evaluation method, a device, equipment and a storage medium thereof. The generation method of the corpus quality assessment model comprises the following steps: constructing a bilingual corpus, wherein the bilingual corpus comprises a plurality of bilingual sentence pairs and inter-translation quality labels corresponding to the bilingual sentence pairs; training a preset corpus quality assessment network by taking the bilingual sentence pairs and the inter-translation quality labels corresponding to the bilingual sentence pairs as training samples to generate a corpus quality assessment model, wherein the corpus quality assessment model is suitable for assessing the inter-translation quality of a given bilingual sentence pair. The embodiment of the invention can realize the inter-translation quality evaluation of the bilingual sentence pairs.

Description

Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method
Technical Field
The invention relates to the technical field of machine translation, in particular to a corpus quality evaluation model generation method, a double-sentence pair inter-translation quality evaluation method, a device, equipment and a storage medium thereof.
Background
Machine translation refers to the technique of translating text from one natural language (source language) to another natural language (target language) using a computer program. Currently, corpus-based machine translation techniques represent a major technical trend in this field, such as statistical machine translation (Statistical Machine Translation, SMT) and neural network machine translation (Neural Machine Translation, NMT), which rely on a corpus containing a large amount of training data to train a translation model. Whether for SMT or NMT, the quality of translation is closely related to the quality and size of the corpus. Therefore, it is important to evaluate the quality of the corpus in the corpus.
Bilingual corpus, sometimes called bilingual parallel corpus, is one type of corpus data in such a corpus, and is key training data of a machine translation model. Bilingual corpus generally refers to text corpus that can be translated from each other, and generally comprises text corpus at word level, phrase level, sentence level, document level and the like. For example, "It's a nice day today" is a bilingual corpus of Chinese-English inter-translation, which is good today, and is a bilingual corpus belonging to sentence level.
The conventional scheme for evaluating the quality of bilingual corpus is to calculate the translation probability of sentence pairs based on the translation probability of vocabulary, and then evaluate the quality of bilingual corpus, and the general processing procedure is as follows: 1) Constructing a bilingual vocabulary, calculating translation probability of the vocabulary, and obtaining vocabulary entries; for example, the vocabulary entry is "apple 0.8.6", the probability of translating the english word "apple" into the chinese word "apple" is 0.8, and the probability of translating the chinese word "apple" into the english word "apple" is 0.6. 2) If the bilingual corpus is phrase-level or sentence-level bilingual corpus, dividing the original text and the translated text of the bilingual corpus into words respectively, and performing word alignment processing to obtain word pair relations; the word alignment processing refers to processing of associating an original word and a translated word which are possibly mutually translated. 3) And (3) calculating the overall translation probability of the bilingual corpus by a proper algorithm (such as calculation of the proportion of the inter-translation words with statistical weighting) through the word pair relation and combining the vocabulary translation probability of the word pair obtained in the step 1). Here, the quality of the bilingual corpus is reflected in the overall translation probability, and the higher the overall translation probability is, the better the quality of the bilingual corpus is considered.
Although the quality of bilingual corpus can be reflected to a certain extent by using the scheme, the scheme is fundamentally based on vocabulary processing, on one hand, the scheme depends on a constructed bilingual word list, on the other hand, word segmentation processing and word alignment processing are needed for original text and translated text, on the other hand, other algorithms are needed to be introduced to calculate the final overall translation probability, and the uncertainty of the processing can influence the calculation result of the overall translation probability, so that the overall translation probability can not accurately reflect the quality of bilingual corpus.
Disclosure of Invention
In view of the above, the present invention provides a training method, a quality evaluation method, a device and a computer storage medium thereof based on bilingual corpus, which are used for solving the problem that the quality evaluation of bilingual corpus is difficult to complete.
In a first aspect, the present invention provides a method for generating a corpus quality assessment model, the method comprising:
constructing a bilingual corpus, wherein the bilingual corpus comprises a plurality of bilingual sentence pairs and inter-translation quality labels corresponding to the bilingual sentence pairs;
training a preset corpus quality assessment network by taking the bilingual sentence pairs and the inter-translation quality labels corresponding to the bilingual sentence pairs as training samples to generate a corpus quality assessment model, wherein the corpus quality assessment model is suitable for assessing the inter-translation quality of a given bilingual sentence pair.
In a second aspect, the present invention further provides a device for generating a corpus quality assessment model, where the device includes:
the corpus construction module is used for constructing a bilingual corpus, and the bilingual corpus comprises a plurality of bilingual sentence pairs and mutual translation quality labels corresponding to the bilingual sentence pairs;
the corpus quality assessment model training module is used for training a preset corpus quality assessment network by taking the bilingual sentence pairs and the inter-translation quality labels corresponding to the bilingual sentence pairs as training samples so as to generate a corpus quality assessment model, and the corpus quality assessment model is suitable for assessing the inter-translation quality of a given bilingual sentence pair.
In a third aspect, the present invention further provides a device for generating a corpus quality assessment model, including:
a memory for storing a program;
and a processor for executing the program stored in the memory to perform the method as described above.
In a fourth aspect, the invention also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement a method as described above.
In a fifth aspect, the present invention further provides a method for evaluating the quality of inter-translation of a pair of sentences, the method comprising:
acquiring a bilingual sentence pair to be evaluated;
inputting the bilingual sentence pairs into a trained corpus quality assessment model;
and determining the inter-translation quality of the double-sentence pair according to the output of the corpus quality assessment model.
In a sixth aspect, the present invention further provides a device for evaluating the quality of inter-translation of a pair of sentences, the device comprising:
the bilingual sentence pair acquisition module is used for acquiring bilingual sentence pairs to be evaluated;
the double sentence pair input module is used for inputting the double sentence pair into the trained corpus quality evaluation model;
and the corpus quality assessment model is used for determining the inter-translation quality of the double-sentence pair according to the output of the corpus quality assessment model.
In a seventh aspect, the present invention also provides a mutual translation quality evaluation apparatus of a pair of sentences, including:
a memory for storing a program;
and a processor for running the program stored in the memory to perform the inter-translation quality assessment method of the double statement pair as described above.
In an eighth aspect, the present invention also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement a method of inter-interpretation quality assessment of bilingual sentence pairs as described above.
According to the embodiment of the invention, through constructing the bilingual training corpus containing the bilingual sentence pairs which are mutually translated and the bilingual sentence pairs which are not mutually translated, expected training of a language quality evaluation network can be realized, so that a stable mapping relation from the bilingual sentence pairs to the mutually translated quality labels is formed, the bilingual sentence pair can be used for mutually translated quality evaluation, and the evaluation result accuracy is high.
Drawings
Fig. 1 is a flow chart of a corpus quality assessment model generation method according to an embodiment of the present invention.
FIG. 2 is a flow chart of a method for evaluating the inter-translation quality of a double sentence pair according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a training process of a corpus quality assessment network according to an embodiment of the present invention.
Fig. 4 is a schematic structural diagram of a corpus quality assessment model generating device according to an embodiment of the present invention.
Fig. 5 is a schematic structural diagram of a dual-sentence pair translation quality evaluation device according to an embodiment of the present invention.
Fig. 6 is a schematic diagram of a hardware structure of an apparatus according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and examples. It should be understood that the detailed description is intended to illustrate the invention, but not to limit the invention. Terms such as first, second, etc. herein are used solely to distinguish one entity (or action) from another entity (or action) without necessarily implying any relationship or order between such entities (or actions); in addition, terms herein such as up, down, left, right, front, back, etc. denote a direction or orientation, but merely denote a relative direction or orientation, not an absolute direction or orientation. Without additional limitations, elements defined by the term "comprising" do not exclude the presence of other elements in a process, method, article, or apparatus that comprises the element.
The invention aims to train a constructed corpus quality evaluation network by constructing a brand-new bilingual corpus and taking the bilingual corpus as training data to generate a corpus quality evaluation model, and the model can realize evaluation of the inter-translation quality of target bilingual corpus. Various aspects of the invention are described in detail below.
< bilingual corpus >
In order to achieve quality assessment of bilingual corpus, particularly phrase-level or sentence-level bilingual corpus, the bilingual corpus constructed by the embodiments of the present invention includes bilingual sentence pairs, which refer to phrases or sentences that are different from a target language and can be mutually translated, for example, chinese-English mutually translated phrases or sentences, chinese-Russian mutually translated phrases or sentences, english-law mutually translated phrases or sentences, french-day mutually translated phrases or sentences, and the like. The number of the double sentence pairs in the double language corpus can be set according to practical situations and requirements, for example, the number of the double sentence pairs can be in the level of ten thousands, hundred thousands, millions or millions, and the training effect on the language quality evaluation network is better when the number of the double sentence pairs is larger.
In order to evaluate the quality of bilingual corpus, the bilingual corpus constructed by the embodiment of the invention comprises a positive example and a negative example, wherein the positive example refers to the bilingual corpus which is completely mutually translated, the completely mutually translated bilingual corpus is considered to have higher translation quality, and the bilingual corpus is marked with a high-quality label to be used as a positive sample for generating a corpus quality evaluation model through subsequent training. The "counterexample" refers to the incompletely inter-translated bilingual corpus, the incompletely inter-translated bilingual corpus is considered to have lower translation quality, and the incompletely inter-translated bilingual corpus is marked with a low-quality label to serve as a negative sample of a corpus quality assessment model generated through subsequent training. Therefore, the bilingual corpus constructed by the embodiment of the invention comprises the following components: the fully mutually translated bilingual sentence pair and the incompletely mutually translated bilingual sentence pair correspond to a high-quality label and a low-quality label respectively. The following describes in detail the two quality bilingual sentence pairs respectively.
For convenience of description, hereinafter, two sentences included in the bilingual sentence pair are referred to as an original sentence and a translated sentence, respectively. Those skilled in the art will appreciate that the original sentence and the translated sentence are only used to distinguish between two sentences in a bilingual sentence pair, and are not specific to a sentence in a certain language. The original sentence can be any sentence in the bilingual sentence pair, and correspondingly, the other sentence in the bilingual sentence pair is a translated sentence.
< bilingual sentence pair with complete translation >
In one embodiment of the present invention, the fully mutually translated bilingual sentence pair refers to a bilingual parallel sentence pair with a complete word alignment relationship, for example, "today's weather is good, it's a nice day today" belongs to the fully mutually translated bilingual sentence pair.
< incompletely inter-translated double statement pair >
In one embodiment of the present invention, the incompletely translated bilingual sentence pairs refer to any bilingual sentence pairs that cannot be referred to as completely translated bilingual sentence pairs, that is, any bilingual sentence pairs that do not have perfect word alignment relationships. For example, words are randomly deleted in original sentences and/or translated sentences, such as "today is good, it's a nice day today", words aligned with "today" are absent from chinese sentences, and thus belong to bilingual sentence pairs that are not completely mutually translated; for another example, new words are randomly inserted into the original sentence and/or the translated sentence, and the randomly inserted words lack aligned translated words, so that the randomly inserted words belong to incompletely translated bilingual sentence pairs; for another example, when the word sequence of the original sentence and/or the translated sentence is randomly disordered, the word alignment relationship is wrong, so that the two-sentence pair belongs to the incompletely inter-translated bilingual sentence pair.
For each of the above cases, the selection and necessary combination may be performed in the actual application scenario according to the actual conditions and requirements (such as the source of the training data, the amount of the training data, the accuracy of model training, etc.).
As an example, the incompletely translated bilingual sentence pair includes two phrases or two sentences that do not have a translation relationship, e.g., "weather is good, tell me your name," the two phrases in the sentence pair do not have a translation relationship; for another example, "what we eat at night, it's a nice day today", there is no inter-translation relationship between two sentences in a sentence pair, and thus It belongs to a bilingual sentence pair that is not completely inter-translated.
Further, in the embodiment of the present invention, the two phrases or sentences of the pair of double sentences are separated by a pause, meaning that the two phrases or sentences constitute one pair of double sentences. Other notations may also be used to represent such relationships in different embodiments or in different operating environments, such as "|", "||", "- - -" and/or "- - -", etc.
< origin of training data >
The training data of the embodiment of the invention comprises all the double-sentence pairs in the double-language corpus, and the sources of the double-sentence pairs are different according to the mutual translation condition of the double-sentence pairs.
In one embodiment of the present invention, for the "positive example" bilingual parallel sentence pairs, those that have been accumulated in the art and are mutually translated may be used, and this part of data is relatively easy to obtain, and further, in order to ensure that the translation quality is high, a manual labeling process may be performed to exclude sentence pairs that are not intended.
For example, if it is desired to use fully inter-translated bilingual sentence pairs as high-quality sentence pairs, incomplete inter-translated sentence pairs may be eliminated, and the "positive example" inter-translated bilingual sentence pairs may be obtained as positive samples in the training data.
In addition, high-quality bilingual sentence pairs of manual translation can be directly used as the bilingual sentence pairs of the 'positive examples', but the quantity of the training data is usually not very large due to the high cost of manual processing.
For the "opposite-case" incompletely inter-translated bilingual sentence pairs, there are various acquisition modes, and the incompletely inter-translated bilingual sentence pairs can be directly constructed by using ready-made sentences, or can be constructed by processing on the completely inter-translated bilingual sentence pairs. The following exemplarily lists some of the ways of constructing the incompletely translated bilingual sentence pairs:
a) Based on the fully mutually translated bilingual sentence pairs, the original sentence and/or the translated sentence is replaced by other sentences through manual or computer means: for example, "weather today is good, what is your name? "; "what we eat today, it's a nice day today".
b) On the basis of fully mutually translated bilingual sentence pairs, single (or multiple) words, such as 'today's best, it's a nice day today', are randomly deleted in the original and/or translated sentence by manual or computer means.
c) Based on the double sentence pairs completely mutually translated, the word sequence of the original text and/or the translated sentence is randomly disordered by manual or computer means, such as 'today's good weather, it's a nice today day'.
d) Based on the fully mutually translated bilingual sentence pairs, other words are randomly inserted into the original text and/or translated sentence through manual or computer means.
e) On the basis of the fully mutually translated bilingual sentence pairs, at least one part of the original text and/or the translated sentence is replaced by a machine translated sentence pair through manual or computer means. Here, the quality of machine translation is generally considered poor.
f) Sentences in the two languages are arbitrarily selected and arbitrarily paired.
g) Any other method that can reduce the quality of the translation.
For a) to g) above, the incompletely mutually translated bilingual sentence pair may be constructed based on any one of them, or may be constructed based on a combination of any plurality of them (for example, two, three or more), and the incompletely mutually translated bilingual sentence pair of the "opposite example" which meets the requirements can be obtained as a negative sample in training data.
For the above a) to f), in order to avoid the investment of labor cost, the implementation may be mainly implemented by computer means, and for those skilled in the computer field, the implementation of processes such as adding, deleting, modifying, pairing, etc. of data in a) to f) is an easy implementation process, and the implementation process and principle of this portion of processes are not repeated herein.
It is noted that for a) to e) above, in an embodiment of the present invention, when the number of words involved in an operation exceeds 10% of the total number of words of a sentence, it is considered that a "counter" incompletely interpreted bilingual sentence pair is formed. Of course, other scaling thresholds, such as 20%, 30%, etc., may be set as criteria for forming the incompletely interpreted bilingual sentence pairs.
< training corpus quality evaluation network >
It can be understood that after the corpus quality evaluation network and the training samples are set, the training samples are input into the corpus quality evaluation network, the network outputs the labels corresponding to the samples, the loss function value of the network can be calculated based on the labels output by the network and the real labels of the training samples, and the network parameters are adjusted according to the loss function value. Based on the updated parameters, the training samples are again input into the network, the loss function values are calculated and the network parameters are updated according to the labels output by the network and the actual labels of the training samples, and so on, the network parameters are continuously updated so as to minimize the loss function (in practice, the loss function is considered to be minimized when the loss function converges or is smaller than a predetermined threshold). The set of parameters which minimize the loss function is the optimal parameters of the network, and the model after training is obtained after the optimal parameters are determined.
In one embodiment of the present invention, the corpus quality evaluation network includes a word embedding layer, a sentence embedding layer, a splicing layer and a classification layer, which are sequentially connected: the word embedding layer is used for generating word vector sequences of words included in two sentences in the double sentence pair; the sentence embedding layer is used for respectively generating sentence vectors corresponding to the two sentences according to word vector sequences of words included in the two sentences; the splicing layer is used for splicing sentence vectors corresponding to the two sentences to obtain spliced vectors; and the classification layer is used for outputting the inter-translation quality label according to the splicing vector.
In one embodiment of the present invention, the word embedding layer inputs two sentences (or two sentences subjected to word segmentation) in the bilingual sentence pair, and outputs a word vector sequence of words included in the two sentences. The word embedding layer may be, for example, word2vec, gloVe, or other word vector models. In one embodiment, the word embedding layer further includes an Attention (Attention) module, which is configured to capture information of inter-translation relationships between words of an original sentence and a translated sentence in the bilingual sentence pair, so that the trained corpus quality assessment model can more effectively predict the inter-translation quality of the bilingual sentence pair.
In one embodiment of the present invention, the input of the sentence embedding layer is a word vector sequence output by the word embedding layer, and the output is a sentence vector corresponding to two sentences in the bilingual sentence pair. The sentence embedding layer may be implemented by adopting a convolutional neural network (Convolutional Neural Network, CNN), a recurrent neural network (Recurrent Neural Network, RNN), a Long Short-Term Memory network (LSTM), and other network structures. For the processing of the original text sentence and the translated text sentence of a single double sentence pair, the same network structure, for example, a CNN network may be adopted, or different neural network structures may be adopted, for example, the original text sentence adopts the CNN network processing and the translated text sentence adopts the RNN network processing, and so on. In addition, the neural networks themselves in the above process may be extended, for example, RNN networks may be added to CNN networks.
In one embodiment of the present invention, the input of the concatenation layer is the sentence vector of the two sentences output by the sentence embedding layer, and the output is the concatenation vector obtained by concatenating the sentence vectors of the two sentences.
In one embodiment of the invention, the input of the classification layer is the splicing vector output by the splicing layer, the output is the probability that the double-sentence pair belongs to each inter-translation quality label, and the label with the highest probability is used as the inter-translation quality of the double-sentence pair.
In one embodiment, the sorting layer includes a fully connected layer and a maximum flexibility (Softmax) layer connected in sequence. The input of the full-connection layer is the splicing vector output by the splicing layer, the output of the full-connection layer is the input of the Softmax layer, the output of the Softmax layer is the probability that the double sentence pair belongs to each inter-translation quality label, and the label with the highest probability is the inter-translation quality of the double sentence pair.
The number of full connection layers can be set by one skilled in the art at his own discretion, which is not limited by the present invention. In one embodiment, to compromise the classification effect and training efficiency of the model, the number of fully connected layers is set to 2.
The output of the Softmax layer is the same in dimension as the number of categories of quality labels. For example, the quality label includes two kinds of high quality and low quality, and then the output of the Softmax layer is a two-dimensional vector, and each dimension data in the vector represents the probability that the double sentence pair belongs to the high quality label and the low quality label respectively. For another example, the quality label includes three kinds of high quality, medium quality and low quality, and then the output of the Softmax layer is a three-dimensional vector, and each dimension data in the vector represents the probability that the double sentence pair belongs to the high quality label, the medium quality label and the low quality label respectively. The label with the highest probability is the inter-translation quality of the double-sentence pair.
For the training samples, the training samples are data samples marked with classification labels, and the labels are real categories to which the data samples belong. In the invention, the training sample is a double sentence pair marked with a mutually translated quality label, specifically, the fully mutually translated double sentence pair is a positive sample, and the label is 1 (representing high quality); the incompletely inter-translated double statement pair is a negative sample, with a label of 0 (representing low quality).
Based on the constructed bilingual corpus comprising the bilingual sentence pairs which are completely mutually translated and the bilingual sentence pairs which are not completely mutually translated and the mutual translation quality labels corresponding to the sentence pairs, training the corpus quality evaluation network can generate a corpus quality evaluation model for evaluating the mutual translation quality of a given bilingual sentence pair.
Based on the foregoing, an embodiment of the present invention may provide a method for generating a corpus quality assessment model, with reference to fig. 1, the method includes:
s101, constructing a bilingual corpus, wherein the bilingual corpus comprises a plurality of bilingual sentence pairs and inter-translation quality labels corresponding to the bilingual sentence pairs;
s102, training a preset corpus quality assessment network by taking the bilingual sentence pairs and the inter-translation quality labels corresponding to the bilingual sentence pairs as training samples to generate a corpus quality assessment model, wherein the corpus quality assessment model is suitable for assessing the inter-translation quality of a given bilingual sentence pair.
By utilizing the scheme provided by the invention, the preset corpus quality evaluation network can be trained based on the constructed bilingual corpus, so that a corpus quality evaluation model is generated, the model can be used for evaluating the inter-translation quality of a given bilingual sentence pair, and the evaluation result is stable and reliable.
Referring to fig. 2, the invention further provides a method for evaluating the inter-translation quality of the bilingual corpus to be evaluated by using the corpus quality evaluation model trained by the method shown in fig. 1, and the evaluation method comprises the following steps:
s201, obtaining bilingual sentence pairs to be evaluated; the method comprises the steps of carrying out a first treatment on the surface of the
S202, inputting the bilingual sentence pairs into a trained corpus quality assessment model;
s203, determining the inter-translation quality of the double-sentence pair according to the output of the corpus quality assessment model.
By using the method for evaluating the inter-translation quality of the double statement pairs, which is provided by the invention, the evaluation result is stable and reliable.
The application scene suitable for the embodiment of the invention comprises most occasions requiring or capable of carrying out quality evaluation on bilingual corpus, for example, in the mining engineering of bilingual data resources dominated by users, the users can carry out quality evaluation on the mined bilingual data by using the embodiment of the invention, so that the mining effect is mastered qualitatively or quantitatively, and the mining scheme can be optimized based on the mining effect. In another example, in the selecting process of bilingual corpus in machine translation, the candidate bilingual corpus can be evaluated by using the method, and low-quality corpus is removed, so that the effect of optimizing the bilingual corpus is achieved.
By way of specific examples, the optional specific processes of embodiments of the present invention are described below. It should be noted that, the scheme of the present invention does not depend on a specific algorithm, and in practical application, any known or unknown hardware, software, algorithm, program or any combination thereof may be selected to implement the scheme of the present invention, so long as the essential idea of the scheme of the present invention is adopted, the present invention falls within the protection scope of the present invention.
Fig. 3 shows a training process schematic diagram of a corpus quality assessment network according to an embodiment of the present invention, where the corpus quality assessment network includes a word embedding layer, a CNN layer, a stitching layer, two full-connection layers, and a Softmax layer, which are sequentially connected.
Wherein SRC and TGT represent original text and translated text, respectively, such as SRC: today weather is good; TGT: it's a nice day today.
(1) Firstly, word segmentation is carried out on the original text and the translated text to obtain word sequences, for example, SRC: "today", "weather" and "good".
(2) Word vector word-embedding module for embedding word input words of original text and translated text into word vector word-embedding module for embedding word of original text and translated text into word vector of word-embedding module, so that words in sentences are converted into a vector, for example, "today" in SRC: [0.13,0.21,0.0.101, …,0.28], the dimension of the vector can be 200 or 300. Each word is represented by its corresponding vector, whereby the original sentence and the translated sentence are converted into corresponding vector sequences, respectively.
(3) The vector sequence of the original text and the translated text is input to a CNN network of the sentence embedding layer, wherein the CNN network comprises a convolution layer (convolutional layer) and a pooling layer (pooling layer) and can extract the information of the sentence. The CNN network module may output a vector representing the semantic meaning of the statement, such as [0.280,0.116, …,0.101].
Here, since CNN is a classical network structure in a neural network, a sentence can be represented in a vectorized manner more precisely, and this vector represents the semantics of the sentence.
(4) After obtaining sentence vectors of the original text and the translated text, the input splicing layer splices the two together (conjugation) to obtain a splicing vector with higher dimension, and the vector represents the sentence pair.
(5) The spliced vector contains semantics of original text and translated text, the vector enters two full-connection layers (2-layer fully connection) and a Softmax layer, and finally a prediction result is output, wherein the prediction result is a quality score or quality label representing the sentence pair.
The two full-connection layers are mainly used for modeling the semantic matching degree of the original text and the translated text, and Softmax is used for outputting a final label.
Regarding the prediction results, the probabilities of 0 and 1 are used to determine that the tag of the sentence pair is 1 (high quality tag) if the probability of 1 is greater than the probability of 0, and that the tag of the sentence pair is 0 (low quality tag) if the probability of 1 is equal to or less than the probability of 0.
Further, the FIG. 3 embodiment may be implemented using a TensorFlow tool. In the process, an Attention mechanism-based Attention module can be constructed between the word vector sequences of the original text and the translated text and used for capturing the inter-translation relation information between the words of the original text and the translated text in the bilingual sentence pair, so that the quality of the sentence pair can be predicted more effectively.
Based on the above examples, it can be understood that, firstly, the implementation process of the embodiment of the present invention does not need bilingual vocabulary, so there is no vocabulary dependency problem; in addition, the embodiment of the invention models the original text and the translated text (word compressing module, CNN network module) and can better show the semantics of the original text and the translated text, so that if the quality of the original text and the translated text is poor (or better), the original text and the translated text are also reflected in the related modules and are reflected in the quality label which is finally output. Therefore, the final prediction result is an evaluation result fused with the original text itself, the translated text itself, and the mutual translation degree of the original text and the translated text.
Corresponding to the generation method of the corpus quality assessment model, the invention also provides a generation device, equipment and a computer storage medium of the corpus quality assessment model.
Referring to fig. 4, the generating device of the corpus quality assessment model includes:
the corpus construction module 100 is configured to construct a bilingual corpus, where the bilingual corpus includes a plurality of bilingual sentence pairs and mutually translated quality labels corresponding to the bilingual sentence pairs;
the corpus quality assessment model training module 200 is configured to train a preset corpus quality assessment network by using the bilingual sentence pairs and the inter-translation quality labels corresponding to the bilingual sentence pairs as training samples, so as to generate a corpus quality assessment model, where the corpus quality assessment model is suitable for assessing the inter-translation quality of a given bilingual sentence pair.
The generation device of the corpus quality assessment model comprises:
a memory for storing a program;
and the processor is used for running the program stored in the memory so as to execute each step in the generation method of the corpus quality assessment model.
The present invention also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps in the method for generating a corpus quality assessment model according to the embodiments of the present invention.
The method and the system can realize expected training of the language quality evaluation network, and the generated model is used for quality evaluation of bilingual language.
Corresponding to the method for evaluating the inter-translation quality of the double-sentence pair in the embodiment of the invention, the invention also provides a device, equipment and a computer storage medium for evaluating the inter-translation quality of the double-sentence pair. Wherein, the liquid crystal display device comprises a liquid crystal display device,
referring to fig. 5, the apparatus for evaluating the inter-translation quality of the double sentence pair includes:
the bilingual sentence pair acquisition module 10 is used for acquiring bilingual sentence pairs to be evaluated;
the bilingual sentence pair input module 20 is configured to input the bilingual sentence pair into a trained corpus quality assessment model;
the corpus quality assessment model 30 is configured to determine the inter-translation quality of the pair of sentences according to the output of the corpus quality assessment model.
The mutual translation quality evaluation device of the double sentence pair comprises:
a memory for storing a program;
and the processor is used for running the program stored in the memory to execute each step in the double-statement pair inter-translation quality evaluation method.
The present invention also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps in the method for evaluating the inter-translation quality of a pair of statements according to the embodiments of the present invention.
The inter-translation quality evaluation device, the inter-translation quality evaluation equipment and the computer storage medium for the bilingual corpus can be used for realizing quality evaluation of the bilingual corpus, and the evaluation result is high in accuracy.
It should be noted that in the above-described embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in whole or in part, the use is in the form of a computer program product comprising one or more computer program instructions. When loaded or executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer program instructions may be stored in or transmitted from one computer readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL), or wireless (e.g., infrared, wireless, microwave, etc.) means from one website, computer, server, or data center. The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.
Fig. 6 shows a block diagram of an exemplary hardware architecture capable of implementing the method and apparatus according to an embodiment of the present invention, such as a bilingual corpus-based training apparatus and a bilingual corpus quality assessment apparatus according to an embodiment of the present invention. The computing device 1000 includes, among other things, an input device 1001, an input interface 1002, a processor 1003, a memory 1004, an output interface 1005, and an output device 1006.
The input interface 1002, the processor 1003, the memory 1004, and the output interface 1005 are connected to each other via a bus 1010, and the input device 1001 and the output device 1006 are connected to the bus 1010 via the input interface 1002 and the output interface 1005, respectively, and further connected to other components of the computing device 1000.
Specifically, the input device 1001 receives input information from the outside, and transmits the input information to the processor 1003 through the input interface 1002; the processor 1003 processes the input information based on computer executable instructions stored in the memory 1004 to generate output information, stores the output information temporarily or permanently in the memory 1004, and then transmits the output information to the output device 1006 through the output interface 1005; output device 1006 outputs output information to the outside of computing device 1000 for use by a user.
The computing device 1000 may perform the steps of the methods of the invention described above.
The processor 1003 may be one or more central processing units (English: central Processing Unit, CPU). In the case where the processor 601 or the processor 701 is one CPU, the CPU may be a single-core CPU or a multi-core CPU.
The memory 1004 may be, but is not limited to, one or more of Random Access Memory (RAM), read Only Memory (ROM), erasable Programmable Read Only Memory (EPROM), compact disc read only memory (CD-ROM), hard disk, and the like. The memory 1004 is used for storing program codes.
It will be appreciated that the functions of any or all of the modules provided by the embodiments of the present invention may be implemented by the central processor 1003 shown in fig. 6.
All parts of the specification are described in a progressive manner, and all parts of the embodiments which are the same and similar to each other are referred to each other, and each embodiment is mainly described as being different from other embodiments. In particular, for apparatus and system embodiments, the description is relatively simple as it is substantially similar to method embodiments, and reference may be made to the description of the method embodiments section for relevant matters.

Claims (16)

1. A method of generating a corpus quality assessment model, the method comprising:
constructing a bilingual corpus, wherein the bilingual corpus comprises a plurality of bilingual sentence pairs and inter-translation quality labels corresponding to the bilingual sentence pairs;
training a preset corpus quality assessment network by taking the bilingual sentence pairs and the corresponding inter-translation quality labels of the bilingual sentence pairs as training samples to generate a corpus quality assessment model, wherein the corpus quality assessment model is suitable for assessing the inter-translation quality of a given bilingual sentence pair;
the corpus quality assessment network comprises a word embedding layer, a sentence embedding layer, a splicing layer and a classification layer which are connected in sequence; the word embedding layer also comprises an attention module, wherein the attention module is used for capturing the inter-translation relation information between words of two sentences in the bilingual sentence pair; the sentence embedded layer is a convolutional neural network and/or a cyclic neural network; the classifying layer comprises a full-connection layer and a flexible maximum layer which are sequentially connected.
2. The method of claim 1, wherein training a preset corpus quality assessment network to generate a corpus quality assessment model comprises:
training a preset corpus quality evaluation network to determine optimal parameters of the corpus quality evaluation network;
and taking the corpus quality evaluation network under the optimal parameters as a corpus quality evaluation model.
3. The method of claim 1, wherein the inter-translation quality tags include high quality tags and low quality tags, the constructing a bilingual corpus comprising:
obtaining a plurality of double sentence pairs, wherein the plurality of double sentence pairs comprise double sentence pairs which are completely mutually translated and double sentence pairs which are not completely mutually translated; and
the fully inter-translated double-sentence pairs are marked as high quality labels, and the incompletely inter-translated double-sentence pairs are marked as low quality labels.
4. A method according to claim 3, wherein the incompletely inter-translated bilingual sentence pairs are obtained based on the completely inter-translated bilingual sentence pairs, and the ratio of the number of incompletely inter-translated words to the total number of words of the respective sentence in the incompletely inter-translated bilingual sentence pairs is equal to or greater than a preset threshold.
5. The method of claim 3 or 4, wherein the pair of double sentences includes an original sentence and a translated sentence, the pair of incompletely translated double sentences being obtained by at least one of:
deleting at least one word in the original sentence and/or translated sentence in the fully inter-translated bilingual sentence pair;
adding at least one word in the original sentence and/or translated sentence in the fully mutually translated bilingual sentence pair;
changing the word sequence of the original sentence and/or translated sentence in the fully inter-translated bilingual sentence pair;
replacing at least one part of original sentence and/or translated sentence in the fully inter-translated bilingual sentence pair with a machine translation result;
the original sentence and/or translated sentence in the fully inter-translated bilingual sentence pair is replaced by other sentences besides the original sentence and/or translated sentence.
6. The method of claim 1, the word embedding layer to generate a word vector sequence of words included in two sentences in a double sentence pair;
the sentence embedding layer is used for respectively generating sentence vectors corresponding to the two sentences according to word vector sequences of words included in the two sentences;
the splicing layer is used for splicing sentence vectors corresponding to the two sentences to obtain spliced vectors;
and the classification layer is used for outputting the inter-translation quality label according to the splicing vector.
7. A method according to claim 1 or 6, wherein the classification layer outputs probabilities that the pair of double sentences belong to respective inter-translation quality tags, respectively, and takes the inter-translation quality tag with the highest probability as the inter-translation quality of the pair of double sentences.
8. A generation apparatus of a corpus quality assessment model, the apparatus comprising:
the corpus construction module is used for constructing a bilingual corpus, and the bilingual corpus comprises a plurality of bilingual sentence pairs and mutual translation quality labels corresponding to the bilingual sentence pairs;
the corpus quality assessment model training module is used for training a preset corpus quality assessment network by taking the bilingual sentence pairs and the inter-translation quality labels corresponding to the bilingual sentence pairs as training samples so as to generate a corpus quality assessment model, wherein the corpus quality assessment model is suitable for assessing the inter-translation quality of a given bilingual sentence pair;
the corpus quality assessment network comprises a word embedding layer, a sentence embedding layer, a splicing layer and a classification layer which are connected in sequence; the word embedding layer also comprises an attention module, wherein the attention module is used for capturing the inter-translation relation information between words of two sentences in the bilingual sentence pair; the sentence embedded layer is a convolutional neural network and/or a cyclic neural network; the classifying layer comprises a full-connection layer and a flexible maximum layer which are sequentially connected.
9. A generation device of a corpus quality assessment model, comprising:
a memory for storing a program;
a processor for executing the program stored in the memory to perform the method of any one of claims 1 to 7.
10. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any of claims 1 to 7.
11. A method for evaluating the quality of mutual translation of bilingual sentence pairs, the method comprising:
acquiring a bilingual sentence pair to be evaluated;
inputting the bilingual sentence pairs into a trained corpus quality assessment model;
determining the inter-translation quality of the double-sentence pair according to the output of the corpus quality assessment model;
the corpus quality assessment model comprises a word embedding layer, a sentence embedding layer, a splicing layer and a classification layer which are connected in sequence; the word embedding layer also comprises an attention module, wherein the attention module is used for capturing the inter-translation relation information between words of two sentences in the bilingual sentence pair; the sentence embedded layer is a convolutional neural network and/or a cyclic neural network; the classifying layer comprises a full-connection layer and a flexible maximum layer which are sequentially connected.
12. The method of claim 11, the word embedding layer to generate a word vector sequence of words included in two sentences in a double sentence pair;
the sentence embedding layer is used for respectively generating sentence vectors corresponding to the two sentences according to word vector sequences of words included in the two sentences;
the splicing layer is used for splicing sentence vectors corresponding to the two sentences to obtain spliced vectors;
and the classification layer is used for outputting the inter-translation quality label according to the splicing vector.
13. A method according to claim 11 or 12, wherein the classification layer outputs probabilities that the pair of double sentences belong to respective inter-translation quality tags, respectively, and takes the inter-translation quality tag with the highest probability as the inter-translation quality of the pair of double sentences.
14. A mutual translation quality assessment device for bilingual sentence pairs, the device comprising:
the bilingual sentence pair acquisition module is used for acquiring bilingual sentence pairs to be evaluated;
the double sentence pair input module is used for inputting the double sentence pair into the trained corpus quality evaluation model;
the corpus quality assessment model is used for determining the inter-translation quality of the double-sentence pair according to the output of the corpus quality assessment model;
the corpus quality assessment model comprises a word embedding layer, a sentence embedding layer, a splicing layer and a classification layer which are connected in sequence; the word embedding layer also comprises an attention module, wherein the attention module is used for capturing the inter-translation relation information between words of two sentences in the bilingual sentence pair; the sentence embedded layer is a convolutional neural network and/or a cyclic neural network; the classifying layer comprises a full-connection layer and a flexible maximum layer which are sequentially connected.
15. A mutual translation quality assessment device of bilingual sentence pairs, comprising:
a memory for storing a program;
a processor for executing the program stored in the memory to perform the method of any one of claims 11-13.
16. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method of any of claims 11-13.
CN201810995294.4A 2018-08-29 2018-08-29 Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method Active CN110874536B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810995294.4A CN110874536B (en) 2018-08-29 2018-08-29 Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810995294.4A CN110874536B (en) 2018-08-29 2018-08-29 Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method

Publications (2)

Publication Number Publication Date
CN110874536A CN110874536A (en) 2020-03-10
CN110874536B true CN110874536B (en) 2023-06-27

Family

ID=69714634

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810995294.4A Active CN110874536B (en) 2018-08-29 2018-08-29 Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method

Country Status (1)

Country Link
CN (1) CN110874536B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113642337B (en) * 2020-05-11 2023-12-19 阿里巴巴集团控股有限公司 Data processing method and device, translation method, electronic device, and computer-readable storage medium
CN112800745A (en) * 2021-02-01 2021-05-14 北京明略昭辉科技有限公司 Method, device and equipment for text generation quality evaluation
CN113761944B (en) * 2021-05-20 2024-03-15 腾讯科技(深圳)有限公司 Corpus processing method, device and equipment for translation model and storage medium
CN113641724B (en) * 2021-07-22 2024-01-19 北京百度网讯科技有限公司 Knowledge tag mining method and device, electronic equipment and storage medium
CN114386437B (en) * 2022-01-13 2022-09-27 延边大学 Mid-orientation translation quality estimation method and system based on cross-language pre-training model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1203316A1 (en) * 1999-06-30 2002-05-08 Synerges OY System for internationalization of search input information
CN101777044A (en) * 2010-01-29 2010-07-14 中国科学院声学研究所 System for automatically evaluating machine translation by using sentence structure information and implementing method
JP2011118496A (en) * 2009-12-01 2011-06-16 National Institute Of Information & Communication Technology Language-independent word segmentation for statistical machine translation
CN102945232A (en) * 2012-11-16 2013-02-27 沈阳雅译网络技术有限公司 Training-corpus quality evaluation and selection method orienting to statistical-machine translation
CN105512114A (en) * 2015-12-14 2016-04-20 清华大学 Parallel sentence pair screening method and system
CN106066851A (en) * 2016-06-06 2016-11-02 清华大学 A kind of neural network training method considering evaluation index and device
CN106598959A (en) * 2016-12-23 2017-04-26 北京金山办公软件股份有限公司 Method and system for determining intertranslation relationship of bilingual sentence pairs

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101030197A (en) * 2006-02-28 2007-09-05 株式会社东芝 Method and apparatus for bilingual word alignment, method and apparatus for training bilingual word alignment model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1203316A1 (en) * 1999-06-30 2002-05-08 Synerges OY System for internationalization of search input information
JP2011118496A (en) * 2009-12-01 2011-06-16 National Institute Of Information & Communication Technology Language-independent word segmentation for statistical machine translation
CN101777044A (en) * 2010-01-29 2010-07-14 中国科学院声学研究所 System for automatically evaluating machine translation by using sentence structure information and implementing method
CN102945232A (en) * 2012-11-16 2013-02-27 沈阳雅译网络技术有限公司 Training-corpus quality evaluation and selection method orienting to statistical-machine translation
CN105512114A (en) * 2015-12-14 2016-04-20 清华大学 Parallel sentence pair screening method and system
CN106066851A (en) * 2016-06-06 2016-11-02 清华大学 A kind of neural network training method considering evaluation index and device
CN106598959A (en) * 2016-12-23 2017-04-26 北京金山办公软件股份有限公司 Method and system for determining intertranslation relationship of bilingual sentence pairs

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
古丽尼尕尔・买合木提 ; 帕力旦・吐尔逊 ; 艾斯卡尔・艾木都拉 ; .基于词形分析的汉-维机器翻译性能分析.电脑知识与技术.2018,(第11期),全文. *

Also Published As

Publication number Publication date
CN110874536A (en) 2020-03-10

Similar Documents

Publication Publication Date Title
CN110874536B (en) Corpus quality evaluation model generation method and double-sentence pair inter-translation quality evaluation method
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
CN109493977B (en) Text data processing method and device, electronic equipment and computer readable medium
US10504010B2 (en) Systems and methods for fast novel visual concept learning from sentence descriptions of images
US20180336193A1 (en) Artificial Intelligence Based Method and Apparatus for Generating Article
US10592607B2 (en) Iterative alternating neural attention for machine reading
CN109815487B (en) Text quality inspection method, electronic device, computer equipment and storage medium
CN109376222B (en) Question-answer matching degree calculation method, question-answer automatic matching method and device
CN110019732B (en) Intelligent question answering method and related device
CN106557563B (en) Query statement recommendation method and device based on artificial intelligence
CN110457708B (en) Vocabulary mining method and device based on artificial intelligence, server and storage medium
US20180025121A1 (en) Systems and methods for finer-grained medical entity extraction
CN107861954B (en) Information output method and device based on artificial intelligence
CN108121699B (en) Method and apparatus for outputting information
CN113722493B (en) Text classification data processing method, apparatus and storage medium
CN113076739A (en) Method and system for realizing cross-domain Chinese text error correction
KR20190136911A (en) method and device for retelling text, server and storage medium
CN111144120A (en) Training sentence acquisition method and device, storage medium and electronic equipment
US10198497B2 (en) Search term clustering
US11797281B2 (en) Multi-language source code search engine
CN115798661A (en) Knowledge mining method and device in clinical medicine field
CN110738056A (en) Method and apparatus for generating information
CN108701126B (en) Theme estimation device, theme estimation method, and storage medium
CN115269828A (en) Method, apparatus, and medium for generating comment reply
CN111666405B (en) Method and device for identifying text implication relationship

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant