CN112257468B

CN112257468B - Multilingual neural machine translation performance improving method

Info

Publication number: CN112257468B
Application number: CN202011212799.2A
Authority: CN
Inventors: 杜权
Original assignee: Shenyang Yayi Network Technology Co ltd
Current assignee: Shenyang Yayi Network Technology Co ltd
Priority date: 2020-11-03
Filing date: 2020-11-03
Publication date: 2023-08-22
Anticipated expiration: 2040-11-03
Also published as: CN112257468A

Abstract

The invention discloses a multilingual neural machine translation performance improving method, which comprises the steps of constructing a multilingual parallel corpus and training a multilingual multilayer neural machine translation model based on an attention mechanism to obtain a training multilingual multilayer neural machine translation model; inputting similar semantic sentences of different languages into models stored in different training round numbers; calculating the similarity between every two of the two by using cosine similarity; removing the similarity change of the layers except the lowest layer and the topmost layer in the training process; selecting a layer with the lowest similarity according to the similarity; according to the multi-language parallel corpus and the multi-layer multi-language neural machine translation model, training is stopped after the number of rounds, the layer number obtained in the step 5) is selected, the layer parameter is repeated for each language, training is continued, and each language independently shares the layer parameter until the model converges and stops. The invention reduces the interference among languages in the training process, and finally achieves the aim of improving the translation performance of the multilingual neural machine translation model.

Description

Multilingual neural machine translation performance improving method

Technical Field

The invention relates to a multi-language neural machine translation performance improving technology, in particular to a multi-language neural machine translation performance improving method.

Background

Machine translation (Machine Translation or MT) is an experimental science that uses an electronic computer to automatically translate between various natural languages. Specifically, it is a process of converting one natural language (source language) into another natural language (target language) by using a computer. For a long time, the machine translation technology is considered as one of the final means for solving the translation problem among various languages, and today, the global globalization of the world is very strong in the practical application requirement of the machine translation technology, and these reflect the great value and the technical application prospect of the machine translation.

The methods of machine translation are divided into two types, one is rule-based machine translation and the other is corpus-based machine translation. In particular, corpus-based machine translation can be categorized into instance-based machine translation, statistical machine translation, and neural machine translation. Early people used rules mainly for machine translation. However, with the deep research, the rule-based method gradually exposes the problems of limited coverage range of manual writing rules, conflict caused by the increase of the number of the rules, difficult language expansion and the like. While the subsequent case-based approach may alleviate the above problems to some extent, the problems have not been solved fundamentally.

Early machine translation was mainly defined manually, but later people found that such manually defined rules had limited coverage for the corpus and that a large number of rules were difficult to maintain, and later machine translation research had breakthrough progress due to the advent of statistical machine translation ideas, which were originally proposed by IBM and AT & T, etc. institutions in the nineties of the last century, which still required manual definition of some rules. In recent years, with the rise of deep learning, a neural machine translation method based on deep learning, abbreviated as neural machine translation, has been proposed. The method directly models the machine translation problem by using a neural network, and finishes model learning in an end-to-end mode, and the whole process does not need the design of artificial features.

A neural machine translation system based on Self-attention (Self-attention) mechanism is a method for directly transmitting information between vocabularies at different positions, and is more advantageous in the case of a longer information transmission distance, and thus is attracting attention in many homogeneous systems. Such neural machine translation models can more fully represent complex relationships between words at different locations in the sequence. The central idea is that the relevancy among the vocabularies is obtained by considering the relevancy among the vocabularies at any position of the source or target sentence, and the relevancy is used as the importance in the integration process of different vocabularies or fragment information. Finally, semantic information expression in the source language can be obtained.

Although this neural machine translation model based on the attention mechanism has improved performance compared with the previous model, with the application of neural machine translation in industry, if the machine translation model is built according to the traditional method, if the inter-translation between large-scale different languages is required, a large amount of model storage space is required, if the inter-translation between 100 languages is required, 100 x 99 = 9900 translation systems are required, if each model requires 2G storage space, the model storage needs 19.8T, and no running memory is calculated. Meanwhile, if some small languages do not have a large number of users, and once they have translation requirements, if the translation system improves the real-time property of the translation, only the throughput of the whole translation speed is reduced, or the number of additional devices is increased, and if the real-time property is not required, the user experience becomes poor, so that the popularization of machine translation is not facilitated. Therefore, there is a need to use multilingual neural machine translation techniques.

The multilingual neural machine translation model is based on a traditional neural machine translation model, and a parallel corpus constructed by a plurality of languages is utilized to construct a multilingual dictionary so as to realize the translation of a model to realize multiple languages. To further enhance performance, multilayer machine translation is typically used, with the overall model being an attention-based multilingual multilayer neural machine translation model. After the multi-language translation model is used, the storage space required by the mutual translation of a plurality of languages can be greatly reduced, the overall throughput of the server can be increased, the running memory is reduced, and the operation and maintenance cost is greatly saved.

Although the multi-language multi-layer neural machine translation model based on the attention mechanism solves the problems of storage space and the like, the method is not optimal in performance, is difficult to satisfy, and can cause interference among languages when being trained together by multiple languages, so that the performance of each language cannot be optimized.

At present, methods for improving translation performance of various languages by alleviating interference among multiple language machine translation model languages have not been reported yet.

Disclosure of Invention

Aiming at the defects of performance reduction and the like caused by the mutual interference among languages in a multi-language machine translation model in the prior art, the invention aims to provide a multi-language neural machine translation performance improvement method which can reduce the mutual interference among languages in model training so as to improve the translation performance of each language of the multi-language machine translation model.

In order to solve the technical problems, the invention adopts the following technical scheme:

the invention provides a multilingual neural machine translation performance improving method, which comprises the following steps:

1) Constructing a multi-language parallel corpus and a multi-language multi-layer neural machine translation model based on an attention mechanism, inputting the multi-language parallel corpus and the multi-language multi-layer neural machine translation model into the model by utilizing a training set to train until convergence on a checking set, and simultaneously storing model parameters according to a certain training round number to obtain the multi-language multi-layer neural machine translation model in the training process;

2) Screening similar semantic sentences of different languages from a multi-language parallel corpus according to the editing distance, inputting the sentences into the model stored by different training rounds in the step 1), and storing the output of each layer of multi-language multi-layer neural machine translation;

3) According to the output of each layer of the similar semantic sentences of different languages in the step 2) in different training step models, calculating the similarity between every two layers of the similar semantic sentences of different languages by using cosine similarity, and averaging the similarity between the sentences of different languages to be used as the similarity between different layers of different language models;

4) According to the similarity obtained in the step 3), observing the change of the similarity of the layers except the lowest layer and the topmost layer in the training process according to different numbers of rounds, and recording the round number value that the similarity of all the layers reaches a peak value;

5) Selecting the numerical values of the different layers of the last round from the similarity obtained in the step 3), and comparing the layers with the lowest similarity;

6) According to the multilingual parallel corpus and the multi-layer multilingual neural machine translation model in the step 1), training the number of rounds obtained in the step 4), stopping, selecting the number of layers obtained in the step 5), repeating the parameter of the layer for each language, continuing training and independently sharing the parameter of the layer for each language until the model converges and stops.

In the step 2), different languages are screened from the multilingual parallel corpus according to the editing distance, and the specific steps are as follows:

201 Grouping bilingual parallel data according to different languages, extracting languages with the same semantics from the bilingual parallel data, and selecting English parts in each bilingual data for similar semantic comparison;

202 Firstly removing punctuation marks from English parts in different corpus, and then comparing the punctuation marks by utilizing editing distance based on word level, wherein the editing distance based on word level means that words in one sentence are converted into another sentence by at least n times, each conversion selects one operation from deleting, adding and replacing words, and n is the editing distance of two sentences based on words; the closer the editing distance between two sentences is, the closer the sentence meaning of the sentences is; taking out English sentences with editing distance less than 2-5 as semantic similar sentence pairs, wherein the number of words contained in the selected sentence pairs is more than 5;

203 Taking out the sentences at the other end of the same semantic meaning in the bilingual corpus corresponding to each single sentence in the English single sentence pair in the step 202), wherein the four sentences are regarded as similar semantic meaning.

In the step 2), similar semantic sentences of different languages are input into models stored in different training rounds, and the output of each layer of multi-language multi-layer neural machine translation is stored, wherein the specific steps are as follows:

204 Taking out the model obtained in the step 1) according to each round of storage, and respectively inputting two similar parallel corpus in the step 203) into the multi-language multi-layer model;

205 Outputting each layer of hidden layer state vector of the parallel corpus of the multi-layer model for one-to-one correspondence storage.

In the step 3), outputting similar semantic sentences of different languages in each layer of different training step number models, calculating the similarity between every two layers of similar semantic sentences by using cosine similarity, and then averaging the similarity between the sentences of different languages, wherein the specific steps are as follows:

301 Taking the hidden state of the parallel corpus in the model as hidden variables of abstract sentence of the model, carrying out cosine similarity calculation on the hidden variables of the parallel corpus and the model, and taking the hidden variables as the similarity of the two vectors in the same space, wherein the closer the value range is [ -1,1], the closer the value range is to-1, the dissimilarity is represented, the closer the value range is to 0, the uncorrelation is represented, and the closer the value range is to 1, the close the value range is very similar;

302 Average of the calculated similarities of all sentence pairs as the similarity of the last two languages.

303 Step 301) to step 302) are executed for models stored among different languages and each layer in the models until the similarity between any two languages is calculated.

In step 4), according to the obtained similarity, observing the change of the similarity of the layers except the lowest layer and the topmost layer in the training process in different numbers of rounds, and recording the round number value that the similarity of all the layers reaches the peak value, wherein the specific steps are as follows:

401 Drawing the similarity of different layers in the model between different languages into a chart according to the change of different rounds, and repeating the steps until all layers except the first layer and the last layer are drawn in the chart;

402 According to the change of the similarity of different numbers of rounds in the graph, obtaining the number epoch of rounds with the i-th layer similarity reaching the maximum value _i Final wheel count set { epoch } _i Selecting the maximum epoch from } _max ；

In step 5), the values of the different layers of the last round are selected according to the similarity obtained in step 3), and the layer with the lowest similarity is obtained by comparison, specifically, the similarity layer between the different layers is calculated by training a converged model _i Then the similarity set { layer between different layers _i The layer Min with the smallest similarity is selected as an anti-interference layer.

The invention has the following beneficial effects and advantages:

1. the method for improving the multi-language neural machine translation performance of the invention provides that the bilingual pre-training model is firstly utilized to separate the layer with the lowest similarity from the training, so that the interference among languages in the training process is reduced, and finally the purpose of improving the translation performance of the multi-language neural machine translation model is achieved.

2. The invention considers that the similar semantics among languages can be used as indexes of the inter-language interference phenomenon in the process of multi-language multi-layer neural machine translation based on an attention mechanism, further judges when interference occurs from beginning by utilizing the change of the similarity in the training process through a method for calculating cosine similarity, and separates and trains a layer with the lowest similarity in the subsequent training, thereby reducing the interference, and achieving the purpose of improving the performance.

3. The method is simple and effective, is not easy to repel other inference methods, does not need adjustment of super parameters in each step, and can be applied to a mainstream multilingual neural machine translation system so as to improve translation performance.

Drawings

FIG. 1 is a diagram of a neural machine translation model based on an attention mechanism as applied in the prior art;

FIG. 2 is a diagram of a multilingual multi-layer neural machine translation model for use with the present invention;

FIG. 3 is a diagram showing similarity variation between different languages of different layers in the training process according to the present invention;

FIG. 4 is a graph showing the similarity between the languages of the layers after convergence of training according to the present invention;

fig. 5 is a structural diagram of a method for improving the translation performance of a neural machine according to the present invention.

Detailed Description

The invention is further elucidated below in connection with the drawings of the specification.

The invention provides a neural machine translation performance improving method, which comprises the following steps:

The invention evaluates the mutual interference condition of the languages in the multi-language multi-layer neural machine translation system based on the attention mechanism from the similarity among the languages, and reduces the mutual interference among the languages in the model in the training process by using a layer separation method based on bilingual pre-training, thereby realizing the improvement on the translation performance.

In the step 1), a machine translation word list is generated by utilizing a multi-language parallel corpus according to word frequency statistics, and the parallel corpus is divided into a training set and a checking set. The multi-language parallel corpus is composed of parallel corpora of a plurality of languages after being processed by self-defined cleaning processes such as length screening, symbol processing and the like. The multi-language multi-layer neural machine translation model based on the attention mechanism consists of an encoding end and a decoding end, and as shown in fig. 1, the encoding end and the decoding end are formed by stacking a plurality of layers, so that multi-layer neural machine translation is formed.

The layers of the coding end are composed of self-attention sub-layers and feedforward network sub-layers, after the sub-layers are calculated, residual connection and layer regularization are used in each sub-layer in order to better improve the stability of network training, wherein the residual connection can effectively expose the network performance of the bottom layer to the network of the upper layer, and from the perspective of parameter updating, the network gradient of the upper layer is more easily transmitted back to the network of the bottom layer through the residual connection; the input value range of the network after the accumulation of the residuals is easy to become too large, so that training becomes unstable, and therefore, the result of residual combination is transformed into a normal distribution of an average value 0 and a variance 1 by using a layer normalization mechanism to solve the problem.

The layer of the decoding end adds a coding and decoding attention sub-layer on the basis of the coding end layer. The final output of the decoding end is an output matrix of the vocabulary size dimension, and the value of the matrix is mapped between [0,1] through normalization operation, so that the probability distribution of each word in the vocabulary is used as a reference of the decoding result.

When the multi-language translation is constructed, the structure is not required to be changed too much, and all languages are required to be made into a unified word list. As shown in fig. 2, after the sub-word segmentation tool is used to segment sub-words in the multilingual corpus, word frequency statistics is performed, then words in different languages form word lists, the word embedding layer of the whole input is shared, and meanwhile layers of the coding end and the decoding end are stacked, so that the multilingual multi-layer neural machine translation model based on the attention mechanism can be constructed.

In the model training process, training data can be disturbed and sent into a model for training, and the model fits real distribution through output distribution so as to realize prediction. All data in the training data is input into the model as one round, and the model is trained for a plurality of rounds to reach a convergence state during training, and model parameters of each round are stored.

201 Grouping bilingual parallel data according to different languages, extracting languages with the same semantics from the bilingual parallel data, and selecting English parts in each bilingual data for similar semantic comparison because English is generally used as a translation bridge of each language;

202 Firstly removing punctuation marks from English parts in different corpus, and then comparing the punctuation marks by utilizing editing distance based on word level, wherein the editing distance based on word level means that words in one sentence are converted into another sentence by at least n times, each conversion selects one operation from deleting, adding and replacing words, and n is the editing distance of two sentences based on words; the closer the editing distance between two sentences is, the closer the sentence meaning of the sentences is; taking out English sentences with editing distance less than 2-5 as semantic similar sentence pairs, and simultaneously, in order to avoid excessive short sentence selection, the number of words contained in the selected sentence pairs is more than 5;

Step 201) requires that bilingual parallel data be grouped according to different languages and similar semantic languages be extracted from the bilingual parallel data. In order to achieve the purpose, for example, sentences of the Golay language and the Indonesia language need to be found by means of bridges among different parallel corpus, because English is generally used as a translation bridge of each language, parallel sentences of the Golay language to the English and parallel sentences of the Indonesia language to the English are relatively easy to find to respectively construct a corpus, and then English parts in each corpus are selected to carry out similar semantic comparison;

in step 202), taking the example of "i want to eat apples" and the sentence "i want to eat peaches", since there is only one different word, the edit distance is 1, so only one word "apple" in the first sentence needs to be modified to "peach" to become the second sentence. The English sentences with editing distance less than 3 are taken out, the semantics of the two sentences are considered to be similar, and in order to avoid excessive short sentences to be selected, the number of words contained in the selected sentences is required to be more than 5.

In step 203), the other end sentences in the bilingual corpus corresponding to the english single sentences extracted in step 202) are extracted, and the four sentences are regarded as similar semantics, for example, "i want to eat apples" and "i want to eat peaches" have similar semantics, and if sentences corresponding to the two sentences are "I want to eat apple" and "I want to eat peach" in the parallel chinese-english corpus, respectively, then all the four sentences are regarded as having similar semantics.

Since the model is stacked up of multiple coding and decoding layers, the different layers are not identical to the abstraction of the sentence, so that the different layers need to be processed.

301 Taking the hidden state of the parallel corpus in the model as a hidden variable of the model for abstracting sentences, and carrying out cosine similarity calculation on the hidden variable of the parallel corpus and the hidden variable of the parallel corpus; the cosine similarity is calculated as the cosine value of two high-dimensional vectors, and the similarity of the two vectors in the same space, wherein the value range is [ -1,1], the closer-1 represents dissimilarity, the closer 0 represents uncorrelated, and the closer-1 represents very close;

302 Since there are many sentences with similar semantics between languages, these sentences are calculated in step 301), in order to measure the similarity of two languages in the model simply, the calculated similarity of all sentences is averaged to obtain the similarity of the last two languages.

In step 301), the specific formula of cosine similarity is:

A _i and B _i The components representing the respective positions of the vectors a and B are respectively calculated as a scalar, n represents the dimensions of the vectors a and B, and i represents the dimension iterator. The value of Θ serves as the similarity of these two vectors in the same space.

Drawing a chart according to the change of different numbers of rounds, wherein the horizontal axis is the number of training rounds, the vertical axis is the similarity, and different broken lines in the chart represent the change of the similarity among languages of different layers in the training process;

as can be seen in FIG. 3, the number of layers each layer reaches a peak is {5,5,3,3}, so the maximum 5 is selected from the set as epoch _max ；

In step 5), the values of the different layers of the last round are selected according to the similarity obtained in step 3), and the layer with the lowest similarity is obtained by comparison, specifically, the similarity layer between the different layers is calculated by training a converged model _i Then from differentSimilarity set { layer between layers _i The layer Min with the smallest similarity is selected as an anti-interference layer.

As can be seen in fig. 4, layer 6 is chosen as the layer with the lowest similarity, and layer 6 is referred to as the immunity layer.

In step 6), training the multi-language parallel corpus and the multi-layer multi-language neural machine translation model, stopping after training the number of rounds obtained according to step 4), selecting the number of layers obtained according to step 5), repeating the layer of parameters for each language, continuing training and independently sharing the layer of parameters for each language until the model converges, and stopping the specific steps are as follows:

601 Training bilingual pre-training according to the number of rounds selected in step 402), and training the epoch in advance by the training expectation and the model in step 1) _max The wheel then stops training;

602 Modifying model parameters, separating the Min layer selected from the step 7), and copying the parameters of the Min layer for each language, so that the Min layer has independent parameters of each language and is not interfered with each other, and the whole model is as shown in figure 5, and separating layers with low similarity;

603 The training is continued, and parameters of each language are updated only when the language is trained at the anti-interference layer until the training convergence stops.

Assume that, taking the example of the Chinese-English sentence pair "I want to eat an apple", "I want to eat an apple" and the Chinese-English sentence pair "I eat a reverse, and" I want to eat a peach ", the English-end sentence edit distance in the two sentence pairs is 1, so four sentences are regarded as semantically similar sentences. And sending the Chinese-English sentence pairs and the Japanese-English sentence pairs into models with different rounds, calculating a similarity set {0.363799329,0.37508004,0.401865154,0.458326494,0.458718845,0.465400045,0.464847689,0.462193273,0.467572979}, obtaining a 5 th value 0.458718845 with the maximum value, calculating a similarity set {0.687403859,0.713000094,0.809822599,0.757797548,0.786908749,0.352922586} of each layer finally, and comparing to obtain the minimum similarity of the 6 th layer, and when the models are trained to the 5 th round, carrying out separate training on the 6 th layer parameters until convergence.

As shown in the following table, the invention utilizes Workshop on machine translation to disclose testing on a data set, and achieves performance improvement in multiple languages. Further, the present method provides performance improvements over the IWSLT17 public dataset over the baseline system in all language translations.

The invention provides a method for continuously training a layer with lowest separation similarity by utilizing a bilingual pre-training model from the perspective of reducing information interference among different languages in a multilingual neural machine translation model, thereby reducing the interference among the languages in the training process and finally achieving the purpose of improving the translation performance of the multilingual neural machine translation model.

The invention considers that the similar semantics among languages can be used as indexes of the inter-language interference phenomenon in the process of multi-language multi-layer neural machine translation based on an attention mechanism, further judges when interference occurs from beginning by utilizing the change of the similarity in the training process through a method for calculating cosine similarity, and separates and trains a layer with the lowest similarity in the subsequent training, thereby reducing the interference, and achieving the purpose of improving the performance.

The method is simple and effective, is not easy to repel other inference methods, does not need adjustment of super parameters in each step, and can be applied to a mainstream multilingual neural machine translation system so as to improve translation performance.

Claims

1. The method for improving the multilingual neural machine translation performance is characterized by comprising the following steps of:

2. The method for improving the translation performance of a multilingual neural machine according to claim 1, wherein: in the step 2), different languages are screened from the multilingual parallel corpus according to the editing distance, and the specific steps are as follows:

3. The method for improving the translation performance of a multilingual neural machine according to claim 1, wherein: in the step 2), similar semantic sentences of different languages are input into models stored in different training rounds, and the output of each layer of multi-language multi-layer neural machine translation is stored, wherein the specific steps are as follows:

4. The method for improving the translation performance of a multilingual neural machine according to claim 1, wherein: in the step 3), outputting similar semantic sentences of different languages in each layer of different training step number models, calculating the similarity between every two layers of similar semantic sentences by using cosine similarity, and then averaging the similarity between the sentences of different languages, wherein the specific steps are as follows:

302 Calculating average value of the calculated similarity of all sentence pairs as the similarity of the last two languages;

5. The method for improving the translation performance of a multilingual neural machine according to claim 1, wherein: in step 4), according to the obtained similarity, observing the change of the similarity of the layers except the lowest layer and the topmost layer in the training process in different numbers of rounds, and recording the round number value that the similarity of all the layers reaches the peak value, wherein the specific steps are as follows:

402 According to the change of the similarity of different numbers of rounds in the graph, obtaining the number epoch of rounds with the i-th layer similarity reaching the maximum value _i Final wheel count set { epoch } _i Selecting the maximum epoch from } _max 。

6. The method for improving the translation performance of a multilingual neural machine according to claim 1, wherein: in step 5), the values of the different layers of the last round are selected according to the similarity obtained in step 3), and the layer with the lowest similarity is obtained by comparison, specifically, the similarity layer between the different layers is calculated by training a converged model _i Then the similarity set { layer between different layers _i The layer Min with the smallest similarity is selected as an anti-interference layer.