CN109933808B

CN109933808B - Neural machine translation method based on dynamic configuration decoding

Info

Publication number: CN109933808B
Application number: CN201910095193.6A
Authority: CN
Inventors: 王强; 李炎洋
Original assignee: Shenyang Yayi Network Technology Co ltd
Current assignee: Shenyang Yayi Network Technology Co ltd
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2022-11-22
Anticipated expiration: 2039-01-31
Also published as: CN109933808A

Abstract

The invention relates to a neural machine translation method based on dynamic configuration decoding, which is characterized in that a decision model based on a convolutional neural network is added on the basis of a Transformer model, coding information obtained by coding is used as input and is sent to the decision model, the decision model carries out convolution, pooling and normalization processing on the coding information, and corresponding decoding configuration is output; decoding by using a trained decoder according to the decoding configuration, and scoring the selected decoding configuration; improving the decision model by adopting a reinforcement learning method according to the scoring result to obtain a trained decision model; and translating by adopting a trained improved self-attention mechanism model, and outputting a translated text with higher accuracy. The invention uses a small decision model with low training cost, and is obtained by training the trained machine translation model in an end-to-end mode without retraining the whole machine translation model.

Description

Neural machine translation method based on dynamic configuration decoding

Technical Field

The invention belongs to the technical field of machine translation, and relates to a neural machine translation method based on dynamic configuration decoding.

Background

Neural machine translation techniques currently employ neural networks based on an encoder-decoder framework for modeling. First, the input source sentence uses the encoder of the network to obtain a vector of fixed dimensions, and then the decoder of the network uses this vector to generate the corresponding translation results word by word. This approach has achieved optimal translation performance in the inter-translation of many different languages.

When a decoder of a neural network generates translation results, there are typically many parameters that control the behavior of the decoder. For example, the decoder may generate a plurality of possible translation results and corresponding scores. Generally we will pick the translation result with the highest score, but in many cases the network performance is not good enough, and we need to make some adjustments to these scores using the length ratio parameter to prevent translation results that are too short or too long from being picked. An example of score adjustment using length ratios is as follows:

correct answer: she had many beautiful clothes

Translation result 1: she had many beautiful clothes

Result 1 scored: -0.1-0.2-0.15-0.13-0.1

Translation result 2: there are many clothes

Result 2 scoring: -0.12-0.15-0.1

For translation result 1, its total score is (-0.1 + -0.2+ -0.15+ -0.13+ -0.1)/5 = -0.68/5= -0.136, where 5 is the length of translation result 1, and the total score of translation result 2 is (-0.12 + -0.15+ -0.1)/3 = -0.37/3= -0.123. Because translation result 2 scores higher than translation result 1, the decoder will pick translation result 2 as the final output. Obviously, translation result 1 is closer to the correct answer, while translation result 2 is too short in comparison. The length ratio parameter takes the length of the translation result into account on the basis of the total score. With a length ratio equal to 1.5, the score for translation result 1 is now-0.68/5 ^1.5 = 0.06, where the denominator 5 is the length of the translation result 1, i.e. the number of words. The score of the corresponding translation result 2 became-0.37/3 ^1.5 And (4) keeping the value of-0.07. Picking on the basis of this score, the decoder will select translation result 1 as the final output.

In addition to the length ratio, the decoder has many other parameters to control its different behavior, such as bundle size to control the range of decoder search, decoding length to limit the number of words in the final translation result, etc. In practice, the decoder usually uses a globally uniform parameter configuration to generate the translation result, i.e. the parameter configuration used by the decoder is not changed regardless of the source sentence. In fact, the optimal parameter configurations of different source sentences are different, for example, some sentences tend to generate short translations, and other sentences tend to generate long translations. An example of using different length ratio settings for different source sentences is as follows:

source language 1: concerning about

Target language 1: take care of

Source language 2: is easier to be

Target language 2: easier

For source language 1, it has only one word and its correct translation has three words, so the decoder should be inclined to generate long translations, i.e. a larger length ratio, when generating translations. Whereas for source language 2, there are two words and the correct translation is only one, so the decoder should tend to generate short translations, i.e. smaller length ratios.

Therefore, a decision method is needed to select the corresponding optimal parameter configuration according to different source sentences.

Disclosure of Invention

The invention aims to provide a neural machine translation method based on dynamic configuration decoding, which aims to solve the problem that a network generates wrong translation results because different parameter configurations cannot be set for different input source sentences in the decoding technology of the neural machine translation in the prior art.

The invention provides a neural machine translation method based on dynamic configuration decoding, which comprises the following steps:

step 1: adding a decision model between an encoder and a decoder of a Transformer model of the self-attention mechanism to form an improved self-attention mechanism model, wherein the decision model is established on the basis of a convolutional neural network;

and 2, step: inputting bilingual sentence-level parallel data, performing word segmentation processing on a source language and a target language respectively to obtain bilingual parallel sentence pairs after word segmentation, and training an encoder and a decoder of an improved self-attention mechanism model;

and step 3: coding a source language sentence of the bilingual parallel sentence pair after word segmentation by using a trained coder according to a time sequence to obtain the state of each time sequence on a hidden layer, namely the coding information of different layers under each time sequence;

and 4, step 4: the obtained coding information is used as input and sent into a decision model, the decision model carries out convolution, pooling and normalization processing on the coding information, and corresponding decoding configuration is output;

and 5: decoding by using a trained decoder according to the decoding configuration output by the decision model, and scoring the selected decoding configuration;

step 6: according to the score given by the evaluation standard, improving the decision model by adopting a reinforcement learning method to obtain a trained decision model;

and 7: and inputting a source sentence into an encoder of the improved self-attention mechanism model, sending the obtained encoding information into the decision model, and translating by a decoder according to decoding configuration output by the decision model.

In the neural machine translation method based on dynamic configuration decoding, the bilingual sentence-level parallel data input in the step 2 is a set of bilingual inter-translated sentence pairs, and each sentence pair consists of a source language sentence and a target language sentence.

In the neural machine translation method based on dynamic configuration decoding, a maximum likelihood method is adopted in step 2 to train an encoder and a decoder of an improved self-attention mechanism model.

In the neural machine translation method based on dynamic configuration decoding of the present invention, the step 3 specifically is:

given a source sentence, the encoder uses N nonlinear transformation layers for encoding, and finally obtains the following encoding information:

where N is the number of layers of the nonlinear transform layer included in the encoder, T is the length of the input source statement, and each element of H is a word vector of length C.

In the neural machine translation method based on dynamic configuration decoding of the present invention, the step 4 is specifically:

step 4.1: carrying out convolution operation on input coding information H;

step 4.2: performing pooling operation on the output of the convolution;

step 4.3: repeating the convolution and pooling operations for multiple times to output a three-dimensional tensor

Wherein T is ₁ ＜T，N ₁ < N, usemax-over-time firing method at T of three-dimensional tensor U ₁ Dimension reduction processing is carried out on the dimensions to obtain a two-dimensional matrix

Step 4.4: reconstruction of U ₁ Is a one-dimensional vector

Wherein L = N ₁ ×C ₁ Then put U ₂ The method is input into the full connection layer processing, and the following calculation is carried out:

Z＝W ₂ ·f(W ₁ ·U ₂ +b ₁ )+b ₂

wherein W ₁ Is a real matrix of shape (D, L), b ₁ Is a real vector of length D, W ₂ Is a matrix of real numbers of shape (O, D), b ₂ Is a real number vector with length of O, Z is a real number vector with length of O, and O is the number of all optional configurations, and f is a nonlinear activation function;

step 4.5: and substituting Z into the softmax function to obtain a real number vector P with the length of O, wherein each element of P represents the probability of the corresponding configuration to be selected, and selecting the configuration with the highest probability as decoding configuration output.

In the neural machine translation method based on dynamic configuration decoding of the present invention, the step 5 specifically is:

step 5.1: decoding by adopting a beam search method;

step 5.2: and scoring the translation result by adopting a BLEU evaluation index.

In the neural machine translation method based on dynamic configuration decoding of the present invention, the step 6 specifically adopts a policy gradient method or a Q learning method to improve the decision model.

The neural machine translation method based on dynamic configuration decoding at least has the following beneficial effects:

1. the method introduces a new decision-making model in the machine translation model, and can automatically generate proper decoding configuration according to different source language inputs.

2. The invention uses a small decision model with low training cost, and is obtained by training the trained machine translation model in an end-to-end mode without retraining the whole machine translation model.

Drawings

FIG. 1 is a flow chart of a method for neural machine translation based on dynamic configuration decoding of the present invention;

FIG. 2 is a schematic diagram of the improved self-attention mechanism model of the present invention;

FIG. 3 is a block diagram of a decision model of the present invention.

Detailed Description

Fig. 1 shows a neural machine translation method based on dynamic configuration decoding, which includes the following steps:

step 1: adding a decision model between an encoder and a decoder of a Transformer model of the self-attention mechanism to form an improved self-attention mechanism model, wherein the decision model is built on the basis of a convolutional neural network.

Fig. 2 is a schematic structural diagram of an improved self-attention mechanism model of the present invention, in which a decision model is added on the basis of a Transformer model, a suitable decoding configuration is automatically generated by the decision model according to different source language inputs, and a decoder performs a decoding operation according to the decoding configuration, so that the accuracy of translation can be improved.

Step 2: inputting bilingual sentence-level parallel data, performing word segmentation processing on a source language and a target language respectively to obtain bilingual parallel sentence pairs after word segmentation, and training an encoder and a decoder of an improved self-attention mechanism model;

the input bilingual sentence-level parallel data is a set of bilingual inter-translation sentence pairs, and each sentence pair consists of a source language sentence and a target language sentence.

In specific implementation, a maximum likelihood method is adopted to train an encoder and a decoder of the improved self-attention mechanism model.

And step 3: coding a source language sentence of the bilingual parallel sentence pair after word segmentation by using a trained coder according to a time sequence, and acquiring the state of each time sequence on a hidden layer, namely the coding information of different layers under each time sequence;

given a source sentence, the encoder encodes the source sentence by using N nonlinear transformation layers to finally obtain encoded information

Where H is a matrix of shape (N, T), N is the number of non-linear transform layers included in the encoder, T is the length of the input source sentence, and each element of H is a word vector of length C.

And 4, step 4: and the obtained coding information is used as input and is sent into a decision model, the decision model carries out convolution, pooling and normalization processing on the coding information, and corresponding decoding configuration is output. FIG. 3 is a block diagram of a decision model of the present invention, which is built based on a convolutional neural network, and includes a plurality of convolutional layers and pooling layers. The decision model is modeled as a multi-class discriminator, each class corresponding to a different decoding configuration. The following is an example of a decision model generating decoding arrangement:

bundle size: 5,10

Length ratio: 0.9,1,1.1

Possible decoding configurations/categories: (5,0.9), (5,1), (5,1.1), (10,0.9), (10,1), (10,1.1)

In this example, the decoding configuration has two parameters, a beam size and a length ratio, each having a different value, e.g., a beam size of 5 or 10 and a length ratio of 0.9,1 and 1.1. For the decision model, the decoding configuration it can choose is a total of 2 × 3=6, i.e., 6 classes. When a decision model selects one of the categories, it is equivalent to picking a particular set of decoding configurations.

For a decision model, given the output H of the encoder as the input to the model, the model classifies and outputs probabilities P of choosing different classes, O being the number of classes. Because the decoder can only accept a set of configurations for decoding, the decision model will pick the configuration corresponding to the class with the highest probability as the final output.

The invention designs a special network structure for the decision model. For the input H of the decision model, its shape is (N, T, C), where T is the source language sentence length, N is the number of layers of the encoder, and C is the length of the continuous vector corresponding to each word at each layer. H can be considered as an image, where T and N are the length and width of the image, respectively, and C is the number of color channels of the image. Based on the observation result, the task of the decision model is modeled into an image classification problem, so that a general model structure in the image classification task can be used for reference. The invention is improved on the basis of LeNet-5. Step 4 specifically generates the decoding configuration by the following steps:

step 4.1: carrying out convolution operation on input coding information H;

in particular, the decision model includes J convolution kernels, each convolution kernel having its own weight matrix W and an offset b, W being (3, C) in shape and b being a real vector of length C. Where the input to each convolution kernel is the model input H. The convolution operation will output a matrix A of shape (T-3 +1, N-3+1, J) as shown in the following equation:

after the convolution operation output result a is obtained, the model performs a nonlinear transformation on each element thereof, typically using a ReLU activation function, as shown in the following formula:

A′＝max(A，0)

where A is the result of the convolution operation and A' is a matrix of real numbers that is the same shape as A.

Step 4.2: after the convolution operation, the decision model performs a pooling operation on the output a' of the convolution. Pooling maximizes A' using a window of shape (3,3) to obtain a pooling operation output M, as shown in the following equation:

where M is a real matrix shaped as (T-6 +2, N-6+2, J). The model then repeats the process of convolution-pooling a number of times.

Wherein T is ₁ ＜T，N ₁ < N due to T ₁ The size of the source language sentence is related to T, the length T of different source language sentences is different, and therefore the decision model uses the max-over-time posing method to determine the T of the three-dimensional tensor U ₁ Dimension reduction processing is carried out on the dimension to obtain a two-dimensional matrix with fixed size

As shown in the following equation:

step 4.4: reconstruction of U ₁ Is a one-dimensional vector

Wherein L = N ₁ ×C ₁ I.e. by connecting U ₁ All row vectors of (A) constitute (U) ₂ . Then put U ₂ The method is input into the full connection layer processing, and the following calculation is carried out:

Z＝W ₂ ·f(W ₁ ·U ₂ +b ₁ )+b ₂

wherein W ₁ Is a matrix of real numbers of shape (D, L), b ₁ Is a real vector of length D, W ₂ Is a matrix of real numbers of shape (O, D), b ₂ Is a real number vector with length of O, Z is a real number vector with length of O, and O is the number of all optional configurations, and f is a nonlinear activation function;

in a specific implementation, the same activation function ReLU as the convolution operation is used.

Step 4.5: and substituting Z into the softmax function to carry out normalization processing, and obtaining a real number vector P with the length of O. Each element of P represents the probability of the corresponding configuration to be chosen, the configuration with the highest probability being selected as the decoding configuration output Config.

The probability distributions for the different decoding configurations are calculated specifically by:

wherein e is a natural base number, P _i Is the probability value, Z, corresponding to the ith element of P _i Is the ith element in the real vector Z.

And 5: decoding by using a trained decoder according to the decoding configuration output by the decision model, and scoring the selected decoding configuration by using a common evaluation standard, wherein the step 5 specifically comprises the following steps:

step 5.1: decoding by adopting a beam searching method;

the decoder uses the coding information H output by the coder, the decoding configuration Config output by the decision model and a special sentence start symbol<sos>(i.e., the 0 th word S of the target language ₀ ) As input, a probability distribution vector of a first word of a target language is output

Where V is the vocabulary size of the target language, pr ₁ Each element of (a) represents a probability of selecting a corresponding target language vocabulary. Decoder according to Pr ₁ Scoring all vocabularies of the target language in a scoring mode Score specified by Config, and selecting the first B vocabularies with the highest scores as the first words S of the target language ₁ Candidate set Q of ₁ ＝{Q _1，1 ，...，Q _1，B Where B is the number of candidates specified by the decoding configuration Config, i.e. the bundle size. Q ₁ Each element in (a) is respectively equal to S ₀ Combine to generate B sentences Y ₁ ＝{S ₀ Q _1，1 ，...，S ₀ Q _1，B In which Y is ₁ Each element in the first word is used as a second word S of the calculation target language ₂ Corresponding probability distributionVector Pr ₂ B different Pr of length V are obtained ₂ The vector is reconstructed into a matrix with the shape of (B, V)

The decoder is based on Score and

scoring and selecting the first B target words with the highest scores as the second words S ₂ Candidate set Q of ₂ ＝{Q _2，1 ，...，Q _2，B In which Q ₂ Each element of the sentence being associated with the input sentence Y used to calculate it ₁ Combining to obtain the probability distribution vector Pr of the third word of the calculated target language ₃ Required input sentence set Y ₂ ＝{S ₀ Q _1，1 Q _2，1 ，...，S ₀ Q _1， _B Q _2，B }. By analogy, the decoder generates translation results continuously word by word until the sentence end symbol<eos>The length of the sentence in the selected or input sentence set Y reaches the limit specified by Config, at which time the decoder returns the sentence X with the highest score in Y as the final translation and finishes decoding.

Resulting in a decoded result X and a reference translation Ref, a common evaluation criterion can be used to score the quality of the translation result X. A common evaluation criterion is typically the BLEU value. It calculates the accuracy of the different n-grams between the translation results and the reference answers, where n-grams are short sequences of n words. The following is an example of calculating n-gram accuracy:

original text: weather today

And (3) translation results: is is is is a

And (3) referencing an answer: today is a nice day

Now the accuracy of the 1-gram is calculated, the 1-gram occurring simultaneously in the translation result and the reference answer has is and a, the number of times they occur in the translation result is 4 and 1, respectively, and the one having the least number of times they occur in the translation result and the reference answer is 1 and 1, respectively, then the final 1-gram accuracy is (1+1)/(4+1) =2/5.BLEU is the average accuracy of 1-gram,2-gram,3-gram and 4-gram.

in specific implementation, a strategy gradient method or a Q learning method is adopted to improve the decision model.

After obtaining the output P (probability of different configurations) and Config (configuration selected finally) of the decision model and the score R given by the evaluation criterion, the decision model will use these information for learning. Here, the objective function of the decision model is maxE _P [R]Meaning that the result R of the decision made by the decision model based on its own output P is expected to be maximized.

A general neural network can directly learn through an end-to-end method, but the use of the end-to-end method requires that operations involved in the whole calculation process are conductive, and operations involved in a decision model are not conductive, such as a process of obtaining Config from P. The general solution is to transform the objective function of the decision model into maxR × log P by the score function method of the strategy gradient method _Config In which P is _Config Is the probability that the decision model picks the Config. The meaning of the transformed objective function is that if R is higher, the decision model will adjust its parameters so that Config is selected with higher probability when the same input is encountered next time. If R is lower, the decision model will make the probability of Config corresponding lower at the next prediction, so as to avoid selecting Config.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the scope of the present invention, which is defined by the appended claims.

Claims

1. A neural machine translation method based on dynamic configuration decoding is characterized by comprising the following steps:

step 1: adding a decision model between an encoder and a decoder of a Transformer model of the self-attention mechanism to form an improved self-attention mechanism model, wherein the decision model is established based on a convolutional neural network;

2. The neural-machine translation method based on dynamic configuration decoding of claim 1, wherein the bilingual sentence-level parallel data input in step 2 is a set of bilingual inter-translated sentence pairs, each sentence pair consisting of a source language sentence and a target language sentence.

3. The neural-machine translation method based on dynamic configuration decoding of claim 1, wherein the maximum likelihood method is used in step 2 to train the encoder and decoder of the improved self-attention mechanism model.

4. The neural-machine translation method based on dynamic configuration decoding as claimed in claim 1, wherein said step 3 is specifically:

where N is the number of layers of the nonlinear transformation layer included in the encoder, T is the length of the input source sentence, and each element of H is a word vector of length C.

5. The neural-machine translation method based on dynamic configuration decoding as claimed in claim 1, wherein said step 4 is specifically:

step 4.1: carrying out convolution operation on input coding information H;

step 4.2: performing pooling operation on the output of the convolution;

Wherein T is ₁ <T,N ₁ <N, using max-over-time firing method on T of three-dimensional tensor U ₁ Dimension reduction processing is carried out on the dimension to obtain a two-dimensional matrix

Step 4.4: reconstruction of U ₁ Is a one-dimensional vector

Z＝W ₂ ·f(W ₁ ·U ₂ +b ₁ )+b ₂

6. The neural-machine translation method based on dynamic configuration decoding as claimed in claim 1, wherein said step 5 is specifically:

step 5.1: decoding by adopting a beam search method;

7. The neural-machine translation method based on dynamic configuration decoding of claim 1, wherein said step 6 employs a strategy gradient method or a Q learning method to improve the decision model.