CN116227506A

CN116227506A - Machine translation method with efficient nonlinear attention structure

Info

Publication number: CN116227506A
Application number: CN202310506400.9A
Authority: CN
Inventors: 李芳芳; 陈晓红; 吴炜; 毛星亮; 崔玉峰; 钟善美
Original assignee: Xiangjiang Laboratory
Current assignee: Xiangjiang Laboratory
Priority date: 2023-05-08
Filing date: 2023-05-08
Publication date: 2023-06-06
Anticipated expiration: 2043-05-08
Also published as: CN116227506B

Abstract

The present application relates to a machine translation method with efficient nonlinear attention structure, the method comprising: acquiring translation corpus, and constructing a word list based on the translation corpus, wherein the word list converts a source language sentence into a digital index; converting the target end input into a second digital index; constructing a translation model, wherein the translation model comprises an embedded layer, an encoder, a decoder and a reverse embedded layer; the embedded layer outputs a source end input vector and a target end input vector; the encoder outputs a context vector based on the source input vector; training an encoder; the trained decoder obtains a target language word vector based on the context vector. Screening out a target language word vector string with the maximum sum of word prediction probabilities in the candidate text in the decoding process to obtain a target language sentence vector; the target language sentence vector outputs a target language sentence corresponding to the source language sentence through the reverse embedding layer. The method avoids the calculation time expenditure caused by multi-head operation.

Description

Machine translation method with efficient nonlinear attention structure

Technical Field

The present application relates to the field of machine translation technologies, and in particular, to a machine translation method with an efficient nonlinear attention structure.

Background

The current front neural network machine translation method adopts a Transformer model architecture, which belongs to an encoder-decoder architecture, because the method is mainly divided into two parts of an encoder and a decoder, wherein the encoder is responsible for encoding an input sentence of a source language end and integrating semantic, grammar and syntax information; the decoder is responsible for encoding the input of the target language end and also integrating the linguistic information of the target end; in addition, the source end context vector obtained by combining the encoder and the target end encoding result obtained by the encoding are decoded, and a translation result of the target language is generated according to the decoding result. In the encoder and decoder of the transducer, a Multi-head attention (MHA) +feed-forward (FFN) architecture is uniformly employed for capturing context information and thus incorporating linguistic information.

However, the neural network machine translation model based on the transducer architecture includes a large number of fully connected mapping layers, and the vector characterization dimension and the mapping layer dimension are large, the number of layers stacked in the encoder and the decoder is large, so that the parameter amount in the whole model is huge, and expensive computing resources are required for training the model with practical effect. In addition, the attention of the multi-head mode is limited in improvement brought by increasing the calculation cost, and model capacity is wasted.

Disclosure of Invention

Based on this, there is a need to provide a machine translation method with an efficient nonlinear attention structure, the method comprising:

s1: acquiring translation corpus; selecting a source language sentence from the translation corpus, and constructing a word list based on the translation corpus, wherein the word list is used for converting the source language sentence into a digital index;

s2: constructing a training corpus pair, selecting sentences as target-end sentences from the translation corpus, adding a first special symbol at the head end of the target-end sentences to form target-end input of the training corpus pair, and adding a second special symbol at the tail end of the target-end sentences to form standard answers of the training corpus pair; converting the target end input into a second digital index;

s3: building a translation model, wherein the translation model comprises an embedded layer, an encoder, a decoder and a reverse embedded layer;

characterizing the digital index by the embedded layer to obtain a source end input vector, and characterizing the second digital index by the embedded layer to obtain a target end input vector;

the encoder outputs a context vector based on the source input vector;

training the encoder by taking the target end input vector and the context vector as inputs and taking the standard answer as an output target; the trained decoder obtains a target language word vector based on the context vector, and decodes the target language word vector at the next position by taking the target language word vector as the input of the next iteration until the digital index is traversed;

s4: screening out a target language word vector string with the maximum sum of word prediction probabilities in the candidate text in the decoding process to obtain a target language sentence vector; the candidate above represents the decoded target language word vector string; and the target language sentence vector outputs a target language sentence corresponding to the source language sentence through the reverse embedding layer, and the target language sentence is a translation result.

Preferably, in S1, a translation corpus is obtained from a translation corpus resource library; the translation corpus resource library comprises WMT.

Preferably, in S1, the process of building a vocabulary includes:

step 1: merging all sentences in the translation corpus into a text file;

step 2: learning byte pairs for the text file by using a python third party library sentenceiece tool, and generating a coded shared dictionary and a byte pair coding model;

step 3: acquiring a bilingual translation data set from a translation corpus resource library, taking the bilingual translation data set as a training data set, and acquiring a newstest2013 data set as a verification set;

step 4: training the byte pair coding model based on a training data set, and verifying based on a verification set;

step 5: using the trained bytes to encode each line of sentences in the text file by the encoding model to obtain encoded bilingual parallel corpus; and combining the coded bilingual parallel corpus to construct the word list.

Preferably, in S3,

the encoder includes a first attention layer and a first normalization layer;

the first attention layer carries out nonlinear mapping on the source input vector to obtain a first gating vector, a first value and a first query object; affine transformation is carried out on the first query object to obtain a first key;

performing matrix multiplication operation on the first query object and the first key, and adopting Softmax normalization calculation to obtain a first attention score; weighting the first value by taking the first attention score as a weight to obtain a first attention weighting result; performing element-by-element product operation on the first attention weighted result and the first gating vector to obtain a first characteristic;

stacking a plurality of first attention layers by adopting residual connection, and carrying out residual calculation on a plurality of first features in a post-norm mode to obtain first stacking features;

the first normalization layer normalizes the first stacking feature to obtain the context vector.

Preferably, in S3,

the decoder comprises a second attention layer, a third attention layer, a second normalization layer and a third normalization layer; the second attention layer is consistent with the first attention layer in working mode;

during training, the target end input vector outputs a second characteristic through the second attention layer; residual calculation is carried out on the plurality of second features in a post-norm mode to obtain second stacking features; normalizing the second stacking feature by the second normalization layer to obtain the second attention layer output vector;

the third attention layer carries out high-dimensional nonlinear full-connection mapping on the context vector and the second attention layer output vector to obtain a second gating vector and a second value; and the context vector is used as a second key, and the second attention layer output vector is used as a second query object;

performing matrix multiplication operation on the second query object and the second key, and adopting Softmax normalization calculation to obtain a second attention score; weighting the second value by taking the second attention score as a weight to obtain a second attention weighting result; performing element-by-element product operation on the second attention weighted result and the second gating vector to obtain a third characteristic;

stacking a plurality of third attention layers by adopting residual connection, and carrying out residual calculation on a plurality of third features in a post-norm mode to obtain third stacking features;

and the third normalization layer normalizes the third stacking feature to obtain the target language word vector string.

Preferably, a predictive mask is also provided in the decoder.

Preferably, the method further comprises training a translation model by adding smoothing processing on the basis of the NLL loss function, and training by adding an R-Dropout strategy in training.

Preferably, in S4, a bundle search strategy is used in the decoding process to screen out the target language word vector string with the largest sum of word prediction probabilities with the candidate above.

Preferably, the method further comprises the step of performing model translation performance evaluation by using BLEU scores, wherein the evaluation process is as follows:

acquiring a newstest2013 data set as a verification set; in the training process, translating and evaluating the parameters of the model by adopting the verification set at fixed interval period, and reserving the parameter weight with the best performance;

and taking the translation model with the optimal parameter weight as a final model.

Preferably, the manner of calculating the BLEU score includes: and (3) calling torchtext in pytorch to calculate BLEU score or calling a bleu_score () method in an NLTK third party library to calculate BLEU score.

The beneficial effects are that: according to the method, a redundant full-connection mapping layer is removed, the overall parameter quantity of the model is obviously reduced, the calculation resource cost occupied by model training and prediction is reduced, the model can run smoothly on a server with smaller video memory, and the number of stacking layers of the encoder/decoder is further increased along with the increase of the feature dimension, so that the model with better performance can be obtained by training under the same calculation resource.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a machine translation method according to an embodiment of the present application.

Detailed Description

In order to make the above objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other forms than those described herein and similar modifications can be made by those skilled in the art without departing from the spirit of the application, and therefore the application is not to be limited to the specific embodiments disclosed below.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.

As shown in fig. 1, the present embodiment provides a machine translation method with an efficient nonlinear attention structure, the method including:

in this embodiment, a translation corpus is obtained from a translation corpus resource library; the translation corpus resource library comprises WMT.

Specifically, the process of constructing the vocabulary includes:

step 1: merging all sentences in the translation corpus into a text file;

in this embodiment the first special symbol is set to "< bos >", and the second special symbol is set to "< eos >". The sentence of the second special symbol added will be used to calculate the loss function and the BLEU score.

the encoder outputs a context vector based on the source input vector;

training the encoder by taking the target end input vector and the context vector as inputs and taking the standard answer as an output target; the trained decoder obtains a target language word vector based on the context vector, and decodes the target language word vector at the next position by taking the target language word vector as the input of the next iteration until the digital index is traversed.

In this embodiment, the model is trained using an Adam optimizer, with an initial learning rate set to 1.0 and betas parameters set to 0.9,0.98; the eps parameter was set to 1e-9 and the number of arm-up steps was set to 4000 using the arm-up training strategy.

Batch generation and training: each step uses data of a batch to train, the batch is generated by a batch sampling strategy, specifically, all the data are firstly ordered according to (sentence length ≡batch_size), each piece of data comprises a source sentence and a target sentence, therefore, the longer length of the two sentences is selected as an ordering basis, then sentences are added one by one until the total length of the accumulated sentences reaches or exceeds max_token_num (set to 8000), the accumulated data are arranged into a batch, filling and testing-forming processes are carried out on all sentences in the batch, and the filling process is to fill all sentences to the longest sentence length in the batch by using a specific filling mark, and the testing-forming process is only aimed at the target sentence.

Specifically, for the encoder: the encoder includes a first attention layer (a first single-headed simplified self-attention layer) and a first normalization layer;

the first attention layer carries out nonlinear mapping on the source input vector to obtain a first gating vector (with the dimension of 1024), a first value (with the dimension of 1024) and a first query object (with the dimension of 512); affine transformation is carried out on the first query object to obtain a first key;

performing matrix multiplication operation on the first query object and the first key, and adopting Softmax normalization calculation to obtain a first attention score; weighting the first value by taking the first attention score as a weight to obtain a first attention weighting result; performing element-by-element product operation on the first attention weighted result and the first gating vector to obtain a first characteristic; further highlighting or screening the first attention weighted result by means of the first gating vector; and pass the first feature to the last fully connected layer in the first attention layer, mapping the final output dimension back to 512;

For a decoder: the decoder includes a second attention layer (second single-headed simplified self-attention layer), a third attention layer (single-headed simplified cross-attention layer), a second normalization layer, and a third normalization layer; the second attention layer is consistent with the first attention layer in working mode;

the third attention layer carries out high-dimensional nonlinear full-connection mapping on the context vector and the second attention layer output vector to obtain a second gating vector (with 1024 dimensions) and a second value (with 1024 dimensions); and the context vector is used as a second key, and the second attention layer output vector is used as a second query object;

performing matrix multiplication operation on the second query object and the second key, and adopting Softmax normalization calculation to obtain a second attention score; weighting the second value by taking the second attention score as a weight to obtain a second attention weighting result; performing element-by-element product operation on the second attention weighted result and the second gating vector to obtain a third characteristic; further highlighting or screening the second attention weighted result by means of the second gating vector; and pass the third feature to the last fully connected layer in the third attention layer, mapping the final output dimension back to 512;

In this embodiment, since the prediction is affected not by each word input by the decoder after the word "sees" itself, a prediction mask is also provided in the decoder.

The training process further comprises: the translation model is trained by adding smoothing processing on the basis of NLL loss function, and R-Dropout strategy is added in training for training.

Specifically, after the R-Dropout strategy is started, in the generation of each batch, one piece of data is not added to the batch at a time, but two pieces of repeated data are added at a time, KL LOSS is additionally added to the training LOSS function, and the LOSS function is calculated according to the transmission result of the adjacent two pieces of data through the R-Dropout.

The method further comprises the step of carrying out model translation performance assessment by using BLEU scores, wherein the assessment process is as follows:

acquiring a newstest2013 data set as a verification set; in the training process, translating and evaluating the parameters of the model by adopting the verification set at fixed interval period, and reserving the parameter weight with the best performance; and taking the translation model with the optimal parameter weight as a final model.

In this embodiment, the fixed interval period is set to be once per training 1000 steps for model effect evaluation; at each evaluation, decoding was performed according to the beam search strategy with beam_size of 1/2/34, respectively, and model performance at each beam setting was evaluated.

Further, the manner of calculating the BLEU score includes: and (3) calling torchtext in pytorch to calculate BLEU score or calling a bleu_score () method in an NLTK third party library to calculate BLEU score.

S4: screening out a target language word vector string with the maximum sum of word prediction probabilities in the candidate above by adopting a beam search strategy in the decoding process to obtain a target language sentence vector; the candidate above represents the decoded target language word vector string; and the target language sentence vector outputs a target language sentence corresponding to the source language sentence through the reverse embedding layer, and the target language sentence is a translation result.

Unlike the training phase, the input of the decoder is not the whole target sentence (because the target sentence is unknown), but the translated segment formed by word-by-word prediction generation and concatenation is not the whole target sentence, and the process of repeatedly inputting the word generated by word prediction into the decoder to iteratively generate the word at the next position is called decoding. Since the decoding involves multiple time steps, a beam search (beam search) strategy is used in the decoding, specifically, k predicted words with the highest word prediction probability in the candidate above are reserved in each translation time step, and finally, the result with the highest total probability is selected from k translation results to be used as a final translation result of the model.

Table 1 is a data comparison table of the machine translation method and the transducer and the conventional transducer-based translation method provided in this embodiment;

under the condition of approaching the standard parameter scale of the translation method of the traditional transducer, the translation method provided by the embodiment is compared with the data of the translation method based on the traditional transducer; as can be seen from table 1, the translation method provided in this embodiment generally exceeds the translation method based on the conventional transducer under the conditions of light weight and a little speed advantage.

The method provided by the embodiment has the following beneficial effects:

1. attention is changed from multiple heads to single heads, so that the calculation time cost caused by multi-head operation is avoided;

2. nonlinear high-dimensional characteristics are introduced into the attention calculation process, so that the capturing capacity of the model for complex linguistic characteristics is enhanced, and the translation performance of the model is improved;

3. the redundant full-connection mapping layer is removed, the overall parameter quantity of the model is obviously reduced, the calculation resource cost occupied by model training and prediction is reduced, the model can run smoothly on a server with smaller video memory, and the model with better performance can be obtained by training under the same calculation resource along with the increase of the feature dimension and the deepening of the stacking layer number of the encoder/decoder.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the claims. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method of machine translation with efficient nonlinear attention structure, comprising:

the encoder outputs a context vector based on the source input vector;

2. The machine translation method according to claim 1, wherein in S1, a translation corpus is obtained from a translation corpus resource library; the translation corpus resource library comprises WMT.

3. The machine translation method according to claim 1, wherein in S1, the process of constructing the vocabulary includes:

step 1: merging all sentences in the translation corpus into a text file;

4. A machine translation method according to claim 3, wherein in S3,

the encoder includes a first attention layer and a first normalization layer;

5. The machine translation method according to claim 4, wherein in S3,

6. The machine translation method according to claim 5, wherein a predictive mask is further provided in the decoder.

7. The machine translation method of claim 1, further comprising training the translation model by adding smoothing based on the NLL loss function, and further adding an R-Dropout strategy to the training.

8. The machine translation method according to claim 1, wherein in S4, a bundle search strategy is used in the decoding process to screen out the target language word vector string that has the largest sum of word prediction probabilities with the candidate above.

9. The machine translation method of claim 7, further comprising performing a model translation performance evaluation using a BLEU score, the evaluation process being:

10. The machine translation method according to claim 9, wherein the way of calculating the BLEU score comprises: and (3) calling torchtext in pytorch to calculate BLEU score or calling a bleu_score () method in an NLTK third party library to calculate BLEU score.