CN116227506A - Machine translation method with efficient nonlinear attention structure - Google Patents

Machine translation method with efficient nonlinear attention structure Download PDF

Info

Publication number
CN116227506A
CN116227506A CN202310506400.9A CN202310506400A CN116227506A CN 116227506 A CN116227506 A CN 116227506A CN 202310506400 A CN202310506400 A CN 202310506400A CN 116227506 A CN116227506 A CN 116227506A
Authority
CN
China
Prior art keywords
vector
attention
translation
target
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310506400.9A
Other languages
Chinese (zh)
Other versions
CN116227506B (en
Inventor
李芳芳
陈晓红
吴炜
毛星亮
崔玉峰
钟善美
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiangjiang Laboratory
Original Assignee
Xiangjiang Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiangjiang Laboratory filed Critical Xiangjiang Laboratory
Priority to CN202310506400.9A priority Critical patent/CN116227506B/en
Publication of CN116227506A publication Critical patent/CN116227506A/en
Application granted granted Critical
Publication of CN116227506B publication Critical patent/CN116227506B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The present application relates to a machine translation method with efficient nonlinear attention structure, the method comprising: acquiring translation corpus, and constructing a word list based on the translation corpus, wherein the word list converts a source language sentence into a digital index; converting the target end input into a second digital index; constructing a translation model, wherein the translation model comprises an embedded layer, an encoder, a decoder and a reverse embedded layer; the embedded layer outputs a source end input vector and a target end input vector; the encoder outputs a context vector based on the source input vector; training an encoder; the trained decoder obtains a target language word vector based on the context vector. Screening out a target language word vector string with the maximum sum of word prediction probabilities in the candidate text in the decoding process to obtain a target language sentence vector; the target language sentence vector outputs a target language sentence corresponding to the source language sentence through the reverse embedding layer. The method avoids the calculation time expenditure caused by multi-head operation.

Description

Machine translation method with efficient nonlinear attention structure
Technical Field
The present application relates to the field of machine translation technologies, and in particular, to a machine translation method with an efficient nonlinear attention structure.
Background
The current front neural network machine translation method adopts a Transformer model architecture, which belongs to an encoder-decoder architecture, because the method is mainly divided into two parts of an encoder and a decoder, wherein the encoder is responsible for encoding an input sentence of a source language end and integrating semantic, grammar and syntax information; the decoder is responsible for encoding the input of the target language end and also integrating the linguistic information of the target end; in addition, the source end context vector obtained by combining the encoder and the target end encoding result obtained by the encoding are decoded, and a translation result of the target language is generated according to the decoding result. In the encoder and decoder of the transducer, a Multi-head attention (MHA) +feed-forward (FFN) architecture is uniformly employed for capturing context information and thus incorporating linguistic information.
However, the neural network machine translation model based on the transducer architecture includes a large number of fully connected mapping layers, and the vector characterization dimension and the mapping layer dimension are large, the number of layers stacked in the encoder and the decoder is large, so that the parameter amount in the whole model is huge, and expensive computing resources are required for training the model with practical effect. In addition, the attention of the multi-head mode is limited in improvement brought by increasing the calculation cost, and model capacity is wasted.
Disclosure of Invention
Based on this, there is a need to provide a machine translation method with an efficient nonlinear attention structure, the method comprising:
s1: acquiring translation corpus; selecting a source language sentence from the translation corpus, and constructing a word list based on the translation corpus, wherein the word list is used for converting the source language sentence into a digital index;
s2: constructing a training corpus pair, selecting sentences as target-end sentences from the translation corpus, adding a first special symbol at the head end of the target-end sentences to form target-end input of the training corpus pair, and adding a second special symbol at the tail end of the target-end sentences to form standard answers of the training corpus pair; converting the target end input into a second digital index;
s3: building a translation model, wherein the translation model comprises an embedded layer, an encoder, a decoder and a reverse embedded layer;
characterizing the digital index by the embedded layer to obtain a source end input vector, and characterizing the second digital index by the embedded layer to obtain a target end input vector;
the encoder outputs a context vector based on the source input vector;
training the encoder by taking the target end input vector and the context vector as inputs and taking the standard answer as an output target; the trained decoder obtains a target language word vector based on the context vector, and decodes the target language word vector at the next position by taking the target language word vector as the input of the next iteration until the digital index is traversed;
s4: screening out a target language word vector string with the maximum sum of word prediction probabilities in the candidate text in the decoding process to obtain a target language sentence vector; the candidate above represents the decoded target language word vector string; and the target language sentence vector outputs a target language sentence corresponding to the source language sentence through the reverse embedding layer, and the target language sentence is a translation result.
Preferably, in S1, a translation corpus is obtained from a translation corpus resource library; the translation corpus resource library comprises WMT.
Preferably, in S1, the process of building a vocabulary includes:
step 1: merging all sentences in the translation corpus into a text file;
step 2: learning byte pairs for the text file by using a python third party library sentenceiece tool, and generating a coded shared dictionary and a byte pair coding model;
step 3: acquiring a bilingual translation data set from a translation corpus resource library, taking the bilingual translation data set as a training data set, and acquiring a newstest2013 data set as a verification set;
step 4: training the byte pair coding model based on a training data set, and verifying based on a verification set;
step 5: using the trained bytes to encode each line of sentences in the text file by the encoding model to obtain encoded bilingual parallel corpus; and combining the coded bilingual parallel corpus to construct the word list.
Preferably, in S3,
the encoder includes a first attention layer and a first normalization layer;
the first attention layer carries out nonlinear mapping on the source input vector to obtain a first gating vector, a first value and a first query object; affine transformation is carried out on the first query object to obtain a first key;
performing matrix multiplication operation on the first query object and the first key, and adopting Softmax normalization calculation to obtain a first attention score; weighting the first value by taking the first attention score as a weight to obtain a first attention weighting result; performing element-by-element product operation on the first attention weighted result and the first gating vector to obtain a first characteristic;
stacking a plurality of first attention layers by adopting residual connection, and carrying out residual calculation on a plurality of first features in a post-norm mode to obtain first stacking features;
the first normalization layer normalizes the first stacking feature to obtain the context vector.
Preferably, in S3,
the decoder comprises a second attention layer, a third attention layer, a second normalization layer and a third normalization layer; the second attention layer is consistent with the first attention layer in working mode;
during training, the target end input vector outputs a second characteristic through the second attention layer; residual calculation is carried out on the plurality of second features in a post-norm mode to obtain second stacking features; normalizing the second stacking feature by the second normalization layer to obtain the second attention layer output vector;
the third attention layer carries out high-dimensional nonlinear full-connection mapping on the context vector and the second attention layer output vector to obtain a second gating vector and a second value; and the context vector is used as a second key, and the second attention layer output vector is used as a second query object;
performing matrix multiplication operation on the second query object and the second key, and adopting Softmax normalization calculation to obtain a second attention score; weighting the second value by taking the second attention score as a weight to obtain a second attention weighting result; performing element-by-element product operation on the second attention weighted result and the second gating vector to obtain a third characteristic;
stacking a plurality of third attention layers by adopting residual connection, and carrying out residual calculation on a plurality of third features in a post-norm mode to obtain third stacking features;
and the third normalization layer normalizes the third stacking feature to obtain the target language word vector string.
Preferably, a predictive mask is also provided in the decoder.
Preferably, the method further comprises training a translation model by adding smoothing processing on the basis of the NLL loss function, and training by adding an R-Dropout strategy in training.
Preferably, in S4, a bundle search strategy is used in the decoding process to screen out the target language word vector string with the largest sum of word prediction probabilities with the candidate above.
Preferably, the method further comprises the step of performing model translation performance evaluation by using BLEU scores, wherein the evaluation process is as follows:
acquiring a newstest2013 data set as a verification set; in the training process, translating and evaluating the parameters of the model by adopting the verification set at fixed interval period, and reserving the parameter weight with the best performance;
and taking the translation model with the optimal parameter weight as a final model.
Preferably, the manner of calculating the BLEU score includes: and (3) calling torchtext in pytorch to calculate BLEU score or calling a bleu_score () method in an NLTK third party library to calculate BLEU score.
The beneficial effects are that: according to the method, a redundant full-connection mapping layer is removed, the overall parameter quantity of the model is obviously reduced, the calculation resource cost occupied by model training and prediction is reduced, the model can run smoothly on a server with smaller video memory, and the number of stacking layers of the encoder/decoder is further increased along with the increase of the feature dimension, so that the model with better performance can be obtained by training under the same calculation resource.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a flowchart of a machine translation method according to an embodiment of the present application.
Detailed Description
In order to make the above objects, features and advantages of the present application more comprehensible, embodiments accompanied with figures are described in detail below. In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is, however, susceptible of embodiment in many other forms than those described herein and similar modifications can be made by those skilled in the art without departing from the spirit of the application, and therefore the application is not to be limited to the specific embodiments disclosed below.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "plurality" is at least two, such as two, three, etc., unless explicitly defined otherwise.
As shown in fig. 1, the present embodiment provides a machine translation method with an efficient nonlinear attention structure, the method including:
s1: acquiring translation corpus; selecting a source language sentence from the translation corpus, and constructing a word list based on the translation corpus, wherein the word list is used for converting the source language sentence into a digital index;
in this embodiment, a translation corpus is obtained from a translation corpus resource library; the translation corpus resource library comprises WMT.
Specifically, the process of constructing the vocabulary includes:
step 1: merging all sentences in the translation corpus into a text file;
step 2: learning byte pairs for the text file by using a python third party library sentenceiece tool, and generating a coded shared dictionary and a byte pair coding model;
step 3: acquiring a bilingual translation data set from a translation corpus resource library, taking the bilingual translation data set as a training data set, and acquiring a newstest2013 data set as a verification set;
step 4: training the byte pair coding model based on a training data set, and verifying based on a verification set;
step 5: using the trained bytes to encode each line of sentences in the text file by the encoding model to obtain encoded bilingual parallel corpus; and combining the coded bilingual parallel corpus to construct the word list.
S2: constructing a training corpus pair, selecting sentences as target-end sentences from the translation corpus, adding a first special symbol at the head end of the target-end sentences to form target-end input of the training corpus pair, and adding a second special symbol at the tail end of the target-end sentences to form standard answers of the training corpus pair; converting the target end input into a second digital index;
in this embodiment the first special symbol is set to "< bos >", and the second special symbol is set to "< eos >". The sentence of the second special symbol added will be used to calculate the loss function and the BLEU score.
S3: building a translation model, wherein the translation model comprises an embedded layer, an encoder, a decoder and a reverse embedded layer;
characterizing the digital index by the embedded layer to obtain a source end input vector, and characterizing the second digital index by the embedded layer to obtain a target end input vector;
the encoder outputs a context vector based on the source input vector;
training the encoder by taking the target end input vector and the context vector as inputs and taking the standard answer as an output target; the trained decoder obtains a target language word vector based on the context vector, and decodes the target language word vector at the next position by taking the target language word vector as the input of the next iteration until the digital index is traversed.
In this embodiment, the model is trained using an Adam optimizer, with an initial learning rate set to 1.0 and betas parameters set to 0.9,0.98; the eps parameter was set to 1e-9 and the number of arm-up steps was set to 4000 using the arm-up training strategy.
Batch generation and training: each step uses data of a batch to train, the batch is generated by a batch sampling strategy, specifically, all the data are firstly ordered according to (sentence length ≡batch_size), each piece of data comprises a source sentence and a target sentence, therefore, the longer length of the two sentences is selected as an ordering basis, then sentences are added one by one until the total length of the accumulated sentences reaches or exceeds max_token_num (set to 8000), the accumulated data are arranged into a batch, filling and testing-forming processes are carried out on all sentences in the batch, and the filling process is to fill all sentences to the longest sentence length in the batch by using a specific filling mark, and the testing-forming process is only aimed at the target sentence.
Specifically, for the encoder: the encoder includes a first attention layer (a first single-headed simplified self-attention layer) and a first normalization layer;
the first attention layer carries out nonlinear mapping on the source input vector to obtain a first gating vector (with the dimension of 1024), a first value (with the dimension of 1024) and a first query object (with the dimension of 512); affine transformation is carried out on the first query object to obtain a first key;
performing matrix multiplication operation on the first query object and the first key, and adopting Softmax normalization calculation to obtain a first attention score; weighting the first value by taking the first attention score as a weight to obtain a first attention weighting result; performing element-by-element product operation on the first attention weighted result and the first gating vector to obtain a first characteristic; further highlighting or screening the first attention weighted result by means of the first gating vector; and pass the first feature to the last fully connected layer in the first attention layer, mapping the final output dimension back to 512;
stacking a plurality of first attention layers by adopting residual connection, and carrying out residual calculation on a plurality of first features in a post-norm mode to obtain first stacking features;
the first normalization layer normalizes the first stacking feature to obtain the context vector.
For a decoder: the decoder includes a second attention layer (second single-headed simplified self-attention layer), a third attention layer (single-headed simplified cross-attention layer), a second normalization layer, and a third normalization layer; the second attention layer is consistent with the first attention layer in working mode;
during training, the target end input vector outputs a second characteristic through the second attention layer; residual calculation is carried out on the plurality of second features in a post-norm mode to obtain second stacking features; normalizing the second stacking feature by the second normalization layer to obtain the second attention layer output vector;
the third attention layer carries out high-dimensional nonlinear full-connection mapping on the context vector and the second attention layer output vector to obtain a second gating vector (with 1024 dimensions) and a second value (with 1024 dimensions); and the context vector is used as a second key, and the second attention layer output vector is used as a second query object;
performing matrix multiplication operation on the second query object and the second key, and adopting Softmax normalization calculation to obtain a second attention score; weighting the second value by taking the second attention score as a weight to obtain a second attention weighting result; performing element-by-element product operation on the second attention weighted result and the second gating vector to obtain a third characteristic; further highlighting or screening the second attention weighted result by means of the second gating vector; and pass the third feature to the last fully connected layer in the third attention layer, mapping the final output dimension back to 512;
stacking a plurality of third attention layers by adopting residual connection, and carrying out residual calculation on a plurality of third features in a post-norm mode to obtain third stacking features;
and the third normalization layer normalizes the third stacking feature to obtain the target language word vector string.
In this embodiment, since the prediction is affected not by each word input by the decoder after the word "sees" itself, a prediction mask is also provided in the decoder.
The training process further comprises: the translation model is trained by adding smoothing processing on the basis of NLL loss function, and R-Dropout strategy is added in training for training.
Specifically, after the R-Dropout strategy is started, in the generation of each batch, one piece of data is not added to the batch at a time, but two pieces of repeated data are added at a time, KL LOSS is additionally added to the training LOSS function, and the LOSS function is calculated according to the transmission result of the adjacent two pieces of data through the R-Dropout.
The method further comprises the step of carrying out model translation performance assessment by using BLEU scores, wherein the assessment process is as follows:
acquiring a newstest2013 data set as a verification set; in the training process, translating and evaluating the parameters of the model by adopting the verification set at fixed interval period, and reserving the parameter weight with the best performance; and taking the translation model with the optimal parameter weight as a final model.
In this embodiment, the fixed interval period is set to be once per training 1000 steps for model effect evaluation; at each evaluation, decoding was performed according to the beam search strategy with beam_size of 1/2/34, respectively, and model performance at each beam setting was evaluated.
Further, the manner of calculating the BLEU score includes: and (3) calling torchtext in pytorch to calculate BLEU score or calling a bleu_score () method in an NLTK third party library to calculate BLEU score.
S4: screening out a target language word vector string with the maximum sum of word prediction probabilities in the candidate above by adopting a beam search strategy in the decoding process to obtain a target language sentence vector; the candidate above represents the decoded target language word vector string; and the target language sentence vector outputs a target language sentence corresponding to the source language sentence through the reverse embedding layer, and the target language sentence is a translation result.
Unlike the training phase, the input of the decoder is not the whole target sentence (because the target sentence is unknown), but the translated segment formed by word-by-word prediction generation and concatenation is not the whole target sentence, and the process of repeatedly inputting the word generated by word prediction into the decoder to iteratively generate the word at the next position is called decoding. Since the decoding involves multiple time steps, a beam search (beam search) strategy is used in the decoding, specifically, k predicted words with the highest word prediction probability in the candidate above are reserved in each translation time step, and finally, the result with the highest total probability is selected from k translation results to be used as a final translation result of the model.
Table 1 is a data comparison table of the machine translation method and the transducer and the conventional transducer-based translation method provided in this embodiment;
Figure SMS_1
under the condition of approaching the standard parameter scale of the translation method of the traditional transducer, the translation method provided by the embodiment is compared with the data of the translation method based on the traditional transducer; as can be seen from table 1, the translation method provided in this embodiment generally exceeds the translation method based on the conventional transducer under the conditions of light weight and a little speed advantage.
The method provided by the embodiment has the following beneficial effects:
1. attention is changed from multiple heads to single heads, so that the calculation time cost caused by multi-head operation is avoided;
2. nonlinear high-dimensional characteristics are introduced into the attention calculation process, so that the capturing capacity of the model for complex linguistic characteristics is enhanced, and the translation performance of the model is improved;
3. the redundant full-connection mapping layer is removed, the overall parameter quantity of the model is obviously reduced, the calculation resource cost occupied by model training and prediction is reduced, the model can run smoothly on a server with smaller video memory, and the model with better performance can be obtained by training under the same calculation resource along with the increase of the feature dimension and the deepening of the stacking layer number of the encoder/decoder.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the claims. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (10)

1. A method of machine translation with efficient nonlinear attention structure, comprising:
s1: acquiring translation corpus; selecting a source language sentence from the translation corpus, and constructing a word list based on the translation corpus, wherein the word list is used for converting the source language sentence into a digital index;
s2: constructing a training corpus pair, selecting sentences as target-end sentences from the translation corpus, adding a first special symbol at the head end of the target-end sentences to form target-end input of the training corpus pair, and adding a second special symbol at the tail end of the target-end sentences to form standard answers of the training corpus pair; converting the target end input into a second digital index;
s3: building a translation model, wherein the translation model comprises an embedded layer, an encoder, a decoder and a reverse embedded layer;
characterizing the digital index by the embedded layer to obtain a source end input vector, and characterizing the second digital index by the embedded layer to obtain a target end input vector;
the encoder outputs a context vector based on the source input vector;
training the encoder by taking the target end input vector and the context vector as inputs and taking the standard answer as an output target; the trained decoder obtains a target language word vector based on the context vector, and decodes the target language word vector at the next position by taking the target language word vector as the input of the next iteration until the digital index is traversed;
s4: screening out a target language word vector string with the maximum sum of word prediction probabilities in the candidate text in the decoding process to obtain a target language sentence vector; the candidate above represents the decoded target language word vector string; and the target language sentence vector outputs a target language sentence corresponding to the source language sentence through the reverse embedding layer, and the target language sentence is a translation result.
2. The machine translation method according to claim 1, wherein in S1, a translation corpus is obtained from a translation corpus resource library; the translation corpus resource library comprises WMT.
3. The machine translation method according to claim 1, wherein in S1, the process of constructing the vocabulary includes:
step 1: merging all sentences in the translation corpus into a text file;
step 2: learning byte pairs for the text file by using a python third party library sentenceiece tool, and generating a coded shared dictionary and a byte pair coding model;
step 3: acquiring a bilingual translation data set from a translation corpus resource library, taking the bilingual translation data set as a training data set, and acquiring a newstest2013 data set as a verification set;
step 4: training the byte pair coding model based on a training data set, and verifying based on a verification set;
step 5: using the trained bytes to encode each line of sentences in the text file by the encoding model to obtain encoded bilingual parallel corpus; and combining the coded bilingual parallel corpus to construct the word list.
4. A machine translation method according to claim 3, wherein in S3,
the encoder includes a first attention layer and a first normalization layer;
the first attention layer carries out nonlinear mapping on the source input vector to obtain a first gating vector, a first value and a first query object; affine transformation is carried out on the first query object to obtain a first key;
performing matrix multiplication operation on the first query object and the first key, and adopting Softmax normalization calculation to obtain a first attention score; weighting the first value by taking the first attention score as a weight to obtain a first attention weighting result; performing element-by-element product operation on the first attention weighted result and the first gating vector to obtain a first characteristic;
stacking a plurality of first attention layers by adopting residual connection, and carrying out residual calculation on a plurality of first features in a post-norm mode to obtain first stacking features;
the first normalization layer normalizes the first stacking feature to obtain the context vector.
5. The machine translation method according to claim 4, wherein in S3,
the decoder comprises a second attention layer, a third attention layer, a second normalization layer and a third normalization layer; the second attention layer is consistent with the first attention layer in working mode;
during training, the target end input vector outputs a second characteristic through the second attention layer; residual calculation is carried out on the plurality of second features in a post-norm mode to obtain second stacking features; normalizing the second stacking feature by the second normalization layer to obtain the second attention layer output vector;
the third attention layer carries out high-dimensional nonlinear full-connection mapping on the context vector and the second attention layer output vector to obtain a second gating vector and a second value; and the context vector is used as a second key, and the second attention layer output vector is used as a second query object;
performing matrix multiplication operation on the second query object and the second key, and adopting Softmax normalization calculation to obtain a second attention score; weighting the second value by taking the second attention score as a weight to obtain a second attention weighting result; performing element-by-element product operation on the second attention weighted result and the second gating vector to obtain a third characteristic;
stacking a plurality of third attention layers by adopting residual connection, and carrying out residual calculation on a plurality of third features in a post-norm mode to obtain third stacking features;
and the third normalization layer normalizes the third stacking feature to obtain the target language word vector string.
6. The machine translation method according to claim 5, wherein a predictive mask is further provided in the decoder.
7. The machine translation method of claim 1, further comprising training the translation model by adding smoothing based on the NLL loss function, and further adding an R-Dropout strategy to the training.
8. The machine translation method according to claim 1, wherein in S4, a bundle search strategy is used in the decoding process to screen out the target language word vector string that has the largest sum of word prediction probabilities with the candidate above.
9. The machine translation method of claim 7, further comprising performing a model translation performance evaluation using a BLEU score, the evaluation process being:
acquiring a newstest2013 data set as a verification set; in the training process, translating and evaluating the parameters of the model by adopting the verification set at fixed interval period, and reserving the parameter weight with the best performance;
and taking the translation model with the optimal parameter weight as a final model.
10. The machine translation method according to claim 9, wherein the way of calculating the BLEU score comprises: and (3) calling torchtext in pytorch to calculate BLEU score or calling a bleu_score () method in an NLTK third party library to calculate BLEU score.
CN202310506400.9A 2023-05-08 2023-05-08 Machine translation method with efficient nonlinear attention structure Active CN116227506B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310506400.9A CN116227506B (en) 2023-05-08 2023-05-08 Machine translation method with efficient nonlinear attention structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310506400.9A CN116227506B (en) 2023-05-08 2023-05-08 Machine translation method with efficient nonlinear attention structure

Publications (2)

Publication Number Publication Date
CN116227506A true CN116227506A (en) 2023-06-06
CN116227506B CN116227506B (en) 2023-07-21

Family

ID=86587618

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310506400.9A Active CN116227506B (en) 2023-05-08 2023-05-08 Machine translation method with efficient nonlinear attention structure

Country Status (1)

Country Link
CN (1) CN116227506B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647214A (en) * 2018-03-29 2018-10-12 中国科学院自动化研究所 Coding/decoding method based on deep-neural-network translation model
CN110427605A (en) * 2019-05-09 2019-11-08 苏州大学 The Ellipsis recovering method understood towards short text
US20200034436A1 (en) * 2018-07-26 2020-01-30 Google Llc Machine translation using neural network models
CN111160050A (en) * 2019-12-20 2020-05-15 沈阳雅译网络技术有限公司 Chapter-level neural machine translation method based on context memory network
CN111353315A (en) * 2020-01-21 2020-06-30 沈阳雅译网络技术有限公司 Deep neural machine translation system based on random residual algorithm
CN112613326A (en) * 2020-12-18 2021-04-06 北京理工大学 Tibetan language neural machine translation method fusing syntactic structure
US20210157991A1 (en) * 2019-11-25 2021-05-27 National Central University Computing device and method for generating machine translation model and machine-translation device
CN113095091A (en) * 2021-04-09 2021-07-09 天津大学 Chapter machine translation system and method capable of selecting context information
CN113297841A (en) * 2021-05-24 2021-08-24 哈尔滨工业大学 Neural machine translation method based on pre-training double-word vectors
CN113468895A (en) * 2021-05-28 2021-10-01 沈阳雅译网络技术有限公司 Non-autoregressive neural machine translation method based on decoder input enhancement
WO2021239631A1 (en) * 2020-05-26 2021-12-02 IP.appify GmbH Neural machine translation method, neural machine translation system, learning method, learning system, and programm
US20210390269A1 (en) * 2020-06-12 2021-12-16 Mehdi Rezagholizadeh System and method for bi-directional translation using sum-product networks
CN114757171A (en) * 2022-05-12 2022-07-15 阿里巴巴(中国)有限公司 Training method of pre-training language model, and training method and device of language model
CN114840499A (en) * 2021-02-01 2022-08-02 腾讯科技(深圳)有限公司 Table description information generation method, related device, equipment and storage medium
US20220343084A1 (en) * 2019-09-02 2022-10-27 Nippon Telegraph And Telephone Corporation Translation apparatus, translation method and program
US20230084333A1 (en) * 2021-08-31 2023-03-16 Naver Corporation Adversarial generation method for training a neural model

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108647214A (en) * 2018-03-29 2018-10-12 中国科学院自动化研究所 Coding/decoding method based on deep-neural-network translation model
US20200034436A1 (en) * 2018-07-26 2020-01-30 Google Llc Machine translation using neural network models
CN110427605A (en) * 2019-05-09 2019-11-08 苏州大学 The Ellipsis recovering method understood towards short text
US20220343084A1 (en) * 2019-09-02 2022-10-27 Nippon Telegraph And Telephone Corporation Translation apparatus, translation method and program
US20210157991A1 (en) * 2019-11-25 2021-05-27 National Central University Computing device and method for generating machine translation model and machine-translation device
CN111160050A (en) * 2019-12-20 2020-05-15 沈阳雅译网络技术有限公司 Chapter-level neural machine translation method based on context memory network
CN111353315A (en) * 2020-01-21 2020-06-30 沈阳雅译网络技术有限公司 Deep neural machine translation system based on random residual algorithm
WO2021239631A1 (en) * 2020-05-26 2021-12-02 IP.appify GmbH Neural machine translation method, neural machine translation system, learning method, learning system, and programm
US20210390269A1 (en) * 2020-06-12 2021-12-16 Mehdi Rezagholizadeh System and method for bi-directional translation using sum-product networks
CN112613326A (en) * 2020-12-18 2021-04-06 北京理工大学 Tibetan language neural machine translation method fusing syntactic structure
CN114840499A (en) * 2021-02-01 2022-08-02 腾讯科技(深圳)有限公司 Table description information generation method, related device, equipment and storage medium
CN113095091A (en) * 2021-04-09 2021-07-09 天津大学 Chapter machine translation system and method capable of selecting context information
CN113297841A (en) * 2021-05-24 2021-08-24 哈尔滨工业大学 Neural machine translation method based on pre-training double-word vectors
CN113468895A (en) * 2021-05-28 2021-10-01 沈阳雅译网络技术有限公司 Non-autoregressive neural machine translation method based on decoder input enhancement
US20230084333A1 (en) * 2021-08-31 2023-03-16 Naver Corporation Adversarial generation method for training a neural model
CN114757171A (en) * 2022-05-12 2022-07-15 阿里巴巴(中国)有限公司 Training method of pre-training language model, and training method and device of language model

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
XUZHANG;WENPENGLU;FANGFANGLI;23;XUEPINGPENG;RUOYUZHANG: "Deep Feature Fusion Model for Sentence Semantic Matching", 《CMC: COMPUTERS, MATERIALS & CONTINUA》, vol. 61, no. 2, pages 601 - 616 *
ZHANG, XU ET AL.: "Deep Feature Fusion Model for Sentence Semantic Matching", 《COMPUTERS, MATERIALS & CONTINUA》, pages 601 - 616 *
张啸川;孙笛;庞建民;周鑫;: "一种基于神经机器翻译模型的跨平台的基本块嵌入方法", 信息工程大学学报, no. 01, pages 49 - 54 *
张振;苏依拉;牛向华;高芬;赵亚平;仁庆道尔吉;: "域信息共享的方法在蒙汉机器翻译中的应用", 计算机工程与应用, vol. 56, no. 10, pages 106 - 114 *
张金超;艾山・吾买尔;买合木提・买买提;刘群;: "基于多编码器多解码器的大规模维汉神经网络机器翻译模型", 中文信息学报, no. 09, pages 20 - 27 *
徐菲菲;冯东升;: "文本词向量与预训练语言模型研究", 上海电力大学学报, no. 04, pages 320 - 328 *
李芳芳,任星凯,毛星亮,林中尧,刘熙尧: "基于多任务联合训练的法律文本机器阅读理解模型", 《中文信息学报》, vol. 35, no. 7, pages 109 - 117 *

Also Published As

Publication number Publication date
CN116227506B (en) 2023-07-21

Similar Documents

Publication Publication Date Title
CN109840287B (en) Cross-modal information retrieval method and device based on neural network
CN110348016B (en) Text abstract generation method based on sentence correlation attention mechanism
CN109635124B (en) Remote supervision relation extraction method combined with background knowledge
CN110309287B (en) Retrieval type chatting dialogue scoring method for modeling dialogue turn information
CN109661664B (en) Information processing method and related device
CN111310438A (en) Chinese sentence semantic intelligent matching method and device based on multi-granularity fusion model
CN110134946B (en) Machine reading understanding method for complex data
CN111737975A (en) Text connotation quality evaluation method, device, equipment and storage medium
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN109522403A (en) A kind of summary texts generation method based on fusion coding
CN110457713A (en) Interpretation method, device, equipment and storage medium based on Machine Translation Model
CN112163431A (en) Chinese missing pronoun completion method based on generic conditional random field
CN116050401B (en) Method for automatically generating diversity problems based on transform problem keyword prediction
CN111651589A (en) Two-stage text abstract generation method for long document
CN111191468B (en) Term replacement method and device
CN113065358A (en) Text-to-semantic matching method based on multi-granularity alignment for bank consultation service
CN111611346A (en) Text matching method and device based on dynamic semantic coding and double attention
CN115545041B (en) Model construction method and system for enhancing semantic vector representation of medical statement
CN115831102A (en) Speech recognition method and device based on pre-training feature representation and electronic equipment
CN111444730A (en) Data enhancement Weihan machine translation system training method and device based on Transformer model
CN114662476A (en) Character sequence recognition method fusing dictionary and character features
CN114528835A (en) Semi-supervised specialized term extraction method, medium and equipment based on interval discrimination
CN113392656A (en) Neural machine translation method fusing push-and-knock network and character coding
CN113362804A (en) Method, device, terminal and storage medium for synthesizing voice
CN113033153A (en) Neural machine translation model fusing key information based on Transformer model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant