CN115392266A

CN115392266A - Translation model training method based on confidence probability, using method, device and storage medium

Info

Publication number: CN115392266A
Application number: CN202110567123.3A
Authority: CN
Inventors: 刘宜进; 孟凡东; 徐金安
Original assignee: Tencent Technology Shenzhen Co Ltd; Beijing Jiaotong University
Current assignee: Tencent Technology Shenzhen Co Ltd; Beijing Jiaotong University
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2022-11-25

Abstract

The application provides a translation model training method and device and a storage medium. The method comprises the following steps: acquiring a text to be trained, wherein the text to be trained comprises at least a source input sentence and a standard target sentence corresponding to the source input sentence; predicting a target sentence based on a source input sentence by using a translation model to obtain a first predicted target sentence; calculating a confidence probability for each word position in the target sentence based on the standard target sentence and the predicted target sentence; a combination of the standard target sentence and the prediction target sentence is determined as a target sentence input of the translation model based on the confidence probability. The scheduling sampling strategy provided by the application greatly reduces the problem of exposure deviation of the NMT model, greatly improves the translation quality, and can be used for improving an online translation system.

Description

Translation model training method based on confidence probability, using method, device and storage medium

Technical Field

The application relates to the technical field of artificial intelligence machine translation, in particular to a training method, a using method, a device, computing equipment and a storage medium of a machine translation model for scheduling and sampling based on confidence probability.

Background

The machine translation model NMT based on neural networks has been greatly developed in recent years and gradually becomes the mainstream machine translation model. NMT models will typically face the problem of inconsistent training-testing distributions, i.e., the exposure bias problem. One common solution to the exposure bias problem is dispatch sampling, which simulates the test scenario of NMT by randomly combining the model's prediction of translation with the standard translation. However, existing scheduled sampling algorithms only determine the sampling probability through a training step, neglecting the real-time capability of the model. In addition, most models predict the translation as the standard translation, which degrades the dispatch sampling algorithm into a common training mode.

Disclosure of Invention

In view of the above, the present application provides a training method, a using method, an apparatus, a computing device and a storage medium for a machine translation model for performing scheduling sampling based on confidence probability.

According to a first aspect of the present application, a method of training a translation model is provided. The method comprises the following steps: acquiring a text to be trained, wherein the text to be trained comprises at least a source input sentence and a standard target sentence corresponding to the source input sentence; predicting a target sentence based on the source input sentence by using the translation model to obtain a first predicted target sentence; calculating a confidence probability for each word position in the target sentence based on the standard target sentence and the predicted target sentence; determining a combination of the standard target sentence and the predicted target sentence as a target sentence input of the translation model based on the confidence probability.

In one embodiment, predicting, by the translation model, a target sentence based on the source input sentence, and obtaining a predicted target sentence includes: predicting a target sentence based on the source input sentence by utilizing a first decoder in the translation model to obtain the translation probability of the t word under the condition of the prediction result of the first t-1 (t is more than or equal to 1) words in the known target sentence; and selecting the target word with the highest translation probability as the translation result of the t word in the target sentence.

In one embodiment, calculating a confidence probability for each word position in the target sentence based on the standard target sentence and the predicted target sentence comprises: and calculating the confidence probability of the current word position of the target word based on the translation result of the tth word and the standard translation result of the tth word in the standard target sentence under the condition of the prediction result of the first t-1 (t is more than or equal to 1) words in the known target sentence.

In one embodiment, the confidence probability comprises any one of: predicting translation probability

Wherein

For the t-th word in the standard translation,

is a partial translation that the target has already translated, X is a semantic representation of the source statement,

parameters of the translation model; expectation of Monte Carlo sampling

Where k represents the number of Monte Carlo samples,

indicating a desire; and variance of Monte Carlo samples

Where k represents the number of Monte Carlo samples,

the variance is indicated.

In one embodiment, the method further comprises: encoding the source input sentence by using an encoder in the translation model, and outputting a semantic representation corresponding to the source sentence; wherein the encoder includes a self-attention mechanism and a fully-connected neural network.

In one embodiment, the method further comprises: determining a combination of the standard target sentence and the predicted target sentence as a target sentence input of the translation model based on the confidence probability comprises: when the confidence probability at the position of the tth word is less than or equal to a first threshold value, taking the t-1 th word in the standard target sentence as the target sentence input of a second decoder in the translation model; when the confidence probability at the position of the tth word is greater than the first threshold and less than or equal to the second threshold, taking the t-1 th word in the prediction target sentence as the target sentence input of a second decoder in the translation model; and when the confidence probability at the position of the tth word is larger than the second threshold value, inputting a random word as a target statement of a second decoder in the translation model.

In one embodiment, a second decoder in the translation model has the same structure and parameters as the first decoder, and the parameters of the second decoder and the first decoder are synchronized after each iteration.

In one embodiment, the first and/or second decoder includes a self-attention mechanism for feature extraction and dimension conversion of an input sentence, a cross-attention mechanism, and a fully-connected neural network.

In one embodiment, the method further comprises: predicting a target statement by using a second decoder in the translation model based on the source input statement to obtain the translation probability of the t-th word under the condition of the prediction result of the first t-1 (t is more than or equal to 1) words in the known target statement; calculating a cross entropy loss function of the translation model based on the translation probability output by the second decoder.

According to a first aspect of the present application, a training apparatus for a translation model is provided. The device includes: the training system comprises an acquisition module, a training module and a training module, wherein the acquisition module is configured to acquire a text to be trained, and the text to be trained comprises at least a source input sentence and a standard target sentence corresponding to the source input sentence; a prediction module configured to predict a target sentence based on the source input sentence using the translation model, resulting in a first predicted target sentence; a confidence probability calculation module configured to calculate a confidence probability for each word position in the target sentence based on the standard target sentence and the predicted target sentence; a scheduling module configured to determine a combination of the standard target statement and the predicted target statement as target statement inputs for the translation model based on the confidence probabilities.

According to yet another aspect of the present application, a computing device is provided. The computing device includes: a memory configured to store computer-executable instructions; a processor configured to perform the method as in any one of the embodiments of the training method of a translation model described above when the computer-executable instructions are executed by the processor.

According to yet another aspect of the application, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed, perform a method as set forth in any one of the embodiments of the method of training a translation model described above.

The application provides a scheduling sampling strategy based on model confidence probability. Compared with the traditional scheduling sampling algorithm, the method and the device can determine the sampling strategy according to the real-time capability (namely confidence probability) of the model. And determining the adoption strategy of the scheduling sampling by setting three calculation modes aiming at different confidence probabilities. Specifically, the model is used for predicting words with high confidence probability, the standard translation is used for words with low confidence probability, and meanwhile, noise is added to the words with high confidence probability to relieve the problem of algorithm degradation. Therefore, the scheduling sampling strategy provided by the application greatly reduces the problem of exposure deviation of the NMT model and greatly improves the translation quality. The application can be used for improving an online translation system.

Drawings

Embodiments of the present disclosure will now be described in more detail and with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an application scenario of a translation model according to some embodiments of the present disclosure;

FIG. 2a schematically illustrates a user interface to which a neural network translation model of one embodiment of the present application is applied;

FIG. 2b schematically illustrates a user interface to which the neural network translation model of one embodiment of the present application is applied;

FIG. 3 schematically illustrates an encoder-decoder architecture diagram for a neural network translation model;

FIG. 4 schematically illustrates a schematic block diagram of a machine translation model trained based on a scheduled sampling strategy according to one embodiment of the present application;

FIG. 5 schematically illustrates an example flow diagram of a method of training a machine translation model for schedule sampling based on confidence probabilities in accordance with one embodiment of this disclosure;

FIG. 6 schematically illustrates an example block diagram of a training apparatus for a machine translation model based on confidence probability scheduling sampling in accordance with this disclosure; and

fig. 7 schematically illustrates an example system that includes an example computing device that represents one or more systems and/or devices that may implement the various techniques described herein.

Detailed Description

The technical solution in the present application will be clearly and completely described below with reference to the accompanying drawings. The embodiments described are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.

Artificial intelligence is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a technique that simulates human cognitive abilities through a machine. The artificial intelligence is a comprehensive subject, relates to a wide field, covers the capabilities of perception, learning thrust, decision and the like, and has the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like. The most central ability of artificial intelligence is to make decisions or predictions based on given input. For example, in a human face recognition application, a person in a photograph may be determined from an input photograph. In medical diagnosis, the cause and nature of a disease can be determined from an input medical image.

In the artificial intelligence software technology, machine learning is an important technology for making a computer have an intelligent characteristic. Machine learning is a multi-field cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. Machine learning specializes in how computers simulate or implement human learning behaviors to acquire new knowledge or skills and reorganize existing knowledge structures to improve their performance. Machine learning generally includes techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and the like.

In order to facilitate an understanding of the embodiments of the present application, a brief description of several concepts follows.

Exposure Bias (Exposure Bias): refers to inconsistencies in training and inference due to text generation. This inconsistency is manifested in the fact that the inputs used in the inference and training are different, each word input being from a real sample (GroudTruth) in the training, but the current input being the output of the previous word in the inference.

Neural network Machine Translation (Neural Machine Translation): the method refers to a machine translation method proposed in recent years. Compared to conventional Statistical Machine Translation (SMT), NMT can train a neural network that can map from one sequence to another, outputting a sequence that can be of a variable length.

Machine translation model (Transformer): a Transformer refers to an encoder-decoder system, where an encoder encodes a sequence in a source language, extracts information from the source language, and converts the information to another language, i.e., a target language, through a decoder, thereby completing translation of the languages.

Scheduled Sampling: the method is a method for solving the problem of inconsistent input data distribution during inference and training. It can quickly guide the model from a randomly initialized state to a reasonable state by mainly using real elements in the target sequence as the input of the decoder at the early stage of training. As training progresses, the method will use the inferred elements increasingly as the decoder input, thereby addressing the issue of data inconsistencies.

BLEU (Bilngual Evaluation Understudy): is an index for evaluating translation quality. The evaluation index focuses on the similarity degree of the translation results of the machine and the human, namely the similarity degree of the machine translation and the reference translation under the same text.

Fig. 1 illustrates an application scenario 100 of a translation model according to some embodiments of the present disclosure. In this application scenario, one or more user interfaces 101 are in bidirectional communication with one or more computing devices 108 via intermediary device 105. The user 104 interacts with one or more user interfaces 101 to complete two-way communication with the computing device 108.

Optionally, one or more databases, such as one or more of first database 110, second database 120, or third database 130, may also be present for implementing functionality in cooperation with computing device 108. It should be appreciated that in some embodiments, one or more of the one or more databases may be integrated into the computing device 108.

In some embodiments, intermediary device 105 may comprise a network connection, such as a combination of a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), and/or a communication network such as the Internet. In this case, the computing device 108 may act as a server, and the user interface 101 may interact with, e.g., send data to or receive data from, one or more computing devices 108, e.g., via a network. Computing device 108 and one or more user interfaces 101 may each include at least one communication interface (not shown) capable of communicating through intermediary device 105. Such communication interfaces may be one or more of the following: any type of network interface (e.g., a Network Interface Card (NIC)), wired or wireless (such as IEEE 802.11 Wireless LAN (WLAN)) wireless interface, worldwide interoperability for microwave Access (Wi-MAX) interface, ethernet interface, universal Serial Bus (USB) interface, cellular network interface, bluetooth ^TM An interface, a Near Field Communication (NFC) interface, etc. Further examples of communication interfaces are described elsewhere herein.

In some embodiments, the intermediary device 105 may be a direct electrical connection and the user interface 101 and the one or more computing devices 108 may be integrated on one or more terminal devices (not shown). The one or more terminal devices may be any type of device having computing capabilities, including mobile computers (e.g., microsoft Surface devices, personal Digital Assistants (PDAs), laptop computers, notebook computers, such as Apple iPad @ ^TM Tablet, netbook, etc.), mobile phones (e.g., cellular phones, smartphones such as Microsoft Windows phones, apple iPhoneThe Google Android are achieved ^TM Operating system's telephone, palm device, black berry device etc.), wearable device (for example intelligent watch, head mounted device, including intelligent glasses, for example Google Glass- ^TM Etc.) or other types of mobile devices. In some embodiments, one or more of the terminal devices may also be stationary devices, such as desktop computers, gaming consoles, smart televisions, and the like. Further, in the case where there are a plurality of terminal devices, the plurality of terminal devices may be the same or different types of devices.

The terminal device may include a display screen (not shown) and a terminal application (not shown) that may interact with a user via the display screen. The terminal application may be a native application, a Web page (Web) application, or an applet (LiteApp, e.g., a cell phone applet, a WeChat applet) that is a lightweight application. In the case where the terminal application is a local application that needs to be installed, the terminal application may be installed in the terminal device. In the case where the terminal application is a Web application, the terminal application can be accessed through a browser. When the terminal application is a small program, the terminal application can be directly opened on the user terminal by searching relevant information of the terminal application (such as a name of the terminal application) and scanning a graphic code of the terminal application (such as a bar code and a two-dimensional code), and the like, without installing the terminal application.

FIG. 2a illustrates a user interface 200a for applying a neural network translation model according to one embodiment of the present application. In the user interface 200a, a background translation service provided by a technology provider can be utilized to translate the a language sequence input by the user into a B language sequence. Correspondingly, the B language sequence input by the user can be translated into the A language sequence by utilizing a background translation service provided by the technology provider. As understood by those skilled in the art, a user herein may be one or more users. The mode of inputting the A language or the B language by the user can comprise a plurality of modes such as picture input, voice input, key zone typing and the like. The background translation model is trained by using a training method of the neural network translation model according to one embodiment of the application.

FIG. 2b illustrates a user interface 200b to which the neural network translation model of one embodiment of the present application is applied. In the user interface 200B, the a language sequence input by the user can be translated into the B language sequence by using a background translation service provided by the technology provider. For example, in fig. 2B, the language sequence "introduction sequence" is input into the left input box, and then the language sequence "introduction sequence" is output into the right input box as the background of one of the underlying technologies through the processing such as background translation as one of the underlying technologies. The background translation model is trained by using the training method of the neural network translation model of one embodiment of the application. The mode of inputting the A language or the B language by the user can comprise a plurality of modes such as picture input, voice input, key-pad input and the like.

In the machine translation task, the core architecture is the encoder-decoder scheme. The encoder processes the variable length input and creates a vector representation of fixed length. The decoder generates a variable length sequence (target sequence) based on the encoded representation. Fig. 3 shows a schematic of the encoder-decoder architecture for the neural network translation model. As shown in fig. 3, the variable length input to the encoder is X = X1, X2, … xn, the encoded representation of the encoder output is [ z1, z2, … zd ], and the variable length sequence output by the decoder is Y = Y1, Y2, \8230, yn.

FIG. 4 schematically illustrates an exemplary architecture 400 for training a machine translation model based on a scheduled sampling strategy according to one embodiment of the present application. The architecture 400 includes an encoder 401 and two structurally

identical decoders

402, 403. Self-attention and feed-forward (i.e., fully-connected neural networks) are included in encoder 401. The encoder 401 first encodes the source input sentence, and outputs a corresponding source-end semantic representation X through a self-attention mechanism and a forward feedback structure. Two decoders, a first round decoder 402 and a second round decoder 403, are included in the architecture 400. The first round decoder 402 and the second round decoder 403 are identical in structure, each including a self-attention mechanism, a cross-attention mechanism, and a feed-forward structure, and parameters of the respective structures in the first round decoder 402 and the second round decoder 403 are kept synchronized. Here, the outputs of the encoder 401 are fed to the cross-attention mechanism of the first round decoder 402 and the cross-attention mechanism of the second round decoder 403, respectively.

The first round decoder 402 encodes the target-side input statement input from the attention mechanism. The self-attention coded target end input statement and the source end semantic representation X are processed by a cross attention mechanism and a full-connection neural network, and the statement translation probability predicted by a model is output

. Here, the confidence probability of the model at the current moment in the position of the sentence with no words is calculated

. In one embodiment, the confidence probabilities

There are three alternative ways of calculating (c). Confidence probability

One can choose to predict the translation probability:

wherein

Represents the t-th word in the standard translation,

is a partial translation that the target has already been translated, X represents the semantic representation of the source statement,

are parameters of the NMT model. Probability of confidence

The expectation for Monte Carlo sampling can be chosen as:

where k represents the number of monte carlo samples and the other symbol definitions are consistent with the above. Furthermore, the confidence probability

The variance of the Monte Carlo samples can be chosen as:

where k represents the number of Monte Carlo samples,

representing the calculation of the variance.

The second decoder round 403 first decides its input for the current location based on the model confidence probabilities computed by the first decoder round 402. Confidence probability conf (t) of the model at each position t. When the confidence probability conf (t) is less than a certain threshold

When the user knows that the current position model is not learned, the user continues to mark the t-1 word in the translated text

As an input. When the confidence probability conf (t) is greater than a certain threshold

Then, it shows that the current location model has learned well enough and should adopt more difficult random words

As an input. When the confidence probability conf (t) is between two thresholds, it indicates that the model has learned the current position basically, so as to predict the model itself

As input for the current time. The above process can be described by the following formula:

here, the first and second liquid crystal display panels are,

and

is a hyper-parameter that can be adjusted according to the effects of the development set.

The second round decoder 403 receives the above inputs, and outputs the final prediction probability of the model through the processing of the self-attention mechanism, the cross-attention mechanism and the fully-connected neural network, and the final prediction probability is used for calculating the cross entropy loss function of the cross learning model.

FIG. 5 schematically illustrates an example flow diagram of a training method 500 for a machine translation model that schedules sampling based on confidence probabilities according to one embodiment of this disclosure. In step 501, the method 500 obtains a text to be trained, where the text to be trained includes at least a source input sentence and a standard target sentence corresponding to the source input sentence. Here, the input sentence at the source end is first encoded by an encoder in the translation model. The input sentence outputs a semantic representation X corresponding to the source sentence through a self-attention mechanism and a fully connected neural network included in the encoder. The first round of decoders includes a self-attention mechanism, a cross-attention mechanism, and a feed-forward network. And coding the standard target sentence corresponding to the source-end input sentence through the first-round decoder, and combining the semantic representation of the source-end sentence with the semantic representation of the corresponding standard target sentence to serve as the input of the cross attention mechanism in the first-round decoder. In step 502, a target sentence is predicted based on a source input sentence by using a translation model, so as to obtain a first predicted target sentence. In one embodiment, a target statement is predicted by a first decoder in a translation model based on a source input statement, and the translation probability of the t word under the condition of the prediction result of the first t-1 (t is more than or equal to 1) words in the known target statement is obtained; selecting a translation profileAnd taking the target word with the highest rate as a translation result of the t word in the target sentence. In step 503, a confidence probability is calculated for each word position in the target sentence based on the standard target sentence and the predicted target sentence. In one embodiment, the confidence probability at the current word position of the target word is calculated based on the translation result of the tth word and the standard translation result of the tth word in the standard target sentence under the condition that the prediction results of the first t-1 (t ≧ 1) words in the target sentence are known. In one embodiment, the confidence probabilities

There are three alternative ways of calculating. Probability of confidence

One can choose to predict the translation probability:

wherein

Represents the t-th word in the standard translation,

is a partial translation of which the target has been translated, X represents a semantic representation of the source statement,

are parameters of the NMT model. Confidence probability

The expectation for Monte Carlo sampling can be chosen:

The variance of the Monte Carlo samples can be chosen as:

where k represents the number of Monte Carlo samples,

representing the calculation of the variance.

In step 504, a combination of the standard target sentence and the prediction target sentence is determined as a target sentence input of the translation model based on the confidence probability. Specifically, when the confidence probability at the position of the tth word is less than or equal to a first threshold value, the t-1 th word in the standard target sentence is used as the target sentence input of a second decoder in the translation model; when the confidence probability at the position of the tth word is larger than a first threshold value and smaller than or equal to a second threshold value, the t-1 th word in the predicted target sentence is used as the target sentence input of a second decoder in the translation model; and when the confidence probability at the position of the t word is larger than a second threshold value, inputting the random word as a target statement of a second decoder in the translation model. In one embodiment, the confidence probability conf (t) of the model at each location t. When the confidence probability conf (t) is less than a certain threshold

As an input. When the confidence probability conf (t) is between two thresholds, sayThe current position of the model is basically learned, and the model can predict the current position

here, the first and second liquid crystal display panels are,

and

is a hyper-parameter and can be adjusted according to the effect of the development set. Here, the second decoder in the translation model has the same structure and parameters as the first decoder, and the parameters of the second decoder and the first decoder are synchronized after each iteration. The first and/or second decoder includes a self-attention mechanism for feature extraction and dimension conversion of an input sentence, a cross-attention mechanism, and a fully-connected neural network. Predicting a target statement by using a second decoder in the translation model based on a source input statement to obtain the translation probability of the t-th word under the condition of the prediction result of the first t-1 (t is more than or equal to 1) words in the known target statement; a cross entropy loss function of the translation model is calculated based on the translation probabilities output by the second decoder.

The training method of the machine translation model for scheduling sampling based on the confidence probability can decide the sampling strategy according to the real-time capability (namely, the confidence probability) of the model. And determining the adoption strategy of the scheduling sampling by setting three calculation modes for different confidence probabilities. Specifically, the method and the device aim at predicting words with high confidence probability by using the model, use standard translation for words with low confidence probability, and simultaneously relieve the problem of algorithm degradation by adding noise to the words with high confidence probability. Therefore, the scheduling sampling strategy provided by the application greatly reduces the problem of exposure deviation of the NMT model and greatly improves the translation quality. The application can be used for improving an online translation system.

A comparison of the translation model training method of the present application with conventional and using conventional dispatch sampling training methods is schematically shown in table 1.

Model (model)	WMT14 EN-DE	WMT19 ZH-EN	WMT14 EN-FR
				Transformer	27.90	24.97	39.90
Transformer (dispatching sampling)	28.60（+0.70）	25.43（+0.46）	40.62（+0.72）
				Transformer (present invention)	28.91（+1.01）	26.00（+1.03）	41.28（+1.38）

Table 1 the performance of the translation model training method of the present application is compared with the conventional training method.

Here, for the three data sets WMT14 EN-DE, WMT19 ZH-EN and WMT14 EN-FR, respectively, the change of the BLEU value under the transform, transform combined with the scheduled sampling and transform combined with the training method proposed by the present invention is considered. The BLEU value indicates a criterion for machine translation evaluation, which is a criterion for machine translation evaluation. Higher BLEU values indicate better results. As can be seen from table 1, for the WMT14 EN-DE dataset, the BLEU value using the fransformer in combination with the scheduled sampling was increased by 0.70 on the fransformer basis, and the BLUE value using the fransformer in combination with the training method proposed by the present invention was increased by 1.01 on the fransformer basis. Aiming at WMT19 ZH-EN data set, BLUE value of scheduling sampling combined with Transformer is improved by 0.46 on the basis of Transformer, and BLUE value of training method combined with Transformer provided by the invention is improved by 1.03 on the basis of Transformer. For WMT14 EN-FR dataset, BLUE value using Transformer combined with scheduling sampling is increased by 0.72 on the basis of Transformer, and BLUE value using Transformer combined with training method provided by the invention is increased by 1.38 on the basis of Transformer.

FIG. 6 schematically illustrates an example block diagram of a training apparatus 600 for a machine translation model based on confidence probability scheduling sampling in accordance with this disclosure. The apparatus 600 includes an acquisition module 601, a prediction module 602, a confidence probability calculation module 603, and a scheduling module 604. The obtaining module 601 is configured to obtain a text to be trained, where the text to be trained includes at least a source input sentence and a standard target sentence corresponding to the source input sentence. Here, the input sentence at the source end is first encoded by an encoder in the translation model. The input sentence outputs a semantic representation X corresponding to the source sentence through a self-attention mechanism and a fully connected neural network included in the encoder. The first round of decoders includes a self-attention mechanism, a cross-attention mechanism, and a feed-forward network. And coding the standard target statement corresponding to the source-end input statement by the first-round decoder, and combining the semantic representation of the source-end statement with the semantic representation of the corresponding standard target statement to serve as the input of the cross attention mechanism in the first-round decoder.

The prediction module 602 is configured to predict a target sentence based on a source input sentence using a translation model, resulting in a first predicted target sentence. In one embodiment, a target statement is predicted by a first decoder in a translation model based on a source input statement, and the translation probability of the t word under the condition of the prediction result of the first t-1 (t is more than or equal to 1) words in the known target statement is obtained; and selecting the target word with the highest translation probability as the translation result of the t word in the target sentence.

The confidence probability calculation module 603 is configured to calculate a confidence probability for each word position in the target sentence based on the standard target sentence and the predicted target sentence. In one embodiment, the confidence probability at the current word position of the target word is calculated based on the translation result of the tth word and the standard translation result of the tth word in the standard target sentence under the condition that the prediction results of the first t-1 (t ≧ 1) words in the target sentence are known. In one embodiment, the confidence probabilities

There are three alternative ways of calculating. Probability of confidence

One can choose to predict the translation probability:

wherein

Represents the t-th word in the standard translation,

are parameters of the NMT model. Confidence probability

The expectation for Monte Carlo sampling can be chosen as:

where k denotes the number of monte carlo samples, and the other symbol definitions are consistent with the above. Furthermore, the confidence probability

The variance of the Monte Carlo samples can be chosen as:

where k represents the number of Monte Carlo samples,

representing the calculation of the variance.

The scheduling module 604 is configured to determine a combination of standard target sentences and predicted target sentences as target sentence inputs for the translation model based on the confidence probabilities. Specifically, when the confidence probability at the position of the tth word is smaller than or equal to a first threshold value, the t-1 th word in the standard target sentence is used as the target sentence input of a second decoder in the translation model; when the confidence probability at the position of the tth word is larger than a first threshold and smaller than or equal to a second threshold, the t-1 th word in the predicted target sentence is used as the target sentence input of a second decoder in the translation model; and when the confidence probability at the position of the t word is larger than a second threshold value, inputting the random word as a target sentence of a second decoder in the translation model. In one embodiment, the confidence probability conf (t) of the model at each location t. When the confidence probability conf (t) is less than a certain threshold

When the user needs to read the translation, the user can continue to mark the t-1 th word in the translation, which indicates that the current position model is not learned yet

As an input. When confidence probability conf (t) is greater than a certain threshold

here, the number of the first and second electrodes,

and

is a hyper-parameter that can be adjusted according to the effects of the development set. Here, the second decoder in the translation model has the same structure and parameters as the first decoder, and the parameters of the second decoder and the first decoder are synchronized after each iteration. The first and/or second decoder includes a self-attention mechanism for feature extraction and dimension conversion of an input sentence, a cross-attention mechanism, and a fully-connected neural network. Predicting a target statement by using a second decoder in the translation model based on a source input statement to obtain the translation probability of the t-th word under the condition of the prediction result of the first t-1 (t is more than or equal to 1) words in the known target statement; a cross entropy loss function of the translation model is calculated based on the translation probabilities output by the second decoder.

The training device of the machine translation model for scheduling sampling based on the confidence probability can decide the sampling strategy according to the real-time capability (namely, the confidence probability) of the model. And determining the adoption strategy of the scheduling sampling by setting three calculation modes for different confidence probabilities. Specifically, the method and the device aim at predicting words with high confidence probability by using the model, use standard translation for words with low confidence probability, and simultaneously relieve the problem of algorithm degradation by adding noise to the words with high confidence probability. Therefore, the scheduling sampling strategy provided by the application greatly reduces the problem of exposure deviation of the NMT model, and greatly improves the translation quality. The application can be used for improving an online translation system.

Fig. 7 illustrates an example system 700 that includes an example computing device 710 that represents one or more systems and/or devices in which aspects described herein can be implemented. Computing device 710 may be, for example, a server of a service provider, a device associated with a server, a system on a chip, and/or any other suitable computing device or computing system. The training apparatus 600 for the translation model described above with reference to fig. 6 may take the form of a computing device 710. Alternatively, the training means 600 of the translation model may be implemented as a computer program in the form of an application 716.

The example computing device 710 as illustrated in fig. 7 includes a processing system 711, one or more computer-readable media 712, and one or more I/O interfaces 713 communicatively coupled to each other. Although not shown, the computing device 710 may also include a system bus or other data and command transfer system that couples the various components to one another. A system bus can include any one or combination of different bus structures, such as a memory bus or memory controller, a peripheral bus, a universal serial bus, and/or a processor or local bus that utilizes any of a variety of bus architectures. Various other examples are also contemplated, such as control and data lines.

The processing system 711 represents functionality to perform one or more operations using hardware. Thus, the processing system 711 is illustrated as including hardware elements 714 that may be configured as processors, functional blocks, and so forth. This may include implementation in hardware as an application specific integrated circuit or other logic device formed using one or more semiconductors. The hardware element 714 is not limited by the material from which it is formed or the processing mechanism employed therein. For example, a processor may be comprised of semiconductor(s) and/or transistors (e.g., electronic Integrated Circuits (ICs)). In such a context, processor-executable instructions may be electronically-executable instructions.

The computer-readable medium 712 is illustrated as including a memory/storage 715. Memory/storage 715 represents memory/storage capacity associated with one or more computer-readable media. The memory/storage 715 may include volatile media (such as Random Access Memory (RAM)) and/or nonvolatile media (such as Read Only Memory (ROM), flash memory, optical disks, magnetic disks, and so forth). Memory/storage 715 may include fixed media (e.g., RAM, ROM, a fixed hard drive, etc.) as well as removable media (e.g., flash memory, a removable hard drive, an optical disk, and so forth). The computer-readable medium 712 may be configured in various other ways as further described below.

One or more I/O interfaces 713 represent functionality that allows a user to enter commands and information to computing device 710 using various input devices and optionally also allows information to be presented to the user and/or other components or devices using various output devices. Examples of input devices include a keyboard, a cursor control device (e.g., a mouse), a microphone (e.g., for voice input), a scanner, touch functionality (e.g., capacitive or other sensors configured to detect physical touch), a camera (e.g., motion that does not involve touch may be detected as gestures using visible or invisible wavelengths such as infrared frequencies), and so forth. Examples of output devices include a display device (e.g., a monitor or projector), speakers, a printer, a network card, a haptic response device, and so forth. Accordingly, the computing device 710 may be configured in various ways as further described below to support user interaction.

Computing device 710 also includes application 716. The application 716 may be, for example, a software instance of the training apparatus 600 of the translation model described with reference to fig. 6, and implements the techniques described herein in combination with other elements in the computing device 710.

Various techniques may be described herein in the general context of software hardware elements or program modules. Generally, these modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The terms "module," "functionality," and "component" as used herein generally represent software, firmware, hardware, or a combination thereof. The features of the techniques described herein are platform-independent, meaning that the techniques may be implemented on a variety of computing platforms having a variety of processors.

An implementation of the described modules and techniques may be stored on or transmitted across some form of computer readable media. Computer-readable media can include a variety of media that can be accessed by computing device 710. By way of example, and not limitation, computer-readable media may comprise "computer-readable storage media" and "computer-readable signal media".

"computer-readable storage medium" refers to a medium and/or device, and/or a tangible storage apparatus, capable of persistently storing information, as opposed to mere signal transmission, carrier wave, or signal per se. Accordingly, computer-readable storage media refers to non-signal bearing media. Computer-readable storage media include hardware such as volatile and nonvolatile, removable and non-removable media and/or storage devices implemented in a method or technology suitable for storage of information such as computer-readable instructions, data structures, program modules, logic elements/circuits or other data. Examples of computer readable storage media may include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital Versatile Disks (DVD) or other optical storage, hard disks, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or other storage devices, tangible media, or an article of manufacture suitable for storing the desired information and accessible by a computer.

"computer-readable signal medium" refers to a signal-bearing medium configured to transmit instructions to hardware of computing device 710, such as via a network. Signal media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave, data signal or other transport mechanism. Signal media also include any information delivery media. The term "modulated data signal" means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as acoustic, RF, infrared and other wireless media.

As previously mentioned, hardware element 714 and computer-readable medium 712 represent instructions, modules, programmable device logic, and/or fixed device logic implemented in hardware form that may be used in some embodiments to implement at least some aspects of the techniques described herein. The hardware elements may include integrated circuits or systems-on-chips, application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), complex Programmable Logic Devices (CPLDs), and other implementations in silicon or components of other hardware devices. In this context, a hardware element may serve as a processing device to perform program tasks defined by instructions, modules, and/or logic embodied by the hardware element, as well as a hardware device to store instructions for execution, such as the computer-readable storage medium described previously.

Combinations of the foregoing may also be used to implement the various techniques and modules described herein. Thus, software, hardware, or program modules and other program modules may be implemented as one or more instructions and/or logic embodied on some form of computer-readable storage medium and/or by one or more hardware elements 714. Computing device 710 may be configured to implement particular instructions and/or functions corresponding to software and/or hardware modules. Thus, implementing a module as a module executable by computing device 710 as software may be implemented at least partially in hardware, for example, using computer-readable storage media of a processing system and/or hardware elements 714. The instructions and/or functions may be executable/operable by one or more articles of manufacture (e.g., one or more computing devices 710 and/or processing systems 711) to implement the techniques, modules, and examples described herein.

In various embodiments, computing device 710 may take on a variety of different configurations. For example, the computing device 710 may be implemented as a computer-like device including a personal computer, a desktop computer, a multi-screen computer, a laptop computer, a netbook, and so forth. The computing device 710 may also be implemented as a mobile device-like device including mobile devices such as mobile phones, portable music players, portable gaming devices, tablet computers, multi-screen computers, and the like. Computing device 710 may also be implemented as a television-like device that includes devices with or connected to a generally larger screen in a casual viewing environment. These devices include televisions, set-top boxes, game consoles, and the like.

The techniques described herein may be supported by these various configurations of computing device 710 and are not limited to specific examples of the techniques described herein. Functionality may also be implemented in whole or in part on "cloud" 720 through the use of a distributed system, such as through platform 722 as described below.

Cloud 720 includes and/or is representative of platform 722 for resources 724. The platform 722 abstracts underlying functionality of hardware (e.g., servers) and software resources of the cloud 720. Resources 724 may include other applications and/or data that may be used when executing computer processes on servers remote from computing device 710. The resources 724 may also include services provided over the internet and/or over a subscriber network, such as a cellular or Wi-Fi network.

Platform 722 may abstract resources and functions to connect computing device 710 with other computing devices. The platform 722 may also be used to abstract a hierarchy of resources to provide a corresponding level of hierarchy encountered for the demand of the resources 724 implemented via the platform 722. Thus, in interconnected device embodiments, implementation of functions described herein may be distributed throughout the system 700. For example, the functionality may be implemented in part on the computing device 710 and through the platform 722 that abstracts the functionality of the cloud 720.

It will be appreciated that embodiments of the disclosure have been described with reference to different functional units for clarity. However, it will be apparent that the functionality of each functional unit may be implemented in a single unit, in a plurality of units or as part of other functional units without departing from the disclosure. For example, functionality illustrated to be performed by a single unit may be performed by a plurality of different units. Thus, references to specific functional units are only to be seen as references to suitable units for providing the described functionality rather than indicative of a strict logical or physical structure or organization. Thus, the present disclosure may be implemented in a single unit or may be physically and functionally distributed between different units and circuits.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various devices, elements, components or sections, these devices, elements, components or sections should not be limited by these terms. These terms are only used to distinguish one device, element, component or section from another device, element, component or section.

Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Rather, the scope of the present disclosure is limited only by the accompanying claims. Additionally, although individual features may be included in different claims, these may possibly advantageously be combined, and the inclusion in different claims does not imply that a combination of features is not feasible and/or advantageous. The order of features in the claims does not imply any specific order in which the features must be worked. Furthermore, in the claims, the word "comprising" does not exclude other elements, and the terms "a" or "an" do not exclude a plurality. Reference signs in the claims are provided merely as a clarifying example and shall not be construed as limiting the scope of the claims in any way.

Claims

1. A method for training a translation model, comprising:

acquiring a text to be trained, wherein the text to be trained comprises at least a source input sentence and a standard target sentence corresponding to the source input sentence;

predicting a target statement based on the source input statement by using the translation model to obtain a first predicted target statement;

calculating a confidence probability for each word position in the target sentence based on the standard target sentence and the predicted target sentence;

determining a combination of the standard target sentence and the predicted target sentence as a target sentence input of the translation model based on the confidence probability.

2. The method of claim 1, wherein the predicting, using the translation model, a target statement based on the source input statement, resulting in a predicted target statement comprises:

predicting a target statement by using a first decoder in the translation model based on the source input statement to obtain the translation probability of the t-th word under the condition of the prediction result of the first t-1 (t is more than or equal to 1) words in the known target statement;

and selecting the target word with the highest translation probability as the translation result of the t word in the target sentence.

3. The method of claim 2, wherein the calculating a confidence probability for each word position in the target sentence based on the standard target sentence and the predicted target sentence comprises:

and calculating the confidence probability of the current word position of the target word based on the translation result of the tth word and the standard translation result of the tth word in the standard target sentence under the condition of the prediction result of the first t-1 (t is more than or equal to 1) words in the known target sentence.

4. The method of any one of claims 1-3, wherein the confidence probability comprises any one of:

predicting translation probability

Wherein

For the t-th word in the standard translation,

parameters of the translation model;

expectation of Monte Carlo sampling

Where k represents the number of Monte Carlo samples,

indicating a desire; and

variance of Monte Carlo samples

Where k represents the number of Monte Carlo samples,

the variance is indicated.

5. The method of any of claims 1-3, further comprising:

encoding the source input sentence by using an encoder in the translation model, and outputting a semantic representation corresponding to the source sentence;

wherein the encoder includes a self-attention mechanism and a fully-connected neural network.

6. The method of claim 1, wherein the determining a combination of the standard target sentence and the predicted target sentence as a target sentence input to the translation model based on the confidence probability comprises:

when the confidence probability at the position of the tth word is less than or equal to a first threshold value, taking the t-1 th word in the standard target sentence as the target sentence input of a second decoder in the translation model;

when the confidence probability at the position of the tth word is greater than the first threshold and less than or equal to the second threshold, taking the t-1 th word in the prediction target sentence as the target sentence input of a second decoder in the translation model;

and when the confidence probability at the position of the tth word is larger than the second threshold value, inputting a random word as a target statement of a second decoder in the translation model.

7. The method of claim 6, wherein a second decoder in the translation model has the same structure and parameters as the first decoder, and the parameters of the second decoder and the first decoder are synchronized after each iteration.

8. The method of claim 6, wherein the first and/or second decoder comprises a self-attention mechanism, a cross-attention mechanism, and a fully-connected neural network, wherein the self-attention mechanism is used for feature extraction and dimension conversion of an input statement.

9. The method of any of claims 1-8, further comprising:

predicting a target statement by using a second decoder in the translation model based on the source input statement to obtain the translation probability of the t-th word under the condition of the prediction result of the first t-1 (t is more than or equal to 1) words in the known target statement;

calculating a cross entropy loss function for the translation model based on the translation probability output by the second decoder.

10. An apparatus for training a translation model, comprising:

the training system comprises an acquisition module, a training module and a training module, wherein the acquisition module is configured to acquire a text to be trained, and the text to be trained comprises at least a source input sentence and a standard target sentence corresponding to the source input sentence;

a prediction module configured to predict a target sentence based on the source input sentence using the translation model, resulting in a first predicted target sentence;

a confidence probability calculation module configured to calculate a confidence probability for each word position in the target sentence based on the standard target sentence and the predicted target sentence; a scheduling module configured to determine a combination of the standard target statement and the predicted target statement as target statement inputs for the translation model based on the confidence probabilities.

11. A computing device comprising

A memory configured to store computer-executable instructions;

a processor configured to perform the method of any one of claims 1-9 when the computer-executable instructions are executed by the processor.

12. A computer-readable storage medium storing computer-executable instructions that, when executed, perform the method of any one of claims 1-9.