CN111339789A

CN111339789A - Translation model training method and device, electronic equipment and storage medium

Info

Publication number: CN111339789A
Application number: CN202010105061.XA
Authority: CN
Inventors: 李磊; 王明轩; 曹军; 孙泽维
Original assignee: Beijing ByteDance Network Technology Co Ltd
Current assignee: Beijing ByteDance Network Technology Co Ltd
Priority date: 2020-02-20
Filing date: 2020-02-20
Publication date: 2020-06-26
Anticipated expiration: 2040-02-20
Also published as: CN111339789B

Abstract

The embodiment of the disclosure discloses a translation model training method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a first source document corpus of a source language, and splitting the first source document corpus into first source monolingual corpuses; inputting the first source monolingual corpus into a mature machine translation model, and taking an output result as a first target monolingual corpus of a target language; splicing the first target monolingual linguistic data to form a first target document linguistic data of the target language; forming parallel bilingual corpus according to the first source document corpus and the first target document corpus; and training a document machine translation model by taking the parallel bilingual corpus as a training sample. The technical scheme of the embodiment of the disclosure can realize that the machine translation model is trained by taking the complete document as the parallel bilingual corpus sample of the machine translation model, thereby improving the accuracy of the document translation of the machine translation model.

Description

Translation model training method and device, electronic equipment and storage medium

Technical Field

The embodiment of the disclosure relates to the technical field of machine translation, and in particular relates to a translation model training method and device, an electronic device and a storage medium.

Background

Machine translation refers to a technique for translating an original text in one natural language (generally referred to as a source language) into a translated text in another natural language (generally referred to as a target language) using a computing device such as a computer. Since this technique is done by machine, a large amount of translation work can be processed in a relatively short time compared to manual translation.

The existing machine translation service generally inputs source text at a sentence level into a machine translation model for translation, and a sentence is generally composed of several or more than ten words. That is, existing machine translation models only support sentence-level translation functionality. When the existing machine translation model is adopted to translate each sentence in the document, the obtained translation result is not accurate because the context of the sentence in the document cannot be considered in whole. Therefore, how to develop a machine translation model using a document as a translation object is a problem to be solved urgently.

Since parallel bilingual corpus samples at the document level required for training are difficult to obtain, a machine translation model using a document as a translation object is difficult to train successfully.

Disclosure of Invention

The embodiment of the disclosure provides a translation model training method and device, an electronic device and a storage medium, which are used for training a machine translation model by taking a complete document as a parallel bilingual corpus sample of the machine translation model, so that the accuracy of document translation of the machine translation model is improved.

In a first aspect, an embodiment of the present disclosure provides a translation model training method, including:

obtaining a first source document corpus of a source language, wherein the first source document corpus is a real document corpus of the source language;

splitting the first source document corpus into first source monolingual corpora;

inputting the first source monolingual corpus into a mature machine translation model, and taking an output result as a first target monolingual corpus of a target language;

splicing the first target monolingual linguistic data to form a first target document linguistic data of the target language;

forming parallel bilingual corpus according to the first source document corpus and the first target document corpus; and

and taking the parallel bilingual corpus as a training sample to train a document machine translation model.

In a second aspect, an embodiment of the present disclosure further provides a translation model training apparatus, including:

the system comprises a first source document corpus acquisition module, a second source document corpus acquisition module and a first source document corpus acquisition module, wherein the first source document corpus is a real document corpus of a source language;

the first source monolingual corpus splitting module is used for splitting the first source document corpus into first source monolingual corpora;

the first target monolingual corpus acquisition module is used for inputting the first source monolingual corpus into a mature machine translation model and taking an output result as a first target monolingual corpus of a target language;

a first target document corpus acquiring module, configured to splice the first target monolingual corpus to form a first target document corpus of the target language;

a first training sample acquisition module, configured to form a parallel bilingual corpus according to the first source document corpus and the first target document corpus;

and the first document machine translation model training module is used for training the document machine translation model by taking the parallel bilingual corpus as a training sample.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, where the electronic device includes:

one or more processors;

storage means for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors implement the translation model training method provided by any embodiment of the disclosure.

In a fourth aspect, the embodiments of the present disclosure further provide a computer storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the translation model training method provided in any of the embodiments of the present disclosure.

According to the method and the device, the obtained first source document corpus of the source language is split into the first source monolingual corpus, the first source monolingual corpus is input into a mature machine translation model, an output result is used as a first target monolingual corpus of a target language, the obtained first target monolingual corpus is spliced into a first target document corpus of the target language, and finally a parallel bilingual corpus is formed according to the first source document corpus and the first target document corpus and is used as a training sample training document machine translation model.

Drawings

FIG. 1 is a flowchart of a translation model training method provided by an embodiment of the present disclosure;

FIG. 2a is a flowchart of a translation model training method provided by an embodiment of the present disclosure;

fig. 2b is a schematic structural diagram of a Seq2Seq model provided in the embodiment of the present disclosure;

FIG. 2c is a schematic diagram of an encoder in a Seq2Seq model according to an embodiment of the present disclosure;

fig. 2d is a schematic diagram of a decoder in a Seq2Seq model according to an embodiment of the present disclosure;

FIG. 3 is a flowchart of a document translation method provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a translation model training apparatus provided in an embodiment of the present disclosure;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

Examples

Fig. 1 is a flowchart of a translation model training method provided in an embodiment of the present disclosure, where the present embodiment is applicable to a case of training a machine translation model taking a document as a translation object, and the method may be executed by a translation model training apparatus, where the apparatus may be configured in an electronic device, and the electronic device may be a terminal device, and may include a mobile phone, a vehicle-mounted terminal, a notebook computer, or the like, or may be a server. Accordingly, as shown in fig. 1, the method comprises the following operations:

s110, obtaining a first source document corpus of a source language, wherein the first source document corpus is a real document corpus of the source language.

Wherein, the source language is the language of the document to be translated. The first source document corpus may be a document corpus corresponding to a source language. The first source document corpus may include one or more paragraphs, each of which may be composed of a plurality of sentences.

The key point of training a machine translation model taking a document as a translation object lies in how to obtain a training sample, namely how to obtain a document-level parallel bilingual corpus sample required by training. For this reason, in the embodiment of the present disclosure, a first source document corpus of a source language may be first obtained as a data source of a training sample.

It should be noted that, the first source document corpus is a real document corpus of a source language, and the obtaining manner may be: the method and the device for obtaining the source language data are obtained from related language materials collected and stored in an open-source language database, public internet resources (such as webpage materials) or a local database, real document language materials of any type of source language can be used as the first source document language material, and the obtaining mode of the first source document language material is not limited in the embodiment of the disclosure.

S120, splitting the first source document corpus into first source monolingual corpora.

The first source monolingual corpus may be a monolingual corpus of a source language at a sentence level.

Correspondingly, after the first source document corpus of the source language is obtained, the first source document corpus can be split to obtain the first source monolingual corpus of the statement level. It should be noted that, when the first source document corpus is split, the first source document corpus needs to be sequentially split according to the sequence of the statements in the document.

S130, inputting the first source monolingual corpus into a mature machine translation model, and taking an output result as a first target monolingual corpus of a target language.

The mature machine model may be a machine translation model that has been trained, and the machine translation model is used to translate any text into translation text of any language, that is, to translate M types of text into N types of translation text. Specifically, the machine translation model includes a machine learning model, for example, a neural network model, specifically, a single neural network model (such as a convolutional neural network model) or a fused neural network model (such as a model fusing a convolutional neural network and a cyclic neural network), and the like. Alternatively, the mature Seq2Seq model can be used as the mature machine translation model. The target language is the translation language corresponding to the source language. Illustratively, for the Chinese-to-English translation mode, the source language is Chinese and the target language is English; for the translation mode of the middle translation method, the source language is chinese, and the target language is french, that is, the source language and the target language need to be determined according to the current translation mode, and the embodiment of the present disclosure does not limit the specific types of the source language and the target language. The first target monolingual corpus may be a monolingual corpus of a target language obtained by translating the first source monolingual corpus through a mature machine translation model. The first target monolingual corpus is a corpus monolingual corpus at a sentence level that matches the first source monolingual corpus.

In the embodiment of the present disclosure, after obtaining the first source monolingual corpus at the sentence level, the first source monolingual corpus may be input into the mature machine translation model, and an output result of the mature machine translation model may be used as the first target monolingual corpus of the target language. It will be appreciated that since the first source monolingual corpus input is a corpus at the statement level, the first target monolingual corpus output by the mature machine model is also a corpus at the statement level. It should be noted that, the order of each first source monolingual corpus is preserved in the process of splitting the first source document corpus into the first source monolingual corpus, so that when the first source monolingual corpus is translated, the first source monolingual corpus can also be translated in sequence according to the preserved order, thereby ensuring that the order of each output first target monolingual corpus is consistent with the order of each first source monolingual corpus.

S140, splicing the first target monolingual linguistic data to form a first target document linguistic data of the target language.

The first target document corpus may be a translation document corpus corresponding to the first source document corpus.

Correspondingly, after obtaining each first target monolingual corpus corresponding to the first source document corpus, the first target monolingual corpora can be spliced to form a first target document corpus of the target language. Optionally, the first target monolingual corpora may be sequentially spliced according to the order of the first target monolingual corpora, so as to ensure that the first target document corpora matches with the first source document corpora.

S150, forming parallel bilingual corpus according to the first source document corpus and the first target document corpus.

And S160, training a document machine translation model by taking the parallel bilingual corpus as a training sample.

The parallel bilingual corpus is a bilingual corpus simultaneously including a source document corpus and a target document corpus. The document machine translation model may be a machine translation model in which a document is a translation object.

In the embodiment of the present disclosure, after the first target document corpus is obtained, parallel bilingual corpus may be composed according to the first source document corpus and the first target document corpus, and the parallel bilingual corpus is used as a training sample to train a document machine translation model.

Therefore, the translation results corresponding to the first source monolingual corpus formed by splitting the first source document corpus are spliced, so that the first target document corpus of the target language can be formed. It should be noted that the first target document corpus may be contextually related to some extent. After the first target document corpus corresponding to the first source monolingual corpus is obtained, the parallel bilingual corpus formed by the first source monolingual corpus and the corresponding first target document corpus is used as a training sample of a document machine translation model, so that the parallel bilingual corpus sample taking an integral document as the machine translation model is realized, the effective training of the document machine translation model is realized, the document machine translation model can effectively translate document data, and the accuracy of the document translation of the machine translation model is improved.

In an optional embodiment of the present disclosure, composing parallel bilingual corpus according to the first source document corpus and the first target document corpus may include: and forming a forward parallel bilingual corpus according to the first source document corpus and the first target document corpus.

The forward parallel bilingual corpus may be a parallel bilingual corpus consisting of a source document corpus and a target document corpus. The source document corpus may be a real document corpus of a source language, and the target document corpus may be a translation document corpus corresponding to the source document corpus.

In the embodiment of the present disclosure, optionally, a parallel bilingual corpus composed of the first source document corpus and the first target document corpus may be used as the forward parallel bilingual corpus. For example, in a Chinese-to-English translation mode, the Chinese source document corpus and the translated English target document corpus may constitute a forward parallel bilingual corpus.

It should be noted that, the meaning of "sentence level" referred to in the embodiments of the present disclosure is one or more sentences. For example, the monolingual corpus of the source language at the sentence level may be one or two monolingual sentences of the source language, and the number of sentences included at the sentence level is not limited by the embodiments of the present disclosure.

Fig. 2a is a flowchart of a translation model training method according to an embodiment of the present disclosure, which is embodied on the basis of the foregoing embodiment, and in this embodiment, a specific implementation manner of forming an antiparallel bilingual corpus according to a second target document corpus and a second source document corpus is given. Accordingly, as shown in fig. 2a, the method of the present embodiment may include:

s210, obtaining a first source document corpus of a source language, wherein the first source document corpus is a real document corpus of the source language.

S220, splitting the first source document corpus into first source monolingual corpora.

And S230, inputting the first source monolingual corpus into a mature machine translation model, and taking an output result as a first target monolingual corpus of a target language.

S240, splicing the first target monolingual linguistic data to form a first target document linguistic data of the target language.

And S250, forming a forward parallel bilingual corpus according to the first source document corpus and the first target document corpus, and taking the forward parallel bilingual corpus as a training sample to train a document machine translation model.

S260, obtaining a second target document corpus of the target language, wherein the second target document corpus is a real document corpus of the target language.

The second target document corpus may be a real document corpus corresponding to the target language. That is, the second target document corpus is the real document corpus to be translated in the target language.

In the embodiment of the present disclosure, in order to further improve the accuracy of document translation of the machine translation model, an antiparallel bilingual corpus may be formed according to the target document corpus and the source document corpus and used as a training sample to train the document machine translation model. Therefore, the real document corpus of the target language can be obtained as the second target document corpus. That is, the second target document corpus is used as an input, and the obtained output result is used as a source document corpus of the source language.

It should be noted that, the second target document corpus is a real document corpus of the target language, and the obtaining manner may be: the second target document corpus is obtained from related corpora collected and stored in an open-source corpus, public internet resources (such as web page data) or a local database, and the real document corpora of any type of target language can be used as the second target document corpus.

S270, splitting the second target document corpus into second target monolingual corpora.

The second target monolingual corpus may be a monolingual corpus of a target language at a sentence level.

Correspondingly, after the second target document corpus of the target language is obtained, the second target document corpus can be split to obtain a second target monolingual corpus of the statement level. It should be noted that, when the second target document corpus is split, the second target document corpus needs to be sequentially split according to the sequence of the statements in the document.

And S280, inputting the second target monolingual corpus into the mature machine translation model, and taking an output result as a second source monolingual corpus of a source language.

The second source monolingual corpus may be a monolingual corpus of a source language obtained by translating the second target monolingual corpus through a mature machine translation model. The second source monolingual corpus is a corpus monolingual corpus at a sentence level that matches the second target monolingual corpus.

In the embodiment of the present disclosure, after the second target monolingual corpus at the sentence level is obtained, the second target monolingual corpus may be input into the mature machine translation model, and an output result of the mature machine translation model is used as the second source monolingual corpus of the source language. It will be appreciated that since the input second target monolingual corpus is a corpus at the statement level, the second source monolingual corpus output by the mature machine model is also a corpus at the statement level. It should be noted that, in the process of splitting the second target document corpus into the second target monolingual corpus, the order of each second target monolingual corpus is preserved, so that when the second target monolingual corpus is translated, the second target monolingual corpus can also be translated in sequence according to the preserved order, thereby ensuring that the order of each output second source monolingual corpus is consistent with the order of each second target monolingual corpus.

And S290, splicing the second source monolingual linguistic data to form a second source document linguistic data of the source language.

The second source document corpus may be a translation document corpus corresponding to the second target document corpus.

Correspondingly, after obtaining each second source monolingual corpus corresponding to the second target document corpus, the second source monolingual corpora can be spliced to form a second source document corpus of the source language. Optionally, the second source monolingual corpus may be sequentially spliced according to the order of the second source monolingual corpus to ensure that the second source document corpus is matched with the second target document corpus.

S2100, forming an antiparallel bilingual corpus according to the second target document corpus and the second source document corpus, and taking the antiparallel bilingual corpus as a training sample to train a document machine translation model.

The antiparallel bilingual corpus may be a parallel bilingual corpus consisting of a target document corpus and a source document corpus. The target document corpus may be a real document corpus of a target language, and the source document corpus may be a translated document corpus corresponding to the target document corpus.

In the embodiment of the present disclosure, after the second source document corpus is obtained, the antiparallel bilingual corpus may be composed according to the second target document corpus and the second source document corpus, and the antiparallel bilingual corpus is used as a training sample to train the document machine translation model. For example, in the Chinese-to-English translation mode, the English target document corpus (the real document corpus to be translated) and the translated Chinese source document corpus may constitute an antiparallel bilingual corpus. That is, in the embodiment of the present disclosure, the mature machine translation model can at least perform a bilingual translation function, for example, can simultaneously support a middle translation and an english translation, or simultaneously support translation functions such as an english translation and a french translation.

That is, in the disclosed embodiments, the training samples used to train the document machine translation model include two types: one is a forward parallel bilingual corpus composed of a first source document corpus of a source language and a first target document language of a target language; the other is an antiparallel bilingual corpus consisting of a second target document corpus in the target language and a second source document corpus in the source language.

It should be noted that, in practical cases, in the case of supporting the bilingual translation mode, the roles of the source language and the target language may be interchanged. Illustratively, in the Chinese-to-English translation mode, the source language is Chinese and the target language is English; in the translation mode in the English translation, the source language is English and the target language is Chinese. In order to avoid confusion, in the embodiment of the present disclosure, the source language and the target language always correspond to the same language. For example, in the embodiment of the present disclosure, in the chinese-to-english translation mode, the source language is chinese and the target language is english; in the translation mode in the English translation, the source language is still Chinese, and the target language is still English. Therefore, in the case of supporting the Chinese-to-English bilingual translation mode, the forward parallel bilingual corpus can be a Chinese corpus and an English corpus corresponding to the Chinese corpus translation; the antiparallel bilingual corpus may be an english document corpus and a chinese document corpus corresponding to the translation of the english document corpus.

In addition, in order to ensure a high-quality training effect, when the two types of training samples are used for training the document machine translation model, the real document corpus can be always used as the translation result of the document machine translation model. In one specific example, assume that the document machine translation model supports the Chinese-English bilingual translation mode. In the Chinese-English translation mode or the English-English translation mode, the source language is Chinese, and the target language is English. During the translation and translation function in training, the antiparallel bilingual corpus can be used as a training sample, that is, the real English document corpus and the Chinese document corpus corresponding to the English document corpus translation are used as training samples. When the translation function in the English translation is trained, the forward parallel bilingual corpus can be used as a training sample, that is, the real Chinese document corpus and the English document corpus corresponding to the translation of the Chinese document corpus are used as the training samples.

S2200, evaluating the output result of the mature machine translation model by adopting at least one preset evaluation index, and updating the parallel bilingual corpus according to the evaluation result.

The preset evaluation index can be an index for evaluating an output result of the mature machine translation model. Optionally, the preset evaluation index may include, but is not limited to, translation accuracy, document length, and full-text consistency.

In order to further guarantee the accuracy and reliability of the training sample, at least one preset evaluation index can be adopted to evaluate the output result of the mature machine translation model, and the parallel bilingual corpus is updated according to the evaluation result. The output result of the mature machine translation model may include a first target monolingual corpus of a target language, or may include a second source monolingual corpus of a source language. Accordingly, the parallel bilingual corpus may include forward parallel bilingual corpus and may include anti-parallel bilingual corpus.

And obtaining an evaluation score as an evaluation result by evaluating with different preset evaluation indexes. Illustratively, the higher the translation accuracy rate is, the more accurate the output result of the mature machine translation model is, and the higher the evaluation score is; the more the length of the document is, the lower the reliability of the output result of the mature machine translation model is, and the lower the evaluation score is; the higher the full-text consistency is, the more accurate the output result of the mature machine translation model is, and the higher the evaluation score is. Correspondingly, the evaluation score may be obtained by integrating the scores of the preset evaluation indexes, for example, the scores of the preset evaluation indexes are accumulated, or the scores of the preset evaluation indexes are multiplied by the matched weighted values and then accumulated, and the specific implementation manner of the evaluation result is not limited in the embodiment of the present disclosure. And after the evaluation result is obtained, updating the parallel bilingual corpus according to the evaluation result.

In an optional embodiment of the present disclosure, the updating the parallel bilingual corpus according to the evaluation result may include: and if the evaluation result is determined not to meet the evaluation standard, deleting the output result of the mature machine translation model and the input corpus matched with the output result.

The evaluation standard can be determined according to the type and the number of the preset evaluation indexes. For example, when the preset evaluation index includes only the translation accuracy, the evaluation criterion may be that the translation accuracy is up to 60%. When the preset evaluation indexes include translation accuracy and full-text consistency, an evaluation standard can be set according to the final evaluation score, for example, the evaluation score reaches more than 60 points, and the specific content of the evaluation standard is not limited in the embodiment of the disclosure.

Specifically, when the parallel bilingual corpus is updated according to the evaluation result, the output result of which the evaluation result does not meet the evaluation standard and the input corpus matched with the output result can be deleted. When the output result is a first target monolingual corpus of the target language, the input corpus is a first source monolingual corpus; and when the output result is a second source monolingual corpus of the source language, the input corpus is a second target monolingual corpus. Therefore, in the process of repeated training, the overall accuracy and reliability of the training sample can be gradually improved until the training is finished.

In an alternative embodiment of the present disclosure, the mature machine translation model and the document machine translation model may be the same machine translation model.

Optionally, the mature machine translation model and the document machine translation model may be the same machine translation model, that is, the training sample of the parallel bilingual corpus may be directly used to train the mature machine translation model, so that the mature machine model can translate the complete document, thereby improving the accuracy of the machine translation model document translation.

In one specific example, a mature Seq2Seq model (sequence to sequence model) can be used as the mature machine translation model and/or the document machine translation model. The Seq2Seq model is a variant of the recurrent neural network, comprising an Encoder (Encoder) and a Decoder (Decoder). Fig. 2b is a schematic structural diagram of a Seq2Seq model provided by the embodiment of the present disclosure, and as shown in fig. 2b, an encoder is used for encoding information of a sequence, encoding sequence information (x) with an arbitrary length into a feature vector (c), specifically, segmenting and transcoding a text sequence represented by a text to be translated into a feature vector. The decoder is used for analyzing the characteristic vector (c) according to the context information to form a text sequence (y), namely a translation text. The feature vectors are actually used to characterize the features of the text to be translated.

When the encoder calculates the feature vector, it usually pre-configures an initial hidden layer vector, and uses a text element as input to calculate and obtain the hidden layer vector corresponding to the current time. And then sequentially taking the text elements as input, converting the hidden layer vector obtained at the last moment to obtain the hidden layer vector corresponding to the current moment, and obtaining the hidden layer vector which is the characteristic vector when all the text elements are input.

Fig. 2c is a schematic diagram of an encoder in a Seq2Seq model according to an embodiment of the present disclosure. Exemplarily, as shown in FIG. 2c, h₁、h₂、h₃……h_nIs a hidden layer vector, related to the state at the previous time and the current input. h is₀Is a predetermined initial hidden layer vector, x₁、x₂、x₃……x_nIs a text element and c is a feature vector. According to h₀And at this moment input x₁Calculate h₁According to h₁And at this moment input x₂Calculate h₂By analogy, according to h_nAnd at this moment input x_nAnd c is calculated. The text to be translated is segmented by the encoder to form at least one text element, which may include, but is not limited to, words, sentences(ii) a And transforming the initial hidden layer vector to form a characteristic vector for representing the characteristics of the text to be translated, thereby realizing the encoding process.

Fig. 2d is a schematic diagram of a decoder in a Seq2Seq model according to an embodiment of the present disclosure. Exemplarily, as shown in FIG. 2d, h₁’、h₂’、h₃’……h_n' is a hidden layer vector, related to the state at the previous time and the current input. h is₀' is a preset initial hidden layer vector, y₁、y₂、y₃……y_nTo output the sequence, c is the feature vector. According to h₀' and c calculate h₁', again according to h₁' and c calculate h₂By analogy, according to h_n-1' and c calculate h_n'. At the same time according to h₀、h₁', c calculating the probability of multiple alternative translation text elements and determining the target text element as y₁Output according to h₁’、y₁And c calculating the probability of a plurality of candidate text segments and determining the target text element as y₂Output, analogize with the rest, according to h_n-1’、y_n-1And c output y_n。

In practice, both the encoder and decoder may be constructed based on neural network models. Wherein the neural network module may include at least one of: a convolutional neural network model, a cyclic neural network model, a deep neural network model, a back propagation neural network model, a long-short term memory network model, and a gate repeat unit model. By adopting the neural network model to construct the encoder and the decoder, the accuracy of encoding and feature vector decoding of the text to be translated can be improved, and therefore the translation accuracy of the text to be translated is improved.

In addition, the Seq2Seq model may further employ an Attention mechanism, and in fact, when the decoder parses the feature vector, the target text element is not only related to the previous hidden layer vector of the decoder, the feature vector, and the target translated speech segment corresponding to the previous time, but also related to the hidden layer vector in the encoder.

It should be noted that fig. 2a is only a schematic diagram of an implementation manner, and there is no sequential relationship between steps S210-S250 and steps S260-S2100, and steps S210-S250 and steps S260-S2100 may be implemented first, and then steps S260-S2100 are implemented, or steps S260-S2100 and then steps S210-S250 are implemented first, or both may be implemented in parallel or alternatively.

Fig. 3 is a flowchart of a document translation method provided in an embodiment of the present disclosure, where the present embodiment is applicable to a case of translating a document, and the method may be executed by an electronic device, where the electronic device may be a terminal device, and may include a mobile phone, a vehicle-mounted terminal, a notebook computer, or the like, or may be a server. Accordingly, as shown in fig. 3, the method includes the following operations:

s310, obtaining the document to be translated in the first language.

The first language may be a source language or a target language. The document to be translated can be a document to be translated and can be composed of at least one paragraph, and each paragraph can be a plurality of sentences.

S320, translating the document to be translated into a target document of a second language by using a document translation model.

Wherein the document translation model is a translation model trained according to the translation model training method provided by any one of the above embodiments. The second language may be a source language or a target language. Specifically, when the first language is the source language, the second language is the target language; when the first language is the target language, the second language is the source language. The target document is the document obtained after the translation of the document to be translated.

In the embodiment of the present disclosure, the electronic device may be configured with a document translation device, and the document translation device may translate the acquired document to be translated in the first language by using a document machine translation model, so that the document to be translated in the source language may be translated, and the document to be translated in the target language may also be translated, so as to obtain the target document in the second language.

According to the method and the device for translating the document, the document to be translated in the first language is translated by adopting a well-trained document translation model to obtain the target document in the second language, so that the problem that the document translation function is difficult to realize by the existing machine translation service is solved, and the document level supporting translation function is realized.

Fig. 4 is a schematic diagram of a translation model training apparatus provided in an embodiment of the present disclosure, which may be implemented in software and/or hardware, and may be configured in an electronic device. As shown in fig. 4, the apparatus includes: a first source document corpus obtaining module 410, a first source monolingual corpus splitting module 420, a first target monolingual corpus obtaining module 430, a first target document corpus obtaining module 440, a first training sample obtaining module 450, and a first document machine translation model training module 460, wherein:

a first source document corpus acquiring module 410, configured to acquire a first source document corpus of a source language, where the first source document corpus is a real document corpus of the source language;

a first source monolingual corpus splitting module 420, configured to split the first source document corpus into first source monolingual corpora;

a first target monolingual corpus obtaining module 430, configured to input the first source monolingual corpus into a mature machine translation model, and use an output result as a first target monolingual corpus of a target language;

a first target document corpus acquiring module 440, configured to splice the first target monolingual corpus to form a first target document corpus of the target language;

a first training sample obtaining module 450, configured to form a parallel bilingual corpus according to the first source document corpus and the first target document corpus;

and a first document machine translation model training module 460, configured to train the document machine translation model by using the parallel bilingual corpus as a training sample.

Optionally, the first training sample obtaining module 450 is specifically configured to compose a forward parallel bilingual corpus according to the first source document corpus and the first target document corpus.

Optionally, the apparatus further comprises: a second target document corpus acquiring module, configured to acquire a second target document corpus of the target language, where the second target document corpus is a real document corpus of the target language; the second target monolingual corpus splitting module is used for splitting the second target document corpus into second target monolingual corpora; a second source monolingual corpus obtaining module, configured to input the second target monolingual corpus into the mature machine translation model, and take an output result as a second source monolingual corpus of the source language; a second source document corpus acquiring module, configured to splice the second source monolingual corpora to form a second source document corpus of the source language; a second training sample acquisition module, configured to form an antiparallel bilingual corpus according to the second target document corpus and the second source document corpus; and taking the antiparallel bilingual corpus as a training sample to train the document machine translation model.

Optionally, the apparatus further comprises: the output result evaluation module is used for evaluating the output result of the mature machine translation model by adopting at least one preset evaluation index; and the parallel bilingual corpus updating module is used for updating the parallel bilingual corpus according to the evaluation result.

Optionally, the parallel bilingual corpus update module is specifically configured to delete the output result of the mature machine translation model and the input corpus matched with the output result if it is determined that the evaluation result does not meet the evaluation criterion.

Optionally, the preset evaluation index includes translation accuracy, document length, and full-text consistency.

Optionally, the mature machine translation model and the document machine translation model are the same machine translation model.

The translation model training device can execute the translation model training method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to a translation model training method provided in any embodiment of the present disclosure.

Since the above-described translation model training device is a device capable of executing the translation model training method in the embodiment of the present disclosure, based on the translation model training method described in the embodiment of the present disclosure, a person skilled in the art can understand a specific implementation manner of the translation model training device in the embodiment and various variations thereof, so that a detailed description of how the translation model training device implements the translation model training method in the embodiment of the present disclosure is omitted here. The scope of the present application is intended to be covered by the following claims so long as those skilled in the art can implement the method for training the translation model in the embodiments of the present disclosure.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Referring now to FIG. 5, a block diagram of an electronic device 500 suitable for use in implementing embodiments of the present disclosure is shown. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a stationary terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 5, electronic device 500 may include a processing means (e.g., central processing unit, graphics processor, etc.) 501 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage means 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the electronic apparatus 500 are also stored. The processing device 501, the ROM 502, and the RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

Generally, the following devices may be connected to the I/O interface 505: input devices 506 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 507 including, for example, a Liquid Crystal Display (LCD), speakers, vibrators, and the like; storage devices 508 including, for example, magnetic tape, hard disk, etc.; and a communication device 509. The communication means 509 may allow the electronic device 500 to communicate with other devices wirelessly or by wire to exchange data. While fig. 5 illustrates an electronic device 500 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 509, or installed from the storage means 508, or installed from the ROM 502. The computer program performs the above-described functions defined in the methods of the embodiments of the present disclosure when executed by the processing device 501.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

In some embodiments, the clients, servers may communicate using any currently known or future developed network protocol, such as HTTP (HyperText transfer protocol), and may be interconnected with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: inputting source speech into a pre-trained speech translation model, and specifying a target language; and acquiring the translation speech which is output by the speech translation model and corresponds to the target language, wherein the language to be translated corresponding to the source speech is different from the target language.

Alternatively, the computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: obtaining a first source document corpus of a source language, wherein the first source document corpus is a real document corpus of the source language; splitting the first source document corpus into first source monolingual corpora; inputting the first source monolingual corpus into a mature machine translation model, and taking an output result as a first target monolingual corpus of a target language; splicing the first target monolingual linguistic data to form a first target document linguistic data of the target language; forming parallel bilingual corpus according to the first source document corpus and the first target document corpus; and training a document machine translation model by taking the parallel bilingual corpus as a training sample.

Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of the module does not, in some cases, constitute a limitation on the module itself, and for example, the first source document corpus acquiring module may also be described as a "module for acquiring a first source document corpus in a source language".

The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), systems on a chip (SOCs), Complex Programmable Logic Devices (CPLDs), and the like.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided a translation model training method including:

According to one or more embodiments of the present disclosure, in a translation model training method provided by the present disclosure, forming a parallel bilingual corpus according to the first source document corpus and the first target document corpus includes:

and forming a forward parallel bilingual corpus according to the first source document corpus and the first target document corpus.

According to one or more embodiments of the present disclosure, in the translation model training method provided by the present disclosure, the method further includes:

acquiring a second target document corpus of the target language, wherein the second target document corpus is a real document corpus of the target language;

splitting the second target document corpus into second target monolingual corpora;

inputting the second target monolingual corpus into the mature machine translation model, and taking an output result as a second source monolingual corpus of a source language;

splicing the second source monolingual linguistic data to form a second source document linguistic data of the source language;

forming an antiparallel bilingual corpus according to the second target document corpus and the second source document corpus;

and taking the antiparallel bilingual corpus as a training sample to train the document machine translation model.

According to one or more embodiments of the present disclosure, in the translation model training method provided by the present disclosure, after forming an antiparallel bilingual corpus according to the second target document corpus and the second source document corpus, the method further includes:

evaluating the output result of the mature machine translation model by adopting at least one preset evaluation index;

and updating the parallel bilingual corpus according to the evaluation result.

According to one or more embodiments of the present disclosure, in the translation model training method provided by the present disclosure, updating the parallel bilingual corpus according to an evaluation result includes:

and if the evaluation result is determined not to meet the evaluation standard, deleting the output result of the mature machine translation model and the input corpus matched with the output result.

According to one or more embodiments of the present disclosure, in the translation model training method provided by the present disclosure, the preset evaluation index includes translation accuracy, document length, and full-text consistency.

According to one or more embodiments of the present disclosure, in the translation model training method provided by the present disclosure, the mature machine translation model and the document machine translation model are the same machine translation model.

According to one or more embodiments of the present disclosure, there is provided a translation model training apparatus including:

According to one or more embodiments of the present disclosure, in the translation model training apparatus provided by the present disclosure, the first training sample obtaining module is specifically configured to compose a forward parallel bilingual corpus according to the first source document corpus and the first target document corpus.

According to one or more embodiments of the present disclosure, in a translation model training apparatus provided by the present disclosure, the apparatus further includes: a second target document corpus acquiring module, configured to acquire a second target document corpus of the target language, where the second target document corpus is a real document corpus of the target language; the second target monolingual corpus splitting module is used for splitting the second target document corpus into second target monolingual corpora; a second source monolingual corpus obtaining module, configured to input the second target monolingual corpus into the mature machine translation model, and take an output result as a second source monolingual corpus of the source language; a second source document corpus acquiring module, configured to splice the second source monolingual corpora to form a second source document corpus of the source language; a second training sample acquisition module, configured to form an antiparallel bilingual corpus according to the second target document corpus and the second source document corpus; and the second document machine translation model training module is used for training the document machine translation model by taking the antiparallel bilingual corpus as a training sample.

According to one or more embodiments of the present disclosure, in a translation model training apparatus provided by the present disclosure, the apparatus further includes: the output result evaluation module is used for evaluating the output result of the mature machine translation model by adopting at least one preset evaluation index; and the parallel bilingual corpus updating module is used for updating the parallel bilingual corpus according to the evaluation result.

According to one or more embodiments of the present disclosure, in the translation model training apparatus provided by the present disclosure, the parallel bilingual corpus update module is specifically configured to delete the output result of the mature machine translation model and the input corpus matched with the output result if it is determined that the evaluation result does not satisfy the evaluation criterion.

According to one or more embodiments of the present disclosure, in the translation model training apparatus provided by the present disclosure, the preset evaluation index includes a translation accuracy, a document length, and a full-text consistency.

According to one or more embodiments of the present disclosure, in the translation model training apparatus provided by the present disclosure, the mature machine translation model and the document machine translation model are the same machine translation model.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.

Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims

1. A translation model training method is characterized by comprising the following steps:

2. The method of claim 1, wherein composing parallel bilingual corpus from the first source document corpus and the first target document corpus comprises:

3. The method according to any one of claims 1 or 2, further comprising:

4. The method of claim 3, further comprising, after composing the antiparallel bilingual corpus from the second target document corpus and the second source document corpus:

and updating the parallel bilingual corpus according to the evaluation result.

5. The method according to claim 4, wherein updating the parallel bilingual corpus according to the evaluation results comprises:

6. The method according to claim 5, wherein the preset evaluation indexes comprise translation accuracy, document length and full-text consistency.

7. The method of any of claims 1-6, wherein the mature machine translation model is the same machine translation model as the document machine translation model.

8. A method of document translation, the method comprising:

acquiring a document to be translated in a first language;

translating the document to be translated into a target document in a second language by using a document translation model, wherein the document translation model is a translation model trained according to the method of any one of claims 1-7.

9. A translation model training apparatus, comprising:

10. An electronic device, characterized in that the device comprises:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.

11. A computer storage medium on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1-8.