CN112287696B

CN112287696B - Post-translation editing method and device, electronic equipment and storage medium

Info

Publication number: CN112287696B
Application number: CN202011186869.1A
Authority: CN
Inventors: 张睦
Original assignee: Iol Wuhan Information Technology Co ltd
Current assignee: Iol Wuhan Information Technology Co ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2024-02-23
Anticipated expiration: 2040-10-29
Also published as: CN112287696A; WO2022088570A1

Abstract

The embodiment of the invention provides a method and a device for editing translated text, wherein the method comprises the following steps: determining a machine translation text to be edited; inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model; the post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text and the sample fine-tuned post-editing translated text thereof and the sample machine translated text of the sample fine-tuning original text; the pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text. According to the method and the device provided by the embodiment of the invention, the training efficiency and the training effect of the post-editing model are improved and the accuracy of post-editing is improved through a pre-training and fine-tuning mode and an error simulation mode for synthesizing translation data.

Description

Post-translation editing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method and apparatus for post-translation editing, an electronic device, and a storage medium.

Background

Post-editing refers to the steps of giving an original text to be translated, calling a corresponding machine translation result, and modifying and coloring on the basis of the result by a translator, so that the translation quality is improved. The machine translation result can provide a translation result for a translator as a reference, so that the translator is prevented from translating from the beginning, and the workload of the translator is reduced.

In actual work, when the difference between the machine translation result and the expected translation result is large, the post-editing mode can cause the translator to need to make a lot of modification and editing, and the workload of the translator is further increased. For example, when the machine translation model processes some texts to be translated, which are limited in resources and are oriented to certain professional fields, the effect is poor, and the obtained machine translation result is far away from the correct translation result. Or the machine translation model is poor in accuracy of the obtained machine translation result when the entity words such as a person name, a place name or an organization name are subjected to incorrect translation or the words are subjected to incorrect translation. Or when the machine translation model cannot reasonably process the translation of the long sentence, the accuracy of the machine translation result is also insufficient, and a large amount of post-editing work is required. Thus, the automatic post-editing model plays an increasingly important role in current auxiliary translation. The post-editing model can automatically carry out post-editing on the machine translated version based on the input original text to be translated and the machine translated version, so that correction of the error of the translation is realized, the post-edited version is output, and the workload of the translator is further reduced by further reducing the gap between the output version and the expected version of the translator.

However, the existing post-editing model training method requires a plurality of triples of parallel corpora, namely, triples consisting of original text, machine translation and post-editing translation. The triplet training data are difficult to obtain and require a large amount of manual labeling cost, so that the training effect and the training efficiency of the post-editing model are poor, and further the accuracy of post-editing of the translated version is poor.

Disclosure of Invention

The embodiment of the invention provides a post-translation editing method, a post-translation editing device, electronic equipment and a storage medium, which are used for solving the defects of poor training effect, low training efficiency and poor post-translation editing accuracy of a post-translation editing model in the prior art.

The embodiment of the invention provides a method for editing a translated text, which comprises the following steps:

determining a machine translation text to be edited;

inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model;

the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text and a sample fine-tuned editing translated text thereof and a sample machine translated text of the sample fine-tuning original text;

The pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text.

According to the post-translation editing method of an embodiment of the present invention, the sample machine translation text corresponds to at least one error type of long sentence translation error, entity name translation error and domain translation error.

According to a post-translation editing method of one embodiment of the present invention, the sample machine-translated translation text is determined based on at least one of the following:

translating the sample fine-tuning original text by applying a first machine translation model to obtain a sample machine translation text with a long sentence translation error type; the first machine translation model is trained based on a first sample translation original text and a first sample translation original text thereof, the sample fine-tuning original text is a long sentence, and the first sample translation original text is a short sentence;

randomly modifying the entity names in the edited translation text after the sample fine adjustment to obtain a sample machine translation text with entity name translation error types;

Translating the sample fine-tuning original text by applying a second machine translation model to obtain a sample machine translation text with a field translation error type; the second machine translation model is trained based on a second sample translation textual text and a second sample translation textual text thereof that is different from the sample fine-tuning textual text field.

According to a post-translation editing method of one embodiment of the present invention, the pre-trained post-editing model includes a pre-trained original language encoder and a pre-trained translated language encoder, and a decoder.

According to the post-translation editing method of the embodiment of the invention, the pre-trained original language encoder and the pre-trained translated language encoder are obtained by training based on sample single language texts of corresponding languages and sample error texts obtained by performing conventional error simulation on the sample single language texts.

According to the post-translation editing method of one embodiment of the invention, the simulated translation text is determined based on the following steps:

and performing conventional error simulation on the sample pre-training original text or the sample pre-training post-editing translated text to obtain the simulated translated text.

According to one embodiment of the invention, the method for performing conventional error simulation specifically comprises the following steps:

and randomly selecting a plurality of text fragments in the corresponding text, and deleting, rearranging, replacing or transferring the text fragments.

The embodiment of the invention also provides a device for editing the translated text, which comprises the following steps:

a translation determining unit for determining a machine translation text to be edited;

the post-editing unit is used for inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model;

The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of any one of the post-translation editing methods when executing the program.

The embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the post-translation editing methods described above.

According to the post-translation editing method, the post-translation editing device, the electronic equipment and the storage medium, the post-translation editing model is obtained through pre-training the original text and the post-training editing text based on the sample and the sample thereof, training the simulated translation text of the pre-training original text, and fine-tuning the original text and the post-training editing text based on the sample, and fine-tuning the machine-translating translation text based on the sample, and fine-tuning the sample of the original text, the post-editing model is obtained after fine-tuning the pre-training editing model, and the training efficiency and the training effect of the post-editing model are improved through the pre-training and fine-tuning modes and the error simulation mode of synthesizing the translation data.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a post-translation editing method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of a training method for a post-translation editing model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a post-translation editing apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Post-editing refers to the steps of giving an original text to be translated, calling a corresponding machine translation result, and modifying and coloring on the basis of the result by a translator, so that the translation quality is improved. The machine translation result can provide a translation result for a translator as a reference, so that the translator is prevented from translating from the beginning, and the workload of the translator is reduced. However, when the machine translation result is far from the expected translation result, the post-editing mode may cause the translator to need to make many modifications and edits, which further increases the workload of the translator. For example, when the machine translation model processes some documents to be translated, which have limited resources and are oriented to certain professional fields, for entity words, such as a person name, a place name or a mechanism name, or when the entity words are incorrectly translated, or when the machine translation model cannot reasonably process long sentences for translation, the translation effect is poor, the obtained machine translation result is far away from the correct translation result, and a large amount of post-editing work is required. Thus, the automatic post-editing model plays an increasingly important role in current auxiliary translation.

In this regard, the embodiment of the invention provides a post-translation editing method. Fig. 1 is a flow chart of a post-translation editing method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 110, determining a machine translation text to be edited;

step 120, inputting the machine translation text and the corresponding original text into the post-editing model to obtain the post-editing translation text output by the post-editing model;

the post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text and the sample fine-tuned post-editing translated text thereof and the sample machine translated text of the sample fine-tuning original text;

Specifically, the machine translation text corresponding to the original text is obtained for automatic post-editing by the post-editing model. The machine translation text may be obtained by inputting the original text into a machine translation model for translation.

And then, inputting the machine translation text and the corresponding original text into a post-editing model, wherein the post-editing model can correct errors of the machine translation text based on semantic information of the original text and semantic information of the machine translation text, so as to obtain corrected post-editing translation text. Here, the language used for post-editing the translated text is the same as the language used for machine translating the translated text.

The post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text and the sample fine-tuned post-editing translated text thereof and the sample machine translated text of the sample fine-tuning original text; the pre-training post-editing model is obtained based on the sample pre-training original text and the sample pre-training post-editing translated text thereof, and the simulated translated text of the sample pre-training original text.

Here, when editing the model after training, a pre-training and fine tuning method is adopted. Fig. 2 is a flow chart of a training method for a post-translation editing model according to an embodiment of the present invention, where, as shown in fig. 2, the training method for a post-translation editing model includes:

Step 210, training an initial model based on the sample pre-training original text and the sample pre-trained edited version text thereof, and the simulated version text of the sample pre-training original text to obtain a pre-trained edited model;

and 220, fine tuning the pre-trained editing model based on the sample fine-tuned original text and the sample fine-tuned edited version text thereof and the sample machine translated version text of the sample fine-tuned original text to obtain the post-editing model.

Firstly, pre-training an original text and a sample thereof by using a large number of samples, editing the translated text after pre-training, and simulating the translated text to pre-train an initial model to obtain a pre-trained editing model. The sample pre-training original text and the sample pre-training post-editing translated text can be obtained by downloading public bilingual parallel corpus data from a network, such as Chinese-English parallel corpus given by national government official documents and international machine translation (Conference on Machine Translation, WMT). Then, error simulation can be performed based on bilingual parallel corpus to obtain simulated translation text of the sample pre-training original text so as to simulate machine translated translations. Because only bilingual parallel corpus is needed to be obtained during pre-training, and a simulated translation text similar to a machine translation is synthesized in an error simulation mode, the acquisition difficulty of training data is greatly reduced, the cost of editing the translation after manual labeling is also saved, the efficiency of the whole training process is improved, and the training difficulty is reduced.

In addition, in the training process, the pre-training post-editing model obtained by pre-training is used for pre-training the original text and the post-training editing text according to the sample, and the simulated translated text of the original text can learn text errors which may occur in the translated text, such as word repetition, reverse order, word leakage and the like, and learn how to correct the text errors in the translated text according to the original text so as to obtain the correct post-editing translated text.

In order to further improve the accuracy of post-editing so as to better complete the post-editing task, the post-editing model can be finely adjusted based on the sample fine-adjusted original text and the sample fine-adjusted post-editing translated text thereof, and the sample machine translated text of the sample fine-adjusted original text. The sample fine-tuning original text and the sample fine-tuning translated text thereof can also be obtained by obtaining bilingual parallel corpus. Here, in order to improve accuracy of fine tuning, bilingual parallel corpus generated in a translation production environment may be obtained. Each bilingual parallel corpus comprises original text to be translated and high-quality translated text generated after manual translation and verification. According to bilingual parallel corpus generated in the production environment, sample fine-tuning original text and high-quality sample fine-tuning post-editing translated text can be obtained. And under the condition that the sample machine translation text comprises a post-editing scene, the translation errors are caused by the limitation of a machine translation model in the actual machine translation process. After-editing translation text is finely adjusted based on the sample fine-adjusting original text and the sample fine-adjusting original text, and the sample machine translation text is finely adjusted, so that a post-editing model can learn translation errors possibly occurring in the field of machine translation besides conventional text errors, error positioning and correcting capability of the post-editing model in a post-editing scene are improved, and accuracy of post-editing is further improved. In addition, the data volume required during fine tuning is smaller than that required during pre-training, so that the acquisition difficulty of triples of < original text, machine translation and post-editing translation > can be reduced, the model training difficulty is further reduced, and the model training efficiency is improved.

According to the method provided by the embodiment of the invention, the pre-trained editing model is obtained through training the pre-trained original text and the pre-trained edited text of the sample based on the sample and the simulated translated text of the pre-trained original text, the post-trained edited text is obtained through fine-tuning the original text and the sample based on the sample, the post-edited text is edited after fine-tuning the sample, the post-edited model is obtained after fine-tuning the post-edited model is obtained through fine-tuning the sample machine translated text of the pre-trained original text, and the post-edited model is synthesized through the pre-training and fine-tuning modes and the error simulation.

Based on the above embodiment, the sample machine translation text corresponds to at least one error type of long sentence translation error, entity name translation error, and domain translation error.

Specifically, in order to enable the post-editing model to learn a translation error caused by the limitation of the machine translation model in the actual machine translation process in the post-editing scene in the fine tuning process, a sample machine translation text containing the translation error can be obtained. In general, possible translation errors include long sentence translation errors, entity name translation errors, domain translation errors, and the like. The long sentence translation error is an error which occurs when a machine translation model cannot reasonably process a long sentence; the entity name translation error is an error which occurs when a machine translation model translates entity words such as a person name, a place name or an organization name or the like or digital words; the domain translation error is an error caused by the difference between the domain to which the machine translation model is applicable and the domain of the original text to be translated when the machine translation model processes the original text to be translated, which has limited resources and is oriented to certain professional domains. Thus, the obtained sample machine translation text may correspond to at least one error type of long sentence translation error, entity name translation error, and domain translation error.

Based on any of the above embodiments, the sample machine translated text is determined based on at least one of:

translating the sample fine-tuning original text by using a first machine translation model to obtain a sample machine translation text with a long sentence translation error type; the first machine translation model is obtained based on a first sample translation original text and a first sample translation original text training, the sample fine-tuning original text is a long sentence, and the first sample translation original text is a short sentence;

randomly modifying entity names in the edited translation text after sample fine adjustment to obtain a sample machine translation text with entity name translation error types;

translating the sample fine-tuning original text by applying a second machine translation model to obtain a sample machine translation text with a field translation error type; the second machine translation model is trained based on a second sample translated textual text and its second sample translated textual text that is different from the sample fine-tuning textual text field.

Specifically, for long sentence translation errors, a first machine translation model can be obtained based on the first sample translation text and the first sample translation text training, and the first machine translation model can be constructed based on a single transducer model. The first sample translation text and the first sample translation text thereof can be bilingual parallel corpus downloaded through a network. Here, the first sample translates the original text into short sentences, e.g. containing only 1 sentence. Since the first machine translation model is obtained based on phrase training, the model is only good at translating phrases, and if long phrases are input into the model for translation, the obtained translations are easy to have long-sentence translation errors. Therefore, long sentences, for example, including more than 2 sentences, are selected as sample fine-tuning original text, and are input into the first machine translation model to obtain sample machine translation text of long sentence translation error type.

For the entity name translation error, entity recognition tools such as space can be utilized to perform entity recognition on the translated text edited after sample fine adjustment, for example, entity recognition is performed on bilingual parallel corpus such as English text in bilingual parallel corpus generated in a translation production environment. And screening out post-editing translation text fragments of entities such as a person name, a place name, a mechanism name, a number and the like in the post-editing translation text subjected to sample fine adjustment, and randomly modifying, such as deleting or replacing, the post-editing translation text fragments to obtain the sample machine translation text of the entity name translation error type.

For the field translation error, a second machine translation model can be obtained based on the second sample translation original text and the second sample translation original text training, and the second machine translation model can be constructed based on a single transducer model. Wherein the second sample translation text and the second sample translation text thereof are different from the sample fine tuning text. For example, a bilingual parallel corpus of a high-quality, but narrower-domain united nations government document can be downloaded as the second sample translation text and the second sample translation text thereof through a network. The second machine translation model obtained through training is only good at translating the second sample translation original text and the text of the field of the second sample translation original text, so that if the original text of different fields is input into the model for translation, the obtained translation is easy to have field translation errors, and the translated text obtained by translating the sample fine-tuning original text through the second machine translation model can be used as the sample machine translation text of the field translation error type.

According to the method provided by the embodiment of the invention, through different data synthesis modes, the sample machine translation text corresponding to three different translation error types can be efficiently generated, the data labeling process in the fine tuning process is omitted, and the training efficiency of the post-editing model can be further improved.

Based on any of the above embodiments, the pre-trained post-editing model includes a pre-trained original language encoder and a pre-trained translated language encoder, and a decoder.

Specifically, the pre-trained post-editing model may include two encoders, namely an original language encoder and a translated language encoder, for encoding the original text and the machine translated text, respectively, and a decoder for decoding based on the encoding of the original text and the encoding of the machine translated text, to implement error correction of the machine translated text, and to obtain the post-edited translated text. The original language encoder, the translated language encoder and the decoder can be constructed based on a single transducer model. Here, the two encoders may be obtained through pre-training to improve the pre-training efficiency of the pre-trained post-editing model, thereby further improving the training efficiency of the post-editing model as a whole.

According to the method provided by the embodiment of the invention, the pre-trained editing model is built through the pre-trained original language encoder, the pre-trained translated language encoder and the decoder together, so that the overall training efficiency of the post-editing model is further improved.

Based on any of the above embodiments, the pre-trained original language encoder and the pre-trained translated language encoder are trained based on sample monolingual text of a corresponding language and sample error text obtained by performing conventional error simulation on the sample monolingual text.

Specifically, in order for the original language encoder and the translated language encoder to learn to extract correct semantic information from the error text, so as to encode the original language code and the translated language code containing the correct semantic information, so as to improve the expression capability of the code, the original language encoder and the translated language encoder can be trained based on the sample monolingual text of the corresponding language and the corresponding sample error text thereof, and the word vector model of the corresponding language. For example, if the original text is chinese and the translated text is english, the original language encoder may be pre-trained based on the chinese sample monolingual text and its corresponding sample error text, and the chinese word vector model, and the translated language encoder may be pre-trained based on the english sample monolingual text and its corresponding sample error text, and the english word vector model. The sample monolingual text can be obtained by collecting a large number of monolingual corpuses, for example, public Chinese monolingual corpuses, such as Chinese wikipedia and news corpuses, and public English corpuses, such as English wikipedia and news corpuses, can be downloaded from a network. In order to reduce the difficulty of acquiring training data, part of the monolingual corpus, for example, 20% of the monolingual corpus, may be randomly selected from the monolingual corpus, and conventional error simulation is performed on the selected monolingual corpus, i.e., the sample monolingual text, to obtain a sample error text containing conventional text errors.

According to the method provided by the embodiment of the invention, the original language encoder and the translated language encoder can be obtained by pre-training the sample single language text of the corresponding language and the sample error text obtained by performing conventional error simulation on the sample single language text, so that the original code and the translated code containing correct semantic information can be obtained by encoding, and the encoding expression capability is improved.

Based on any of the above embodiments, the simulated translation text is determined based on the steps of:

and performing conventional error simulation on the sample pre-training original text or the sample pre-training post-editing translated text to obtain a simulated translated text.

Specifically, a part of bilingual parallel corpus, for example, 10% of bilingual parallel corpus, may be randomly selected from the bilingual parallel corpus, a sample pre-training original text in each corpus is subjected to conventional error simulation to obtain a simulated translated text containing conventional text errors, and the sample pre-trained translated text, the simulated translated text and the sample pre-trained original text in the bilingual parallel corpus are used as one piece of training data of the pre-trained editing model. And part of bilingual parallel corpus, such as 10% bilingual parallel corpus, can be randomly selected, the sample is pre-trained and then the edited translation text is subjected to conventional error simulation to obtain a simulated translation text containing conventional text errors, and the sample pre-trained original text, the simulated translation text and the sample pre-trained edited translation text in the bilingual parallel corpus are used as training data of the pre-trained and then edited model.

Based on any of the above embodiments, performing conventional error simulation specifically includes:

Specifically, the conventional text errors include word leakage, reverse order, word misplacement, repetition and the like, so that when conventional error simulation is performed, a plurality of text fragments in the corresponding text can be randomly selected, and deleting, rearranging, replacing or transferring operation is performed on each text fragment. Wherein, deleting means deleting the text segment as a whole, rearranging means reversing the order of words in the text segment, replacing means replacing the text segment with a text segment at other positions in the original text, and transferring means exchanging the text segment at other positions in the original text with the text segment. For example, conventional error simulation may be performed in the manner set forth in the following table:

original text	<zh>Weather today is really good.
		Deletion of	<zh>DEL is good in the daytime.
Rearrangement of	<zh>Good true qi in every day.
		Replacement of	<zh>Today, it is good.
Transfer of	<zh>Jinqi is good all the day long.

Based on any one of the above embodiments, a further embodiment of the present invention provides a post-editing model building method. The method comprises the following steps:

First, collecting corpus data required for model training, including:

and accumulating bilingual parallel corpus generated in the translation production environment, and marking the bilingual parallel corpus as a bilingual parallel corpus C. Each corpus comprises an original text to be translated and a high-quality translated text generated after manual translation and examination.

The common bilingual parallel corpus, such as the united nations and WMT bilingual parallel corpus, is downloaded from the network and is noted as bilingual parallel corpus T.

Common original language monolingual corpus, such as chinese wikipedia and news corpus, is downloaded from the network and noted as monolingual corpus Z.

The common translation language monolingual corpus, such as english wikipedia and news corpus, is downloaded from the network and noted as monolingual corpus E.

And performing word segmentation on all the corpus. Wherein, for English corpus, word segmentation can be performed by using a space tool; for Chinese corpus, word segmentation can be performed by using grammar rules in units of characters, namely, individual Chinese characters, continuous numbers or English letters, punctuations and the like are independently used as word examples. Then, a language identifier is added at the beginning of each corpus, as shown in the following table:

based on the segmented corpus data, word vector training is carried out on the original language and the translated language by utilizing a Skip-Gram algorithm. Wherein the dimension of the word vector may be set to 300 and the context window may be set to 5.

Randomly extracting 20% of corpus from Z, performing conventional error simulation, synthesizing parallel corpus containing possibly destroyed corpus and original corpus, and pre-training an original language encoder of a standard transducer model by combining word vector model of original language.

Randomly extracting 20% of corpus from E, performing conventional error simulation, synthesizing parallel corpus containing possibly destroyed corpus and original corpus, and pre-training a standard translation language encoder of a translation language model by combining word vector model of the translation language.

10% of corpus is randomly extracted from T, and conventional error simulation is carried out on the original corpus in the T, namely, a ternary corpus (original corpus which is possibly destroyed, original translation corpus and initial original corpus) is generated. And similarly, 10% of the linguistic data is randomly extracted from the T, and conventional error simulation is carried out on the linguistic data of the translations in the T, so that a ternary linguistic data (initial original linguistic data, possibly destroyed linguistic data and original linguistic data) is generated. And (3) pre-training from a double-transducer encoder to a single-transducer decoder by using the synthesized ternary parallel corpus to obtain a pre-trained editing model. Wherein the dual fransformer encoder is an original language encoder and a translated language encoder.

Subsequently, training data acquisition for the fine tuning task is performed, including:

a) The Chinese sentence breaking rule method is utilized to break sentences of the original text corpus in the C, bilingual parallel corpus with the number of sentences of the original text corpus being more than or equal to 2 is screened out, and a subset C is formed ₁ . Similarly, the original text corpus in T is subjected to sentence breaking, bilingual parallel corpus with the number of sentences of the original text corpus being 1 is screened out, and another subset T is formed ₁ . Using corpus T ₁ A machine translation engine based on a transducer model is constructed. Then C is carried out ₁ The original text corpus is input into the model for decoding to generate machine translation, and a triplet (C ₁ Original text, machine translated version, C ₁ Translation).

b) Entity recognition is carried out on the translated language material in the C by using a space tool, and bilingual parallel language material C containing entities such as person names, place names, organization names, numbers and the like is screened out ₂ . Random modification C ₂ Entity nouns in the translation corpus, e.g. deleted or replaced, resulting in triples (C ₂ Original text, damaged translation of entity noun, C ₂ Translation).

c) And screening the parallel bilingual corpus of the united states from the T, and constructing a machine translation engine based on a transducer model. Extracting a subset C from C ₃ C is carried out by ₃ The original text corpus is input into the model for decoding to generate machine translation, and a triplet (C ₃ Original text, machine translated version, C ₃ Translation).

And c) combining the triples generated in a), b) and c) to form total fine tuning task training data, and fine tuning the pre-trained post-editing model to obtain a final post-editing model.

The post-translation editing device provided by the embodiment of the invention is described below, and the post-translation editing device described below and the post-translation editing method described above can be referred to correspondingly.

Based on any of the above embodiments, fig. 3 is a schematic structural diagram of a post-translation editing device according to an embodiment of the present invention, as shown in fig. 3, where the device includes: a translation determination unit 310 and a post-editing unit 320.

Wherein the translation determining unit 310 is configured to determine a machine translation text to be edited;

the post-editing unit 320 is configured to input the machine translated text and the corresponding original text into the post-editing model, so as to obtain a post-edited translated text output by the post-editing model;

According to the device provided by the embodiment of the invention, the pre-trained editing model is obtained through pre-training the original text and the pre-training edited text based on the sample, and the simulated translated text training of the pre-training original text, and the post-training editing model is obtained through fine-tuning the original text and the post-training edited text based on the sample, and the post-machine translated text of the post-training original text based on the sample, and the post-editing model is obtained through fine-tuning the post-training edited model based on the sample machine translated text of the post-training original text, and the post-editing model is obtained through a pre-training and fine-tuning mode and a mode of synthesizing translated data by error simulation, so that the training efficiency and training effect of the post-editing model are improved, and the accuracy of post-editing is improved.

Based on any of the above embodiments, the sample machine translated text corresponds to at least one error type of long sentence translation error, entity name translation error, and domain translation error.

According to the device provided by the embodiment of the invention, through different data synthesis modes, sample machine translation text corresponding to three different translation error types can be generated efficiently, a data labeling process in a fine tuning process is omitted, and the training efficiency of a post-editing model can be further improved.

The device provided by the embodiment of the invention constructs the pre-trained post-editing model through the pre-trained original language encoder, the pre-trained translated language encoder and the decoder together, so that the overall training efficiency of the post-editing model is further improved.

According to the device provided by the embodiment of the invention, the original language encoder and the translated language encoder can be obtained by pre-training the sample single language text of the corresponding language and the sample error text obtained by performing conventional error simulation on the sample single language text, so that the original code and the translated code containing correct semantic information can be obtained by encoding, and the encoding expression capability is improved.

Based on any of the above embodiments, the apparatus further comprises a conventional error simulation unit for:

Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 410, communication interface (Communications Interface) 420, memory 430 and communication bus 440, wherein processor 410, communication interface 420 and memory 430 communicate with each other via communication bus 440. The processor 410 may invoke logic instructions in the memory 430 to perform a post-translation editing method comprising: determining a machine translation text to be edited; inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model; the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text and a sample fine-tuned editing translated text thereof and a sample machine translated text of the sample fine-tuning original text; the pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text.

Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, embodiments of the present invention further provide a computer program product, including a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions which, when executed by a computer, enable the computer to perform the post-translation editing method provided in the above method embodiments, the method including: determining a machine translation text to be edited; inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model; the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text and a sample fine-tuned editing translated text thereof and a sample machine translated text of the sample fine-tuning original text; the pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text.

In still another aspect, an embodiment of the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform the post-translation editing method provided in the above embodiments, the method including: determining a machine translation text to be edited; inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model; the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text and a sample fine-tuned editing translated text thereof and a sample machine translated text of the sample fine-tuning original text; the pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text.

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A post-translation editing method, comprising:

determining a machine translation text to be edited;

the pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text;

the simulated translation text is determined based on the steps of:

2. The post-translation editing method according to claim 1, wherein the sample machine-translated text corresponds to at least one error type of long sentence translation error, entity name translation error, and domain translation error.

3. The post-translation editing method according to claim 2, wherein the sample machine-translated translation text is determined based on at least one of:

4. The post-translation editing method according to claim 1, wherein the pre-trained post-editing model comprises a pre-trained original language encoder and a pre-trained translated language encoder, and a decoder.

5. The post-translation editing method according to claim 4, wherein the pre-trained original language encoder and the pre-trained translated language encoder are obtained by training based on sample monolingual text of a corresponding language and sample error text obtained by performing conventional error simulation on the sample monolingual text.

6. The post-translation editing method according to claim 1 or 5, wherein said performing conventional error simulation specifically comprises:

7. A post-translation editing device, comprising:

the simulated translation text is determined based on the steps of:

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the post-translation editing method according to any one of claims 1 to 6 when the program is executed.

9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the post-translation editing method according to any of claims 1 to 6.