CN112287696A

CN112287696A - Post-translation editing method and device, electronic equipment and storage medium

Info

Publication number: CN112287696A
Application number: CN202011186869.1A
Authority: CN
Inventors: 张睦
Original assignee: Iol Wuhan Information Technology Co ltd
Current assignee: Iol Wuhan Information Technology Co ltd
Priority date: 2020-10-29
Filing date: 2020-10-29
Publication date: 2021-01-29
Anticipated expiration: 2040-10-29
Also published as: WO2022088570A1; CN112287696B

Abstract

The embodiment of the invention provides a method and a device for editing a translated text, wherein the method comprises the following steps: determining a machine translation text to be edited; inputting the machine translation translated text and the corresponding original text into a post-editing model to obtain a post-editing translated text output by the post-editing model; the post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text, the sample fine-tuned original text edited into the translation text, and the sample machine translation text of the sample fine-tuning original text; the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text. According to the method and the device provided by the embodiment of the invention, the training efficiency and the training effect of the post-editing model are improved and the accuracy of post-editing is improved in a mode of pre-training and fine-tuning and a mode of error simulation to synthesize the translated text data.

Description

Post-translation editing method and device, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a device for editing translated text, electronic equipment and a storage medium.

Background

And the post-editing means that the original text to be translated is given, the corresponding machine translation result is called, and then the translator carries out modification and retouching on the basis, so that the translation quality is improved. The machine translation result can provide a translation result for the translator as a reference, so that the translator is prevented from translating from the beginning, and the workload of the translator is reduced.

In actual work, when the difference between the machine translation result and the expected translation result is large, the post-editing mode may cause the translator to need to make many revisions and edits, and the workload of the translator is further increased. For example, when the machine translation model processes the original text to be translated, which has limited resources and is oriented to some professional fields, the effect is poor, and the obtained machine translation result is far from the correct translation result. Or when the machine translation model performs error translation on entity words such as names of people, places, organizations and the like or numeric words, the accuracy of the obtained machine translation result is poor. Or when the machine translation model cannot reasonably process the translation of long sentences, the accuracy of the machine translation result is also insufficient, and a large amount of post-editing work is needed. Thus, the automatic post-editing model plays an increasingly important role in current assisted translation. The post-editing model can automatically perform post-editing on the machine-translated translation based on the input original text to be translated and the machine-translated translation, realize correction of translation errors, output the post-edited translation, and further reduce the workload of a translator by further reducing the difference between the output translation and the translation expected by the translator.

However, the existing post-editing model training method requires a large number of triples of triple. The triple training data are difficult to obtain and need a large amount of manual labeling cost, so that the training effect of the post-editing model is poor, the training efficiency is not high, and the accuracy of post-editing of the translated text is poor.

Disclosure of Invention

The embodiment of the invention provides a method and a device for editing a translated text, electronic equipment and a storage medium, which are used for solving the defects of poor training effect, low training efficiency and poor accuracy of editing the translated text in the prior art.

The embodiment of the invention provides a method for editing a translated text, which comprises the following steps:

determining a machine translation text to be edited;

inputting the machine translation translated text and the corresponding original text into a post-editing model to obtain a post-editing translated text output by the post-editing model;

the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text, a sample fine-tuning edited translated text of the sample fine-tuning original text, and a sample machine translated text of the sample fine-tuning original text;

the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text.

According to the post-translation editing method of the invention, the sample machine translation text corresponds to at least one error type of long sentence translation errors, entity name translation errors and domain translation errors.

According to the method for editing after translation of the invention, the sample machine translation text is determined based on at least one of the following modes:

translating the sample fine-tuning original text by using a first machine translation model to obtain a sample machine translation text with a long sentence translation error type; the first machine translation model is obtained by training a first sample translation original text and a first sample translation text thereof, the sample fine-tuning original text is a long sentence, and the first sample translation original text is a short sentence;

randomly modifying the entity name in the edited translated text after the sample is finely adjusted to obtain a sample machine translation translated text with the entity name translation error type;

translating the sample fine-tuning original text by using a second machine translation model to obtain a sample machine translation text with a field translation error type; the second machine translation model is obtained by training based on a second sample translation original text which is different from the sample fine tuning original text field and a second sample translation text thereof.

According to one embodiment of the invention, the pre-trained post-editing model comprises a pre-trained original language encoder, a pre-trained translated language encoder and a decoder.

According to the method for editing after translation of the invention, the pre-trained original language encoder and the pre-trained translation language encoder are obtained by training a sample error text obtained by performing conventional error simulation on the sample monolingual text based on the sample monolingual text of the corresponding language.

According to the method for editing after translation of the invention, the simulated translation text is determined based on the following steps:

and performing conventional error simulation on the sample pre-training original text or the sample pre-training edited translated text to obtain the simulated translated text.

According to the method for editing after translation of one embodiment of the present invention, the performing of the conventional error simulation specifically includes:

randomly selecting a plurality of text segments in the corresponding text, and deleting, rearranging, replacing or transferring the text segments.

An embodiment of the present invention further provides a device for editing after translation, including:

the translation determining unit is used for determining a machine translation text to be edited;

the post-editing unit is used for inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model;

The embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements any of the steps of the method for editing a translated text when executing the program.

Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of any of the methods for editing a translated text.

According to the method, the device, the electronic equipment and the storage medium for editing after-translation provided by the embodiment of the invention, the original text is pre-trained based on the sample, the translated text is edited after the sample is pre-trained, the simulated translated text of the original text is pre-trained to obtain the pre-trained editing model, the original text is finely adjusted based on the sample, the translated text is edited after the sample is finely adjusted, the sample machine translation of the original text is finely adjusted by the sample, the post-editing model is obtained after the pre-trained editing model is finely adjusted, the training efficiency and the training effect of the post-editing model are improved, and the accuracy of the post-editing is improved in a mode of synthesizing the translated data through pre-training plus fine adjustment and error simulation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a post-translation editing method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for training a post-translation editing model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a post-translation editing apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

And the post-editing means that the original text to be translated is given, the corresponding machine translation result is called, and then the translator carries out modification and retouching on the basis, so that the translation quality is improved. The machine translation result can provide a translation result for the translator as a reference, so that the translator is prevented from translating from the beginning, and the workload of the translator is reduced. However, when the machine translation result is far from the expected translation result, the post-editing mode may cause the translator to perform many editing modifications, which may further increase the workload of the translator. For example, when the machine translation model processes a to-be-translated text with limited resources and oriented to some professional fields, for example, for a physical word, such as a name of a person, a place, or a name of an organization, or when a digital word is wrongly translated, or when the machine translation model cannot reasonably process the translation of a long sentence, the translation effect is poor, and the obtained machine translation result is far from the correct translation result, which requires a lot of post-editing work. Thus, the automatic post-editing model plays an increasingly important role in current assisted translation.

Accordingly, the embodiment of the invention provides a method for editing a translated text. Fig. 1 is a schematic flowchart of a post-translation editing method according to an embodiment of the present invention, as shown in fig. 1, the method includes:

step 110, determining a machine translation text to be edited;

step 120, inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model;

the post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text, the sample fine-tuned original text edited into the translation text, and the sample machine translation text of the sample fine-tuning original text;

Specifically, a machine translation text corresponding to the original text is obtained for the post-editing model to perform automatic post-editing on the original text. The machine translation text may be obtained by inputting the original text into a machine translation model and translating the original text.

And then inputting the machine translation text and the corresponding original text into a post-editing model, and performing error correction on the machine translation text by the post-editing model based on the semantic information of the original text and the semantic information of the machine translation text to obtain the corrected post-editing translation text. Here, the language used for post-editing the translated text is the same as the language used for machine translating the translated text.

The post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text, the sample fine-tuned original text edited into the translation text, and the sample machine translation text of the sample fine-tuning original text; the pre-training post-editing model is obtained by training a simulated translation text of the sample pre-training original text based on the sample pre-training original text and the sample pre-training post-editing translation text.

Here, when editing the model after training, a method of pre-training and fine-tuning is used. Fig. 2 is a schematic flowchart of a method for training a post-translation editing model according to an embodiment of the present invention, and as shown in fig. 2, the method for training the post-translation editing model includes:

step 210, training an initial model based on a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text to obtain a pre-training edited model;

and step 220, fine-tuning the pre-trained editing model based on the sample fine-tuning original text, the edited translated text after the sample fine-tuning original text and the sample machine-translated text of the sample fine-tuning original text to obtain a post-editing model.

Firstly, pre-training an original text by using a large number of samples, editing a translated text after pre-training the samples, and pre-training an initial model by using a simulated translated text to obtain a pre-trained editing model. The sample pre-training original text and the sample pre-training edited Translation text can be obtained by downloading public bilingual parallel corpus data from a network, such as Chinese and English parallel corpuses given by United nations government documents and International Machine Translation (WMT). Then, error simulation can be performed based on bilingual parallel corpora to obtain a simulated translation text of the sample pre-training original text so as to simulate a machine-translated translation. Because only bilingual parallel corpora are required to be obtained during pre-training, and a simulated translation text similar to a machine translation is synthesized in a wrong simulation mode, the difficulty in obtaining training data is greatly reduced, the cost for editing the translation after manual marking is saved, the efficiency of the whole training process is improved, and the training difficulty is reduced.

In addition, in the training process of the pre-training post-editing model obtained through pre-training, the pre-training original text and the sample pre-training post-editing translated text thereof and the simulated translated text of the sample pre-training original text can learn text errors which may occur in the translated text, such as repeated words, inverted words, word omission and the like, and learn how to correct the text errors in the translated text according to the original text so as to obtain the correct post-editing translated text.

In order to further improve the accuracy of post-editing and better complete the post-editing task, the pre-trained editing model can be finely adjusted based on the sample fine-tuning original text, the sample fine-tuned original text and the sample machine-translated text, so as to obtain the post-editing model. The sample fine-tuning original text and the edited translated text after the sample fine-tuning can also be obtained by acquiring bilingual parallel linguistic data. Here, in order to improve the accuracy of the fine tuning, bilingual parallel corpus generated in the translation production environment may be obtained. Each bilingual parallel corpus comprises an original text to be translated and a high-quality translated text generated after manual translation and calibration. And obtaining a sample fine-tuning original text and a high-quality sample fine-tuning edited translated text according to bilingual parallel corpora generated in the production environment. And under the scene that the sample machine translation translated text comprises post-editing, the translation error is caused by the limitation of a machine translation model in the actual machine translation process. The method comprises the steps of finely adjusting an original text based on a sample, finely adjusting the sample to edit a translated text, and finely adjusting the sample to translate the translated text by a machine, so that a post-editing model can learn the translation errors possibly occurring in the field of machine translation besides conventional text errors, thereby improving the error positioning and correcting capability of the post-editing model in a post-editing scene, and further improving the accuracy of post-editing. In addition, the data volume required during fine adjustment is less than that in the pre-training stage, so that the difficulty in obtaining triples of original text, machine translation and post-editing translation can be reduced, the model training difficulty is further reduced, and the model training efficiency is improved.

According to the method provided by the embodiment of the invention, the pre-trained model is obtained by pre-training the pre-trained original text based on the sample, the pre-trained edited version text is edited after pre-training the sample, the sample fine-tuning original text and the sample fine-tuning edited version text based on the sample are obtained by training the simulated version text of the sample pre-trained original text, the sample machine translation version text of the sample fine-tuning original text is obtained by fine-tuning the pre-trained edited model to obtain the post-edited model, the training efficiency and the training effect of the post-edited model are improved and the accuracy of the post-editing is improved by the pre-training plus fine-tuning mode and the error simulation mode of synthesizing the version data.

Based on the embodiment, the sample machine translation text corresponds to at least one error type of long sentence translation errors, entity name translation errors and domain translation errors.

Specifically, in order to enable the post-editing model to learn a translation error caused by the limitation of the machine translation model in the actual machine translation process under the post-editing scene in the fine tuning process, a sample machine translation text containing the translation error can be obtained. In general, possible translation errors include long sentence translation errors, entity name translation errors, domain translation errors, and the like. The long sentence translation error is an error which occurs when a machine translation model cannot reasonably process a long sentence; the entity name translation error is an error which occurs when a machine translation model translates an entity word, such as a name of a person, a place or a mechanism, or a number word; the domain translation error is an error caused by the difference between the applicable domain of the machine translation model and the domain of the original text to be translated when the machine translation model processes the original text to be translated with limited resources and oriented to some professional domains. Therefore, the obtained sample machine translation text can correspond to at least one error type of long sentence translation errors, entity name translation errors and domain translation errors.

Based on any embodiment, the sample machine translation text is determined based on at least one of the following modes:

translating the sample fine-tuning original text by using a first machine translation model to obtain a sample machine translation text with a long sentence translation error type; the first machine translation model is obtained by training a first sample translation original text and a first sample translation text, the sample fine adjustment original text is a long sentence, and the first sample translation original text is a short sentence;

translating the sample fine-tuning original text by using a second machine translation model to obtain a sample machine translation text with a field translation error type; the second machine translation model is obtained by training based on a second sample translation original text which is different from the sample fine adjustment original text field and a second sample translation text thereof.

Specifically, for a long sentence translation error, a first machine translation model can be obtained based on a first sample translation original text and training of the first sample translation text, and the first machine translation model can be constructed based on a single Transformer model. The first sample translation text and the first sample translation text thereof may be bilingual parallel corpora downloaded through a network. Here, the first sample translation original text is a short sentence, and includes, for example, only 1 sentence. The first machine translation model is obtained based on short sentence training, so that the model is good at translating short sentences only, and if long sentences are input into the model for translation, the obtained translated text is easy to have long sentence translation errors. Therefore, a long sentence, for example, including more than 2 sentences is selected as a sample fine-tuning original text and is input into the first machine translation model, so as to obtain a sample machine translation text with a long sentence translation error type.

For the entity name translation error, entity recognition can be performed on the edited translation text after the sample is finely adjusted by using an entity recognition tool such as space and the like, for example, entity recognition is performed on bilingual parallel corpus, such as english text in the bilingual parallel corpus generated in a translation production environment. After the sample is screened out and fine-tuned, the edited translated text segment of the entity, including the person name, the place name, the organization name, the number and the like, in the edited translated text is screened out, and random modification, such as deletion or replacement, is carried out on the edited translated text segment to obtain the sample machine-translated text with the entity name translation error type.

Aiming at the field translation error, a second machine translation model can be obtained based on the second sample translation original text and the second sample translation text training, and the second machine translation model can be constructed based on a single Transformer model. And the second sample translation original text and the second sample translation text thereof belong to a field different from the field of the sample fine tuning original text. For example, bilingual parallel corpora of a high-quality but narrow-scope government official document of the united nations can be downloaded through the network as the second sample translation original text and the second sample translation text thereof. The second machine translation model obtained by training is only good at translating the second sample translation original text and the field text of the second sample translation text, so that if the original texts in different fields are input into the model for translation, the obtained translation is easy to have field translation errors, and the translation text obtained by translating the sample fine-tuning original text by the second machine translation model can be used as the sample machine translation text with the field translation error type.

According to the method provided by the embodiment of the invention, through different data synthesis modes, the sample machine translation translated text corresponding to three different translation error types can be generated efficiently, the data marking process in the fine tuning process is omitted, and the training efficiency of the post-editing model can be further improved.

Based on any of the above embodiments, the pre-trained post-editing model includes a pre-trained original language encoder, a pre-trained translated language encoder, and a decoder.

Specifically, the pre-trained post-editing model may include two encoders, namely, an original language encoder and a translated language encoder, for encoding the original text and the machine-translated translation text, respectively, and a decoder for decoding the encoded original text and the encoded machine-translated translation text, so as to implement error correction of the machine-translated translation text, and obtain a post-edited translation text. The original text language encoder, the translated text language encoder and the decoder can be constructed on the basis of a single Transformer model. Here, the two encoders may be obtained by pre-training to improve the pre-training efficiency of the pre-trained post-editing model, thereby further improving the training efficiency of the post-editing model as a whole.

According to the method provided by the embodiment of the invention, the pre-trained original text language encoder, the pre-trained translated text language encoder and the decoder are used for jointly constructing the pre-trained post-editing model, so that the overall training efficiency of the post-editing model is further improved.

Based on any of the above embodiments, the pre-trained original language encoder and the pre-trained translated language encoder are trained based on the sample monolingual text of the corresponding language and the sample error text obtained by performing conventional error simulation on the sample monolingual text.

Specifically, in order to enable the original language encoder and the translated language encoder to learn to extract correct semantic information from the wrong text, so as to encode and obtain the original encoding and the translated encoding containing the correct semantic information, so as to improve the expression capability of the encoding, the original language encoder and the translated language encoder may be trained based on the sample monolingual text of the corresponding language and the corresponding sample wrong text, and the word vector model of the corresponding language. For example, if the original text is chinese and the translated text is english, the original language encoder may be pre-trained based on the sample monolingual text of chinese and the corresponding sample error text, and the chinese word vector model, and the translated language encoder may be pre-trained based on the sample monolingual text of english and the corresponding sample error text, and the english word vector model. The sample monolingual text may be obtained by collecting a large number of monolingual corpora, for example, common chinese monolingual corpora, such as chinese wikipedia and news corpora, and common english corpora, such as english wikipedia and news corpora, may be downloaded from the network. In order to reduce the difficulty in acquiring the training data, part of the monolingual corpus, for example, 20% of the monolingual corpus, may be randomly selected from the monolingual corpus, and the selected monolingual corpus, that is, the sample monolingual text, is subjected to a conventional error simulation to obtain a sample error text containing a conventional text error.

According to the method provided by the embodiment of the invention, the original text language encoder and the translated text language encoder are obtained by pre-training the sample monolingual text corresponding to the language and the sample error text obtained by performing conventional error simulation on the sample monolingual text, so that the original text code and the translated text code containing correct semantic information can be obtained by encoding, and the expression capacity of the encoding is improved.

Based on any embodiment, the simulated translated text is determined based on the following steps:

and performing conventional error simulation on the sample pre-training original text or the edited translated text after the sample pre-training to obtain a simulated translated text.

Specifically, part of bilingual parallel corpora, for example, 10% of bilingual parallel corpora, may be randomly selected from a bilingual parallel corpus, a sample pre-training original text in each corpus is subjected to a conventional error simulation to obtain a simulated translation text including a conventional text error, and the sample pre-training original text in the bilingual parallel corpora is used as a piece of training data of the pre-training post-editing model after being pre-trained to edit the translation text, the simulated translation text, and the sample pre-training original text. And selecting part of bilingual parallel corpora randomly from a bilingual parallel corpus, for example, 10% of the bilingual parallel corpora, performing conventional error simulation on the edited translated text after pre-training a sample in the bilingual parallel corpus to obtain a simulated translated text containing errors of the conventional text, and taking the sample pre-trained original text, the simulated translated text and the sample pre-trained edited translated text in the bilingual parallel corpora as a piece of training data of the pre-trained edited model.

Based on any of the above embodiments, performing a conventional error simulation specifically includes:

Specifically, the conventional text errors include word missing, reverse order, word missing, repetition, and the like, so that when the conventional error simulation is performed, a plurality of text segments in the corresponding text can be randomly selected, and each text segment is subjected to deleting, rearranging, replacing, or transferring operations. The deleting refers to deleting the whole text segment, the rearranging refers to reversing the sequence of words in the text segment, the replacing refers to replacing the text segment with a text segment at other positions in the original text, and the transferring refers to exchanging positions of the text segment at other positions in the original text with the text segment. For example, a conventional error simulation can be performed in the manner shown in the following table:

original text	<zh>Today the weather is really good.
		Deleting	<zh>DEL is good every day.
Rearrangement	<zh>Today's truth is good.
		Replacement of	<zh>Today is good today.
Transfer of	<zh>So it is good every day.

Based on any one of the above embodiments, another embodiment of the present invention provides a post-editing model construction method. The method comprises the following steps:

firstly, collecting corpus data required by model training, including:

and accumulating bilingual parallel corpora generated in the translation production environment, and marking as a bilingual parallel corpus C. Each corpus comprises an original text to be translated and a high-quality translated text generated after manual translation checking.

The common bilingual parallel corpus, such as the United nations and WMT bilingual parallel corpus, is downloaded from the network and is denoted as bilingual parallel corpus T.

The common monolingual corpus of the original language, such as Chinese Wikipedia and news corpus, is downloaded from the network and is denoted as monolingual corpus Z.

The public translation language monolingual corpus, such as English Wikipedia and News corpus, is downloaded from the network and is denoted as monolingual corpus E.

And performing word segmentation processing on all the linguistic data. Wherein, for English corpus, the space tool can be used for word segmentation; for Chinese linguistic data, word segmentation can be performed by taking characters as units by utilizing grammar rules, namely, independent Chinese characters, continuous numbers or English letters, punctuation marks and the like are taken as word examples. Then, a language identifier is added to the beginning of each corpus, as shown in the following table:

and respectively carrying out word vector training on the original language and the translated language by using a Skip-Gram algorithm based on the corpus data of the segmented words. Therein, the dimension of the word vector may be set to 300 and the context window may be set to 5.

And randomly extracting 20% of linguistic data from Z, performing conventional error simulation, synthesizing parallel linguistic data comprising the possibly damaged linguistic data and the original linguistic data, and combining a word vector model of the original language to pre-train an original language encoder of a standard Transformer model.

And randomly extracting 20% of linguistic data from the E, performing conventional error simulation, synthesizing parallel linguistic data comprising the possibly damaged linguistic data and the original linguistic data, and combining a word vector model of the translation language to pre-train a translation language encoder of a standard Transformer model.

Randomly extracting 10% of the corpus from T, and performing conventional error simulation on the original corpus to generate a ternary corpus (possibly damaged original corpus, original translated corpus, and original corpus). Similarly, 10% of the linguistic data are randomly extracted from T, and the translation linguistic data are subjected to conventional error simulation to generate a ternary linguistic data (initial original linguistic data, possibly damaged translation linguistic data, and original translation linguistic data). And pre-training from a double-Transformer encoder to a single-Transformer decoder by using the synthesized ternary parallel corpus to obtain a pre-trained editing model. Wherein the dual Transformer encoders are a source language encoder and a translated language encoder.

Subsequently, training data acquisition of the fine tuning task is performed, including:

a) using Chinese sentence-breaking rule method to break the sentences of original corpus in C, screening out bilingual parallel corpus whose number of sentences in original corpus is greater than or equal to 2, forming a subset C₁. Similarly, the sentence breaking is carried out on the original corpus in T, and the bilingual parallel corpus with the number of the sentences of the original corpus being 1 is screened out to form another subset T₁. Using corpus T₁And constructing a machine translation engine based on a Transformer model. Then C is mixed₁The original corpus of text is input into the model and decoded to produce a machine translation, producing triples (C)₁Original, machine-translated translation, C₁A translation).

b) Utilizing a space tool to carry out entity recognition on the translated text corpus in the C, and screening out bilingual parallel corpus C containing entities such as names of people, places, organizations and numbers₂. Random modification of C₂Entity nouns in a translation corpus, e.g. deleted or replaced, produce triples (C)₂Original text, translation with damaged physical noun, C₂A translation).

c) And (4) screening bilingual parallel corpora of the united nations from the T, and constructing a machine translation engine based on a Transformer model. Extracting a subset C from C₃Mixing C with₃The original corpus of text (C) is input into the model for decoding to generate a machine translation, generating a triplet (C)₃Original, machine-translated translation, C₃A translation).

Collecting the triples generated in a), b) and c) to form total fine tuning task training data, and fine tuning the pre-trained editing model to obtain a final post-editing model.

The following describes the post-translation editing apparatus provided in the embodiment of the present invention, and the post-translation editing apparatus described below and the post-translation editing method described above may be referred to in correspondence with each other.

Based on any of the above embodiments, fig. 3 is a schematic structural diagram of a post-translation editing apparatus according to an embodiment of the present invention, as shown in fig. 3, the apparatus includes: a translation determining unit 310 and a post-editing unit 320.

The translation determining unit 310 is configured to determine a machine translation text to be edited;

the post-editing unit 320 is configured to input the machine-translated translation text and the original text corresponding to the machine-translated translation text into a post-editing model, and obtain a post-editing translation text output by the post-editing model;

The device provided by the embodiment of the invention has the advantages that the pre-training and post-editing model is obtained by training the pre-training original text based on the sample and the pre-training and post-editing sample text thereof, the fine-tuning original text based on the sample and the fine-tuning post-editing sample text thereof, the fine-tuning post-editing model is obtained by translating the translated text based on the sample and the fine-tuning post-editing sample text thereof, the training efficiency and the training effect of the post-editing model are improved and the accuracy of the post-editing is improved by the pre-training and fine-tuning mode and the error simulation mode for synthesizing the translated text data.

Based on any embodiment, the sample machine translation text corresponds to at least one error type of long sentence translation errors, entity name translation errors and domain translation errors.

The device provided by the embodiment of the invention can efficiently generate the sample machine translation translated text corresponding to three different translation error types through different data synthesis modes, saves the data marking process in the fine tuning process, and can further improve the training efficiency of the post-editing model.

The device provided by the embodiment of the invention constructs the pre-trained post-editing model through the pre-trained original language encoder, the pre-trained translated language encoder and the decoder, thereby further improving the overall training efficiency of the post-editing model.

The device provided by the embodiment of the invention can be used for pre-training the sample monolingual text of the corresponding language and the sample error text obtained by performing conventional error simulation on the sample monolingual text to obtain the original text language encoder and the translated text language encoder, so that the original text code and the translated text code containing correct semantic information can be encoded, and the encoding expression capacity is improved.

Based on any of the above embodiments, the apparatus further comprises a conventional error simulation unit for:

Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor)410, a communication Interface 420, a memory (memory)430 and a communication bus 440, wherein the processor 410, the communication Interface 420 and the memory 430 are communicated with each other via the communication bus 440. Processor 410 may call logic instructions in memory 430 to perform a post-translation editing method comprising: determining a machine translation text to be edited; inputting the machine translation translated text and the corresponding original text into a post-editing model to obtain a post-editing translated text output by the post-editing model; the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text, a sample fine-tuning edited translated text of the sample fine-tuning original text, and a sample machine translated text of the sample fine-tuning original text; the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text.

In addition, the logic instructions in the memory 430 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

In another aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer can execute the method for editing after translation provided by the above-mentioned method embodiments, where the method includes: determining a machine translation text to be edited; inputting the machine translation translated text and the corresponding original text into a post-editing model to obtain a post-editing translated text output by the post-editing model; the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text, a sample fine-tuning edited translated text of the sample fine-tuning original text, and a sample machine translated text of the sample fine-tuning original text; the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text.

In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method for editing after translation provided by the foregoing embodiments when executed by a processor, where the method includes: determining a machine translation text to be edited; inputting the machine translation translated text and the corresponding original text into a post-editing model to obtain a post-editing translated text output by the post-editing model; the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text, a sample fine-tuning edited translated text of the sample fine-tuning original text, and a sample machine translated text of the sample fine-tuning original text; the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text.

The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for editing a translated text, comprising:

determining a machine translation text to be edited;

2. The post-translation editing method according to claim 1, wherein the sample machine translation text corresponds to at least one type of error among a long sentence translation error, an entity name translation error, and a domain translation error.

3. The method of post-translation editing according to claim 2, wherein the sample machine translation of the translation text is determined based on at least one of:

4. The method of claim 1, wherein the pre-trained post-editing model comprises a pre-trained native language encoder and a pre-trained translated language encoder, and a decoder.

5. The method of post-translation editing according to claim 4, wherein the pre-trained native language encoder and the pre-trained translation language encoder are trained based on a sample monolingual text of a corresponding language and a sample error text obtained by performing a conventional error simulation on the sample monolingual text.

6. The method of post-translation editing according to claim 1, wherein the simulated translation text is determined based on the steps of:

7. The method for editing after-translation of claim 5 or 6, wherein the performing of the routine error simulation specifically comprises:

8. A post-translation editing apparatus, comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method for post-compilation of a translated version according to any of claims 1 to 7 are implemented when the program is executed by the processor.

10. A non-transitory computer readable storage medium, having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the post-translation editing method according to any one of claims 1 to 7.