CN112287696A - Post-translation editing method and device, electronic equipment and storage medium - Google Patents

Post-translation editing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112287696A
CN112287696A CN202011186869.1A CN202011186869A CN112287696A CN 112287696 A CN112287696 A CN 112287696A CN 202011186869 A CN202011186869 A CN 202011186869A CN 112287696 A CN112287696 A CN 112287696A
Authority
CN
China
Prior art keywords
text
sample
translation
editing
post
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011186869.1A
Other languages
Chinese (zh)
Other versions
CN112287696B (en
Inventor
张睦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iol Wuhan Information Technology Co ltd
Original Assignee
Iol Wuhan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iol Wuhan Information Technology Co ltd filed Critical Iol Wuhan Information Technology Co ltd
Priority to CN202011186869.1A priority Critical patent/CN112287696B/en
Publication of CN112287696A publication Critical patent/CN112287696A/en
Priority to PCT/CN2021/078814 priority patent/WO2022088570A1/en
Application granted granted Critical
Publication of CN112287696B publication Critical patent/CN112287696B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a method and a device for editing a translated text, wherein the method comprises the following steps: determining a machine translation text to be edited; inputting the machine translation translated text and the corresponding original text into a post-editing model to obtain a post-editing translated text output by the post-editing model; the post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text, the sample fine-tuned original text edited into the translation text, and the sample machine translation text of the sample fine-tuning original text; the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text. According to the method and the device provided by the embodiment of the invention, the training efficiency and the training effect of the post-editing model are improved and the accuracy of post-editing is improved in a mode of pre-training and fine-tuning and a mode of error simulation to synthesize the translated text data.

Description

Post-translation editing method and device, electronic equipment and storage medium
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method and a device for editing translated text, electronic equipment and a storage medium.
Background
And the post-editing means that the original text to be translated is given, the corresponding machine translation result is called, and then the translator carries out modification and retouching on the basis, so that the translation quality is improved. The machine translation result can provide a translation result for the translator as a reference, so that the translator is prevented from translating from the beginning, and the workload of the translator is reduced.
In actual work, when the difference between the machine translation result and the expected translation result is large, the post-editing mode may cause the translator to need to make many revisions and edits, and the workload of the translator is further increased. For example, when the machine translation model processes the original text to be translated, which has limited resources and is oriented to some professional fields, the effect is poor, and the obtained machine translation result is far from the correct translation result. Or when the machine translation model performs error translation on entity words such as names of people, places, organizations and the like or numeric words, the accuracy of the obtained machine translation result is poor. Or when the machine translation model cannot reasonably process the translation of long sentences, the accuracy of the machine translation result is also insufficient, and a large amount of post-editing work is needed. Thus, the automatic post-editing model plays an increasingly important role in current assisted translation. The post-editing model can automatically perform post-editing on the machine-translated translation based on the input original text to be translated and the machine-translated translation, realize correction of translation errors, output the post-edited translation, and further reduce the workload of a translator by further reducing the difference between the output translation and the translation expected by the translator.
However, the existing post-editing model training method requires a large number of triples of triple. The triple training data are difficult to obtain and need a large amount of manual labeling cost, so that the training effect of the post-editing model is poor, the training efficiency is not high, and the accuracy of post-editing of the translated text is poor.
Disclosure of Invention
The embodiment of the invention provides a method and a device for editing a translated text, electronic equipment and a storage medium, which are used for solving the defects of poor training effect, low training efficiency and poor accuracy of editing the translated text in the prior art.
The embodiment of the invention provides a method for editing a translated text, which comprises the following steps:
determining a machine translation text to be edited;
inputting the machine translation translated text and the corresponding original text into a post-editing model to obtain a post-editing translated text output by the post-editing model;
the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text, a sample fine-tuning edited translated text of the sample fine-tuning original text, and a sample machine translated text of the sample fine-tuning original text;
the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text.
According to the post-translation editing method of the invention, the sample machine translation text corresponds to at least one error type of long sentence translation errors, entity name translation errors and domain translation errors.
According to the method for editing after translation of the invention, the sample machine translation text is determined based on at least one of the following modes:
translating the sample fine-tuning original text by using a first machine translation model to obtain a sample machine translation text with a long sentence translation error type; the first machine translation model is obtained by training a first sample translation original text and a first sample translation text thereof, the sample fine-tuning original text is a long sentence, and the first sample translation original text is a short sentence;
randomly modifying the entity name in the edited translated text after the sample is finely adjusted to obtain a sample machine translation translated text with the entity name translation error type;
translating the sample fine-tuning original text by using a second machine translation model to obtain a sample machine translation text with a field translation error type; the second machine translation model is obtained by training based on a second sample translation original text which is different from the sample fine tuning original text field and a second sample translation text thereof.
According to one embodiment of the invention, the pre-trained post-editing model comprises a pre-trained original language encoder, a pre-trained translated language encoder and a decoder.
According to the method for editing after translation of the invention, the pre-trained original language encoder and the pre-trained translation language encoder are obtained by training a sample error text obtained by performing conventional error simulation on the sample monolingual text based on the sample monolingual text of the corresponding language.
According to the method for editing after translation of the invention, the simulated translation text is determined based on the following steps:
and performing conventional error simulation on the sample pre-training original text or the sample pre-training edited translated text to obtain the simulated translated text.
According to the method for editing after translation of one embodiment of the present invention, the performing of the conventional error simulation specifically includes:
randomly selecting a plurality of text segments in the corresponding text, and deleting, rearranging, replacing or transferring the text segments.
An embodiment of the present invention further provides a device for editing after translation, including:
the translation determining unit is used for determining a machine translation text to be edited;
the post-editing unit is used for inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model;
the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text, a sample fine-tuning edited translated text of the sample fine-tuning original text, and a sample machine translated text of the sample fine-tuning original text;
the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text.
The embodiment of the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements any of the steps of the method for editing a translated text when executing the program.
Embodiments of the present invention further provide a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps of any of the methods for editing a translated text.
According to the method, the device, the electronic equipment and the storage medium for editing after-translation provided by the embodiment of the invention, the original text is pre-trained based on the sample, the translated text is edited after the sample is pre-trained, the simulated translated text of the original text is pre-trained to obtain the pre-trained editing model, the original text is finely adjusted based on the sample, the translated text is edited after the sample is finely adjusted, the sample machine translation of the original text is finely adjusted by the sample, the post-editing model is obtained after the pre-trained editing model is finely adjusted, the training efficiency and the training effect of the post-editing model are improved, and the accuracy of the post-editing is improved in a mode of synthesizing the translated data through pre-training plus fine adjustment and error simulation.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a post-translation editing method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for training a post-translation editing model according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a post-translation editing apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
And the post-editing means that the original text to be translated is given, the corresponding machine translation result is called, and then the translator carries out modification and retouching on the basis, so that the translation quality is improved. The machine translation result can provide a translation result for the translator as a reference, so that the translator is prevented from translating from the beginning, and the workload of the translator is reduced. However, when the machine translation result is far from the expected translation result, the post-editing mode may cause the translator to perform many editing modifications, which may further increase the workload of the translator. For example, when the machine translation model processes a to-be-translated text with limited resources and oriented to some professional fields, for example, for a physical word, such as a name of a person, a place, or a name of an organization, or when a digital word is wrongly translated, or when the machine translation model cannot reasonably process the translation of a long sentence, the translation effect is poor, and the obtained machine translation result is far from the correct translation result, which requires a lot of post-editing work. Thus, the automatic post-editing model plays an increasingly important role in current assisted translation.
However, the existing post-editing model training method requires a large number of triples of triple. The triple training data are difficult to obtain and need a large amount of manual labeling cost, so that the training effect of the post-editing model is poor, the training efficiency is not high, and the accuracy of post-editing of the translated text is poor.
Accordingly, the embodiment of the invention provides a method for editing a translated text. Fig. 1 is a schematic flowchart of a post-translation editing method according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step 110, determining a machine translation text to be edited;
step 120, inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model;
the post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text, the sample fine-tuned original text edited into the translation text, and the sample machine translation text of the sample fine-tuning original text;
the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text.
Specifically, a machine translation text corresponding to the original text is obtained for the post-editing model to perform automatic post-editing on the original text. The machine translation text may be obtained by inputting the original text into a machine translation model and translating the original text.
And then inputting the machine translation text and the corresponding original text into a post-editing model, and performing error correction on the machine translation text by the post-editing model based on the semantic information of the original text and the semantic information of the machine translation text to obtain the corrected post-editing translation text. Here, the language used for post-editing the translated text is the same as the language used for machine translating the translated text.
The post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text, the sample fine-tuned original text edited into the translation text, and the sample machine translation text of the sample fine-tuning original text; the pre-training post-editing model is obtained by training a simulated translation text of the sample pre-training original text based on the sample pre-training original text and the sample pre-training post-editing translation text.
Here, when editing the model after training, a method of pre-training and fine-tuning is used. Fig. 2 is a schematic flowchart of a method for training a post-translation editing model according to an embodiment of the present invention, and as shown in fig. 2, the method for training the post-translation editing model includes:
step 210, training an initial model based on a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text to obtain a pre-training edited model;
and step 220, fine-tuning the pre-trained editing model based on the sample fine-tuning original text, the edited translated text after the sample fine-tuning original text and the sample machine-translated text of the sample fine-tuning original text to obtain a post-editing model.
Firstly, pre-training an original text by using a large number of samples, editing a translated text after pre-training the samples, and pre-training an initial model by using a simulated translated text to obtain a pre-trained editing model. The sample pre-training original text and the sample pre-training edited Translation text can be obtained by downloading public bilingual parallel corpus data from a network, such as Chinese and English parallel corpuses given by United nations government documents and International Machine Translation (WMT). Then, error simulation can be performed based on bilingual parallel corpora to obtain a simulated translation text of the sample pre-training original text so as to simulate a machine-translated translation. Because only bilingual parallel corpora are required to be obtained during pre-training, and a simulated translation text similar to a machine translation is synthesized in a wrong simulation mode, the difficulty in obtaining training data is greatly reduced, the cost for editing the translation after manual marking is saved, the efficiency of the whole training process is improved, and the training difficulty is reduced.
In addition, in the training process of the pre-training post-editing model obtained through pre-training, the pre-training original text and the sample pre-training post-editing translated text thereof and the simulated translated text of the sample pre-training original text can learn text errors which may occur in the translated text, such as repeated words, inverted words, word omission and the like, and learn how to correct the text errors in the translated text according to the original text so as to obtain the correct post-editing translated text.
In order to further improve the accuracy of post-editing and better complete the post-editing task, the pre-trained editing model can be finely adjusted based on the sample fine-tuning original text, the sample fine-tuned original text and the sample machine-translated text, so as to obtain the post-editing model. The sample fine-tuning original text and the edited translated text after the sample fine-tuning can also be obtained by acquiring bilingual parallel linguistic data. Here, in order to improve the accuracy of the fine tuning, bilingual parallel corpus generated in the translation production environment may be obtained. Each bilingual parallel corpus comprises an original text to be translated and a high-quality translated text generated after manual translation and calibration. And obtaining a sample fine-tuning original text and a high-quality sample fine-tuning edited translated text according to bilingual parallel corpora generated in the production environment. And under the scene that the sample machine translation translated text comprises post-editing, the translation error is caused by the limitation of a machine translation model in the actual machine translation process. The method comprises the steps of finely adjusting an original text based on a sample, finely adjusting the sample to edit a translated text, and finely adjusting the sample to translate the translated text by a machine, so that a post-editing model can learn the translation errors possibly occurring in the field of machine translation besides conventional text errors, thereby improving the error positioning and correcting capability of the post-editing model in a post-editing scene, and further improving the accuracy of post-editing. In addition, the data volume required during fine adjustment is less than that in the pre-training stage, so that the difficulty in obtaining triples of original text, machine translation and post-editing translation can be reduced, the model training difficulty is further reduced, and the model training efficiency is improved.
According to the method provided by the embodiment of the invention, the pre-trained model is obtained by pre-training the pre-trained original text based on the sample, the pre-trained edited version text is edited after pre-training the sample, the sample fine-tuning original text and the sample fine-tuning edited version text based on the sample are obtained by training the simulated version text of the sample pre-trained original text, the sample machine translation version text of the sample fine-tuning original text is obtained by fine-tuning the pre-trained edited model to obtain the post-edited model, the training efficiency and the training effect of the post-edited model are improved and the accuracy of the post-editing is improved by the pre-training plus fine-tuning mode and the error simulation mode of synthesizing the version data.
Based on the embodiment, the sample machine translation text corresponds to at least one error type of long sentence translation errors, entity name translation errors and domain translation errors.
Specifically, in order to enable the post-editing model to learn a translation error caused by the limitation of the machine translation model in the actual machine translation process under the post-editing scene in the fine tuning process, a sample machine translation text containing the translation error can be obtained. In general, possible translation errors include long sentence translation errors, entity name translation errors, domain translation errors, and the like. The long sentence translation error is an error which occurs when a machine translation model cannot reasonably process a long sentence; the entity name translation error is an error which occurs when a machine translation model translates an entity word, such as a name of a person, a place or a mechanism, or a number word; the domain translation error is an error caused by the difference between the applicable domain of the machine translation model and the domain of the original text to be translated when the machine translation model processes the original text to be translated with limited resources and oriented to some professional domains. Therefore, the obtained sample machine translation text can correspond to at least one error type of long sentence translation errors, entity name translation errors and domain translation errors.
Based on any embodiment, the sample machine translation text is determined based on at least one of the following modes:
translating the sample fine-tuning original text by using a first machine translation model to obtain a sample machine translation text with a long sentence translation error type; the first machine translation model is obtained by training a first sample translation original text and a first sample translation text, the sample fine adjustment original text is a long sentence, and the first sample translation original text is a short sentence;
randomly modifying the entity name in the edited translated text after the sample is finely adjusted to obtain a sample machine translation translated text with the entity name translation error type;
translating the sample fine-tuning original text by using a second machine translation model to obtain a sample machine translation text with a field translation error type; the second machine translation model is obtained by training based on a second sample translation original text which is different from the sample fine adjustment original text field and a second sample translation text thereof.
Specifically, for a long sentence translation error, a first machine translation model can be obtained based on a first sample translation original text and training of the first sample translation text, and the first machine translation model can be constructed based on a single Transformer model. The first sample translation text and the first sample translation text thereof may be bilingual parallel corpora downloaded through a network. Here, the first sample translation original text is a short sentence, and includes, for example, only 1 sentence. The first machine translation model is obtained based on short sentence training, so that the model is good at translating short sentences only, and if long sentences are input into the model for translation, the obtained translated text is easy to have long sentence translation errors. Therefore, a long sentence, for example, including more than 2 sentences is selected as a sample fine-tuning original text and is input into the first machine translation model, so as to obtain a sample machine translation text with a long sentence translation error type.
For the entity name translation error, entity recognition can be performed on the edited translation text after the sample is finely adjusted by using an entity recognition tool such as space and the like, for example, entity recognition is performed on bilingual parallel corpus, such as english text in the bilingual parallel corpus generated in a translation production environment. After the sample is screened out and fine-tuned, the edited translated text segment of the entity, including the person name, the place name, the organization name, the number and the like, in the edited translated text is screened out, and random modification, such as deletion or replacement, is carried out on the edited translated text segment to obtain the sample machine-translated text with the entity name translation error type.
Aiming at the field translation error, a second machine translation model can be obtained based on the second sample translation original text and the second sample translation text training, and the second machine translation model can be constructed based on a single Transformer model. And the second sample translation original text and the second sample translation text thereof belong to a field different from the field of the sample fine tuning original text. For example, bilingual parallel corpora of a high-quality but narrow-scope government official document of the united nations can be downloaded through the network as the second sample translation original text and the second sample translation text thereof. The second machine translation model obtained by training is only good at translating the second sample translation original text and the field text of the second sample translation text, so that if the original texts in different fields are input into the model for translation, the obtained translation is easy to have field translation errors, and the translation text obtained by translating the sample fine-tuning original text by the second machine translation model can be used as the sample machine translation text with the field translation error type.
According to the method provided by the embodiment of the invention, through different data synthesis modes, the sample machine translation translated text corresponding to three different translation error types can be generated efficiently, the data marking process in the fine tuning process is omitted, and the training efficiency of the post-editing model can be further improved.
Based on any of the above embodiments, the pre-trained post-editing model includes a pre-trained original language encoder, a pre-trained translated language encoder, and a decoder.
Specifically, the pre-trained post-editing model may include two encoders, namely, an original language encoder and a translated language encoder, for encoding the original text and the machine-translated translation text, respectively, and a decoder for decoding the encoded original text and the encoded machine-translated translation text, so as to implement error correction of the machine-translated translation text, and obtain a post-edited translation text. The original text language encoder, the translated text language encoder and the decoder can be constructed on the basis of a single Transformer model. Here, the two encoders may be obtained by pre-training to improve the pre-training efficiency of the pre-trained post-editing model, thereby further improving the training efficiency of the post-editing model as a whole.
According to the method provided by the embodiment of the invention, the pre-trained original text language encoder, the pre-trained translated text language encoder and the decoder are used for jointly constructing the pre-trained post-editing model, so that the overall training efficiency of the post-editing model is further improved.
Based on any of the above embodiments, the pre-trained original language encoder and the pre-trained translated language encoder are trained based on the sample monolingual text of the corresponding language and the sample error text obtained by performing conventional error simulation on the sample monolingual text.
Specifically, in order to enable the original language encoder and the translated language encoder to learn to extract correct semantic information from the wrong text, so as to encode and obtain the original encoding and the translated encoding containing the correct semantic information, so as to improve the expression capability of the encoding, the original language encoder and the translated language encoder may be trained based on the sample monolingual text of the corresponding language and the corresponding sample wrong text, and the word vector model of the corresponding language. For example, if the original text is chinese and the translated text is english, the original language encoder may be pre-trained based on the sample monolingual text of chinese and the corresponding sample error text, and the chinese word vector model, and the translated language encoder may be pre-trained based on the sample monolingual text of english and the corresponding sample error text, and the english word vector model. The sample monolingual text may be obtained by collecting a large number of monolingual corpora, for example, common chinese monolingual corpora, such as chinese wikipedia and news corpora, and common english corpora, such as english wikipedia and news corpora, may be downloaded from the network. In order to reduce the difficulty in acquiring the training data, part of the monolingual corpus, for example, 20% of the monolingual corpus, may be randomly selected from the monolingual corpus, and the selected monolingual corpus, that is, the sample monolingual text, is subjected to a conventional error simulation to obtain a sample error text containing a conventional text error.
According to the method provided by the embodiment of the invention, the original text language encoder and the translated text language encoder are obtained by pre-training the sample monolingual text corresponding to the language and the sample error text obtained by performing conventional error simulation on the sample monolingual text, so that the original text code and the translated text code containing correct semantic information can be obtained by encoding, and the expression capacity of the encoding is improved.
Based on any embodiment, the simulated translated text is determined based on the following steps:
and performing conventional error simulation on the sample pre-training original text or the edited translated text after the sample pre-training to obtain a simulated translated text.
Specifically, part of bilingual parallel corpora, for example, 10% of bilingual parallel corpora, may be randomly selected from a bilingual parallel corpus, a sample pre-training original text in each corpus is subjected to a conventional error simulation to obtain a simulated translation text including a conventional text error, and the sample pre-training original text in the bilingual parallel corpora is used as a piece of training data of the pre-training post-editing model after being pre-trained to edit the translation text, the simulated translation text, and the sample pre-training original text. And selecting part of bilingual parallel corpora randomly from a bilingual parallel corpus, for example, 10% of the bilingual parallel corpora, performing conventional error simulation on the edited translated text after pre-training a sample in the bilingual parallel corpus to obtain a simulated translated text containing errors of the conventional text, and taking the sample pre-trained original text, the simulated translated text and the sample pre-trained edited translated text in the bilingual parallel corpora as a piece of training data of the pre-trained edited model.
Based on any of the above embodiments, performing a conventional error simulation specifically includes:
randomly selecting a plurality of text segments in the corresponding text, and deleting, rearranging, replacing or transferring the text segments.
Specifically, the conventional text errors include word missing, reverse order, word missing, repetition, and the like, so that when the conventional error simulation is performed, a plurality of text segments in the corresponding text can be randomly selected, and each text segment is subjected to deleting, rearranging, replacing, or transferring operations. The deleting refers to deleting the whole text segment, the rearranging refers to reversing the sequence of words in the text segment, the replacing refers to replacing the text segment with a text segment at other positions in the original text, and the transferring refers to exchanging positions of the text segment at other positions in the original text with the text segment. For example, a conventional error simulation can be performed in the manner shown in the following table:
original text <zh>Today the weather is really good.
Deleting <zh>DEL is good every day.
Rearrangement <zh>Today's truth is good.
Replacement of <zh>Today is good today.
Transfer of <zh>So it is good every day.
Based on any one of the above embodiments, another embodiment of the present invention provides a post-editing model construction method. The method comprises the following steps:
firstly, collecting corpus data required by model training, including:
and accumulating bilingual parallel corpora generated in the translation production environment, and marking as a bilingual parallel corpus C. Each corpus comprises an original text to be translated and a high-quality translated text generated after manual translation checking.
The common bilingual parallel corpus, such as the United nations and WMT bilingual parallel corpus, is downloaded from the network and is denoted as bilingual parallel corpus T.
The common monolingual corpus of the original language, such as Chinese Wikipedia and news corpus, is downloaded from the network and is denoted as monolingual corpus Z.
The public translation language monolingual corpus, such as English Wikipedia and News corpus, is downloaded from the network and is denoted as monolingual corpus E.
And performing word segmentation processing on all the linguistic data. Wherein, for English corpus, the space tool can be used for word segmentation; for Chinese linguistic data, word segmentation can be performed by taking characters as units by utilizing grammar rules, namely, independent Chinese characters, continuous numbers or English letters, punctuation marks and the like are taken as word examples. Then, a language identifier is added to the beginning of each corpus, as shown in the following table:
Figure BDA0002751616760000131
and respectively carrying out word vector training on the original language and the translated language by using a Skip-Gram algorithm based on the corpus data of the segmented words. Therein, the dimension of the word vector may be set to 300 and the context window may be set to 5.
And randomly extracting 20% of linguistic data from Z, performing conventional error simulation, synthesizing parallel linguistic data comprising the possibly damaged linguistic data and the original linguistic data, and combining a word vector model of the original language to pre-train an original language encoder of a standard Transformer model.
And randomly extracting 20% of linguistic data from the E, performing conventional error simulation, synthesizing parallel linguistic data comprising the possibly damaged linguistic data and the original linguistic data, and combining a word vector model of the translation language to pre-train a translation language encoder of a standard Transformer model.
Randomly extracting 10% of the corpus from T, and performing conventional error simulation on the original corpus to generate a ternary corpus (possibly damaged original corpus, original translated corpus, and original corpus). Similarly, 10% of the linguistic data are randomly extracted from T, and the translation linguistic data are subjected to conventional error simulation to generate a ternary linguistic data (initial original linguistic data, possibly damaged translation linguistic data, and original translation linguistic data). And pre-training from a double-Transformer encoder to a single-Transformer decoder by using the synthesized ternary parallel corpus to obtain a pre-trained editing model. Wherein the dual Transformer encoders are a source language encoder and a translated language encoder.
Subsequently, training data acquisition of the fine tuning task is performed, including:
a) using Chinese sentence-breaking rule method to break the sentences of original corpus in C, screening out bilingual parallel corpus whose number of sentences in original corpus is greater than or equal to 2, forming a subset C1. Similarly, the sentence breaking is carried out on the original corpus in T, and the bilingual parallel corpus with the number of the sentences of the original corpus being 1 is screened out to form another subset T1. Using corpus T1And constructing a machine translation engine based on a Transformer model. Then C is mixed1The original corpus of text is input into the model and decoded to produce a machine translation, producing triples (C)1Original, machine-translated translation, C1A translation).
b) Utilizing a space tool to carry out entity recognition on the translated text corpus in the C, and screening out bilingual parallel corpus C containing entities such as names of people, places, organizations and numbers2. Random modification of C2Entity nouns in a translation corpus, e.g. deleted or replaced, produce triples (C)2Original text, translation with damaged physical noun, C2A translation).
c) And (4) screening bilingual parallel corpora of the united nations from the T, and constructing a machine translation engine based on a Transformer model. Extracting a subset C from C3Mixing C with3The original corpus of text (C) is input into the model for decoding to generate a machine translation, generating a triplet (C)3Original, machine-translated translation, C3A translation).
Collecting the triples generated in a), b) and c) to form total fine tuning task training data, and fine tuning the pre-trained editing model to obtain a final post-editing model.
The following describes the post-translation editing apparatus provided in the embodiment of the present invention, and the post-translation editing apparatus described below and the post-translation editing method described above may be referred to in correspondence with each other.
Based on any of the above embodiments, fig. 3 is a schematic structural diagram of a post-translation editing apparatus according to an embodiment of the present invention, as shown in fig. 3, the apparatus includes: a translation determining unit 310 and a post-editing unit 320.
The translation determining unit 310 is configured to determine a machine translation text to be edited;
the post-editing unit 320 is configured to input the machine-translated translation text and the original text corresponding to the machine-translated translation text into a post-editing model, and obtain a post-editing translation text output by the post-editing model;
the post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text, the sample fine-tuned original text edited into the translation text, and the sample machine translation text of the sample fine-tuning original text;
the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text.
The device provided by the embodiment of the invention has the advantages that the pre-training and post-editing model is obtained by training the pre-training original text based on the sample and the pre-training and post-editing sample text thereof, the fine-tuning original text based on the sample and the fine-tuning post-editing sample text thereof, the fine-tuning post-editing model is obtained by translating the translated text based on the sample and the fine-tuning post-editing sample text thereof, the training efficiency and the training effect of the post-editing model are improved and the accuracy of the post-editing is improved by the pre-training and fine-tuning mode and the error simulation mode for synthesizing the translated text data.
Based on any embodiment, the sample machine translation text corresponds to at least one error type of long sentence translation errors, entity name translation errors and domain translation errors.
Based on any embodiment, the sample machine translation text is determined based on at least one of the following modes:
translating the sample fine-tuning original text by using a first machine translation model to obtain a sample machine translation text with a long sentence translation error type; the first machine translation model is obtained by training a first sample translation original text and a first sample translation text, the sample fine adjustment original text is a long sentence, and the first sample translation original text is a short sentence;
randomly modifying the entity name in the edited translated text after the sample is finely adjusted to obtain a sample machine translation translated text with the entity name translation error type;
translating the sample fine-tuning original text by using a second machine translation model to obtain a sample machine translation text with a field translation error type; the second machine translation model is obtained by training based on a second sample translation original text which is different from the sample fine adjustment original text field and a second sample translation text thereof.
The device provided by the embodiment of the invention can efficiently generate the sample machine translation translated text corresponding to three different translation error types through different data synthesis modes, saves the data marking process in the fine tuning process, and can further improve the training efficiency of the post-editing model.
Based on any of the above embodiments, the pre-trained post-editing model includes a pre-trained original language encoder, a pre-trained translated language encoder, and a decoder.
The device provided by the embodiment of the invention constructs the pre-trained post-editing model through the pre-trained original language encoder, the pre-trained translated language encoder and the decoder, thereby further improving the overall training efficiency of the post-editing model.
Based on any of the above embodiments, the pre-trained original language encoder and the pre-trained translated language encoder are trained based on the sample monolingual text of the corresponding language and the sample error text obtained by performing conventional error simulation on the sample monolingual text.
The device provided by the embodiment of the invention can be used for pre-training the sample monolingual text of the corresponding language and the sample error text obtained by performing conventional error simulation on the sample monolingual text to obtain the original text language encoder and the translated text language encoder, so that the original text code and the translated text code containing correct semantic information can be encoded, and the encoding expression capacity is improved.
Based on any embodiment, the simulated translated text is determined based on the following steps:
and performing conventional error simulation on the sample pre-training original text or the edited translated text after the sample pre-training to obtain a simulated translated text.
Based on any of the above embodiments, the apparatus further comprises a conventional error simulation unit for:
randomly selecting a plurality of text segments in the corresponding text, and deleting, rearranging, replacing or transferring the text segments.
Fig. 4 illustrates a physical structure diagram of an electronic device, which may include, as shown in fig. 4: a processor (processor)410, a communication Interface 420, a memory (memory)430 and a communication bus 440, wherein the processor 410, the communication Interface 420 and the memory 430 are communicated with each other via the communication bus 440. Processor 410 may call logic instructions in memory 430 to perform a post-translation editing method comprising: determining a machine translation text to be edited; inputting the machine translation translated text and the corresponding original text into a post-editing model to obtain a post-editing translated text output by the post-editing model; the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text, a sample fine-tuning edited translated text of the sample fine-tuning original text, and a sample machine translated text of the sample fine-tuning original text; the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text.
In addition, the logic instructions in the memory 430 may be implemented in the form of software functional units and stored in a computer readable storage medium when the software functional units are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, an embodiment of the present invention further provides a computer program product, where the computer program product includes a computer program stored on a non-transitory computer-readable storage medium, the computer program includes program instructions, and when the program instructions are executed by a computer, the computer can execute the method for editing after translation provided by the above-mentioned method embodiments, where the method includes: determining a machine translation text to be edited; inputting the machine translation translated text and the corresponding original text into a post-editing model to obtain a post-editing translated text output by the post-editing model; the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text, a sample fine-tuning edited translated text of the sample fine-tuning original text, and a sample machine translated text of the sample fine-tuning original text; the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text.
In another aspect, an embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method for editing after translation provided by the foregoing embodiments when executed by a processor, where the method includes: determining a machine translation text to be edited; inputting the machine translation translated text and the corresponding original text into a post-editing model to obtain a post-editing translated text output by the post-editing model; the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text, a sample fine-tuning edited translated text of the sample fine-tuning original text, and a sample machine translated text of the sample fine-tuning original text; the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for editing a translated text, comprising:
determining a machine translation text to be edited;
inputting the machine translation translated text and the corresponding original text into a post-editing model to obtain a post-editing translated text output by the post-editing model;
the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text, a sample fine-tuning edited translated text of the sample fine-tuning original text, and a sample machine translated text of the sample fine-tuning original text;
the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text.
2. The post-translation editing method according to claim 1, wherein the sample machine translation text corresponds to at least one type of error among a long sentence translation error, an entity name translation error, and a domain translation error.
3. The method of post-translation editing according to claim 2, wherein the sample machine translation of the translation text is determined based on at least one of:
translating the sample fine-tuning original text by using a first machine translation model to obtain a sample machine translation text with a long sentence translation error type; the first machine translation model is obtained by training a first sample translation original text and a first sample translation text thereof, the sample fine-tuning original text is a long sentence, and the first sample translation original text is a short sentence;
randomly modifying the entity name in the edited translated text after the sample is finely adjusted to obtain a sample machine translation translated text with the entity name translation error type;
translating the sample fine-tuning original text by using a second machine translation model to obtain a sample machine translation text with a field translation error type; the second machine translation model is obtained by training based on a second sample translation original text which is different from the sample fine tuning original text field and a second sample translation text thereof.
4. The method of claim 1, wherein the pre-trained post-editing model comprises a pre-trained native language encoder and a pre-trained translated language encoder, and a decoder.
5. The method of post-translation editing according to claim 4, wherein the pre-trained native language encoder and the pre-trained translation language encoder are trained based on a sample monolingual text of a corresponding language and a sample error text obtained by performing a conventional error simulation on the sample monolingual text.
6. The method of post-translation editing according to claim 1, wherein the simulated translation text is determined based on the steps of:
and performing conventional error simulation on the sample pre-training original text or the sample pre-training edited translated text to obtain the simulated translated text.
7. The method for editing after-translation of claim 5 or 6, wherein the performing of the routine error simulation specifically comprises:
randomly selecting a plurality of text segments in the corresponding text, and deleting, rearranging, replacing or transferring the text segments.
8. A post-translation editing apparatus, comprising:
the translation determining unit is used for determining a machine translation text to be edited;
the post-editing unit is used for inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model;
the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text, a sample fine-tuning edited translated text of the sample fine-tuning original text, and a sample machine translated text of the sample fine-tuning original text;
the pre-training editing model is obtained by training a sample pre-training original text, a sample pre-training edited translated text and a simulated translated text of the sample pre-training original text.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method for post-compilation of a translated version according to any of claims 1 to 7 are implemented when the program is executed by the processor.
10. A non-transitory computer readable storage medium, having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the post-translation editing method according to any one of claims 1 to 7.
CN202011186869.1A 2020-10-29 2020-10-29 Post-translation editing method and device, electronic equipment and storage medium Active CN112287696B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011186869.1A CN112287696B (en) 2020-10-29 2020-10-29 Post-translation editing method and device, electronic equipment and storage medium
PCT/CN2021/078814 WO2022088570A1 (en) 2020-10-29 2021-03-03 Method and apparatus for post-editing of translation, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011186869.1A CN112287696B (en) 2020-10-29 2020-10-29 Post-translation editing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112287696A true CN112287696A (en) 2021-01-29
CN112287696B CN112287696B (en) 2024-02-23

Family

ID=74352729

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011186869.1A Active CN112287696B (en) 2020-10-29 2020-10-29 Post-translation editing method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112287696B (en)
WO (1) WO2022088570A1 (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836528A (en) * 2021-02-07 2021-05-25 语联网(武汉)信息技术有限公司 Machine translation post-editing method and system
CN114091483A (en) * 2021-10-27 2022-02-25 北京百度网讯科技有限公司 Translation processing method and device, electronic equipment and storage medium
WO2022088570A1 (en) * 2020-10-29 2022-05-05 语联网(武汉)信息技术有限公司 Method and apparatus for post-editing of translation, electronic device, and storage medium
CN117273027A (en) * 2023-11-22 2023-12-22 四川语言桥信息技术有限公司 Automatic machine translation post-verification method based on translation error correction

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170091177A1 (en) * 2015-09-30 2017-03-30 Kabushiki Kaisha Toshiba Machine translation apparatus, machine translation method and computer program product
CN109670191A (en) * 2019-01-24 2019-04-23 语联网(武汉)信息技术有限公司 Calibration optimization method, device and the electronic equipment of machine translation
CN111144137A (en) * 2019-12-17 2020-05-12 语联网(武汉)信息技术有限公司 Method and device for generating edited model corpus after machine translation
CN111382580A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 Encoder-decoder framework pre-training method for neural machine translation
CN111597778A (en) * 2020-04-15 2020-08-28 哈尔滨工业大学 Method and system for automatically optimizing machine translation based on self-supervision

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105740218A (en) * 2015-12-31 2016-07-06 成都数联铭品科技有限公司 Post-editing processing method for mechanical translation
US10558762B2 (en) * 2018-02-24 2020-02-11 International Business Machines Corporation System and method for adaptive quality estimation for machine translation post-editing
CN112287696B (en) * 2020-10-29 2024-02-23 语联网(武汉)信息技术有限公司 Post-translation editing method and device, electronic equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170091177A1 (en) * 2015-09-30 2017-03-30 Kabushiki Kaisha Toshiba Machine translation apparatus, machine translation method and computer program product
CN109670191A (en) * 2019-01-24 2019-04-23 语联网(武汉)信息技术有限公司 Calibration optimization method, device and the electronic equipment of machine translation
CN111144137A (en) * 2019-12-17 2020-05-12 语联网(武汉)信息技术有限公司 Method and device for generating edited model corpus after machine translation
CN111382580A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 Encoder-decoder framework pre-training method for neural machine translation
CN111597778A (en) * 2020-04-15 2020-08-28 哈尔滨工业大学 Method and system for automatically optimizing machine translation based on self-supervision

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022088570A1 (en) * 2020-10-29 2022-05-05 语联网(武汉)信息技术有限公司 Method and apparatus for post-editing of translation, electronic device, and storage medium
CN112836528A (en) * 2021-02-07 2021-05-25 语联网(武汉)信息技术有限公司 Machine translation post-editing method and system
WO2022166267A1 (en) * 2021-02-07 2022-08-11 语联网(武汉)信息技术有限公司 Machine translation post-editing method and system
CN112836528B (en) * 2021-02-07 2023-10-03 语联网(武汉)信息技术有限公司 Machine post-translation editing method and system
CN114091483A (en) * 2021-10-27 2022-02-25 北京百度网讯科技有限公司 Translation processing method and device, electronic equipment and storage medium
CN117273027A (en) * 2023-11-22 2023-12-22 四川语言桥信息技术有限公司 Automatic machine translation post-verification method based on translation error correction
CN117273027B (en) * 2023-11-22 2024-04-30 四川语言桥信息技术有限公司 Automatic machine translation post-verification method based on translation error correction

Also Published As

Publication number Publication date
WO2022088570A1 (en) 2022-05-05
CN112287696B (en) 2024-02-23

Similar Documents

Publication Publication Date Title
CN112287696B (en) Post-translation editing method and device, electronic equipment and storage medium
US11113234B2 (en) Semantic extraction method and apparatus for natural language, and computer storage medium
CN109840331B (en) Neural machine translation method based on user dictionary
US7853444B2 (en) Method and apparatus for training transliteration model and parsing statistic model, method and apparatus for transliteration
CN112766000B (en) Machine translation method and system based on pre-training model
CN112329447B (en) Training method of Chinese error correction model, chinese error correction method and device
US8874433B2 (en) Syntax-based augmentation of statistical machine translation phrase tables
CN111539229A (en) Neural machine translation model training method, neural machine translation method and device
CN111144140B (en) Zhongtai bilingual corpus generation method and device based on zero-order learning
CN112541365B (en) Machine translation method and device based on term replacement
CN110211562B (en) Voice synthesis method, electronic equipment and readable storage medium
CN105808528A (en) Document character processing method
CN112818712B (en) Machine translation method and device based on translation memory library
CN111144137B (en) Method and device for generating corpus of machine post-translation editing model
CN111539199A (en) Text error correction method, device, terminal and storage medium
CN109657244B (en) English long sentence automatic segmentation method and system
CN112766001A (en) Enterprise name translation method and device
CN115718904A (en) Text processing method and device
CN112836528B (en) Machine post-translation editing method and system
CN114861628A (en) System, method, electronic device and storage medium for training machine translation model
CN114185573A (en) Implementation and online updating system and method for human-computer interaction machine translation system
CN117034968B (en) Neural machine translation method, device, electronic equipment and medium
CN110287496A (en) A kind of English to Chinese Word sense disambiguation method neural network based
CN116522966B (en) Text translation method and system based on multilingual vocabulary entry
CN111652004B (en) Fusion method and device for machine translation system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant