CN112287696B - Post-translation editing method and device, electronic equipment and storage medium - Google Patents

Post-translation editing method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN112287696B
CN112287696B CN202011186869.1A CN202011186869A CN112287696B CN 112287696 B CN112287696 B CN 112287696B CN 202011186869 A CN202011186869 A CN 202011186869A CN 112287696 B CN112287696 B CN 112287696B
Authority
CN
China
Prior art keywords
text
translation
sample
post
editing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011186869.1A
Other languages
Chinese (zh)
Other versions
CN112287696A (en
Inventor
张睦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Iol Wuhan Information Technology Co ltd
Original Assignee
Iol Wuhan Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Iol Wuhan Information Technology Co ltd filed Critical Iol Wuhan Information Technology Co ltd
Priority to CN202011186869.1A priority Critical patent/CN112287696B/en
Publication of CN112287696A publication Critical patent/CN112287696A/en
Priority to PCT/CN2021/078814 priority patent/WO2022088570A1/en
Application granted granted Critical
Publication of CN112287696B publication Critical patent/CN112287696B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention provides a method and a device for editing translated text, wherein the method comprises the following steps: determining a machine translation text to be edited; inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model; the post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text and the sample fine-tuned post-editing translated text thereof and the sample machine translated text of the sample fine-tuning original text; the pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text. According to the method and the device provided by the embodiment of the invention, the training efficiency and the training effect of the post-editing model are improved and the accuracy of post-editing is improved through a pre-training and fine-tuning mode and an error simulation mode for synthesizing translation data.

Description

Post-translation editing method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of natural language processing technologies, and in particular, to a method and apparatus for post-translation editing, an electronic device, and a storage medium.
Background
Post-editing refers to the steps of giving an original text to be translated, calling a corresponding machine translation result, and modifying and coloring on the basis of the result by a translator, so that the translation quality is improved. The machine translation result can provide a translation result for a translator as a reference, so that the translator is prevented from translating from the beginning, and the workload of the translator is reduced.
In actual work, when the difference between the machine translation result and the expected translation result is large, the post-editing mode can cause the translator to need to make a lot of modification and editing, and the workload of the translator is further increased. For example, when the machine translation model processes some texts to be translated, which are limited in resources and are oriented to certain professional fields, the effect is poor, and the obtained machine translation result is far away from the correct translation result. Or the machine translation model is poor in accuracy of the obtained machine translation result when the entity words such as a person name, a place name or an organization name are subjected to incorrect translation or the words are subjected to incorrect translation. Or when the machine translation model cannot reasonably process the translation of the long sentence, the accuracy of the machine translation result is also insufficient, and a large amount of post-editing work is required. Thus, the automatic post-editing model plays an increasingly important role in current auxiliary translation. The post-editing model can automatically carry out post-editing on the machine translated version based on the input original text to be translated and the machine translated version, so that correction of the error of the translation is realized, the post-edited version is output, and the workload of the translator is further reduced by further reducing the gap between the output version and the expected version of the translator.
However, the existing post-editing model training method requires a plurality of triples of parallel corpora, namely, triples consisting of original text, machine translation and post-editing translation. The triplet training data are difficult to obtain and require a large amount of manual labeling cost, so that the training effect and the training efficiency of the post-editing model are poor, and further the accuracy of post-editing of the translated version is poor.
Disclosure of Invention
The embodiment of the invention provides a post-translation editing method, a post-translation editing device, electronic equipment and a storage medium, which are used for solving the defects of poor training effect, low training efficiency and poor post-translation editing accuracy of a post-translation editing model in the prior art.
The embodiment of the invention provides a method for editing a translated text, which comprises the following steps:
determining a machine translation text to be edited;
inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model;
the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text and a sample fine-tuned editing translated text thereof and a sample machine translated text of the sample fine-tuning original text;
The pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text.
According to the post-translation editing method of an embodiment of the present invention, the sample machine translation text corresponds to at least one error type of long sentence translation error, entity name translation error and domain translation error.
According to a post-translation editing method of one embodiment of the present invention, the sample machine-translated translation text is determined based on at least one of the following:
translating the sample fine-tuning original text by applying a first machine translation model to obtain a sample machine translation text with a long sentence translation error type; the first machine translation model is trained based on a first sample translation original text and a first sample translation original text thereof, the sample fine-tuning original text is a long sentence, and the first sample translation original text is a short sentence;
randomly modifying the entity names in the edited translation text after the sample fine adjustment to obtain a sample machine translation text with entity name translation error types;
Translating the sample fine-tuning original text by applying a second machine translation model to obtain a sample machine translation text with a field translation error type; the second machine translation model is trained based on a second sample translation textual text and a second sample translation textual text thereof that is different from the sample fine-tuning textual text field.
According to a post-translation editing method of one embodiment of the present invention, the pre-trained post-editing model includes a pre-trained original language encoder and a pre-trained translated language encoder, and a decoder.
According to the post-translation editing method of the embodiment of the invention, the pre-trained original language encoder and the pre-trained translated language encoder are obtained by training based on sample single language texts of corresponding languages and sample error texts obtained by performing conventional error simulation on the sample single language texts.
According to the post-translation editing method of one embodiment of the invention, the simulated translation text is determined based on the following steps:
and performing conventional error simulation on the sample pre-training original text or the sample pre-training post-editing translated text to obtain the simulated translated text.
According to one embodiment of the invention, the method for performing conventional error simulation specifically comprises the following steps:
and randomly selecting a plurality of text fragments in the corresponding text, and deleting, rearranging, replacing or transferring the text fragments.
The embodiment of the invention also provides a device for editing the translated text, which comprises the following steps:
a translation determining unit for determining a machine translation text to be edited;
the post-editing unit is used for inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model;
the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text and a sample fine-tuned editing translated text thereof and a sample machine translated text of the sample fine-tuning original text;
the pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text.
The embodiment of the invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of any one of the post-translation editing methods when executing the program.
The embodiments of the present invention also provide a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the post-translation editing methods described above.
According to the post-translation editing method, the post-translation editing device, the electronic equipment and the storage medium, the post-translation editing model is obtained through pre-training the original text and the post-training editing text based on the sample and the sample thereof, training the simulated translation text of the pre-training original text, and fine-tuning the original text and the post-training editing text based on the sample, and fine-tuning the machine-translating translation text based on the sample, and fine-tuning the sample of the original text, the post-editing model is obtained after fine-tuning the pre-training editing model, and the training efficiency and the training effect of the post-editing model are improved through the pre-training and fine-tuning modes and the error simulation mode of synthesizing the translation data.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of a post-translation editing method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a training method for a post-translation editing model according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a post-translation editing apparatus according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Post-editing refers to the steps of giving an original text to be translated, calling a corresponding machine translation result, and modifying and coloring on the basis of the result by a translator, so that the translation quality is improved. The machine translation result can provide a translation result for a translator as a reference, so that the translator is prevented from translating from the beginning, and the workload of the translator is reduced. However, when the machine translation result is far from the expected translation result, the post-editing mode may cause the translator to need to make many modifications and edits, which further increases the workload of the translator. For example, when the machine translation model processes some documents to be translated, which have limited resources and are oriented to certain professional fields, for entity words, such as a person name, a place name or a mechanism name, or when the entity words are incorrectly translated, or when the machine translation model cannot reasonably process long sentences for translation, the translation effect is poor, the obtained machine translation result is far away from the correct translation result, and a large amount of post-editing work is required. Thus, the automatic post-editing model plays an increasingly important role in current auxiliary translation.
However, the existing post-editing model training method requires a plurality of triples of parallel corpora, namely, triples consisting of original text, machine translation and post-editing translation. The triplet training data are difficult to obtain and require a large amount of manual labeling cost, so that the training effect and the training efficiency of the post-editing model are poor, and further the accuracy of post-editing of the translated version is poor.
In this regard, the embodiment of the invention provides a post-translation editing method. Fig. 1 is a flow chart of a post-translation editing method according to an embodiment of the present invention, as shown in fig. 1, the method includes:
step 110, determining a machine translation text to be edited;
step 120, inputting the machine translation text and the corresponding original text into the post-editing model to obtain the post-editing translation text output by the post-editing model;
the post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text and the sample fine-tuned post-editing translated text thereof and the sample machine translated text of the sample fine-tuning original text;
the pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text.
Specifically, the machine translation text corresponding to the original text is obtained for automatic post-editing by the post-editing model. The machine translation text may be obtained by inputting the original text into a machine translation model for translation.
And then, inputting the machine translation text and the corresponding original text into a post-editing model, wherein the post-editing model can correct errors of the machine translation text based on semantic information of the original text and semantic information of the machine translation text, so as to obtain corrected post-editing translation text. Here, the language used for post-editing the translated text is the same as the language used for machine translating the translated text.
The post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text and the sample fine-tuned post-editing translated text thereof and the sample machine translated text of the sample fine-tuning original text; the pre-training post-editing model is obtained based on the sample pre-training original text and the sample pre-training post-editing translated text thereof, and the simulated translated text of the sample pre-training original text.
Here, when editing the model after training, a pre-training and fine tuning method is adopted. Fig. 2 is a flow chart of a training method for a post-translation editing model according to an embodiment of the present invention, where, as shown in fig. 2, the training method for a post-translation editing model includes:
Step 210, training an initial model based on the sample pre-training original text and the sample pre-trained edited version text thereof, and the simulated version text of the sample pre-training original text to obtain a pre-trained edited model;
and 220, fine tuning the pre-trained editing model based on the sample fine-tuned original text and the sample fine-tuned edited version text thereof and the sample machine translated version text of the sample fine-tuned original text to obtain the post-editing model.
Firstly, pre-training an original text and a sample thereof by using a large number of samples, editing the translated text after pre-training, and simulating the translated text to pre-train an initial model to obtain a pre-trained editing model. The sample pre-training original text and the sample pre-training post-editing translated text can be obtained by downloading public bilingual parallel corpus data from a network, such as Chinese-English parallel corpus given by national government official documents and international machine translation (Conference on Machine Translation, WMT). Then, error simulation can be performed based on bilingual parallel corpus to obtain simulated translation text of the sample pre-training original text so as to simulate machine translated translations. Because only bilingual parallel corpus is needed to be obtained during pre-training, and a simulated translation text similar to a machine translation is synthesized in an error simulation mode, the acquisition difficulty of training data is greatly reduced, the cost of editing the translation after manual labeling is also saved, the efficiency of the whole training process is improved, and the training difficulty is reduced.
In addition, in the training process, the pre-training post-editing model obtained by pre-training is used for pre-training the original text and the post-training editing text according to the sample, and the simulated translated text of the original text can learn text errors which may occur in the translated text, such as word repetition, reverse order, word leakage and the like, and learn how to correct the text errors in the translated text according to the original text so as to obtain the correct post-editing translated text.
In order to further improve the accuracy of post-editing so as to better complete the post-editing task, the post-editing model can be finely adjusted based on the sample fine-adjusted original text and the sample fine-adjusted post-editing translated text thereof, and the sample machine translated text of the sample fine-adjusted original text. The sample fine-tuning original text and the sample fine-tuning translated text thereof can also be obtained by obtaining bilingual parallel corpus. Here, in order to improve accuracy of fine tuning, bilingual parallel corpus generated in a translation production environment may be obtained. Each bilingual parallel corpus comprises original text to be translated and high-quality translated text generated after manual translation and verification. According to bilingual parallel corpus generated in the production environment, sample fine-tuning original text and high-quality sample fine-tuning post-editing translated text can be obtained. And under the condition that the sample machine translation text comprises a post-editing scene, the translation errors are caused by the limitation of a machine translation model in the actual machine translation process. After-editing translation text is finely adjusted based on the sample fine-adjusting original text and the sample fine-adjusting original text, and the sample machine translation text is finely adjusted, so that a post-editing model can learn translation errors possibly occurring in the field of machine translation besides conventional text errors, error positioning and correcting capability of the post-editing model in a post-editing scene are improved, and accuracy of post-editing is further improved. In addition, the data volume required during fine tuning is smaller than that required during pre-training, so that the acquisition difficulty of triples of < original text, machine translation and post-editing translation > can be reduced, the model training difficulty is further reduced, and the model training efficiency is improved.
According to the method provided by the embodiment of the invention, the pre-trained editing model is obtained through training the pre-trained original text and the pre-trained edited text of the sample based on the sample and the simulated translated text of the pre-trained original text, the post-trained edited text is obtained through fine-tuning the original text and the sample based on the sample, the post-edited text is edited after fine-tuning the sample, the post-edited model is obtained after fine-tuning the post-edited model is obtained through fine-tuning the sample machine translated text of the pre-trained original text, and the post-edited model is synthesized through the pre-training and fine-tuning modes and the error simulation.
Based on the above embodiment, the sample machine translation text corresponds to at least one error type of long sentence translation error, entity name translation error, and domain translation error.
Specifically, in order to enable the post-editing model to learn a translation error caused by the limitation of the machine translation model in the actual machine translation process in the post-editing scene in the fine tuning process, a sample machine translation text containing the translation error can be obtained. In general, possible translation errors include long sentence translation errors, entity name translation errors, domain translation errors, and the like. The long sentence translation error is an error which occurs when a machine translation model cannot reasonably process a long sentence; the entity name translation error is an error which occurs when a machine translation model translates entity words such as a person name, a place name or an organization name or the like or digital words; the domain translation error is an error caused by the difference between the domain to which the machine translation model is applicable and the domain of the original text to be translated when the machine translation model processes the original text to be translated, which has limited resources and is oriented to certain professional domains. Thus, the obtained sample machine translation text may correspond to at least one error type of long sentence translation error, entity name translation error, and domain translation error.
Based on any of the above embodiments, the sample machine translated text is determined based on at least one of:
translating the sample fine-tuning original text by using a first machine translation model to obtain a sample machine translation text with a long sentence translation error type; the first machine translation model is obtained based on a first sample translation original text and a first sample translation original text training, the sample fine-tuning original text is a long sentence, and the first sample translation original text is a short sentence;
randomly modifying entity names in the edited translation text after sample fine adjustment to obtain a sample machine translation text with entity name translation error types;
translating the sample fine-tuning original text by applying a second machine translation model to obtain a sample machine translation text with a field translation error type; the second machine translation model is trained based on a second sample translated textual text and its second sample translated textual text that is different from the sample fine-tuning textual text field.
Specifically, for long sentence translation errors, a first machine translation model can be obtained based on the first sample translation text and the first sample translation text training, and the first machine translation model can be constructed based on a single transducer model. The first sample translation text and the first sample translation text thereof can be bilingual parallel corpus downloaded through a network. Here, the first sample translates the original text into short sentences, e.g. containing only 1 sentence. Since the first machine translation model is obtained based on phrase training, the model is only good at translating phrases, and if long phrases are input into the model for translation, the obtained translations are easy to have long-sentence translation errors. Therefore, long sentences, for example, including more than 2 sentences, are selected as sample fine-tuning original text, and are input into the first machine translation model to obtain sample machine translation text of long sentence translation error type.
For the entity name translation error, entity recognition tools such as space can be utilized to perform entity recognition on the translated text edited after sample fine adjustment, for example, entity recognition is performed on bilingual parallel corpus such as English text in bilingual parallel corpus generated in a translation production environment. And screening out post-editing translation text fragments of entities such as a person name, a place name, a mechanism name, a number and the like in the post-editing translation text subjected to sample fine adjustment, and randomly modifying, such as deleting or replacing, the post-editing translation text fragments to obtain the sample machine translation text of the entity name translation error type.
For the field translation error, a second machine translation model can be obtained based on the second sample translation original text and the second sample translation original text training, and the second machine translation model can be constructed based on a single transducer model. Wherein the second sample translation text and the second sample translation text thereof are different from the sample fine tuning text. For example, a bilingual parallel corpus of a high-quality, but narrower-domain united nations government document can be downloaded as the second sample translation text and the second sample translation text thereof through a network. The second machine translation model obtained through training is only good at translating the second sample translation original text and the text of the field of the second sample translation original text, so that if the original text of different fields is input into the model for translation, the obtained translation is easy to have field translation errors, and the translated text obtained by translating the sample fine-tuning original text through the second machine translation model can be used as the sample machine translation text of the field translation error type.
According to the method provided by the embodiment of the invention, through different data synthesis modes, the sample machine translation text corresponding to three different translation error types can be efficiently generated, the data labeling process in the fine tuning process is omitted, and the training efficiency of the post-editing model can be further improved.
Based on any of the above embodiments, the pre-trained post-editing model includes a pre-trained original language encoder and a pre-trained translated language encoder, and a decoder.
Specifically, the pre-trained post-editing model may include two encoders, namely an original language encoder and a translated language encoder, for encoding the original text and the machine translated text, respectively, and a decoder for decoding based on the encoding of the original text and the encoding of the machine translated text, to implement error correction of the machine translated text, and to obtain the post-edited translated text. The original language encoder, the translated language encoder and the decoder can be constructed based on a single transducer model. Here, the two encoders may be obtained through pre-training to improve the pre-training efficiency of the pre-trained post-editing model, thereby further improving the training efficiency of the post-editing model as a whole.
According to the method provided by the embodiment of the invention, the pre-trained editing model is built through the pre-trained original language encoder, the pre-trained translated language encoder and the decoder together, so that the overall training efficiency of the post-editing model is further improved.
Based on any of the above embodiments, the pre-trained original language encoder and the pre-trained translated language encoder are trained based on sample monolingual text of a corresponding language and sample error text obtained by performing conventional error simulation on the sample monolingual text.
Specifically, in order for the original language encoder and the translated language encoder to learn to extract correct semantic information from the error text, so as to encode the original language code and the translated language code containing the correct semantic information, so as to improve the expression capability of the code, the original language encoder and the translated language encoder can be trained based on the sample monolingual text of the corresponding language and the corresponding sample error text thereof, and the word vector model of the corresponding language. For example, if the original text is chinese and the translated text is english, the original language encoder may be pre-trained based on the chinese sample monolingual text and its corresponding sample error text, and the chinese word vector model, and the translated language encoder may be pre-trained based on the english sample monolingual text and its corresponding sample error text, and the english word vector model. The sample monolingual text can be obtained by collecting a large number of monolingual corpuses, for example, public Chinese monolingual corpuses, such as Chinese wikipedia and news corpuses, and public English corpuses, such as English wikipedia and news corpuses, can be downloaded from a network. In order to reduce the difficulty of acquiring training data, part of the monolingual corpus, for example, 20% of the monolingual corpus, may be randomly selected from the monolingual corpus, and conventional error simulation is performed on the selected monolingual corpus, i.e., the sample monolingual text, to obtain a sample error text containing conventional text errors.
According to the method provided by the embodiment of the invention, the original language encoder and the translated language encoder can be obtained by pre-training the sample single language text of the corresponding language and the sample error text obtained by performing conventional error simulation on the sample single language text, so that the original code and the translated code containing correct semantic information can be obtained by encoding, and the encoding expression capability is improved.
Based on any of the above embodiments, the simulated translation text is determined based on the steps of:
and performing conventional error simulation on the sample pre-training original text or the sample pre-training post-editing translated text to obtain a simulated translated text.
Specifically, a part of bilingual parallel corpus, for example, 10% of bilingual parallel corpus, may be randomly selected from the bilingual parallel corpus, a sample pre-training original text in each corpus is subjected to conventional error simulation to obtain a simulated translated text containing conventional text errors, and the sample pre-trained translated text, the simulated translated text and the sample pre-trained original text in the bilingual parallel corpus are used as one piece of training data of the pre-trained editing model. And part of bilingual parallel corpus, such as 10% bilingual parallel corpus, can be randomly selected, the sample is pre-trained and then the edited translation text is subjected to conventional error simulation to obtain a simulated translation text containing conventional text errors, and the sample pre-trained original text, the simulated translation text and the sample pre-trained edited translation text in the bilingual parallel corpus are used as training data of the pre-trained and then edited model.
Based on any of the above embodiments, performing conventional error simulation specifically includes:
and randomly selecting a plurality of text fragments in the corresponding text, and deleting, rearranging, replacing or transferring the text fragments.
Specifically, the conventional text errors include word leakage, reverse order, word misplacement, repetition and the like, so that when conventional error simulation is performed, a plurality of text fragments in the corresponding text can be randomly selected, and deleting, rearranging, replacing or transferring operation is performed on each text fragment. Wherein, deleting means deleting the text segment as a whole, rearranging means reversing the order of words in the text segment, replacing means replacing the text segment with a text segment at other positions in the original text, and transferring means exchanging the text segment at other positions in the original text with the text segment. For example, conventional error simulation may be performed in the manner set forth in the following table:
original text <zh>Weather today is really good.
Deletion of <zh>DEL is good in the daytime.
Rearrangement of <zh>Good true qi in every day.
Replacement of <zh>Today, it is good.
Transfer of <zh>Jinqi is good all the day long.
Based on any one of the above embodiments, a further embodiment of the present invention provides a post-editing model building method. The method comprises the following steps:
First, collecting corpus data required for model training, including:
and accumulating bilingual parallel corpus generated in the translation production environment, and marking the bilingual parallel corpus as a bilingual parallel corpus C. Each corpus comprises an original text to be translated and a high-quality translated text generated after manual translation and examination.
The common bilingual parallel corpus, such as the united nations and WMT bilingual parallel corpus, is downloaded from the network and is noted as bilingual parallel corpus T.
Common original language monolingual corpus, such as chinese wikipedia and news corpus, is downloaded from the network and noted as monolingual corpus Z.
The common translation language monolingual corpus, such as english wikipedia and news corpus, is downloaded from the network and noted as monolingual corpus E.
And performing word segmentation on all the corpus. Wherein, for English corpus, word segmentation can be performed by using a space tool; for Chinese corpus, word segmentation can be performed by using grammar rules in units of characters, namely, individual Chinese characters, continuous numbers or English letters, punctuations and the like are independently used as word examples. Then, a language identifier is added at the beginning of each corpus, as shown in the following table:
based on the segmented corpus data, word vector training is carried out on the original language and the translated language by utilizing a Skip-Gram algorithm. Wherein the dimension of the word vector may be set to 300 and the context window may be set to 5.
Randomly extracting 20% of corpus from Z, performing conventional error simulation, synthesizing parallel corpus containing possibly destroyed corpus and original corpus, and pre-training an original language encoder of a standard transducer model by combining word vector model of original language.
Randomly extracting 20% of corpus from E, performing conventional error simulation, synthesizing parallel corpus containing possibly destroyed corpus and original corpus, and pre-training a standard translation language encoder of a translation language model by combining word vector model of the translation language.
10% of corpus is randomly extracted from T, and conventional error simulation is carried out on the original corpus in the T, namely, a ternary corpus (original corpus which is possibly destroyed, original translation corpus and initial original corpus) is generated. And similarly, 10% of the linguistic data is randomly extracted from the T, and conventional error simulation is carried out on the linguistic data of the translations in the T, so that a ternary linguistic data (initial original linguistic data, possibly destroyed linguistic data and original linguistic data) is generated. And (3) pre-training from a double-transducer encoder to a single-transducer decoder by using the synthesized ternary parallel corpus to obtain a pre-trained editing model. Wherein the dual fransformer encoder is an original language encoder and a translated language encoder.
Subsequently, training data acquisition for the fine tuning task is performed, including:
a) The Chinese sentence breaking rule method is utilized to break sentences of the original text corpus in the C, bilingual parallel corpus with the number of sentences of the original text corpus being more than or equal to 2 is screened out, and a subset C is formed 1 . Similarly, the original text corpus in T is subjected to sentence breaking, bilingual parallel corpus with the number of sentences of the original text corpus being 1 is screened out, and another subset T is formed 1 . Using corpus T 1 A machine translation engine based on a transducer model is constructed. Then C is carried out 1 The original text corpus is input into the model for decoding to generate machine translation, and a triplet (C 1 Original text, machine translated version, C 1 Translation).
b) Entity recognition is carried out on the translated language material in the C by using a space tool, and bilingual parallel language material C containing entities such as person names, place names, organization names, numbers and the like is screened out 2 . Random modification C 2 Entity nouns in the translation corpus, e.g. deleted or replaced, resulting in triples (C 2 Original text, damaged translation of entity noun, C 2 Translation).
c) And screening the parallel bilingual corpus of the united states from the T, and constructing a machine translation engine based on a transducer model. Extracting a subset C from C 3 C is carried out by 3 The original text corpus is input into the model for decoding to generate machine translation, and a triplet (C 3 Original text, machine translated version, C 3 Translation).
And c) combining the triples generated in a), b) and c) to form total fine tuning task training data, and fine tuning the pre-trained post-editing model to obtain a final post-editing model.
The post-translation editing device provided by the embodiment of the invention is described below, and the post-translation editing device described below and the post-translation editing method described above can be referred to correspondingly.
Based on any of the above embodiments, fig. 3 is a schematic structural diagram of a post-translation editing device according to an embodiment of the present invention, as shown in fig. 3, where the device includes: a translation determination unit 310 and a post-editing unit 320.
Wherein the translation determining unit 310 is configured to determine a machine translation text to be edited;
the post-editing unit 320 is configured to input the machine translated text and the corresponding original text into the post-editing model, so as to obtain a post-edited translated text output by the post-editing model;
the post-editing model is obtained by fine-tuning the pre-trained editing model based on the sample fine-tuning original text and the sample fine-tuned post-editing translated text thereof and the sample machine translated text of the sample fine-tuning original text;
The pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text.
According to the device provided by the embodiment of the invention, the pre-trained editing model is obtained through pre-training the original text and the pre-training edited text based on the sample, and the simulated translated text training of the pre-training original text, and the post-training editing model is obtained through fine-tuning the original text and the post-training edited text based on the sample, and the post-machine translated text of the post-training original text based on the sample, and the post-editing model is obtained through fine-tuning the post-training edited model based on the sample machine translated text of the post-training original text, and the post-editing model is obtained through a pre-training and fine-tuning mode and a mode of synthesizing translated data by error simulation, so that the training efficiency and training effect of the post-editing model are improved, and the accuracy of post-editing is improved.
Based on any of the above embodiments, the sample machine translated text corresponds to at least one error type of long sentence translation error, entity name translation error, and domain translation error.
Based on any of the above embodiments, the sample machine translated text is determined based on at least one of:
Translating the sample fine-tuning original text by using a first machine translation model to obtain a sample machine translation text with a long sentence translation error type; the first machine translation model is obtained based on a first sample translation original text and a first sample translation original text training, the sample fine-tuning original text is a long sentence, and the first sample translation original text is a short sentence;
randomly modifying entity names in the edited translation text after sample fine adjustment to obtain a sample machine translation text with entity name translation error types;
translating the sample fine-tuning original text by applying a second machine translation model to obtain a sample machine translation text with a field translation error type; the second machine translation model is trained based on a second sample translated textual text and its second sample translated textual text that is different from the sample fine-tuning textual text field.
According to the device provided by the embodiment of the invention, through different data synthesis modes, sample machine translation text corresponding to three different translation error types can be generated efficiently, a data labeling process in a fine tuning process is omitted, and the training efficiency of a post-editing model can be further improved.
Based on any of the above embodiments, the pre-trained post-editing model includes a pre-trained original language encoder and a pre-trained translated language encoder, and a decoder.
The device provided by the embodiment of the invention constructs the pre-trained post-editing model through the pre-trained original language encoder, the pre-trained translated language encoder and the decoder together, so that the overall training efficiency of the post-editing model is further improved.
Based on any of the above embodiments, the pre-trained original language encoder and the pre-trained translated language encoder are trained based on sample monolingual text of a corresponding language and sample error text obtained by performing conventional error simulation on the sample monolingual text.
According to the device provided by the embodiment of the invention, the original language encoder and the translated language encoder can be obtained by pre-training the sample single language text of the corresponding language and the sample error text obtained by performing conventional error simulation on the sample single language text, so that the original code and the translated code containing correct semantic information can be obtained by encoding, and the encoding expression capability is improved.
Based on any of the above embodiments, the simulated translation text is determined based on the steps of:
and performing conventional error simulation on the sample pre-training original text or the sample pre-training post-editing translated text to obtain a simulated translated text.
Based on any of the above embodiments, the apparatus further comprises a conventional error simulation unit for:
and randomly selecting a plurality of text fragments in the corresponding text, and deleting, rearranging, replacing or transferring the text fragments.
Fig. 4 illustrates a physical schematic diagram of an electronic device, as shown in fig. 4, which may include: processor 410, communication interface (Communications Interface) 420, memory 430 and communication bus 440, wherein processor 410, communication interface 420 and memory 430 communicate with each other via communication bus 440. The processor 410 may invoke logic instructions in the memory 430 to perform a post-translation editing method comprising: determining a machine translation text to be edited; inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model; the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text and a sample fine-tuned editing translated text thereof and a sample machine translated text of the sample fine-tuning original text; the pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text.
Further, the logic instructions in the memory 430 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
In another aspect, embodiments of the present invention further provide a computer program product, including a computer program stored on a non-transitory computer readable storage medium, the computer program including program instructions which, when executed by a computer, enable the computer to perform the post-translation editing method provided in the above method embodiments, the method including: determining a machine translation text to be edited; inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model; the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text and a sample fine-tuned editing translated text thereof and a sample machine translated text of the sample fine-tuning original text; the pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text.
In still another aspect, an embodiment of the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform the post-translation editing method provided in the above embodiments, the method including: determining a machine translation text to be edited; inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model; the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text and a sample fine-tuned editing translated text thereof and a sample machine translated text of the sample fine-tuning original text; the pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims (9)

1. A post-translation editing method, comprising:
determining a machine translation text to be edited;
inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model;
the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text and a sample fine-tuned editing translated text thereof and a sample machine translated text of the sample fine-tuning original text;
the pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text;
the simulated translation text is determined based on the steps of:
and performing conventional error simulation on the sample pre-training original text or the sample pre-training post-editing translated text to obtain the simulated translated text.
2. The post-translation editing method according to claim 1, wherein the sample machine-translated text corresponds to at least one error type of long sentence translation error, entity name translation error, and domain translation error.
3. The post-translation editing method according to claim 2, wherein the sample machine-translated translation text is determined based on at least one of:
translating the sample fine-tuning original text by applying a first machine translation model to obtain a sample machine translation text with a long sentence translation error type; the first machine translation model is trained based on a first sample translation original text and a first sample translation original text thereof, the sample fine-tuning original text is a long sentence, and the first sample translation original text is a short sentence;
randomly modifying the entity names in the edited translation text after the sample fine adjustment to obtain a sample machine translation text with entity name translation error types;
translating the sample fine-tuning original text by applying a second machine translation model to obtain a sample machine translation text with a field translation error type; the second machine translation model is trained based on a second sample translation textual text and a second sample translation textual text thereof that is different from the sample fine-tuning textual text field.
4. The post-translation editing method according to claim 1, wherein the pre-trained post-editing model comprises a pre-trained original language encoder and a pre-trained translated language encoder, and a decoder.
5. The post-translation editing method according to claim 4, wherein the pre-trained original language encoder and the pre-trained translated language encoder are obtained by training based on sample monolingual text of a corresponding language and sample error text obtained by performing conventional error simulation on the sample monolingual text.
6. The post-translation editing method according to claim 1 or 5, wherein said performing conventional error simulation specifically comprises:
and randomly selecting a plurality of text fragments in the corresponding text, and deleting, rearranging, replacing or transferring the text fragments.
7. A post-translation editing device, comprising:
a translation determining unit for determining a machine translation text to be edited;
the post-editing unit is used for inputting the machine translation text and the corresponding original text into a post-editing model to obtain a post-editing translation text output by the post-editing model;
the post-editing model is obtained by fine-tuning a pre-trained editing model based on a sample fine-tuning original text and a sample fine-tuned editing translated text thereof and a sample machine translated text of the sample fine-tuning original text;
The pre-training post-editing model is obtained based on a sample pre-training original text and a sample pre-training post-editing translated text thereof, and a simulated translated text training of the sample pre-training original text;
the simulated translation text is determined based on the steps of:
and performing conventional error simulation on the sample pre-training original text or the sample pre-training post-editing translated text to obtain the simulated translated text.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor performs the steps of the post-translation editing method according to any one of claims 1 to 6 when the program is executed.
9. A non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, implements the steps of the post-translation editing method according to any of claims 1 to 6.
CN202011186869.1A 2020-10-29 2020-10-29 Post-translation editing method and device, electronic equipment and storage medium Active CN112287696B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011186869.1A CN112287696B (en) 2020-10-29 2020-10-29 Post-translation editing method and device, electronic equipment and storage medium
PCT/CN2021/078814 WO2022088570A1 (en) 2020-10-29 2021-03-03 Method and apparatus for post-editing of translation, electronic device, and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011186869.1A CN112287696B (en) 2020-10-29 2020-10-29 Post-translation editing method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112287696A CN112287696A (en) 2021-01-29
CN112287696B true CN112287696B (en) 2024-02-23

Family

ID=74352729

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011186869.1A Active CN112287696B (en) 2020-10-29 2020-10-29 Post-translation editing method and device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN112287696B (en)
WO (1) WO2022088570A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287696B (en) * 2020-10-29 2024-02-23 语联网(武汉)信息技术有限公司 Post-translation editing method and device, electronic equipment and storage medium
CN112836528B (en) * 2021-02-07 2023-10-03 语联网(武汉)信息技术有限公司 Machine post-translation editing method and system
CN114091483B (en) * 2021-10-27 2023-02-28 北京百度网讯科技有限公司 Translation processing method and device, electronic equipment and storage medium
CN116956946A (en) * 2023-07-14 2023-10-27 上海一者信息科技有限公司 Machine translation text fine granularity error type identification and positioning method
CN117273027B (en) * 2023-11-22 2024-04-30 四川语言桥信息技术有限公司 Automatic machine translation post-verification method based on translation error correction

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670191A (en) * 2019-01-24 2019-04-23 语联网(武汉)信息技术有限公司 Calibration optimization method, device and the electronic equipment of machine translation
CN111144137A (en) * 2019-12-17 2020-05-12 语联网(武汉)信息技术有限公司 Method and device for generating edited model corpus after machine translation
CN111382580A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 Encoder-decoder framework pre-training method for neural machine translation
CN111597778A (en) * 2020-04-15 2020-08-28 哈尔滨工业大学 Method and system for automatically optimizing machine translation based on self-supervision

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6471074B2 (en) * 2015-09-30 2019-02-13 株式会社東芝 Machine translation apparatus, method and program
CN105740218A (en) * 2015-12-31 2016-07-06 成都数联铭品科技有限公司 Post-editing processing method for mechanical translation
US10558762B2 (en) * 2018-02-24 2020-02-11 International Business Machines Corporation System and method for adaptive quality estimation for machine translation post-editing
CN112287696B (en) * 2020-10-29 2024-02-23 语联网(武汉)信息技术有限公司 Post-translation editing method and device, electronic equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109670191A (en) * 2019-01-24 2019-04-23 语联网(武汉)信息技术有限公司 Calibration optimization method, device and the electronic equipment of machine translation
CN111144137A (en) * 2019-12-17 2020-05-12 语联网(武汉)信息技术有限公司 Method and device for generating edited model corpus after machine translation
CN111382580A (en) * 2020-01-21 2020-07-07 沈阳雅译网络技术有限公司 Encoder-decoder framework pre-training method for neural machine translation
CN111597778A (en) * 2020-04-15 2020-08-28 哈尔滨工业大学 Method and system for automatically optimizing machine translation based on self-supervision

Also Published As

Publication number Publication date
CN112287696A (en) 2021-01-29
WO2022088570A1 (en) 2022-05-05

Similar Documents

Publication Publication Date Title
CN112287696B (en) Post-translation editing method and device, electronic equipment and storage medium
CN110852117B (en) Effective data enhancement method for improving translation effect of neural machine
WO2018010455A1 (en) Neural network-based translation method and apparatus
CN109840331B (en) Neural machine translation method based on user dictionary
CN112766000B (en) Machine translation method and system based on pre-training model
US8874433B2 (en) Syntax-based augmentation of statistical machine translation phrase tables
CN111144140B (en) Zhongtai bilingual corpus generation method and device based on zero-order learning
CN112329447B (en) Training method of Chinese error correction model, chinese error correction method and device
CN112818712B (en) Machine translation method and device based on translation memory library
CN112541365B (en) Machine translation method and device based on term replacement
Bertoldi et al. A new decoder for spoken language translation based on confusion networks
CN111144137B (en) Method and device for generating corpus of machine post-translation editing model
CN115587590A (en) Training corpus construction method, translation model training method and translation method
Afli et al. Integrating optical character recognition and machine translation of historical documents
CN109657244B (en) English long sentence automatic segmentation method and system
Ahmadnia et al. Round-trip training approach for bilingually low-resource statistical machine translation systems
CN111178060A (en) Korean word segmentation reduction method based on language model
CN112836528B (en) Machine post-translation editing method and system
CN114861628A (en) System, method, electronic device and storage medium for training machine translation model
CN114185573A (en) Implementation and online updating system and method for human-computer interaction machine translation system
CN117034968B (en) Neural machine translation method, device, electronic equipment and medium
CN110287496A (en) A kind of English to Chinese Word sense disambiguation method neural network based
CN116029310A (en) Automatic post-editing method and device for machine translation
CN114595703A (en) Interactive machine translation method and device, storage medium and electronic device
CN117709370A (en) Text translation method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant