CN111597824A

CN111597824A - Training method and device of language translation model

Info

Publication number: CN111597824A
Application number: CN202010307663.3A
Authority: CN
Inventors: 陈巍华
Original assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Current assignee: Unisound Intelligent Technology Co Ltd; Xiamen Yunzhixin Intelligent Technology Co Ltd
Priority date: 2020-04-17
Filing date: 2020-04-17
Publication date: 2020-08-28
Anticipated expiration: 2040-04-17
Also published as: CN111597824B

Abstract

The invention relates to a method and a device for training a language translation model. The method comprises the following steps: step A1, obtaining a source language material S1 and a target language material T1 which are bilingual language materials with each other, and constructing a two-classification translation network M1 according to the source language material S1 and the target language material T1, wherein the two-classification translation network M1 has the capability of judging whether any source sentences and target sentences are translated with each other; step A2, training an initial source language training model by using a source corpus P1 to obtain a target source language training model M2; step A3, obtaining target corpus T2 corresponding to a source corpus S3 and a source corpus S3 according to a binary translation network M1, a target source language training model M2 and the source corpus S1; and step A4, obtaining a language translation model according to the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2. According to the technical scheme, the source language material can be expanded, so that the source language material and the target language material which are rich in resources and mutually double-translated are obtained, and a language translation model with high translation precision, accuracy and quality is obtained.

Description

Training method and device of language translation model

Technical Field

The invention relates to the technical field of translation, in particular to a method and a device for training a language translation model.

Background

At present, in a Translation task, most of existing mainstream data enhancement algorithms extend a corpus by introducing noise (word insertion, deletion, reordering, and the like) or extend the corpus by generating parallel pseudo-bilingual by using a large amount of target monolingues through a Back-Translation method, and then train a language Translation model after obtaining bilingual data.

Disclosure of Invention

The embodiment of the invention provides a method and a device for training a language translation model. The technical scheme is as follows:

according to a first aspect of the embodiments of the present invention, there is provided a method for training a language translation model, including:

step A1, obtaining a source language material S1 and a target language material T1 which are bilingual language materials with each other, and constructing a two-classification translation network M1 according to the source language material S1 and the target language material T1, wherein the two-classification translation network M1 has the capability of judging whether any source sentences and target sentences are translated with each other;

step A2, training an initial source language training model by using a source corpus P1 to obtain a target source language training model M2;

step A3, obtaining a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to the two-classification translation network M1, the target source language training model M2 and the source corpus S1;

and A4, obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2.

In one embodiment, the step a3 includes:

expanding the source corpus S1 according to the target source language training model M2 to obtain S2 of candidate source corpora;

and screening the candidate source corpus S2 and the target corpus T1 by using the two-classification translation network M1 to obtain a source corpus S3 and the target corpus T2.

In one embodiment, the screening the candidate source corpus S2 and the target corpus T1 by using the two-category translation network M1 to obtain a source corpus S3 and a target corpus T2 includes:

obtaining preset corpus probability threshold values of the inter-translation corpus;

inputting the candidate source corpus S2 and the target corpus T1 into the two-classification translation network M1, so as to screen out the source corpus S3 and the target corpus T2 by using the two-classification translation network M1 and the corpus probability threshold.

In one embodiment, the method further comprises:

respectively taking the target corpus T2 corresponding to the source corpus S3 and the source corpus S3 as the target corpus T1 and the source corpus S1 again, and executing the step A2 and the step A3 to obtain a target corpus T3 and a source corpus S4 corresponding to the target corpus T3;

the step A4 includes:

and obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, the target corpus T2, the target corpus T3 and the source corpus S4.

In one embodiment, the method further comprises:

acquiring a preset number of target monolingues;

translating the source language material P2 corresponding to the target monolingual by using the language translation model;

and retraining the language translation model according to the source corpus P2 and the target monolingua.

According to a second aspect of the embodiments of the present invention, there is provided a training apparatus for a language translation model, including:

the system comprises a first processing module, a second processing module and a third processing module, wherein the first processing module is used for acquiring a source language material S1 and a target language material T1 which are bilingual language materials with each other, and constructing a two-classification translation network M1 according to the source language material S1 and the target language material T1, wherein the two-classification translation network M1 has the capability of judging whether any source sentences and any target sentences are translated with each other;

the training module is used for training the initial source language training model by utilizing the source corpus P1 to obtain a target source language training model M2;

a first obtaining module, configured to obtain a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to the binary translation network M1, the target source language training model M2 and the source corpus S1;

and the second obtaining module is configured to obtain the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, and the target corpus T2.

In one embodiment, the first obtaining module comprises:

the expansion submodule is used for expanding the source corpus S1 according to the target source language training model M2 to obtain S2 of a candidate source corpus;

and the screening submodule is used for screening the candidate source corpus S2 and the target corpus T1 by using the two-classification translation network M1 to obtain a source corpus S3 and the target corpus T2.

In one embodiment, the screening submodule is specifically configured to:

In one embodiment, the apparatus further comprises:

a second processing module, configured to respectively re-use the source corpus S3 and the target corpus T2 corresponding to the source corpus S3 as the target corpus T1 and the source corpus S1, and execute the steps of the training module and the first obtaining module, so as to obtain source corpora S4 corresponding to the target corpus T3 and the target corpus T3;

the second acquisition module includes:

an obtaining submodule, configured to obtain the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, the target corpus T2, the target corpus T3, and the source corpus S4.

In one embodiment, the apparatus further comprises:

the third acquisition module is used for acquiring target monolingues with preset number;

the translation module is used for translating the source language material P2 corresponding to the target monolingual by using the language translation model;

and the training module is used for retraining the language translation model according to the source corpus P2 and the target monolingual.

The technical scheme provided by the embodiment of the invention can have the following beneficial effects:

an initial two-classification translation network M1 can be constructed through a source language material S1 and a target language material T1 with low resources (i.e., a small amount of resources and/or simplicity), then a model can be trained on an initial source language by using a source language material P1, so that a target source language training model M2 with high accuracy is obtained, then the two-classification translation network M1 and the target source language training model M2 can be used for expanding the source language material S1, so that a source language material S3 with rich resources and double translations with each other and a target language material T2 corresponding to the source language material S3 are obtained, and then the language translation model with high translation accuracy and precision and high quality can be obtained by using rich languages (i.e., the source language material S1, the target language material T1, the source language material S3 and the target language material T2).

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a flow diagram illustrating a method of training a language translation model in accordance with an exemplary embodiment.

FIG. 2 is a block diagram illustrating a training apparatus for a language translation model in accordance with an exemplary embodiment.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

In order to solve the above technical problem, an embodiment of the present invention provides a method for training a language translation model, where the method is used in a training program, system or device for a language translation model, and an execution subject corresponding to the method may be a terminal or a server, as shown in fig. 1, and the method includes steps a1 to a 4:

step A1, obtaining a source language material S1 and a target language material T1 which are bilingual language materials with each other, and constructing a two-classification translation network M1 according to the source language material S1 and the target language material T1, wherein the two-classification translation network M1 has the capability of judging whether any source sentences and target sentences are translated with each other; the bilingual corpus source corpus S1 and target corpus T1 each other mean that the target corpus T1 and the source corpus S1 refer to the same content, but different expression languages. The source corpus S1 belongs to a corpus with low resources, i.e. a small amount of resources, i.e. the number of corpuses of the source corpus S1 is less than the preset number. Therefore, the essence of the invention is a training method of the low-resource translation model. The binary translation network M1 may be any network having the capability of determining whether any source sentence and any target sentence are translated with each other, such as a binary convolutional neural network.

Step A2, training an initial source language training model by using a source corpus P1 to obtain a target source language training model M2; the initial source language training model may be an open-source language training model, the target source language training model M2 is a model obtained by training a source corpus P1 on a pre-training model (e.g., BERT (Bidirectional Encoder expressions from transducers)), which is a network, and the model M2 trained by the network has a capability of completely filling in the empty space and can be used for expanding the corpus; in addition, both the initial source language training model and the target source language training model have the function of enriching the linguistic data, namely enriching a certain source linguistic data into a plurality of linguistic data.

The source corpus P1 may be a monolingual corpus, i.e., a corpus that can currently be expressed in only one language, or a bilingual corpus, i.e., a corpus that can currently be expressed in multiple languages.

and A4, obtaining the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2. The language translation model is a model with a language translation function (double translation function), such as being capable of translating chinese into english, russian, and other languages.

An initial two-classification translation network M1 can be constructed through a source corpus S1 and a target corpus T1 with low resources (i.e., a small amount of resources and/or simplicity), then the initial source language training model can be trained by using a source corpus P1, so that a target source language training model M2 with high accuracy is obtained, then the two-classification translation network M1 and the target source language training model M2 can be used for expanding the source corpus S1, so that a source corpus S3 with rich resources and double translations for each other and a target corpus T2 corresponding to the source corpus S3 are obtained, and then the language translation model with high translation accuracy and precision and high quality can be obtained by using rich corpora (i.e., the source corpus S1, the target corpus T1, the source corpus S3 and the target corpus T2).

In one embodiment, the step a3 includes:

Because the source corpus S1 is large in quantity, the language translation model trained only by S1 is low in precision, so that the target source language training model M2 can be used to expand the source corpus S1 to obtain S2 of a larger number of candidate source corpora, and the two-classification translation network M1 is used to further screen the candidate source corpus S2 and the target corpus T1, so as to obtain the source corpus S3 and the target corpus T2 with higher probability of bilingual corpus each other, so that the source corpus S3 and the target corpus T2 are combined to improve the translation precision and quality of the language translation model.

obtaining a preset corpus probability threshold value of the mutual translation corpuses (namely the mutual bilingual corpuses);

In the screening, a preset corpus probability threshold may be combined to filter the source corpus S2 and the target corpus T1, which have lower probability of being bilingual corpus, so as to ensure that the source corpus S3 and the target corpus T2 are screened to have higher probability of being bilingual corpus.

In one embodiment, the method further comprises:

the step A4 includes:

By re-using the source corpus S3 and the target corpus T2 corresponding to the source corpus S3 as the target corpus T1 and the source corpus S1, respectively, and re-executing step a2 and step A3, more source corpora and target corpora each other as bilingual corpora can be obtained, i.e., further obtaining the target corpus T3 and the source corpus S4 each as a bilingual corpus with respect to the target corpus T3, so that a language translation model with higher translation accuracy and quality can be obtained by using more source corpora and target corpora each as bilingual corpora (i.e., (S1, T1), (S3, T2), (S4, T3)).

In one embodiment, the method further comprises:

acquiring a preset number of target monolingues; a target monolingual is a corpus that can currently be expressed in only one language.

By acquiring a preset number of target monolingues and translating a source language material P2 which is translated with the target monolingues by using the language Translation model, a large number of target monolingues are subjected to the language Translation model by using a Back-Translation (reverse Translation) mode to obtain bilinguals with higher quality, and then the language Translation model can be retrained by using the source language material P2 and the target monolingues, so that the Translation precision and quality of the language Translation model are further improved by combining the Back-Translation mode.

Therefore, the invention constructs the Translation model by further expanding high-quality bilingual for the bilingual corpus of the low resource, improves the capability of carrying out the Translation model, further has higher quality of the forged bilingual obtained by combining the Back-Translation, and finally can further improve the Translation capability of the bilingual model of the low resource.

The technical solution of the present invention will be further explained in detail below:

step 1: modeling is carried out on a source language material S1 and a target language material T1 in a low-resource bilingual language material, and a two-classification network M1 is trained, wherein the network has the capability of judging whether any source sentences and target sentences are translated mutually.

Step 2: for source corpora (monolingual, bilingual all rows), a pre-trained model (e.g., BERT M2 from the bidirectional coder token of the transformer, etc.) is obtained by using an open-source pre-trained language model or by using a large number of source monolingual training.

And step 3: and performing corpus expansion on the source corpus S1 by using M2, randomly masking the source corpus or adding some masks at certain positions of sentences, predicting through M2, and predicting/adding words at the mask positions to obtain S2 of candidate source corpuses.

And 4, step 4: and (3) passing the generated source language material S2 and the corresponding target language material T1 through a binary network M1, and screening data which are larger than a certain threshold value (the probability that an original expected language material and a target expected language material are bilingual to each other) in S2 to obtain a forged source language material S3 and a target language material T1 corresponding to the forged source language material S3.

And 5: and (3) converting the source language material into the target language material, and repeating the steps 2, 3 and 4 to obtain the forged target language material T3 and the source language material S1 corresponding to the forged target language material T3.

Step 6: and (3) performing model construction by taking the original bilingual corpus (S1, T1) and the forged bilingual corpus (S3, T1) and the forged bilingual corpus (S1, T3) as training corpora training translation model to obtain the bilingual translation model M3.

And 7: a large amount of target monolingues are subjected to M3 in a Back-Translation mode to obtain high-quality forged monolingues (S4, T4), and the obtained data are subjected to Translation model training, so that the effect of low-resource Translation is improved.

Finally, it is clear that: the above embodiments can be freely combined by those skilled in the art according to actual needs.

Corresponding to the above method for training a language translation model provided in the embodiment of the present invention, an embodiment of the present invention further provides a device for training a language translation model, as shown in fig. 2, the device includes:

the first processing module 201 is configured to obtain a source corpus S1 and a target corpus T1 of bilingual corpuses each other, and construct a two-classification translation network M1 according to the source corpus S1 and the target corpus T1, where the two-classification translation network M1 has an ability to determine whether any source sentences and target sentences are translated with each other;

the training module 202 is configured to train an initial source language training model by using the source corpus P1, and obtain a target source language training model M2;

the first obtaining module 203 is configured to obtain a source corpus S3 and a target corpus T2 corresponding to the source corpus S3 according to the binary translation network M1, the target source language training model M2 and the source corpus S1;

the second obtaining module 204 is configured to obtain the language translation model according to the source corpus S1, the target corpus T1, the source corpus S3, and the target corpus T2.

In one embodiment, the first obtaining module comprises:

In one embodiment, the screening submodule is specifically configured to:

In one embodiment, the apparatus further comprises:

the second acquisition module includes:

In one embodiment, the apparatus further comprises:

Other embodiments of the invention will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the invention and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the invention being indicated by the following claims.

It will be understood that the invention is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the invention is limited only by the appended claims.

Claims

1. A method for training a language translation model, comprising:

2. The method of claim 1,

the step A3 includes:

3. The method of claim 2,

the screening the candidate source corpus S2 and the target corpus T1 by using the two-class translation network M1 to obtain a source corpus S3 and a target corpus T2 includes:

4. The method of claim 2, further comprising:

the step A4 includes:

5. The method according to any one of claims 1 to 4, further comprising:

acquiring a preset number of target monolingues;

6. An apparatus for training a language translation model, comprising:

7. The apparatus of claim 6,

the first obtaining module comprises:

8. The apparatus of claim 7,

the screening submodule is specifically configured to:

9. The apparatus of claim 7, further comprising:

the second acquisition module includes:

10. The apparatus of any one of claims 6 to 9, further comprising: