CN112966530A

CN112966530A - Self-adaptive method, system, medium and computer equipment in machine translation field

Info

Publication number: CN112966530A
Application number: CN202110375078.1A
Authority: CN
Inventors: 贝超; 程国艮
Original assignee: Global Tone Communication Technology Co ltd
Current assignee: Global Tone Communication Technology Co ltd
Priority date: 2021-04-08
Filing date: 2021-04-08
Publication date: 2021-06-15
Anticipated expiration: 2041-04-08
Also published as: CN112966530B

Abstract

The invention belongs to the technical field of machine translation, and discloses a self-adaptive method, a self-adaptive system, a self-adaptive medium and a self-adaptive computer device in the field of machine translation, which comprise the following steps: and performing adaptive training of the domain model based on the quantity and quality of different domain corpora through semi-supervised augmentation corpora, and performing machine translation by using the trained domain model. The invention provides a set of complete solution for the main problems in the practical application of the field machine translation, can effectively utilize the field linguistic data provided by the user and provides a better field machine translation model. Aiming at the condition of less field linguistic data, the invention uses a semi-supervised method to construct the field linguistic data; selecting different training modes according to the requirements of users, and obtaining a better field neural network machine translation model in an up-sampling mode; the method avoids the phenomenon that the field model is over-fitted in an incremental training mode, cannot cover the use scenes of most users, and can quickly construct the field model.

Description

Self-adaptive method, system, medium and computer equipment in machine translation field

Technical Field

The invention belongs to the technical field of machine translation, and particularly relates to a self-adaptive method, a self-adaptive system, a self-adaptive medium and computer equipment in the field of machine translation.

Background

Currently, machine translation is the process of automatically translating a source language sentence into another target language sentence using computer algorithms. Machine translation is a research direction of artificial intelligence, and has very important scientific research value and practical value. Along with the continuous deepening of the globalization process and the rapid development of the internet, the machine translation technology plays an increasingly important role in political, economic, social, cultural communication and the like at home and abroad.

As the availability of neural network machine translation has increased substantially, the user's demand for machine translation has increased. The general user has no professional requirement and does not need high accuracy, and the requirement can be met by using machine translation in the general field. However, general machine translation in the general field cannot meet the requirements of users in the professional field, but the requirements of the users in the professional field for machine translation are large, and the requirements on the accuracy and the specialty of the translation are high.

The neural network machine translation system in the current field has been discussed more in academia, but for the application of the neural network machine translation in the field of industry level, a plurality of problems to be solved still exist. The academic papers can be optimized correspondingly for the test set, and most importantly, the domain test set is only thousands of sentences, and cannot represent the sentences to be translated in all scenes of the user. Therefore, in practical applications, the domain machine translation model often causes users to feel poor.

Training a neural network model from scratch takes a lot of time on corpus processing and model training. However, in practical application, users often generate new domain corpora continuously, but cannot train a model from scratch every time, which puts requirements on rapid domain adaptation.

In addition, the user has less corpus, which is difficult to cover all the use scenes and the quality is difficult to ensure. The difficulty is how to use the language material provided by the user and perform personalized customization.

Through the above analysis, the problems and defects of the prior art are as follows: the existing machine translation system or method can not be applied to the professional field and can not carry out field self-adaptation; and the existing machine translation method or system has inaccurate translation and poor user experience. The difficulty in solving the above problems and defects is: because the available domain linguistic data are few or even none, the neural network machine translation model needs a large amount of data to drive, and a small amount of data cannot be trained to obtain an available domain machine translation model or even train.

The significance of solving the problems and the defects is as follows: according to the invention, an available domain machine translation model can be trained under a reasonable condition according to the requirements of users and different conditions, and the problem that the domain neural network machine translation model cannot be trained under the condition of domain corpus deficiency is solved.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a self-adaptive method, a self-adaptive system, a self-adaptive medium and computer equipment in the field of machine translation.

The invention is realized in such a way that a machine translation field self-adaptive method comprises the following steps: and performing adaptive training of the domain model based on the quantity and quality of different domain corpora through semi-supervised augmentation corpora, and performing machine translation by using the trained domain model.

Further, the machine translation domain adaptive method comprises the following steps:

step one, producing a pseudo-parallel field corpus by using a semi-supervised method, and augmenting the corpus;

step two, constructing a domain model, judging whether the time is sufficient, and if the time is sufficient, performing full training on the constructed domain model; if the time is not sufficient, performing incremental training on the constructed domain model;

and step three, performing the machine translation of the domain self-adaption by using the trained domain model.

Further, the producing the pseudo parallel domain corpus by using the semi-supervised method comprises the following steps:

and (3) collecting monolingues in the field, translating by using a reverse machine translation model, and forming a pseudo-parallel corpus in the field by using a translated text obtained by translation and an original text.

Further, in the second step, the performing full-scale training on the constructed domain model includes:

(1) carrying out training set pretreatment; training the constructed domain model by taking the universal test set as a development set;

(2) and (4) taking the field test set as a development set, and performing secondary training on the constructed field model by using the same training set.

Further, in step (1), the performing training set preprocessing includes: and upsampling the domain linguistic data to enable the ratio of the number of the general linguistic data to the number of the domain linguistic data to be 5:1 to 10: 1.

Further, in the second step, the incremental training of the constructed domain model includes: and judging the condition of the domain linguistic data, and training the constructed domain model by utilizing the linguistic data in the domain based on a judgment result.

Further, the domain model constructed by using corpus training in the domain based on the determination result includes:

if the field corpus is more and the quality is better: based on the original general model, the field linguistic data is used for carrying out incremental training of the field model;

if the domain corpus is less or the quality is lower: mixing the domain linguistic data and the general linguistic data, and performing upsampling to obtain a ratio of the general linguistic data to the domain linguistic data of 5: 1; incremental training is performed based on the generic model.

It is a further object of the invention to provide a computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of: and performing adaptive training of the domain model based on the quantity and quality of different domain corpora through semi-supervised augmentation corpora, and performing machine translation by using the trained domain model.

It is another object of the present invention to provide a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of: and performing adaptive training of the domain model based on the quantity and quality of different domain corpora through semi-supervised augmentation corpora, and performing machine translation by using the trained domain model.

Another object of the present invention is to provide a machine translation domain adaptive system for implementing the machine translation domain adaptive method, the machine translation domain adaptive system comprising:

the corpus augmentation module is used for producing the corpus in the pseudo-parallel field by using a semi-supervised method and augmenting the corpus;

the model building module is used for building a domain model;

the training module is used for carrying out incremental or full training of the domain model based on the quantity and quality of different domain linguistic data;

and the translation module is used for performing the machine translation of the domain self-adaption by utilizing the trained domain model.

By combining all the technical schemes, the invention has the advantages and positive effects that: the invention can select a proper training mode under the condition of the quantity and the quality of the linguistic data in different fields. Under the condition of less field language material quantity, the field language material can be rapidly expanded through a semi-supervision mode. During training, the field model can be provided with better quality through full training. And in a short time, the domain model training needs to be performed rapidly: under the condition that the quality and the quantity of the linguistic data are both good, the incremental training can be directly carried out by using the domain linguistic data; under the condition of poor corpus quality or small corpus quantity, the field and the general corpus can be mixed, the specific gravity of the field corpus is improved through upsampling, and then incremental training is carried out.

The invention provides a set of complete solution for the main problems in the practical application of the field machine translation, can effectively utilize the field linguistic data provided by the user and provides a better field machine translation model.

When the field linguistic data are less, the semi-supervised method is selected to construct the field linguistic data. And according to the requirements of users, different training modes are selected, and a better domain neural network machine translation model is finally obtained through an up-sampling mode. The method avoids the phenomenon that the field model is over-fitted in an incremental training mode, cannot cover the use scenes of most users, and can quickly construct the field model.

The invention is applied to a machine translation engine in the financial field, and has the effect shown in the table 1, and the effect is obvious compared with the effect of a model in the general field.

TABLE 1 financial field BLEU values

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments of the present application will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained from the drawings without creative efforts.

FIG. 1 is a schematic diagram of a method for adaptive domain machine translation according to an embodiment of the present invention.

Fig. 2 is a flowchart of a method for adapting to the field of machine translation according to an embodiment of the present invention.

FIG. 3 is a diagram illustrating a structure of a machine translation domain adaptive system according to an embodiment of the present invention;

in the figure: 1. a corpus augmentation module; 2. a model building module; 3. a training module; 4. and a translation module.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In view of the problems in the prior art, the present invention provides a method, a system, a medium, and a computer device for adaptive field of machine translation, and the present invention is described in detail below with reference to the accompanying drawings.

As shown in fig. 1, a method for adapting to a machine translation domain provided by an embodiment of the present invention includes: and performing adaptive training of the domain model based on the quantity and quality of different domain corpora through semi-supervised augmentation corpora, and performing machine translation by using the trained domain model.

As shown in fig. 2, the method for adapting to the field of machine translation provided by the embodiment of the present invention includes the following steps:

s101, producing a pseudo-parallel field corpus by using a semi-supervised method, and augmenting the corpus;

s102, constructing a domain model, judging whether the time is sufficient, and if the time is sufficient, performing full training on the constructed domain model; if the time is not sufficient, performing incremental training on the constructed domain model;

and S103, performing the machine translation of the domain self-adaption by using the trained domain model.

The method for producing the language material in the pseudo parallel field by using the semi-supervised method provided by the embodiment of the invention comprises the following steps:

In step S102, the performing full-scale training on the constructed domain model provided by the embodiment of the present invention includes:

In step (1), the training set preprocessing provided by the embodiment of the present invention includes: and upsampling the domain linguistic data to enable the ratio of the number of the general linguistic data to the number of the domain linguistic data to be 5:1 to 10: 1.

In step S102, the incremental training of the constructed domain model provided in the embodiment of the present invention includes:

and judging the condition of the domain linguistic data, and training the constructed domain model by utilizing the linguistic data in the domain based on a judgment result.

The domain model constructed by utilizing the corpus training in the domain based on the judgment result provided by the embodiment of the invention comprises the following steps:

As shown in fig. 3, the machine translation domain adaptive system provided by the embodiment of the present invention includes:

the corpus augmentation module 1 is used for producing the corpus in the pseudo-parallel field by using a semi-supervised method and augmenting the corpus;

the model building module 2 is used for building a domain model;

the training module 3 is used for carrying out incremental or full training of the domain model based on the quantity and quality of different domain corpora;

and the translation module 4 is used for performing the machine translation of the domain self-adaption by utilizing the trained domain model.

The technical effects of the present invention will be further described with reference to specific embodiments.

Example 1:

the invention provides a neural network-based field machine translation self-adaption method and system. The whole process is shown in FIG. 1.

1. Aiming at the problem of the amount of the linguistic data, the invention uses a semi-supervised method to produce the linguistic data in the pseudo-parallel field:

because the linguistic data in the bilingual field are few, especially in certain small languages, and the quality is difficult to guarantee, the quantity and the quality of the linguistic data can be better guaranteed by collecting the monolingual in the field. And then, translating by using a machine translation model in the opposite direction, wherein the obtained translated text and the original text form a field pseudo-parallel corpus.

2. Regarding how to quickly build a domain model:

a) gross training

According to the quantity and quality of the domain corpora, the domain corpora are subjected to upsampling, so that the ratio of the general corpora to the quantity of the domain corpora is about 5:1 to 10: 1. The higher the field corpus quality is, the larger the occupied proportion is. The basic steps are as follows:

i. and training until stopping by taking the universal test set as a development set under the condition of not changing the model structure.

And ii, training the model by using the same training set by taking the field test set as a development set until stopping.

b) Incremental training

Under limited time, it is desirable to obtain a domain model quickly, and incremental training can be performed. Incremental training is based on the original model (generally, the universal model), and the model is continuously trained by using the linguistic data in the field.

i. For more domain corpora and better quality: based on the original general model, the domain corpora are used for incremental training.

Less or lower quality for domain corpora: and mixing the domain linguistic data and the general linguistic data, and enabling the ratio of the general linguistic data to the domain linguistic data to be about 5:1 through upsampling. Incremental training is then performed based on the generic model.

The training method of the field machine turnover can select a proper training mode under the condition of the quantity and the quality of the linguistic data in different fields. Under the condition of less field language material quantity, the field language material can be rapidly expanded through a semi-supervision mode. During training, the field model can be provided with better quality through full training. And in a short time, the domain model training needs to be performed rapidly: under the condition that the quality and the quantity of the linguistic data are both good, the incremental training can be directly carried out by using the domain linguistic data; under the condition of poor corpus quality or small corpus quantity, the field and the general corpus can be mixed, the specific gravity of the field corpus is improved through upsampling, and then incremental training is carried out.

Example 2:

a field model in the english to mid direction is trained.

1. Language material in semi-supervised augmentation field:

and (3) collecting monolingues in the Chinese field, translating the Chinese to English by using a Chinese-to-English universal machine translation model after cleaning, and cleaning the obtained translation to form pseudo bilingual corpus in the English-to-Chinese field.

2. Training a field model:

a) if time allows, then full training is performed

b) If the time is short, incremental training is performed

It should be noted that the embodiments of the present invention can be realized by hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or specially designed hardware. Those skilled in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such code being provided on a carrier medium such as a disk, CD-or DVD-ROM, programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier, for example. The apparatus and its modules of the present invention may be implemented by hardware circuits such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., or by software executed by various types of processors, or by a combination of hardware circuits and software, e.g., firmware.

The above description is only for the purpose of illustrating the present invention and the appended claims are not to be construed as limiting the scope of the invention, which is intended to cover all modifications, equivalents and improvements that are within the spirit and scope of the invention as defined by the appended claims.

Claims

1. A self-adaptive method in the machine translation field is characterized in that the self-adaptive method in the machine translation field is used for carrying out adaptive training of a field model based on the quantity and quality of different field linguistic data through semi-supervised augmentation linguistic data, and carrying out machine translation by using the trained field model;

the self-adaptive method of the machine translation field comprises the following steps:

constructing a domain model, and carrying out full-scale training on the constructed domain model; if the time is not sufficient, performing incremental training on the constructed domain model;

2. The machine translation domain adaptive method of claim 1, wherein said producing pseudo parallel domain corpus using semi-supervised method comprises: and (3) collecting monolingues in the field, translating by using a reverse machine translation model, and forming a pseudo-parallel corpus in the field by using a translated text obtained by translation and an original text.

3. The machine translation domain adaptive method of claim 1, wherein in step two, the training the constructed domain model comprises:

4. The machine translation domain adaptive method of claim 3, wherein in step (1), the performing training set preprocessing comprises: and upsampling the domain linguistic data to enable the ratio of the number of the general linguistic data to the number of the domain linguistic data to be 5:1 to 10: 1.

5. The machine translation domain adaptive method of claim 1, wherein in step two, the incrementally training the constructed domain model comprises: and judging the condition of the domain linguistic data, and training the constructed domain model by utilizing the linguistic data in the domain based on a judgment result.

6. A computer device, characterized in that the computer device comprises a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to carry out the steps of: performing adaptive training of a domain model based on the quantity and quality of different domain corpora through semi-supervised augmentation corpora, and performing machine translation by using the trained domain model; the method specifically comprises the following steps:

7. A computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of: performing adaptive training of a domain model based on the quantity and quality of different domain corpora through semi-supervised augmentation corpora, and performing machine translation by using the trained domain model; the method specifically comprises the following steps:

8. A machine translation domain adaptive system for implementing the machine translation domain adaptive method according to any one of claims 1 to 5, wherein the machine translation domain adaptive system comprises:

the model building module is used for building a domain model;