CN113111667B

CN113111667B - Method for generating pseudo data in low-resource language based on multi-language model

Info

Publication number: CN113111667B
Application number: CN202110397096.XA
Authority: CN
Inventors: 杜权
Original assignee: Shenyang Yayi Network Technology Co ltd
Current assignee: Shenyang Yayi Network Technology Co ltd
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2023-08-22
Anticipated expiration: 2041-04-13
Also published as: CN113111667A

Abstract

The invention discloses a method for generating pseudo data in a low-resource language based on a multilingual model, which comprises the following steps: obtaining bilingual training data obtained by preprocessing the bilingual data of the small language single language and bilingual data and the bilingual data of the same or adjacent language system, and further obtaining a multilingual model; preprocessing the multilingual data of the small language source language to obtain multilingual training data; decoding the single-language training data to obtain small-language target language single-language pseudo data; and processing the target language single-language pseudo data and the source language single-language training data to obtain bilingual pseudo data, integrating the bilingual pseudo data with training data of the target language-source language multi-language model, processing to obtain new target language-source language bilingual training data, and finally iterating until the performance of the model is not improved. According to the invention, bilingual data of the same or adjacent language system of the small language is merged into the model training, so that the data volume of the model training is increased, and the language features of the same or adjacent language system are merged into the model, thereby improving the performance of the model.

Description

Method for generating pseudo data in low-resource language based on multi-language model

Technical Field

The invention relates to a method for generating pseudo data, in particular to a method for generating pseudo data based on a low-resource language of a multi-language model.

Background

Neural machine translation is a research area that has been rising from 2013, but it has been developed very rapidly, and has far exceeded traditional statistical machine translation in many languages. Unlike traditional statistical-based translation algorithms, neural machine translation sees the translation task as a problem of translating from one sentence sequence to another, and the end-to-end translation task is completed through a deep neural network comprising an encoder and a decoder, so that the complicated intermediate process in traditional statistical machine translation is avoided, and the neural machine translation model receives more and more attention due to high simplicity and good translation performance. The translation model also underwent a development process from the original naive codec model, to the attention-mechanism-based coding model, to the fully CNN-based neuro-machine translation model, and finally to the fully self-attention-mechanism-based neuro-machine translation model. Each development and evolution of the translation model brings about a great improvement of performance, but the success of the translation model mainly depends on a large amount of high-quality bilingual corpus, and *** has once been demonstrated as follows: the index of the automatic evaluation can be improved by 0.5 percent every time the data set scale is doubled, so that the importance of the data quantity on the improvement of the model performance can be seen.

With the rapid development of machine translation, the translation task between large languages such as english and french can achieve a good translation effect, but the translation of small languages is still a difficult problem, and compared with the corpus data with abundant large languages, the machine translation of small languages has the main challenge of sparseness of the corpus, and the task of acquiring a large amount of parallel data is very difficult, but acquiring single language data is relatively easy. Aiming at the low-resource problem of small languages, many attempts have been made in the field of machine translation, and methods are roughly divided into two categories: the first is to fully utilize the easy-to-acquire single-language data, the typical method is reverse translation, which translates the single-language data of the target language end into the data of the source language by using a reverse translation model, and constructs double-language pseudo data to train a forward translation model by the method; the second type of method is a multilingual model method, which defines an encoder and a decoder for each language, respectively, and performs translation between different languages through a shared intent mechanism.

However, the above methods ignore the similarity characteristics of the same or adjacent language system languages in small languages, and have low utilization rate on low-resource languages, so that the translation quality is poor. For example, the Golay and Indian are of the same family of Indian languages of the Indoku family, and both have the same alphabet through which the two languages can be converted into each other. For the low-resource machine translation of the small languages, the method is a new method for adding bilingual data, and a system capable of realizing the low-resource language translation by utilizing the similarity characteristics of the same or adjacent language systems of the small languages has not been reported at present.

Disclosure of Invention

Aiming at the problem that the existing method ignores the similarity characteristics of the same or adjacent language system languages of the small languages in the low-resource translation problem of the small languages, the invention aims to solve the technical problem of providing a method for generating pseudo data based on the low-resource language of the multilingual model, and improving the performance of the small-language translation model.

In order to solve the technical problems, the invention adopts the following technical scheme:

the invention provides a method for generating pseudo data in a low-resource language based on a multi-language model, which comprises the following steps:

1) Acquiring bilingual data of a small language single language and bilingual data of a neighboring language system or the same language as the small language through the Internet;

2) Preprocessing bilingual data of a small language and bilingual data of the same language or an adjacent language system of the language to obtain bilingual training data, and performing model training by using the bilingual training data to obtain a multi-language model from a source language to a target language and from the target language to the source language;

3) Preprocessing the multilingual data of the small language source language to obtain the multilingual training data for decoding to generate pseudo data;

4) Decoding the single-language training data by using a multi-language model from the source language to the target language to obtain small-language target language single-language pseudo data;

5) Processing the obtained target language single-language pseudo data and the source language single-language training data to obtain bilingual pseudo data;

6) And integrating the processed bilingual pseudo data with training data of the target-language-to-source-language multi-language model, processing to obtain new target-language-to-source-language bilingual training data, then performing new round of training of the target-language-to-source-language multi-language model, generating the pseudo data, and finally iterating until the performance of the model is not improved.

In the step 2), the bilingual training data is obtained by preprocessing the bilingual data of the small language and the bilingual data of the same language or an adjacent language system, and model training is carried out to obtain a multi-language model in two directions, wherein the steps are as follows:

201 The bilingual data obtained in the step 1) is sequentially subjected to a bilingual data mixing method, an HTML tag removing method, a scrambling code filtering method in the bilingual data, a repeated translation data filtering method in the bilingual data, a bracket content non-corresponding filtering method in the bilingual data, a bilingual data word number excessive filtering method and a target language Unicode coding filtering method to obtain preprocessed bilingual data;

202 Combining the small language bilingual data obtained by preprocessing with other language bilingual data, and obtaining training data used for multi-language model training by using a bidirectional duplication removal method;

203 Performing word segmentation on the training data by using a word segmentation tool, and then performing BPE word segmentation to obtain final training data used for training;

204 Using the final training data to perform model training in two directions to obtain a multi-language model in two directions.

In the step 3), the multilingual source language single-language data is preprocessed to obtain single-language training data for decoding the multilingual model to generate pseudo data, and the steps are as follows:

301 The method for removing the HTML label, the method for filtering the messy codes in the single-language data, the method for filtering the word number of the single-language data and the method for filtering the Unicode code based on the target language are sequentially used for the single-language data obtained in the step 1) to obtain the preprocessed single-language data;

302 Data scoring is carried out on the monolingual data by using an XENC tool, and the upper n% of data are filtered according to the score ranking, so that the monolingual training data for decoding is obtained. n is 50 to 60.

And 5) processing the obtained target language single language pseudo data and the source language single language training data to obtain bilingual pseudo data, wherein the steps are as follows:

501 Combining the target language single-language pseudo data and the source language single-language training data to obtain bilingual pseudo data;

502 Processing the bilingual pseudo data by using an n-gram filtering method, and then performing word segmentation to obtain the processed bilingual pseudo data.

In step 6), integrating the processed bilingual pseudo data with the training data of the target-to-source-language multilingual model, processing to obtain new target-to-source-language bilingual training data, and then training the new target-to-source-language multilingual model and generating the pseudo data, and finally iterating until the performance of the model is not improved, wherein the method specifically comprises the following steps:

601 Combining bilingual pseudo data and bilingual training data of a target language to source language multilingual model;

602 Using a bidirectional duplication elimination method to the bilingual data to obtain processed bilingual training data;

603 Training a new round of target language to source language multi-language model and generating pseudo data by using the processed bilingual training data, and finally iterating until the performance of the model is not improved.

The invention has the following beneficial effects and advantages:

1. according to the invention, on the basis of low resources of the small languages, bilingual data of the same or adjacent language systems of the small languages are merged into the model training, so that the data volume of the model training is increased, and language features of the same language system or adjacent language systems are merged into the model, thereby improving the performance of the model.

2. Meanwhile, the multi-language model trained by the invention fuses a plurality of bilingual data of the same or adjacent language system, and the multi-language model provides a new method for generating other small-language pseudo data and improving translation quality.

Drawings

FIG. 1 is a flow chart of bilingual data preprocessing in accordance with the present invention;

FIG. 2 is a flow chart of the whisper data preprocessing of the present invention.

Detailed Description

The invention is further elucidated below in connection with the drawings of the specification.

The invention expands on the basis of a pseudo data generation method which is originally only oriented to a small-language single language model, and provides a low-resource machine translation pseudo data generation method based on a multi-language model. According to the method, bilingual data of the same or adjacent languages of the small languages are merged into the multilingual model training, so that the data volume of the model training is increased, and better performance is achieved compared with a common pseudo data generation method under the condition that the bilingual data of the small languages is not increased.

Step 1) mainly obtains the published bilingual data of the multilingual single language and the bilingual data of the adjacent language system which are the same as the language from the network.

Step 2) is mainly to preprocess the bilingual data of the small language and the bilingual data of the same language or the adjacent language system obtained in the step 1) to obtain bilingual training data, and then to use the data to perform model training to obtain a multi-language model from the source language to the target language and from the target language to the source language, wherein fig. 1 shows the flow of bilingual data preprocessing, and the specific process is as follows:

201 The bilingual data of the small language and the bilingual data of the language same as or adjacent to the language obtained in the step 1) are sequentially subjected to a bilingual data mixing method (filtering out a large number of words of other languages mixed in sentences), an HTML tag removing method (removing redundant HTML tags in sentences), a messy code filtering method (filtering messy codes and other unknown symbols) in the bilingual data, a repeated translation data filtering method (filtering sentences with multistage repeated translation), a bracket content non-corresponding method in the bilingual data (filtering redundant bracket content), a bilingual data word number filtering method (filtering words or sentences too long data), a target language Unicode coding filtering method (filtering based on the small language Unicode coding range), and the output of each method is the input of the next method, so that the preprocessed bilingual data is finally obtained;

202 Combining the bilingual data of the small language obtained by preprocessing in the step 201) with bilingual data of the same language or a neighboring language system, and obtaining training data used for multi-language model training by using a bidirectional duplication elimination method based on a source language and a target language;

203 Performing word segmentation on the training data processed in the step 202) by using a word segmentation tool (English uses Moses, hindi uses INDIC and the like), and performing BPE word segmentation to obtain final training data used for training;

204 Using the final training data processed in step 203) to perform model training in two directions, and obtaining a multi-language model in two directions.

Step 3) is mainly to preprocess the whisper source language whisper data to obtain whisper training data for decoding to generate pseudo data, and fig. 2 shows the whisper data preprocessing flow, which comprises the following specific steps:

301 The method comprises the steps of 1) sequentially using an HTML tag removing method (removing redundant HTML tags in sentences), a disorder code filtering method (filtering disorder codes and other unknown symbols) in the single language data, a word number excessive filtering method (filtering word number excessive or sentence overlong data) and a target language Unicode based filtering method (filtering based on the small language Unicode coding range), wherein the output of each method is the input of the next method, and finally preprocessed single language data is obtained;

302 Data scoring is performed on the whisper data by using an XENC tool, and the former n (n is 50-60 in the embodiment) data are filtered according to the score ranking, so that whisper training data for decoding are obtained.

And step 4) mainly uses the multi-language model from the source language to the target language trained in the step 2) to decode the single-language training data processed in the step 3) to obtain the small-language target language single-language pseudo data.

Step 5) is mainly to process the small language target language single language pseudo data obtained by decoding in step 4) and the small language source language single language training data obtained by processing in step 3) to obtain bilingual pseudo data, and the specific process is as follows:

Step 6) mainly integrates the bilingual pseudo data obtained in step 5) and the bilingual training data of the target-to-source-language multilingual model obtained in step 2), obtains new target-to-source-language bilingual training data after processing, and then carries out training of a new round of target-to-source-language multilingual model and generation of pseudo data until the performance of the model is not improved, wherein the specific process is as follows:

601 Combining the bilingual pseudo data obtained in the step 5) with the bilingual training data of the target-to-source-language multilingual model obtained in the step 2);

602 Obtaining processed bilingual training data by using a bidirectional duplication elimination method based on a source language and a target language for the bilingual data in the step 601);

603 Using the target-to-source bilingual training data processed in step 602) to train a new round of target-to-source multilingual model and generate pseudo data until the final iterative model performance is no longer improved.

In order to verify the effectiveness of the method, the generation method of the low-resource machine translation pseudo data based on the multi-language model is subjected to experiments on translation tasks. Specifically, the model of the multilingual model is built on WMT2020 single language and bilingual corpus, from tami (ta) to english (en) and from english to tami, while the bilingual data of marathon, kanda and taru, which are the same language as tami, and the seal, ullage and bystanders, which are adjacent to the language, are selected for the model of the multilingual model, using a Transformer model, and experiments are performed on NVIDIA 2080Ti equipment. The experimental results are shown in the following table.

Experimental results show that the method for generating the low-resource machine translation pseudo data based on the multilingual model can improve the translation performance of the model (the higher the BLEU is, the better the representation model performance) under the condition that the multilingual bilingual data is less, and the method has better model performance compared with the method for adding the pseudo data after the model is trained by directly using the multilingual bilingual data.

The invention provides a method for generating low-resource machine translation pseudo data based on a multilingual model, which can integrate bilingual data of the same or adjacent language system of a small language into model training on the basis of low resources of the small language, thereby not only increasing the data volume of model training, but also integrating language features of the same or adjacent language system into the model, and further improving the performance of the model. In addition, the multilingual model trained by the invention fuses a plurality of bilingual data of the same or adjacent language system, which enables the generation of other small-language pseudo data by the model and the improvement of translation quality.

Claims

1. A method for generating dummy data in a low-resource language based on a multi-language model, comprising the steps of:

6) Integrating the processed bilingual pseudo data with training data of the target-language-to-source-language multi-language model, processing to obtain new target-language-to-source-language bilingual training data, then training the new target-language-to-source-language multi-language model, generating pseudo data, and finally iterating until the performance of the model is not improved;

203 Performing word segmentation on the training data by using a word segmentation tool, and performing BPE word segmentation to obtain final training data used for training;

204 Performing model training in two directions by using the final training data to obtain a multi-language model in two directions;

302 Data scoring is carried out on the monolingual data by using an XENC tool, and the upper n% of data are filtered according to the score ranking, so that the monolingual training data for decoding is obtained.

2. The method for generating pseudo data in a low resource language based on a multilingual model as recited in claim 1, wherein: n is 50 to 60.

3. The method for generating pseudo data in a low resource language based on a multilingual model as recited in claim 1, wherein: and 5) processing the obtained target language single language pseudo data and the source language single language training data to obtain bilingual pseudo data, wherein the steps are as follows:

4. The method for generating pseudo data in a low resource language based on a multilingual model as recited in claim 1, wherein: in step 6), integrating the processed bilingual pseudo data with the training data of the target-to-source-language multilingual model, processing to obtain new target-to-source-language bilingual training data, and then training the new target-to-source-language multilingual model and generating the pseudo data, and finally iterating until the performance of the model is not improved, wherein the method specifically comprises the following steps: