CN112861516A

CN112861516A - Experimental method for verifying influence of common sub-words on XLM translation model effect

Info

Publication number: CN112861516A
Application number: CN202110079357.3A
Authority: CN
Inventors: 余正涛; 杨晓霞; 吴霖; 朱俊国; 王振晗; 文永华
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2021-01-21
Filing date: 2021-01-21
Publication date: 2021-05-28
Anticipated expiration: 2041-01-21
Also published as: CN112861516B

Abstract

The invention relates to an experimental method for verifying influence of common sub-words on XLM translation model effects. The invention comprises the following steps: preprocessing a corpus pre-trained by an XLM (XLM translation model); verifying whether the performance of the XLM translation model is degraded: and pre-training the XLM by using the preprocessed corpus, initializing the translation model by using the pre-trained model, and observing the BLEU value of the new translation model. The pretreatment comprises the following steps: firstly, acquiring common subwords and word frequencies of all subwords in English and French subwords; then, randomly separating the common sub-words according to the separation proportion; reading all word lists of English-French sub-words, storing the word lists in a dictionary, and generating separated sub-word files in a follow-up mode; and initializing a dictionary by using the generated separated sub-word file, and finally structuring the model language database file by using the initialized dictionary. The invention verifies the influence of the common subwords on the BLEU value, and is helpful for the low-resource neural machine translation research of non-homologous languages.

Description

Experimental method for verifying influence of common sub-words on XLM translation model effect

Technical Field

The invention relates to an experimental method for verifying influence of common sub-words on XLM translation model effects, and belongs to the technical field of natural language processing.

Background

Machine translation is one of the tasks in the field of natural language processing, is widely applied and has great research value and commercial value, and the development of machine translation is greatly promoted by the emergence of neural network machine translation. The neural machine translation needs a large amount of parallel linguistic data, and the development of low-resource neural machine translation is particularly important. Like english, low-resource neural machine translation of english-german equivalent source language pairs develops well, but chinese-english, a non-homologous language pair, does not work well. In order to analyze the reason that the middle English pair degenerates on a translation model and further to have deeper understanding on the low-resource neural machine translation of the non-homologous language pair, an experimental method for verifying the influence of common sub-words on the performance of an XLM translation model is provided.

Disclosure of Invention

The invention provides an experimental method for verifying the influence of common subwords on an XLM translation model effect, which is used for verifying the influence of the common subwords on a BLEU value, further analyzing the cause of degradation of the common subwords on the translation model, further providing more deep understanding on the low-resource neural machine translation of non-homologous language pairs, and facilitating the auxiliary proposal to solve the degradation problem of the low-resource neural machine translation of the non-homologous language pairs.

The technical scheme of the invention is as follows: an experimental method for verifying influence of common sub-words on XLM translation model effects, the method comprising:

step1, preprocessing a corpus pre-trained by an XLM translation model;

step2, verifying whether the performance of the XLM translation model is degraded: and pre-training the XLM by using the preprocessed corpus, initializing the translation model by using the pre-trained model, and observing the BLEU value of the new translation model.

Wherein the Step1 pretreatment comprises the following steps:

firstly, acquiring common subwords and word frequencies of all subwords in English and French subwords; then, randomly separating the common sub-words according to the separation proportion; reading all word lists of English-French sub-words, storing the word lists in a dictionary, and generating separated sub-word files in a follow-up mode; and initializing a dictionary by using the generated separated sub-word file, and finally structuring the model language database file by using the initialized dictionary.

As a further scheme of the invention, the method comprises the following specific steps:

step1.1, acquiring common subwords and word frequencies of all subwords in English and French subwords;

step1.2, randomly separating the common sub-words according to the separation proportion to obtain separated sub-word files;

firstly, multiplying the total number of the common subwords and the separation ratio to calculate the number of the common subwords to be separated, screening the common subwords by using a random function to obtain the common subwords to be separated and the common subwords which are not separated, and storing the common subwords and the common subwords separately; searching and storing the word frequency of the common sub-words to be separated in the English-French vocabulary respectively;

step1.3, reading a word list containing all English-French sub-words and storing the word list in a dictionary; all the word lists of English-French sub-words contain sub-words and word frequency;

step1.4, generating a separated sub-word file;

firstly, reading a dictionary containing a word list of all English-French sub-words, judging whether the common sub-words are common sub-words or not according to read data, and judging whether the common sub-words are separated or not if the common sub-words are common words; if the common sub-words are not the common sub-words, the judgment of whether to separate is not needed; when judging whether the common sub-words are separated or not, if the common sub-words are separated, marking the word frequency in English and French, and if the common sub-words are not separated, marking the word frequency in English and French as the total word frequency; finally, storing different types of sub-words in the same file by different marks;

step1.5, initializing a dictionary by using the generated separated sub-word file;

reading the file generated by Step1.4, respectively adding suffixes to the separated common subwords for distinguishing, and respectively representing the common subwords by different id serial numbers; and the corresponding word frequencies are also stored, for the common sub-words which are not separated, the corresponding id of the sub-word is directly recorded, the word frequency is recorded, and various members in the dictionary class are initialized;

step1.6, using the initialized dictionary to structure the model corpus file;

reading the sub-words in each row of sentences in the English-to-French corpus file processed by the BPE, replacing the corresponding sub-words by using the id serial numbers of the sub-words according to the initialized dictionary, adding ending identifiers at the tail end of each row, and finally storing the ending identifiers in an array; meanwhile, the beginning and end positions of the sentence identifiers are also saved in the binary file.

The invention has the beneficial effects that:

1. the invention verifies the influence of the common sub-words on the BLEU value;

2. the present invention helps in the study of low-resource neural machine translation in non-homologous languages, where in the XLM model, the source language and the target language share a vocabulary, and in the process of training the encoder, common subwords in homologous languages (such as English and French) are aligned in semantic space, and non-common subwords in English and French can also be aligned in semantic space better according to their positions relative to the common subwords. That is to say, the common sub-words in english and french provide the alignment information in semantic space for the english-french translation model, which is equivalent to the role of anchor point. Non-homologous languages (e.g., chinese and english) do not contain substantially common subwords, causing alignment of the source and target languages in the semantic space to be problematic. Experiments show that the missing of the common subword information has great influence on translation of the non-homologous language, and according to the missing, methods such as adding a bilingual dictionary to increase alignment information can be provided to improve the machine translation performance of the non-homologous language. Therefore, the invention is helpful for the low-resource neural machine translation research of non-homologous languages.

Drawings

FIG. 1 is a flow chart of the present invention for generating and applying a separated common subword file;

FIG. 2 is a graph illustrating the effect of different ratios of separating common sub-words on the English-to-legal BLEU value according to the present invention;

FIG. 3 is a graph of the effect of different scale separation common sub-words on the French-British BLEU value in accordance with the present invention.

Detailed Description

Example 1: as shown in fig. 1-3, an experimental method for verifying influence of common subwords on XLM translation model effects includes:

step1, preprocessing a corpus pre-trained by an XLM translation model;

Wherein the Step1 pretreatment comprises the following steps:

english sub-words and word frequencies, French sub-words and word frequencies are respectively stored in two English and French word lists (vocab.en, vocab.fr) generated after BPE processing. Traversing an English word list, putting the subwords in a set, copying the set into a newly-built common subword set, then solving an intersection between the common subword set and the set of the French word list, wherein the intersection is the common subwords of English and French, and searching a word list file to store the common subwords and the corresponding words in a dictionary frequently.

firstly, multiplying the total number of the common subwords and the separation ratio to calculate the number of the common subwords to be separated, screening the common subwords by using a random function random. sample to obtain the common subwords to be separated and the common subwords not to be separated, and storing the common subwords and the common subwords separately; searching and storing the word frequency of the common sub-words to be separated in the English-French vocabulary respectively;

BPE generates vocab.en-fr, which contains all english and french subwords, and stores the subwords and word frequencies in a dictionary for easy searching.

Step1.4, generating a separated sub-word file;

1 represents true and 0 represents false. The final file form is as in table 1:

TABLE 1 separate subword files generated

reading the file generated by Step1.4, respectively adding suffixes (_1 represents English sub-words and _2represents French sub-words) to the separated common sub-words for distinguishing, and respectively representing by different id (serial number); storing the word frequencies corresponding to the sub words, directly recording the id corresponding to the sub word and recording the word frequency of the sub word for the common sub word which is not separated, initializing class member id2 words (each id corresponds to the dictionary of the word), word2id (each id corresponds to the dictionary of the id), counts (records the dictionary of the word frequency of each sub word) and split words (used for recording the collection of the common sub words) in a dictionary class (split dictionary);

step1.6, using the initialized dictionary to structure the model corpus file;

the train.en and train.fr files were structured using members of the splitdectionary. (index _ data function in spandictionary. py)

Fig. 2 and fig. 3 show the translation results obtained by using the present invention, and the evaluation method uses the international universal BLEU index, and the higher the value, the better. In the figure, the X-axis represents the separation ratio of common subwords, and the Y-axis represents the evaluation index. Fig. 2 shows a translation result from english to french, and fig. 3 shows a translation result from french to english.

Tests show that the method effectively verifies the importance of the common sub-words in the English translation model. Effective analysis experiments are carried out for the degradation of the Chinese-English translation model.

Experiments show that the missing of the common subword information has great influence on translation of the non-homologous language, and according to the missing, methods such as adding a bilingual dictionary to increase alignment information can be provided to improve the machine translation performance of the non-homologous language. Therefore, the invention is helpful for the low-resource neural machine translation research of non-homologous languages.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. An experimental method for verifying influence of common sub-words on XLM translation model effect is characterized in that: the method comprises the following steps:

step1, preprocessing a corpus pre-trained by an XLM translation model;

Wherein the Step1 pretreatment comprises the following steps:

2. The experimental method of claim 1 for verifying the effect of common subwords on XLM translation models, wherein: the method comprises the following specific steps:

step1.4, generating a separated sub-word file;

step1.6, using the initialized dictionary to structure the model corpus file;