CN112861516A - Experimental method for verifying influence of common sub-words on XLM translation model effect - Google Patents

Experimental method for verifying influence of common sub-words on XLM translation model effect Download PDF

Info

Publication number
CN112861516A
CN112861516A CN202110079357.3A CN202110079357A CN112861516A CN 112861516 A CN112861516 A CN 112861516A CN 202110079357 A CN202110079357 A CN 202110079357A CN 112861516 A CN112861516 A CN 112861516A
Authority
CN
China
Prior art keywords
sub
words
common
subwords
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110079357.3A
Other languages
Chinese (zh)
Other versions
CN112861516B (en
Inventor
余正涛
杨晓霞
吴霖
朱俊国
王振晗
文永华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN202110079357.3A priority Critical patent/CN112861516B/en
Publication of CN112861516A publication Critical patent/CN112861516A/en
Application granted granted Critical
Publication of CN112861516B publication Critical patent/CN112861516B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/226Validation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an experimental method for verifying influence of common sub-words on XLM translation model effects. The invention comprises the following steps: preprocessing a corpus pre-trained by an XLM (XLM translation model); verifying whether the performance of the XLM translation model is degraded: and pre-training the XLM by using the preprocessed corpus, initializing the translation model by using the pre-trained model, and observing the BLEU value of the new translation model. The pretreatment comprises the following steps: firstly, acquiring common subwords and word frequencies of all subwords in English and French subwords; then, randomly separating the common sub-words according to the separation proportion; reading all word lists of English-French sub-words, storing the word lists in a dictionary, and generating separated sub-word files in a follow-up mode; and initializing a dictionary by using the generated separated sub-word file, and finally structuring the model language database file by using the initialized dictionary. The invention verifies the influence of the common subwords on the BLEU value, and is helpful for the low-resource neural machine translation research of non-homologous languages.

Description

Experimental method for verifying influence of common sub-words on XLM translation model effect
Technical Field
The invention relates to an experimental method for verifying influence of common sub-words on XLM translation model effects, and belongs to the technical field of natural language processing.
Background
Machine translation is one of the tasks in the field of natural language processing, is widely applied and has great research value and commercial value, and the development of machine translation is greatly promoted by the emergence of neural network machine translation. The neural machine translation needs a large amount of parallel linguistic data, and the development of low-resource neural machine translation is particularly important. Like english, low-resource neural machine translation of english-german equivalent source language pairs develops well, but chinese-english, a non-homologous language pair, does not work well. In order to analyze the reason that the middle English pair degenerates on a translation model and further to have deeper understanding on the low-resource neural machine translation of the non-homologous language pair, an experimental method for verifying the influence of common sub-words on the performance of an XLM translation model is provided.
Disclosure of Invention
The invention provides an experimental method for verifying the influence of common subwords on an XLM translation model effect, which is used for verifying the influence of the common subwords on a BLEU value, further analyzing the cause of degradation of the common subwords on the translation model, further providing more deep understanding on the low-resource neural machine translation of non-homologous language pairs, and facilitating the auxiliary proposal to solve the degradation problem of the low-resource neural machine translation of the non-homologous language pairs.
The technical scheme of the invention is as follows: an experimental method for verifying influence of common sub-words on XLM translation model effects, the method comprising:
step1, preprocessing a corpus pre-trained by an XLM translation model;
step2, verifying whether the performance of the XLM translation model is degraded: and pre-training the XLM by using the preprocessed corpus, initializing the translation model by using the pre-trained model, and observing the BLEU value of the new translation model.
Wherein the Step1 pretreatment comprises the following steps:
firstly, acquiring common subwords and word frequencies of all subwords in English and French subwords; then, randomly separating the common sub-words according to the separation proportion; reading all word lists of English-French sub-words, storing the word lists in a dictionary, and generating separated sub-word files in a follow-up mode; and initializing a dictionary by using the generated separated sub-word file, and finally structuring the model language database file by using the initialized dictionary.
As a further scheme of the invention, the method comprises the following specific steps:
step1.1, acquiring common subwords and word frequencies of all subwords in English and French subwords;
step1.2, randomly separating the common sub-words according to the separation proportion to obtain separated sub-word files;
firstly, multiplying the total number of the common subwords and the separation ratio to calculate the number of the common subwords to be separated, screening the common subwords by using a random function to obtain the common subwords to be separated and the common subwords which are not separated, and storing the common subwords and the common subwords separately; searching and storing the word frequency of the common sub-words to be separated in the English-French vocabulary respectively;
step1.3, reading a word list containing all English-French sub-words and storing the word list in a dictionary; all the word lists of English-French sub-words contain sub-words and word frequency;
step1.4, generating a separated sub-word file;
firstly, reading a dictionary containing a word list of all English-French sub-words, judging whether the common sub-words are common sub-words or not according to read data, and judging whether the common sub-words are separated or not if the common sub-words are common words; if the common sub-words are not the common sub-words, the judgment of whether to separate is not needed; when judging whether the common sub-words are separated or not, if the common sub-words are separated, marking the word frequency in English and French, and if the common sub-words are not separated, marking the word frequency in English and French as the total word frequency; finally, storing different types of sub-words in the same file by different marks;
step1.5, initializing a dictionary by using the generated separated sub-word file;
reading the file generated by Step1.4, respectively adding suffixes to the separated common subwords for distinguishing, and respectively representing the common subwords by different id serial numbers; and the corresponding word frequencies are also stored, for the common sub-words which are not separated, the corresponding id of the sub-word is directly recorded, the word frequency is recorded, and various members in the dictionary class are initialized;
step1.6, using the initialized dictionary to structure the model corpus file;
reading the sub-words in each row of sentences in the English-to-French corpus file processed by the BPE, replacing the corresponding sub-words by using the id serial numbers of the sub-words according to the initialized dictionary, adding ending identifiers at the tail end of each row, and finally storing the ending identifiers in an array; meanwhile, the beginning and end positions of the sentence identifiers are also saved in the binary file.
The invention has the beneficial effects that:
1. the invention verifies the influence of the common sub-words on the BLEU value;
2. the present invention helps in the study of low-resource neural machine translation in non-homologous languages, where in the XLM model, the source language and the target language share a vocabulary, and in the process of training the encoder, common subwords in homologous languages (such as English and French) are aligned in semantic space, and non-common subwords in English and French can also be aligned in semantic space better according to their positions relative to the common subwords. That is to say, the common sub-words in english and french provide the alignment information in semantic space for the english-french translation model, which is equivalent to the role of anchor point. Non-homologous languages (e.g., chinese and english) do not contain substantially common subwords, causing alignment of the source and target languages in the semantic space to be problematic. Experiments show that the missing of the common subword information has great influence on translation of the non-homologous language, and according to the missing, methods such as adding a bilingual dictionary to increase alignment information can be provided to improve the machine translation performance of the non-homologous language. Therefore, the invention is helpful for the low-resource neural machine translation research of non-homologous languages.
Drawings
FIG. 1 is a flow chart of the present invention for generating and applying a separated common subword file;
FIG. 2 is a graph illustrating the effect of different ratios of separating common sub-words on the English-to-legal BLEU value according to the present invention;
FIG. 3 is a graph of the effect of different scale separation common sub-words on the French-British BLEU value in accordance with the present invention.
Detailed Description
Example 1: as shown in fig. 1-3, an experimental method for verifying influence of common subwords on XLM translation model effects includes:
step1, preprocessing a corpus pre-trained by an XLM translation model;
step2, verifying whether the performance of the XLM translation model is degraded: and pre-training the XLM by using the preprocessed corpus, initializing the translation model by using the pre-trained model, and observing the BLEU value of the new translation model.
Wherein the Step1 pretreatment comprises the following steps:
firstly, acquiring common subwords and word frequencies of all subwords in English and French subwords; then, randomly separating the common sub-words according to the separation proportion; reading all word lists of English-French sub-words, storing the word lists in a dictionary, and generating separated sub-word files in a follow-up mode; and initializing a dictionary by using the generated separated sub-word file, and finally structuring the model language database file by using the initialized dictionary.
As a further scheme of the invention, the method comprises the following specific steps:
step1.1, acquiring common subwords and word frequencies of all subwords in English and French subwords;
english sub-words and word frequencies, French sub-words and word frequencies are respectively stored in two English and French word lists (vocab.en, vocab.fr) generated after BPE processing. Traversing an English word list, putting the subwords in a set, copying the set into a newly-built common subword set, then solving an intersection between the common subword set and the set of the French word list, wherein the intersection is the common subwords of English and French, and searching a word list file to store the common subwords and the corresponding words in a dictionary frequently.
Step1.2, randomly separating the common sub-words according to the separation proportion to obtain separated sub-word files;
firstly, multiplying the total number of the common subwords and the separation ratio to calculate the number of the common subwords to be separated, screening the common subwords by using a random function random. sample to obtain the common subwords to be separated and the common subwords not to be separated, and storing the common subwords and the common subwords separately; searching and storing the word frequency of the common sub-words to be separated in the English-French vocabulary respectively;
step1.3, reading a word list containing all English-French sub-words and storing the word list in a dictionary; all the word lists of English-French sub-words contain sub-words and word frequency;
BPE generates vocab.en-fr, which contains all english and french subwords, and stores the subwords and word frequencies in a dictionary for easy searching.
Step1.4, generating a separated sub-word file;
firstly, reading a dictionary containing a word list of all English-French sub-words, judging whether the common sub-words are common sub-words or not according to read data, and judging whether the common sub-words are separated or not if the common sub-words are common words; if the common sub-words are not the common sub-words, the judgment of whether to separate is not needed; when judging whether the common sub-words are separated or not, if the common sub-words are separated, marking the word frequency in English and French, and if the common sub-words are not separated, marking the word frequency in English and French as the total word frequency; finally, storing different types of sub-words in the same file by different marks;
1 represents true and 0 represents false. The final file form is as in table 1:
TABLE 1 separate subword files generated
Figure BDA0002908547840000041
Step1.5, initializing a dictionary by using the generated separated sub-word file;
reading the file generated by Step1.4, respectively adding suffixes (_1 represents English sub-words and _2represents French sub-words) to the separated common sub-words for distinguishing, and respectively representing by different id (serial number); storing the word frequencies corresponding to the sub words, directly recording the id corresponding to the sub word and recording the word frequency of the sub word for the common sub word which is not separated, initializing class member id2 words (each id corresponds to the dictionary of the word), word2id (each id corresponds to the dictionary of the id), counts (records the dictionary of the word frequency of each sub word) and split words (used for recording the collection of the common sub words) in a dictionary class (split dictionary);
step1.6, using the initialized dictionary to structure the model corpus file;
the train.en and train.fr files were structured using members of the splitdectionary. (index _ data function in spandictionary. py)
Reading the sub-words in each row of sentences in the English-to-French corpus file processed by the BPE, replacing the corresponding sub-words by using the id serial numbers of the sub-words according to the initialized dictionary, adding ending identifiers at the tail end of each row, and finally storing the ending identifiers in an array; meanwhile, the beginning and end positions of the sentence identifiers are also saved in the binary file.
Fig. 2 and fig. 3 show the translation results obtained by using the present invention, and the evaluation method uses the international universal BLEU index, and the higher the value, the better. In the figure, the X-axis represents the separation ratio of common subwords, and the Y-axis represents the evaluation index. Fig. 2 shows a translation result from english to french, and fig. 3 shows a translation result from french to english.
Tests show that the method effectively verifies the importance of the common sub-words in the English translation model. Effective analysis experiments are carried out for the degradation of the Chinese-English translation model.
Experiments show that the missing of the common subword information has great influence on translation of the non-homologous language, and according to the missing, methods such as adding a bilingual dictionary to increase alignment information can be provided to improve the machine translation performance of the non-homologous language. Therefore, the invention is helpful for the low-resource neural machine translation research of non-homologous languages.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (2)

1. An experimental method for verifying influence of common sub-words on XLM translation model effect is characterized in that: the method comprises the following steps:
step1, preprocessing a corpus pre-trained by an XLM translation model;
step2, verifying whether the performance of the XLM translation model is degraded: and pre-training the XLM by using the preprocessed corpus, initializing the translation model by using the pre-trained model, and observing the BLEU value of the new translation model.
Wherein the Step1 pretreatment comprises the following steps:
firstly, acquiring common subwords and word frequencies of all subwords in English and French subwords; then, randomly separating the common sub-words according to the separation proportion; reading all word lists of English-French sub-words, storing the word lists in a dictionary, and generating separated sub-word files in a follow-up mode; and initializing a dictionary by using the generated separated sub-word file, and finally structuring the model language database file by using the initialized dictionary.
2. The experimental method of claim 1 for verifying the effect of common subwords on XLM translation models, wherein: the method comprises the following specific steps:
step1.1, acquiring common subwords and word frequencies of all subwords in English and French subwords;
step1.2, randomly separating the common sub-words according to the separation proportion to obtain separated sub-word files;
firstly, multiplying the total number of the common subwords and the separation ratio to calculate the number of the common subwords to be separated, screening the common subwords by using a random function to obtain the common subwords to be separated and the common subwords which are not separated, and storing the common subwords and the common subwords separately; searching and storing the word frequency of the common sub-words to be separated in the English-French vocabulary respectively;
step1.3, reading a word list containing all English-French sub-words and storing the word list in a dictionary; all the word lists of English-French sub-words contain sub-words and word frequency;
step1.4, generating a separated sub-word file;
firstly, reading a dictionary containing a word list of all English-French sub-words, judging whether the common sub-words are common sub-words or not according to read data, and judging whether the common sub-words are separated or not if the common sub-words are common words; if the common sub-words are not the common sub-words, the judgment of whether to separate is not needed; when judging whether the common sub-words are separated or not, if the common sub-words are separated, marking the word frequency in English and French, and if the common sub-words are not separated, marking the word frequency in English and French as the total word frequency; finally, storing different types of sub-words in the same file by different marks;
step1.5, initializing a dictionary by using the generated separated sub-word file;
reading the file generated by Step1.4, respectively adding suffixes to the separated common subwords for distinguishing, and respectively representing the common subwords by different id serial numbers; and the corresponding word frequencies are also stored, for the common sub-words which are not separated, the corresponding id of the sub-word is directly recorded, the word frequency is recorded, and various members in the dictionary class are initialized;
step1.6, using the initialized dictionary to structure the model corpus file;
reading the sub-words in each row of sentences in the English-to-French corpus file processed by the BPE, replacing the corresponding sub-words by using the id serial numbers of the sub-words according to the initialized dictionary, adding ending identifiers at the tail end of each row, and finally storing the ending identifiers in an array; meanwhile, the beginning and end positions of the sentence identifiers are also saved in the binary file.
CN202110079357.3A 2021-01-21 2021-01-21 Experimental method for verifying influence of common subword on XLM translation model effect Active CN112861516B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110079357.3A CN112861516B (en) 2021-01-21 2021-01-21 Experimental method for verifying influence of common subword on XLM translation model effect

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110079357.3A CN112861516B (en) 2021-01-21 2021-01-21 Experimental method for verifying influence of common subword on XLM translation model effect

Publications (2)

Publication Number Publication Date
CN112861516A true CN112861516A (en) 2021-05-28
CN112861516B CN112861516B (en) 2023-05-16

Family

ID=76008519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110079357.3A Active CN112861516B (en) 2021-01-21 2021-01-21 Experimental method for verifying influence of common subword on XLM translation model effect

Country Status (1)

Country Link
CN (1) CN112861516B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108563640A (en) * 2018-04-24 2018-09-21 中译语通科技股份有限公司 A kind of multilingual pair of neural network machine interpretation method and system
CN109033042A (en) * 2018-06-28 2018-12-18 中译语通科技股份有限公司 BPE coding method and system, machine translation system based on the sub- word cell of Chinese
CN109815456A (en) * 2019-02-13 2019-05-28 北京航空航天大学 A method of it is compressed based on term vector memory space of the character to coding
CN110413736A (en) * 2019-07-25 2019-11-05 百度在线网络技术(北京)有限公司 Across language text representation method and device
CN110674646A (en) * 2019-09-06 2020-01-10 内蒙古工业大学 Mongolian Chinese machine translation system based on byte pair encoding technology
CN110688862A (en) * 2019-08-29 2020-01-14 内蒙古工业大学 Mongolian-Chinese inter-translation method based on transfer learning
CN110991192A (en) * 2019-11-08 2020-04-10 昆明理工大学 Method for constructing semi-supervised neural machine translation model based on word-to-word translation
CN111414771A (en) * 2020-03-03 2020-07-14 云知声智能科技股份有限公司 Phrase-based neural machine translation method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108563640A (en) * 2018-04-24 2018-09-21 中译语通科技股份有限公司 A kind of multilingual pair of neural network machine interpretation method and system
CN109033042A (en) * 2018-06-28 2018-12-18 中译语通科技股份有限公司 BPE coding method and system, machine translation system based on the sub- word cell of Chinese
CN109815456A (en) * 2019-02-13 2019-05-28 北京航空航天大学 A method of it is compressed based on term vector memory space of the character to coding
CN110413736A (en) * 2019-07-25 2019-11-05 百度在线网络技术(北京)有限公司 Across language text representation method and device
CN110688862A (en) * 2019-08-29 2020-01-14 内蒙古工业大学 Mongolian-Chinese inter-translation method based on transfer learning
CN110674646A (en) * 2019-09-06 2020-01-10 内蒙古工业大学 Mongolian Chinese machine translation system based on byte pair encoding technology
CN110991192A (en) * 2019-11-08 2020-04-10 昆明理工大学 Method for constructing semi-supervised neural machine translation model based on word-to-word translation
CN111414771A (en) * 2020-03-03 2020-07-14 云知声智能科技股份有限公司 Phrase-based neural machine translation method and system

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
GUILLAUME LAMPLE 等: ""Cross-lingual Language Model Pretraining"", 《ARXIV》 *
RICO SENNRICH 等: ""Neural Machine Translation of Rare Words with Subword Units"", 《ARXIV》 *
RIOS ANNETTE 等: ""Subword segmentation and a single bridge language affect zero-shot neural machine translation"", 《5TH CONFERENCE ON MACHINE TRANSLATION》 *
孙凌浩 等: ""基于跨语言迁移学习的实体关系抽取算法研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
徐毓 等: ""基于深度可分离卷积的汉越神经机器翻译"", 《厦门大学学报》 *

Also Published As

Publication number Publication date
CN112861516B (en) 2023-05-16

Similar Documents

Publication Publication Date Title
CN109726293A (en) A kind of causal event map construction method, system, device and storage medium
CN105095196B (en) The method and apparatus of new word discovery in text
CN111143531A (en) Question-answer pair construction method, system, device and computer readable storage medium
Akhiroh The influence of translation technique on the quality of the translation of international news in Seputar Indonesia daily
Sabtan et al. Teaching Arabic machine translation to EFL student translators: A case study of Omani translation undergraduates
CN110674722B (en) Test paper splitting method and system
CN113934814B (en) Automatic scoring method for subjective questions of ancient poems
Ciobanu et al. Temporal text classification for romanian novels set in the past
Akhmanova Exact methods in linguistic research
Yue et al. Translationese and interlanguage in inverse translation: A case study
CN112861516B (en) Experimental method for verifying influence of common subword on XLM translation model effect
Cristea et al. Automatic discrimination between inherited and borrowed Latin words in Romance languages
CN112085985B (en) Student answer automatic scoring method for English examination translation questions
Berkling et al. WISE: A Web-Interface for Spelling Error Recognition for German: A Description and Evaluation of the Underlying Algorithm.
CN115017404A (en) Target news topic abstracting method based on compressed space sentence selection
CN113569560A (en) Automatic scoring method for Chinese bilingual composition
Pilán et al. Exploring word embeddings and phonological similarity for the unsupervised correction of language learner errors
CN117544831B (en) Automatic decomposing method and system for classroom teaching links
Nurhandini et al. The Translation Methods Used in the Subtitles of Dialogues in Maleficent Movie
CN112836047B (en) Electronic medical record text data enhancement method based on sentence semantic replacement
CN114595688B (en) Chinese cross-language word embedding method fusing word cluster constraint
Mbaye et al. Beqi: Revitalize the senegalese wolof language with a robust spelling corrector
Chathuranga et al. Opinion target extraction for student course feedback
Zmandar et al. Multilingual Financial Word Embeddings for Arabic, English and French
Egli et al. Voting Booklet Bias: Stance Detection in Swiss Federal Communication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant