CN113111667B - Method for generating pseudo data in low-resource language based on multi-language model - Google Patents

Method for generating pseudo data in low-resource language based on multi-language model Download PDF

Info

Publication number
CN113111667B
CN113111667B CN202110397096.XA CN202110397096A CN113111667B CN 113111667 B CN113111667 B CN 113111667B CN 202110397096 A CN202110397096 A CN 202110397096A CN 113111667 B CN113111667 B CN 113111667B
Authority
CN
China
Prior art keywords
language
data
bilingual
model
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110397096.XA
Other languages
Chinese (zh)
Other versions
CN113111667A (en
Inventor
杜权
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenyang Yayi Network Technology Co ltd
Original Assignee
Shenyang Yayi Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenyang Yayi Network Technology Co ltd filed Critical Shenyang Yayi Network Technology Co ltd
Priority to CN202110397096.XA priority Critical patent/CN113111667B/en
Publication of CN113111667A publication Critical patent/CN113111667A/en
Application granted granted Critical
Publication of CN113111667B publication Critical patent/CN113111667B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for generating pseudo data in a low-resource language based on a multilingual model, which comprises the following steps: obtaining bilingual training data obtained by preprocessing the bilingual data of the small language single language and bilingual data and the bilingual data of the same or adjacent language system, and further obtaining a multilingual model; preprocessing the multilingual data of the small language source language to obtain multilingual training data; decoding the single-language training data to obtain small-language target language single-language pseudo data; and processing the target language single-language pseudo data and the source language single-language training data to obtain bilingual pseudo data, integrating the bilingual pseudo data with training data of the target language-source language multi-language model, processing to obtain new target language-source language bilingual training data, and finally iterating until the performance of the model is not improved. According to the invention, bilingual data of the same or adjacent language system of the small language is merged into the model training, so that the data volume of the model training is increased, and the language features of the same or adjacent language system are merged into the model, thereby improving the performance of the model.

Description

Method for generating pseudo data in low-resource language based on multi-language model
Technical Field
The invention relates to a method for generating pseudo data, in particular to a method for generating pseudo data based on a low-resource language of a multi-language model.
Background
Neural machine translation is a research area that has been rising from 2013, but it has been developed very rapidly, and has far exceeded traditional statistical machine translation in many languages. Unlike traditional statistical-based translation algorithms, neural machine translation sees the translation task as a problem of translating from one sentence sequence to another, and the end-to-end translation task is completed through a deep neural network comprising an encoder and a decoder, so that the complicated intermediate process in traditional statistical machine translation is avoided, and the neural machine translation model receives more and more attention due to high simplicity and good translation performance. The translation model also underwent a development process from the original naive codec model, to the attention-mechanism-based coding model, to the fully CNN-based neuro-machine translation model, and finally to the fully self-attention-mechanism-based neuro-machine translation model. Each development and evolution of the translation model brings about a great improvement of performance, but the success of the translation model mainly depends on a large amount of high-quality bilingual corpus, and *** has once been demonstrated as follows: the index of the automatic evaluation can be improved by 0.5 percent every time the data set scale is doubled, so that the importance of the data quantity on the improvement of the model performance can be seen.
With the rapid development of machine translation, the translation task between large languages such as english and french can achieve a good translation effect, but the translation of small languages is still a difficult problem, and compared with the corpus data with abundant large languages, the machine translation of small languages has the main challenge of sparseness of the corpus, and the task of acquiring a large amount of parallel data is very difficult, but acquiring single language data is relatively easy. Aiming at the low-resource problem of small languages, many attempts have been made in the field of machine translation, and methods are roughly divided into two categories: the first is to fully utilize the easy-to-acquire single-language data, the typical method is reverse translation, which translates the single-language data of the target language end into the data of the source language by using a reverse translation model, and constructs double-language pseudo data to train a forward translation model by the method; the second type of method is a multilingual model method, which defines an encoder and a decoder for each language, respectively, and performs translation between different languages through a shared intent mechanism.
However, the above methods ignore the similarity characteristics of the same or adjacent language system languages in small languages, and have low utilization rate on low-resource languages, so that the translation quality is poor. For example, the Golay and Indian are of the same family of Indian languages of the Indoku family, and both have the same alphabet through which the two languages can be converted into each other. For the low-resource machine translation of the small languages, the method is a new method for adding bilingual data, and a system capable of realizing the low-resource language translation by utilizing the similarity characteristics of the same or adjacent language systems of the small languages has not been reported at present.
Disclosure of Invention
Aiming at the problem that the existing method ignores the similarity characteristics of the same or adjacent language system languages of the small languages in the low-resource translation problem of the small languages, the invention aims to solve the technical problem of providing a method for generating pseudo data based on the low-resource language of the multilingual model, and improving the performance of the small-language translation model.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention provides a method for generating pseudo data in a low-resource language based on a multi-language model, which comprises the following steps:
1) Acquiring bilingual data of a small language single language and bilingual data of a neighboring language system or the same language as the small language through the Internet;
2) Preprocessing bilingual data of a small language and bilingual data of the same language or an adjacent language system of the language to obtain bilingual training data, and performing model training by using the bilingual training data to obtain a multi-language model from a source language to a target language and from the target language to the source language;
3) Preprocessing the multilingual data of the small language source language to obtain the multilingual training data for decoding to generate pseudo data;
4) Decoding the single-language training data by using a multi-language model from the source language to the target language to obtain small-language target language single-language pseudo data;
5) Processing the obtained target language single-language pseudo data and the source language single-language training data to obtain bilingual pseudo data;
6) And integrating the processed bilingual pseudo data with training data of the target-language-to-source-language multi-language model, processing to obtain new target-language-to-source-language bilingual training data, then performing new round of training of the target-language-to-source-language multi-language model, generating the pseudo data, and finally iterating until the performance of the model is not improved.
In the step 2), the bilingual training data is obtained by preprocessing the bilingual data of the small language and the bilingual data of the same language or an adjacent language system, and model training is carried out to obtain a multi-language model in two directions, wherein the steps are as follows:
201 The bilingual data obtained in the step 1) is sequentially subjected to a bilingual data mixing method, an HTML tag removing method, a scrambling code filtering method in the bilingual data, a repeated translation data filtering method in the bilingual data, a bracket content non-corresponding filtering method in the bilingual data, a bilingual data word number excessive filtering method and a target language Unicode coding filtering method to obtain preprocessed bilingual data;
202 Combining the small language bilingual data obtained by preprocessing with other language bilingual data, and obtaining training data used for multi-language model training by using a bidirectional duplication removal method;
203 Performing word segmentation on the training data by using a word segmentation tool, and then performing BPE word segmentation to obtain final training data used for training;
204 Using the final training data to perform model training in two directions to obtain a multi-language model in two directions.
In the step 3), the multilingual source language single-language data is preprocessed to obtain single-language training data for decoding the multilingual model to generate pseudo data, and the steps are as follows:
301 The method for removing the HTML label, the method for filtering the messy codes in the single-language data, the method for filtering the word number of the single-language data and the method for filtering the Unicode code based on the target language are sequentially used for the single-language data obtained in the step 1) to obtain the preprocessed single-language data;
302 Data scoring is carried out on the monolingual data by using an XENC tool, and the upper n% of data are filtered according to the score ranking, so that the monolingual training data for decoding is obtained. n is 50 to 60.
And 5) processing the obtained target language single language pseudo data and the source language single language training data to obtain bilingual pseudo data, wherein the steps are as follows:
501 Combining the target language single-language pseudo data and the source language single-language training data to obtain bilingual pseudo data;
502 Processing the bilingual pseudo data by using an n-gram filtering method, and then performing word segmentation to obtain the processed bilingual pseudo data.
In step 6), integrating the processed bilingual pseudo data with the training data of the target-to-source-language multilingual model, processing to obtain new target-to-source-language bilingual training data, and then training the new target-to-source-language multilingual model and generating the pseudo data, and finally iterating until the performance of the model is not improved, wherein the method specifically comprises the following steps:
601 Combining bilingual pseudo data and bilingual training data of a target language to source language multilingual model;
602 Using a bidirectional duplication elimination method to the bilingual data to obtain processed bilingual training data;
603 Training a new round of target language to source language multi-language model and generating pseudo data by using the processed bilingual training data, and finally iterating until the performance of the model is not improved.
The invention has the following beneficial effects and advantages:
1. according to the invention, on the basis of low resources of the small languages, bilingual data of the same or adjacent language systems of the small languages are merged into the model training, so that the data volume of the model training is increased, and language features of the same language system or adjacent language systems are merged into the model, thereby improving the performance of the model.
2. Meanwhile, the multi-language model trained by the invention fuses a plurality of bilingual data of the same or adjacent language system, and the multi-language model provides a new method for generating other small-language pseudo data and improving translation quality.
Drawings
FIG. 1 is a flow chart of bilingual data preprocessing in accordance with the present invention;
FIG. 2 is a flow chart of the whisper data preprocessing of the present invention.
Detailed Description
The invention is further elucidated below in connection with the drawings of the specification.
The invention expands on the basis of a pseudo data generation method which is originally only oriented to a small-language single language model, and provides a low-resource machine translation pseudo data generation method based on a multi-language model. According to the method, bilingual data of the same or adjacent languages of the small languages are merged into the multilingual model training, so that the data volume of the model training is increased, and better performance is achieved compared with a common pseudo data generation method under the condition that the bilingual data of the small languages is not increased.
In order to solve the technical problems, the invention adopts the following technical scheme:
the invention provides a method for generating pseudo data in a low-resource language based on a multi-language model, which comprises the following steps:
1) Acquiring bilingual data of a small language single language and bilingual data of a neighboring language system or the same language as the small language through the Internet;
2) Preprocessing bilingual data of a small language and bilingual data of the same language or an adjacent language system of the language to obtain bilingual training data, and performing model training by using the bilingual training data to obtain a multi-language model from a source language to a target language and from the target language to the source language;
3) Preprocessing the multilingual data of the small language source language to obtain the multilingual training data for decoding to generate pseudo data;
4) Decoding the single-language training data by using a multi-language model from the source language to the target language to obtain small-language target language single-language pseudo data;
5) Processing the obtained target language single-language pseudo data and the source language single-language training data to obtain bilingual pseudo data;
6) And integrating the processed bilingual pseudo data with training data of the target-language-to-source-language multi-language model, processing to obtain new target-language-to-source-language bilingual training data, then performing new round of training of the target-language-to-source-language multi-language model, generating the pseudo data, and finally iterating until the performance of the model is not improved.
Step 1) mainly obtains the published bilingual data of the multilingual single language and the bilingual data of the adjacent language system which are the same as the language from the network.
Step 2) is mainly to preprocess the bilingual data of the small language and the bilingual data of the same language or the adjacent language system obtained in the step 1) to obtain bilingual training data, and then to use the data to perform model training to obtain a multi-language model from the source language to the target language and from the target language to the source language, wherein fig. 1 shows the flow of bilingual data preprocessing, and the specific process is as follows:
201 The bilingual data of the small language and the bilingual data of the language same as or adjacent to the language obtained in the step 1) are sequentially subjected to a bilingual data mixing method (filtering out a large number of words of other languages mixed in sentences), an HTML tag removing method (removing redundant HTML tags in sentences), a messy code filtering method (filtering messy codes and other unknown symbols) in the bilingual data, a repeated translation data filtering method (filtering sentences with multistage repeated translation), a bracket content non-corresponding method in the bilingual data (filtering redundant bracket content), a bilingual data word number filtering method (filtering words or sentences too long data), a target language Unicode coding filtering method (filtering based on the small language Unicode coding range), and the output of each method is the input of the next method, so that the preprocessed bilingual data is finally obtained;
202 Combining the bilingual data of the small language obtained by preprocessing in the step 201) with bilingual data of the same language or a neighboring language system, and obtaining training data used for multi-language model training by using a bidirectional duplication elimination method based on a source language and a target language;
203 Performing word segmentation on the training data processed in the step 202) by using a word segmentation tool (English uses Moses, hindi uses INDIC and the like), and performing BPE word segmentation to obtain final training data used for training;
204 Using the final training data processed in step 203) to perform model training in two directions, and obtaining a multi-language model in two directions.
Step 3) is mainly to preprocess the whisper source language whisper data to obtain whisper training data for decoding to generate pseudo data, and fig. 2 shows the whisper data preprocessing flow, which comprises the following specific steps:
301 The method comprises the steps of 1) sequentially using an HTML tag removing method (removing redundant HTML tags in sentences), a disorder code filtering method (filtering disorder codes and other unknown symbols) in the single language data, a word number excessive filtering method (filtering word number excessive or sentence overlong data) and a target language Unicode based filtering method (filtering based on the small language Unicode coding range), wherein the output of each method is the input of the next method, and finally preprocessed single language data is obtained;
302 Data scoring is performed on the whisper data by using an XENC tool, and the former n (n is 50-60 in the embodiment) data are filtered according to the score ranking, so that whisper training data for decoding are obtained.
And step 4) mainly uses the multi-language model from the source language to the target language trained in the step 2) to decode the single-language training data processed in the step 3) to obtain the small-language target language single-language pseudo data.
Step 5) is mainly to process the small language target language single language pseudo data obtained by decoding in step 4) and the small language source language single language training data obtained by processing in step 3) to obtain bilingual pseudo data, and the specific process is as follows:
501 Combining the target language single-language pseudo data and the source language single-language training data to obtain bilingual pseudo data;
502 Processing the bilingual pseudo data by using an n-gram filtering method, and then performing word segmentation to obtain the processed bilingual pseudo data.
Step 6) mainly integrates the bilingual pseudo data obtained in step 5) and the bilingual training data of the target-to-source-language multilingual model obtained in step 2), obtains new target-to-source-language bilingual training data after processing, and then carries out training of a new round of target-to-source-language multilingual model and generation of pseudo data until the performance of the model is not improved, wherein the specific process is as follows:
601 Combining the bilingual pseudo data obtained in the step 5) with the bilingual training data of the target-to-source-language multilingual model obtained in the step 2);
602 Obtaining processed bilingual training data by using a bidirectional duplication elimination method based on a source language and a target language for the bilingual data in the step 601);
603 Using the target-to-source bilingual training data processed in step 602) to train a new round of target-to-source multilingual model and generate pseudo data until the final iterative model performance is no longer improved.
In order to verify the effectiveness of the method, the generation method of the low-resource machine translation pseudo data based on the multi-language model is subjected to experiments on translation tasks. Specifically, the model of the multilingual model is built on WMT2020 single language and bilingual corpus, from tami (ta) to english (en) and from english to tami, while the bilingual data of marathon, kanda and taru, which are the same language as tami, and the seal, ullage and bystanders, which are adjacent to the language, are selected for the model of the multilingual model, using a Transformer model, and experiments are performed on NVIDIA 2080Ti equipment. The experimental results are shown in the following table.
Experimental results show that the method for generating the low-resource machine translation pseudo data based on the multilingual model can improve the translation performance of the model (the higher the BLEU is, the better the representation model performance) under the condition that the multilingual bilingual data is less, and the method has better model performance compared with the method for adding the pseudo data after the model is trained by directly using the multilingual bilingual data.
The invention provides a method for generating low-resource machine translation pseudo data based on a multilingual model, which can integrate bilingual data of the same or adjacent language system of a small language into model training on the basis of low resources of the small language, thereby not only increasing the data volume of model training, but also integrating language features of the same or adjacent language system into the model, and further improving the performance of the model. In addition, the multilingual model trained by the invention fuses a plurality of bilingual data of the same or adjacent language system, which enables the generation of other small-language pseudo data by the model and the improvement of translation quality.

Claims (4)

1. A method for generating dummy data in a low-resource language based on a multi-language model, comprising the steps of:
1) Acquiring bilingual data of a small language single language and bilingual data of a neighboring language system or the same language as the small language through the Internet;
2) Preprocessing bilingual data of a small language and bilingual data of the same language or an adjacent language system of the language to obtain bilingual training data, and performing model training by using the bilingual training data to obtain a multi-language model from a source language to a target language and from the target language to the source language;
3) Preprocessing the multilingual data of the small language source language to obtain the multilingual training data for decoding to generate pseudo data;
4) Decoding the single-language training data by using a multi-language model from the source language to the target language to obtain small-language target language single-language pseudo data;
5) Processing the obtained target language single-language pseudo data and the source language single-language training data to obtain bilingual pseudo data;
6) Integrating the processed bilingual pseudo data with training data of the target-language-to-source-language multi-language model, processing to obtain new target-language-to-source-language bilingual training data, then training the new target-language-to-source-language multi-language model, generating pseudo data, and finally iterating until the performance of the model is not improved;
in the step 2), the bilingual training data is obtained by preprocessing the bilingual data of the small language and the bilingual data of the same language or an adjacent language system, and model training is carried out to obtain a multi-language model in two directions, wherein the steps are as follows:
201 The bilingual data obtained in the step 1) is sequentially subjected to a bilingual data mixing method, an HTML tag removing method, a scrambling code filtering method in the bilingual data, a repeated translation data filtering method in the bilingual data, a bracket content non-corresponding filtering method in the bilingual data, a bilingual data word number excessive filtering method and a target language Unicode coding filtering method to obtain preprocessed bilingual data;
202 Combining the small language bilingual data obtained by preprocessing with other language bilingual data, and obtaining training data used for multi-language model training by using a bidirectional duplication removal method;
203 Performing word segmentation on the training data by using a word segmentation tool, and performing BPE word segmentation to obtain final training data used for training;
204 Performing model training in two directions by using the final training data to obtain a multi-language model in two directions;
in the step 3), the multilingual source language single-language data is preprocessed to obtain single-language training data for decoding the multilingual model to generate pseudo data, and the steps are as follows:
301 The method for removing the HTML label, the method for filtering the messy codes in the single-language data, the method for filtering the word number of the single-language data and the method for filtering the Unicode code based on the target language are sequentially used for the single-language data obtained in the step 1) to obtain the preprocessed single-language data;
302 Data scoring is carried out on the monolingual data by using an XENC tool, and the upper n% of data are filtered according to the score ranking, so that the monolingual training data for decoding is obtained.
2. The method for generating pseudo data in a low resource language based on a multilingual model as recited in claim 1, wherein: n is 50 to 60.
3. The method for generating pseudo data in a low resource language based on a multilingual model as recited in claim 1, wherein: and 5) processing the obtained target language single language pseudo data and the source language single language training data to obtain bilingual pseudo data, wherein the steps are as follows:
501 Combining the target language single-language pseudo data and the source language single-language training data to obtain bilingual pseudo data;
502 Processing the bilingual pseudo data by using an n-gram filtering method, and then performing word segmentation to obtain the processed bilingual pseudo data.
4. The method for generating pseudo data in a low resource language based on a multilingual model as recited in claim 1, wherein: in step 6), integrating the processed bilingual pseudo data with the training data of the target-to-source-language multilingual model, processing to obtain new target-to-source-language bilingual training data, and then training the new target-to-source-language multilingual model and generating the pseudo data, and finally iterating until the performance of the model is not improved, wherein the method specifically comprises the following steps:
601 Combining bilingual pseudo data and bilingual training data of a target language to source language multilingual model;
602 Using a bidirectional duplication elimination method to the bilingual data to obtain processed bilingual training data;
603 Training a new round of target language to source language multi-language model and generating pseudo data by using the processed bilingual training data, and finally iterating until the performance of the model is not improved.
CN202110397096.XA 2021-04-13 2021-04-13 Method for generating pseudo data in low-resource language based on multi-language model Active CN113111667B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110397096.XA CN113111667B (en) 2021-04-13 2021-04-13 Method for generating pseudo data in low-resource language based on multi-language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110397096.XA CN113111667B (en) 2021-04-13 2021-04-13 Method for generating pseudo data in low-resource language based on multi-language model

Publications (2)

Publication Number Publication Date
CN113111667A CN113111667A (en) 2021-07-13
CN113111667B true CN113111667B (en) 2023-08-22

Family

ID=76716824

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110397096.XA Active CN113111667B (en) 2021-04-13 2021-04-13 Method for generating pseudo data in low-resource language based on multi-language model

Country Status (1)

Country Link
CN (1) CN113111667B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113505571A (en) * 2021-07-30 2021-10-15 沈阳雅译网络技术有限公司 Data selection and training method for neural machine translation

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363704A (en) * 2018-03-02 2018-08-03 北京理工大学 A kind of neural network machine translation corpus expansion method based on statistics phrase table
CN110334361A (en) * 2019-07-12 2019-10-15 电子科技大学 A kind of neural machine translation method towards rare foreign languages language
KR20200105056A (en) * 2019-02-28 2020-09-07 한국전력공사 Apparatus and method for generating video

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120330644A1 (en) * 2011-06-22 2012-12-27 Salesforce.Com Inc. Multi-lingual knowledge base

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108363704A (en) * 2018-03-02 2018-08-03 北京理工大学 A kind of neural network machine translation corpus expansion method based on statistics phrase table
KR20200105056A (en) * 2019-02-28 2020-09-07 한국전력공사 Apparatus and method for generating video
CN110334361A (en) * 2019-07-12 2019-10-15 电子科技大学 A kind of neural machine translation method towards rare foreign languages language

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
EM算法在神经机器翻译模型中的应用研究;杨云;王全;;计算机应用与软件(第08期);全文 *

Also Published As

Publication number Publication date
CN113111667A (en) 2021-07-13

Similar Documents

Publication Publication Date Title
CN100437557C (en) Machine translation method and apparatus based on language knowledge base
CN111382580A (en) Encoder-decoder framework pre-training method for neural machine translation
CN106383818A (en) Machine translation method and device
CN111061861B (en) Text abstract automatic generation method based on XLNet
CN111539229A (en) Neural machine translation model training method, neural machine translation method and device
CN104881406B (en) Web page translation method and system
CN112507734B (en) Neural machine translation system based on romanized Uygur language
Meetei et al. WAT2019: English-Hindi translation on Hindi visual genome dataset
CN112287696B (en) Post-translation editing method and device, electronic equipment and storage medium
CN111178061B (en) Multi-lingual word segmentation method based on code conversion
CN112580373A (en) High-quality Mongolian unsupervised neural machine translation method
CN113111667B (en) Method for generating pseudo data in low-resource language based on multi-language model
CN112766000A (en) Machine translation method and system based on pre-training model
CN112287694A (en) Shared encoder-based Chinese-crossing unsupervised neural machine translation method
CN110502759B (en) Method for processing Chinese-Yue hybrid network neural machine translation out-of-set words fused into classification dictionary
KR101794274B1 (en) Method and apparatus for filtering translation rules and generating target word in hierarchical phrase-based statistical machine translation
CN115114940A (en) Machine translation style migration method and system based on curriculum pre-training
CN112765977B (en) Word segmentation method and device based on cross-language data enhancement
KR101616031B1 (en) Query Translator and Method for Cross-language Information Retrieval using Liguistic Resources from Wikipedia and Parallel Corpus
CN112926344A (en) Word vector replacement data enhancement-based machine translation model training method and device, electronic equipment and storage medium
CN114861628A (en) System, method, electronic device and storage medium for training machine translation model
Zhou et al. The NiuTrans machine translation systems for WMT21
CN112836526B (en) Multi-language neural machine translation method and device based on gating mechanism
Acharya et al. A Comparative Study of SMT and NMT: Case Study of English-Nepali Language Pair.
Boito et al. How does language influence documentation workflow? unsupervised word discovery using translations in multiple languages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant