WO2023051148A1 - Method and apparatus for multilingual processing - Google Patents

Method and apparatus for multilingual processing Download PDF

Info

Publication number
WO2023051148A1
WO2023051148A1 PCT/CN2022/116378 CN2022116378W WO2023051148A1 WO 2023051148 A1 WO2023051148 A1 WO 2023051148A1 CN 2022116378 W CN2022116378 W CN 2022116378W WO 2023051148 A1 WO2023051148 A1 WO 2023051148A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
tags
target
text
translation model
Prior art date
Application number
PCT/CN2022/116378
Other languages
French (fr)
Chinese (zh)
Inventor
宋珍巧
周浩
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023051148A1 publication Critical patent/WO2023051148A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • Various embodiments of the present disclosure relate to the technical field of natural language processing, and more specifically, to methods, devices, devices, media and program products for multilingual processing.
  • Multilingual neural translation (MNMT) technology can train language models to handle translation tasks across multiple languages.
  • MNMT has two significant advantages, one is the ability to provide multilingual translation services through a single model, which greatly reduces the cost of online services.
  • multilingual training enables language models to transfer knowledge from high-resource languages to low-resource languages, helping to improve the translation quality of low-resource language pairs.
  • MNMT-based systems augment dedicated decoders for each target language without learning alignment information across language representations.
  • it because it uses an autoregressive model, it needs to be decoded in order during the process of translating the source language into the target language. Therefore, there is room for improvement in the decoding speed and cross-language representation of current multilingual conversion techniques.
  • Embodiments of the present disclosure provide a method, device, device, medium and program product for multilingual processing.
  • a method for multilingual processing includes: generating a text representation of the second language through a translation model based on the text representation of the first language and the second language tag; and obtaining a text representation of the mixed language through the translation model based on a set of language tags and the text representation of the second language and a markup language tag, wherein the set of language tags includes at least a third language tag in a third language different from the first and second languages, the markup language tag is used to indicate that the first language, the second language, and the third language are related to parallel corpus data across multiple languages; and using the text representation of the first language and the text representation of the mixed language as input to the translation model to update the parameters of the translation model, the parameters including the parallel corpus data across multiple languages.
  • an apparatus for multilingual processing includes: a generation module configured to generate a text representation in a second language based on a text representation in a first language and a second language tag; an acquisition module configured to generate a text representation in a second language based on a set of language tags and a text representation in a second language, Obtaining a textual representation of a mixed language and markup language tags, wherein a set of language tags includes at least a third language tag in a third language different from the first language and the second language, the markup language tag is used to indicate the difference between the first language, the second language Parallel corpus data across multiple languages associated with the language and the third language; and an update module configured to use the text representation of the first language and the text representation of the mixed language as input to the translation model to update parameters of the translation model, the parameters Includes parallel corpus data across multiple languages.
  • a method for multilingual processing includes: obtaining original text data in the source language and multiple target language tags; encoding the original text data into source text representations in the source language; based on multiple target language tags and pre-configured parallel corpus data across multiple languages, the parallel decoding of a source text representation into multiple target text representations in multiple target languages indicated by multiple target language tags; and parallel decoding of multiple target text representations in multiple target languages into multiple target text data in multiple target languages .
  • an apparatus for multilingual processing includes: an encoder configured to: obtain raw text data in a source language and a plurality of target language tags; and encode the raw text data into a source text representation in a source language; and a decoder deployed with a translation model, the translation model With parallel corpus data across multiple languages, the decoder is configured to decode in parallel a source text representation into multiple targets indicated by multiple target language tags based on multiple target language tags and pre-configured parallel corpus data across multiple languages a plurality of target text representations of languages; and parallel decoding of the plurality of target text representations of the plurality of target languages into a plurality of target text data of the plurality of target languages.
  • an electronic device in a fifth aspect of the present disclosure, includes: a memory and a processor; wherein the memory is used to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method according to the first aspect or the third aspect.
  • a computer readable storage medium is provided.
  • One or more computer instructions are stored on the computer-readable storage medium, wherein the one or more computer instructions are executed by the processor to implement the method according to the first aspect or the third aspect.
  • a computer program product includes one or more computer instructions, wherein the one or more computer instructions are executed by a processor to implement the method according to the first aspect or the third aspect.
  • FIG. 1 shows a block diagram of a multilingual processing system according to some embodiments of the present disclosure
  • Figure 2 shows a block diagram of an example architecture based on GLAT according to some embodiments of the present disclosure
  • FIG. 3 shows a schematic diagram of self-reinforcement learning of a multilingual processing model according to some embodiments of the present disclosure
  • FIG. 4A shows a schematic diagram of the difference between a multilingual processing system and a traditional multilingual conversion system in terms of cross-language markup representation according to some embodiments of the present disclosure
  • FIG. 4B shows a schematic diagram of translation performance of a multilingual processing system according to some embodiments of the present disclosure
  • FIG. 5 shows a flowchart of a method for training a multilingual processing model according to some embodiments of the present disclosure
  • FIG. 6 shows a flowchart of a method for multilingual processing according to some embodiments of the present disclosure
  • FIG. 7 shows a block diagram of an apparatus for training a multilingual processing model according to some embodiments of the present disclosure.
  • Figure 8 shows a block diagram of a computing system in which one or more embodiments of the present disclosure may be implemented.
  • the term "language” used in the present disclosure refers to a category of language defined in linguistics, also referred to as a language category, such as English, Chinese, French, German, and the like.
  • the term "corpus” as used in this disclosure refers to a form in which language is presented, such as text presented in words, which has thought content and meaning and can be understood by a user of the language. Corpus can also be information or data of a certain nature. Examples of the type of information or data include, but are not limited to, voice, video, text, picture, or document, among others.
  • the term “corpus” used in this disclosure refers to a collection of corpus, and multiple corpora may also be referred to as a "corpus collection”.
  • the term "representation” used in this disclosure refers to mapping a corpus into a corresponding low-dimensional vector (eg, a word embedding vector) for processing by a computing system.
  • a word embedding vector e.g. a word embedding vector
  • token used in this disclosure refers to a unit with specific meaning obtained by segmenting a corpus, for example, a word or several consecutive words as a unit. Tags can be used to analyze the content and meaning of text information. For example, the text "The weather is good today” includes the tags ["The", “weather”, “is”, “good”, “today”], while the text “The weather is good today” includes the tags ["Today", “weather “,”good”].
  • the term "transform” means to convert between any two types of information or data. Examples of conversion include, but are not limited to, translation between two languages, conversion between speech and text, conversion between text and pictures, and the like.
  • the translation process between different languages is mainly taken as an example of the conversion process.
  • the conversion process can be realized by means of corresponding conversion models or translation models. Therefore, the terms “model” or “layer” will sometimes be used in the description herein to refer to the corresponding transformation process.
  • training or “learning” refers to the process of using experience or training data to update configuration parameters and optimize system performance.
  • a machine translation system can gradually optimize translation performance, such as improving translation accuracy, through a training or learning process.
  • the training or learning process can end based on certain convergence conditions.
  • training or learning are used interchangeably for convenience of discussion.
  • inference refers to the process of performing a specific task on real-world data using a model or system that is trained or has learned capabilities. It should be understood that training and inference of the system may occur in a particular order or concurrently.
  • multilingual processing method/model refers to a method/model based on prior knowledge associated with a specific language's syntax, grammar, morphology, etc., which can be used to generate conversion results during the conversion process .
  • the conversion result can include generating target language corpus, and can also include generating target language corpus representations, which can be used by other subjects and continue to be used for other tasks, such as classification tasks, labeling tasks, etc.
  • the term “comprise” and its variations are open-ended, ie “including but not limited to”.
  • the term “based on” is “based at least in part on”.
  • the terms “one embodiment”, “embodiment” mean “at least one embodiment”; as used herein, the term “another embodiment” means “at least one additional embodiment”. Relevant definitions of other terms will be given in the description below.
  • corpus eg, text
  • corpus eg, text
  • X ⁇ x 1 , x 2 ,...,x M
  • Y ⁇ y 1 , y 2 ,..., y N ⁇
  • M and N represent the statement length, respectively.
  • MNMT builds a model from X' to Y' through a transformer.
  • the Transformer consists of stacked encoder and decoder layers, where the encoder layer is a self-attention block followed by a position-wise feed-forward block. Based on this architecture, the decoding layer has an additional encoder-decoder attention block.
  • the encoder and decoder are jointly trained such that the conditional probability of Y' given X' is maximized, the conditional probability P(Y'
  • Translation efficiency and translation quality are important indicators for considering the performance of machine translation.
  • Traditional multilingual machine translation systems generally use autoregressive models. This type of model generates translations step by step, and the translated target language words at each step depend on the previous translation results, so the translation quality is better, but the translation speed is slower. If the text to be translated is large, it will take a lot of processing time. Furthermore, when performing multilingual processing tasks, whether the vector representations of corpora in different languages with associated (eg, same) semantics are accurate and aligned will significantly affect the conversion results and translation quality. If the cross-multilingual representations are not aligned, sometimes even the transformed corpus loses semantics, contains repeated words, or omits translations, etc.
  • an embodiment of the present disclosure provides a non-autoregressive multilingual processing system.
  • the system architecture is capable of parallel processing, and can perform parallel decoding of tags at various positions in the source text, greatly improving translation efficiency.
  • the translation model learns aligned cross-language vector representations by incrementally creating context-dependent transcoding sentences. Therefore, the multilingual processing system exhibits high processing performance and translation quality in multilingual processing tasks.
  • FIG. 1 shows a block diagram of a multilingual processing system 100 according to some embodiments of the present disclosure.
  • Multilingual processing system 100 may be a computing system, a translation system, and any other device capable of performing language conversion tasks. It should be understood that the system 100 shown in FIG. 1 is exemplary only and should not constitute any limitation on the functionality and scope of the implementations described in this disclosure.
  • components of the multilingual processing system 100 may include, but are not limited to, an encoder 110 and a decoder 120 .
  • Encoder 110 and decoder 120 may each include one or more processors or processing units, memory, one or more communication units, one or more input devices, and one or more output devices (not shown).
  • the multilingual processing system 100 is equipped with a translation model, and the translation model can obtain the parallel multilingual translation (PCSS) capability based on context-dependent self-conversion (PCSS) after training, so that it can be used to perform multilingual processing tasks.
  • PCSS parallel multilingual translation
  • PCSS context-dependent self-conversion
  • Training of the translation model for system 100 may include two stages.
  • the translation model can be trained by using the original corpus data across multiple language pairs, and the original corpus data has original parallel data.
  • the original corpus data has original parallel data.
  • i represents the i-th language pair
  • X represents the text data of the input language
  • the input language can also be called the source language
  • Y represents the text data of the output language
  • the output language can also be called the target language.
  • a translation model can be trained based on the following formula (2), aiming to maximize the sum of the log probabilities of the true output given the input:
  • ⁇ M represents the parameters of the translation model
  • is used to control the loss function L stage1 and factor of relative importance.
  • the i-th parallel corpus in is composed of N i parallel sentences, expressed as can be calculated using specific training criteria Represents a parallel corpus The length prediction loss for the k-th pair in .
  • FIG. 2 shows a block diagram of an example GLAT-based architecture 200 in accordance with certain embodiments of the present disclosure.
  • the GLAT architecture 200 includes an encoding module 201 , a parallel decoding module 202 , an adoption module 203 and a parallel decoding module 204 .
  • the GLAT architecture 200 may also include any other components, modules, elements, etc. required to perform processing tasks.
  • the GLAT architecture 200 performs two-step decoding. Specifically, assuming that the source language sentence input to the encoding module 201 is X , and the target language sentence is Y, given the collected decoder F d ( i.e., Parallel decoding module 202) input is And Y can be predicted according to the following formula (3):
  • Parallel decoding module 200 calculates Y and distance. Using module 203 to sample a subset of Y based on the calculated distance using a corresponding glacing sampling (Glacing sampling) strategy to obtain in Denotes the subset remaining after removing sampled tokens from sentence Y in the target language.
  • the GLAT architecture 200 can be based on the subset according to the following formula (4): and a statement X in the source language to predict a target statement Y:
  • Equation (4) can be calculated according to the GLAT-based training criteria shown in Equation (4) and then, can be determined as follows:
  • the GLAT module utilizes parallel corpus data sets
  • the corpus of all L language pairs in is trained until the convergence condition is met.
  • the translation model has balanced translation performance for L language pairs and can enter the second stage of training.
  • the decoder 120 masks the text data Y i in the target language with a given ratio P M to obtain Thereafter, the decoder 120 uses markup language tags to convert The positions in the mask are decoded into a randomly sampled language. Therefore, the final decoded text sequence Includes mixed-language markup language tags, which can be used to indicate parallel corpus data across multiple languages.
  • the decoder 120 can convert the mixed-language text sequence As the input of the source side, the text sequence X of the source language is used as the input of the target side, and decoded to obtain a synthetic parallel corpus
  • the decoder 120 can obtain markup language tags across L language pairs. Markup language tags are used to distinguish parallel corpus data in different languages, which are word tags at the same position in the text representation.
  • the multilingual processing system 100 may include K stacked encoder and decoder layers. Language-specific tags are added to the first layer input and last layer output at various positions, as follows:
  • E src and E tgt represent the representation of the source language markup and the representation of the target language markup, respectively. From this, you can use to update the text representation y t in equation (4).
  • the decoder 120 may include independent decoding units for performing the first-level decoding 121 and the second-level decoding 122 respectively, or may include a single decoding unit supporting the first-level decoding 121 and the second-level decoding 122 at the same time. device, the present disclosure is not limited in this respect.
  • Fig. 3 shows a schematic diagram of self-reinforcement learning of a multilingual processing model according to some embodiments of the present disclosure. Given that the number of steps is 0.1, the value of the mask ratio P M iterates from 0.1 to 0.5 every I round (epoch), which can be expressed as follows:
  • Epoch represents the number of current rounds.
  • the number of mixed languages is set to 1 in the first iteration of PM . Since then, the number of mixed languages has increased to one-third of the total number of languages. During the iteration process, a large number of code conversion statements are generated. This helps translation models learn context-dependent aligned cross-lingual representations, thus enabling better translation performance.
  • an annealed dropout strategy can be applied, gradually reducing the number of randomly zeroed neurons during training.
  • a linear annealing procedure for a given mini-batch say, 64000 tokens
  • t represents the training update
  • N represents the amount of total annealing
  • Pd [0] represents the abandonment rate for initialization (e.g., 0.3).
  • the annealing drop strategy stabilizes training and improves translation quality. In particular, for acoustic models, applying an annealing drop strategy can substantially reduce the model's word error rate.
  • the decoder 120 of the translation system 100 may be involved. While in the inference stage, the translation model performs multilingual processing tasks through the encoder 110 and the decoder 120 .
  • the trained PCSS model exhibits significantly enhanced translation performance, both in terms of translation speed and translation quality.
  • Table 1 shows the comparison results of the translation performance of the traditional translation model and the multilingual processing model obtained based on the training method of the present disclosure when performing English, German, and French mutual translation tasks, wherein the average score of the bilingual replacement evaluation (BLEU ) as a performance parameter to measure translation performance.
  • BLEU bilingual replacement evaluation
  • Transformer and GLAT are bilingual translation models
  • M-transformer GLSR and Adaptor are multilingual translation models
  • MNAT and the PCSS proposed by this disclosure are multilingual NAT models.
  • Table 1 compared to M-transformer, PCSS translates 6.1 times faster, with an average score exceeding +1.7 BLEU.
  • FIG. 4A and 4B visually illustrate the translation performance of a multilingual processing system according to some embodiments of the present disclosure.
  • Figure 4A shows a schematic diagram of the mark representation obtained by using the traditional multilingual conversion system of the prior art and the PCSS-based translation system proposed in the present disclosure according to the bilingual words in the English-German dictionary, where blue represents English words and red Indicates German words.
  • (A) of Figure 4A there is a clear demarcation in the cross-language token representations learned by traditional multilingual translation systems, while the cross-language token representations learned by PCSS-based translation systems are well aligned.
  • FIG. 4B shows a schematic diagram of the performance of a PCSS-based translation system according to some embodiments of the present disclosure, where each German word is shown in green, if the similarity of the English word paired with the corresponding German word is greater than a given If the similarity is less than a given threshold (for example, 0.8), the English word is shown in blue, and if the similarity is less than the given threshold, the English word is shown in yellow. As shown in FIG. 4B , the number of English words shown in blue is much higher than the number of English words shown in yellow. That is, PCSS-based translation systems can produce fully aligned cross-lingual vector representations, which will greatly improve translation quality.
  • a given threshold for example, 0.8
  • Fig. 5 shows a flowchart of a method 500 for training a multilingual processing model according to some embodiments of the present disclosure.
  • the method 500 can be implemented by the multilingual translation system 100 , for example, can be implemented at the encoder 110 and the decoder 120 of the translation system 100 .
  • the decoder 120 generates a textual representation in a second language through a translation model based on the textual representation in the first language and tags in the second language.
  • the translation model may be based on the GLAT language model.
  • the method 500 further includes: before generating the text representation of the second language through the translation model, using parallel corpus data to train the translation model until the translation model has a balanced effect on multiple language pairs.
  • the parallel corpus data may include corpus data of multiple language pairs. This enables training translation models with similar translation performance for multiple language pairs.
  • the method may further include: determining a plurality of sampling factors for the original corpus data of the plurality of language pairs, each sampling factor being associated with the original corpus data of a corresponding language pair among the plurality of language pairs.
  • the method samples the original corpus data of multiple language pairs based on multiple sampling factors to obtain parallel corpus data for training a translation model.
  • the training method further includes determining multiple sampling factors for the original corpus data of multiple language pairs. Then, the original corpus data of multiple language pairs are sampled based on the multiple sampling factors to obtain parallel corpus data for training the translation model.
  • the training method may determine a sampling ratio parameter based on the amount of corpus data for each language pair in the original corpus data and the total amount of corpus data.
  • the training method also includes applying an adjustment factor associated with the importance of the corresponding language pair to the sampling scale parameter to obtain a plurality of sampling factors.
  • a mixed language text representation and markup language tags are obtained through a translation model.
  • the set of language tags includes at least a third language tag in a third language different from the first language and the second language.
  • the markup language tag is used to indicate parallel corpus data across multiple languages associated with the first language, the second language and the third language.
  • the training method may include: the decoder 120 sampling word representations in the text representation in the second language based on the first scale.
  • the sampled first proportion of word representations are converted to word representations corresponding to a set of languages based on the set of language labels.
  • the encoder 120 determines markup language tags associated with the transformed word representations of the first scale.
  • the encoder 120 then generates a mixed-language text representation based on the converted word representations of the first scale and remaining word representations in the text representation of the second language.
  • the training method may include that the decoder 120 generates at least one target text representation in the target language through an updated translation model based on the source text representation in the source language and the markup language tags.
  • the decoder 120 determines a distance parameter between the target textual representation and the source textual representation.
  • the decoder 120 updates the first scale based on the distance parameter.
  • the encoder 120 uses the text representation of the first language and the text representation of the mixed language as input to the translation model to update parameters of the translation model.
  • parameters may include parallel corpus data across multiple languages.
  • encoder 120 may perform the following operations at least once: using a text representation in a mixed language as source data for training and using a text representation in a first language as a target for training the data is input into the translation model; and the encoder 120 obtains another textual representation of the mixed language and updated markup language tags based on another set of language tags, wherein the other set of language tags includes at least a fourth language different from the set of language tags Tags, the updated markup language tags are used to indicate parallel corpus data across multiple languages associated with the first language, the second language, the third language and the fourth language.
  • the translation model obtained by the training method 500 may be used to perform multilingual processing tasks.
  • FIG. 6 shows a flowchart of a method 600 for multilingual processing according to some embodiments of the present disclosure.
  • the method 600 can be implemented by the multilingual translation system 100 , for example, can be implemented at the encoder 110 and the decoder 120 of the translation system 100 .
  • the encoder 110 obtains raw text data in a source language and a plurality of target language tags.
  • the encoder 110 encodes raw text data into a source text representation in a source language. Encoder 110 may in turn output the source textual representation to decoder 120 .
  • the decoder 120 decodes the source text representation in parallel into multiple target texts in multiple target languages indicated by multiple target language tags based on multiple target language tags and preconfigured cross-multilingual parallel corpus data express.
  • the decoder 120 decodes in parallel multiple target text representations in multiple target languages into multiple target text data in multiple target languages.
  • FIG. 7 shows a block diagram of an apparatus 700 for training a multilingual processing model according to some embodiments of the present disclosure.
  • the device includes a generation module 701 , an acquisition module 702 and an update module 703 .
  • the generation module 701 is configured to generate a text representation in the second language based on the text representation in the first language and the tags in the second language.
  • the obtaining module 702 is configured to obtain a textual representation of a mixed language and markup language tags based on a set of language tags and a textual representation in a second language, wherein the set of language tags includes at least a third language different from the first language and the second language
  • the markup language tag can be used to indicate parallel corpus data across multiple languages associated with the first language, the second language, and the third language.
  • the update module 703 is configured to take the text representation of the first language and the text representation of the mixed language as input of the translation model to update the parameters of the translation model, the parameters include parallel corpus data across multiple languages.
  • a multilingual processing device adopts a context-dependent non-autoregressive translation model, and can learn cross-language representations between multiple language pairs.
  • the multilingual processing device can execute multilingual translation tasks in parallel, thereby significantly speeding up the translation speed. Furthermore, good translation quality can be achieved with aligned cross-lingual token representations.
  • FIG. 8 shows a block diagram of a computing system 800 in which one or more embodiments of the present disclosure may be implemented.
  • the method 500 shown in FIG. 5 and the method 600 shown in FIG. 6 can be implemented by the computing system 800 .
  • the computing system 800 shown in FIG. 8 is an example only, and should not be construed as limiting the functionality and scope of use of the implementations described herein.
  • computing system 800 is in the form of a general-purpose computing device.
  • Components of computing system 800 may include, but are not limited to, one or more processors or processing units 800, memory 820, one or more input devices 830, one or more output devices 840, storage 850, and one or more communication Unit 860.
  • the processing unit 800 may be an actual or virtual processor and is capable of performing various processes according to the persistence stored in the memory 820. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power.
  • Computing system 800 typically includes a plurality of computer media. Such media can be any available media that is accessible to computing system 800, including but not limited to, volatile and nonvolatile media, removable and non-removable media.
  • Memory 820 can be volatile memory (eg, registers, cache, random access memory (RAM), non-volatile memory (eg, read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM) ), flash memory) or some combination of them.
  • Storage 850 may be removable or non-removable, and may include machine-readable media, such as flash drives, magnetic disks, or any other media that may be capable of storing information and that may be accessed within computing system 800 .
  • Computing system 800 may further include additional removable/non-removable, volatile/nonvolatile computer system storage media.
  • a disk drive for reading from or writing to a removable, nonvolatile disk such as a "floppy disk"
  • a disk drive for reading from a removable, nonvolatile disk may be provided.
  • CD-ROM drive for reading or writing.
  • each drive may be connected to bus 18 by one or more data media interfaces.
  • Memory 820 may include at least one program product having (eg, at least one) set of program modules configured to perform the functions of the various embodiments described herein.
  • a program/utility tool 822 having a set of one or more execution modules 824 may be stored in memory 820, for example.
  • Execution module 824 may include, but is not limited to, an operating system, one or more application programs, other program modules, and operational data. Each of these examples, or certain combinations, can include the implementation of a networked environment.
  • Execution module 824 generally performs the functions and/or methodologies of embodiments of the subject matter described herein, such as method 200 .
  • the input unit 830 may be one or more various input devices.
  • the input unit 839 may include user equipment such as a mouse, keyboard, trackball, and the like.
  • Communications unit 860 enables communications to other computing entities over a communications medium.
  • the functionality of the components of computing system 800 may be implemented in a single computing cluster or as a plurality of computing machines capable of communicating through communication links. Accordingly, computing system 800 may operate in a networked environment using logical connections to one or more other servers, a network personal computer (PC), or another general network node.
  • communication media includes wired or wireless networking technologies.
  • Computing system 800 can also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., and one or more devices that allow users to interact with computing system 800, as needed, Or communicate with any device (eg, network card, modem, etc.) that enables computing system 800 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).
  • external devices such as storage devices, display devices, etc.
  • I/O input/output
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs System on Chips
  • CPLD Complex Programmable Logic Devices
  • Program code for implementing the methods of the subject matter described herein can be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented.
  • the program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • a method for multilingual processing includes: generating a text representation of the second language through a translation model based on the text representation of the first language and the second language tag; and obtaining a text representation of the mixed language through the translation model based on a set of language tags and the text representation of the second language and a markup language tag, wherein the set of language tags includes at least a third language tag in a third language different from the first and second languages, the markup language tag is used to indicate that the first language, the second language, and the third language are related to parallel corpus data across multiple languages; and using the text representation of the first language and the text representation of the mixed language as input to the translation model to update the parameters of the translation model, the parameters including the parallel corpus data across multiple languages.
  • the method further includes: before generating the text representation of the second language by the translation model, using parallel corpus data to train the translation model until the translation model has balanced translation performance with respect to multiple language pairs, the parallel corpus data Corpus data including multiple language pairs.
  • using parallel corpus data to train the translation model includes: for the original corpus data of a plurality of language pairs, determining a plurality of sampling factors, each sampling factor is related to the original corpus data of a corresponding language pair in the plurality of language pairs associating; and sampling the original corpus data of multiple language pairs based on multiple sampling factors to obtain parallel corpus data for training the translation model.
  • determining a plurality of sampling factors includes: determining a sampling ratio parameter based on the amount of corpus data of each language pair in the original corpus data and the amount of total corpus data; Adjustment coefficients associated with properties to obtain multiple sampling factors.
  • obtaining the text representation of the mixed language and the markup language tags includes: sampling word representations in the text representation of the second language based on a first proportion; based on a set of language tags, sampling the first proportion converting the word representations of the converted first proportion of words into word representations corresponding to a set of languages; determining a markup language tag associated with the transformed first proportion of word representations; and text based on the transformed first proportion of word representations and the second language Representing the remaining word representations in the representation, a textual representation of the mixed language is generated.
  • the method further includes: generating at least one target text representation in the target language through an updated translation model based on the source text representation in the source language and the markup language tags; determining the relationship between the target text representation and the source text representation a distance parameter; and based on the distance parameter, updating the first scale.
  • updating the first ratio includes: if the distance parameter exceeds the distance threshold, updating the first ratio to a second ratio, the second ratio being smaller than the first ratio; and if the distance parameter does not exceed the distance threshold, updating The first ratio is updated to a third converted ratio, and the third ratio is greater than the first ratio.
  • the method further includes: if the distance parameter exceeds a distance threshold, decreasing the number of tags in the set of language tags; and if the distance parameter does not exceed the distance threshold, increasing the number of tags in the set of language tags.
  • updating the translation model includes at least one of inputting the mixed language text representation as source data for training and the first language text representation as target data for training into the translation model; and Another textual representation of the mixed language and updated markup language tags are obtained by a translation model based on another set of language tags, wherein the other set of language tags includes at least a fourth language tag different from the set of language tags, the updated markup Language tags are used to indicate parallel corpus data across multiple languages associated with the first language, the second language, the third language and the fourth language.
  • the method further includes: determining a performance parameter of the updated translation model; and stopping updating the translation model if the performance parameter exceeds a threshold parameter, wherein the performance parameter includes a bilingual replacement evaluation score.
  • At least a portion of the translation model is based on a Glancing language model.
  • the method further includes causing the updated translation model to be deployed for multilingual parallel translation tasks.
  • a method for multilingual processing includes: obtaining original text data in the source language and multiple target language tags; encoding the original text data into source text representations in the source language; based on multiple target language tags and pre-configured parallel corpus data across multiple languages, the parallel decoding of a source text representation into multiple target text representations in multiple target languages indicated by multiple target language tags; and parallel decoding of multiple target text representations in multiple target languages into multiple target text data in multiple target languages .
  • the method of the first aspect is performed to train the translation model of the apparatus of the second aspect.
  • an apparatus for multilingual processing includes: a generation module configured to generate a text representation in a second language based on a text representation in a first language and a second language tag; an acquisition module configured to generate a text representation in a second language based on a set of language tags and a text representation in a second language, Obtaining a textual representation of a mixed language and markup language tags, wherein a set of language tags includes at least a third language tag in a third language different from the first language and the second language, the markup language tag is used to indicate the difference between the first language, the second language Parallel corpus data across multiple languages associated with the language and the third language; and an update module configured to use the text representation of the first language and the text representation of the mixed language as input to the translation model to update parameters of the translation model, the parameters Includes parallel corpus data across multiple languages.
  • the apparatus further includes: a training module configured to use parallel corpus data to train the translation model until the translation model has a balanced Translation performance, parallel corpus data includes corpus data of multiple language pairs.
  • the training module is configured to: determine a plurality of sampling factors for the original corpus data of a plurality of language pairs, each sampling factor is associated with the original corpus data of a corresponding language pair in the plurality of language pairs; and; and sampling the original corpus data of multiple language pairs based on multiple sampling factors to obtain parallel corpus data for training the translation model.
  • determining a plurality of sampling factors includes: determining a sampling ratio parameter based on the amount of corpus data of each language pair in the original corpus data and the amount of total corpus data; Adjustment coefficients associated with properties to obtain multiple sampling factors.
  • the acquisition module is configured to: sample word representations in the textual representation of the second language based on a first scale; and convert the sampled first scale word representations into word representations corresponding to a set of languages; determining markup language tags associated with the transformed first proportion of word representations; and remaining word representations based on the transformed first proportion of word representations and the textual representation in a second language , generating a textual representation of the mixed language.
  • the generation module is further configured to: generate at least one target language representation of the target text through the updated translation model based on the source text representation in the source language and the markup language tags, and the update module is further configured to: A distance parameter between the target text representation and the source text representation is determined; based on the distance parameter, the first scale is updated.
  • the update module is configured to: if the distance parameter exceeds a distance threshold, update the first ratio to a second ratio, the second ratio being smaller than the first ratio; and if the distance parameter does not exceed the distance threshold, then update The first ratio is updated to a third converted ratio, and the third ratio is greater than the first ratio.
  • the update module is further configured to: decrease the number of tags in the set of language tags if the distance parameter exceeds a distance threshold; and increase the number of tags in the set of language tags if the distance parameter does not exceed the distance threshold number.
  • the update module is further configured to at least once input the textual representation of the mixed language as source data for training and the textual representation of the first language as target data for training into the translation model and obtaining, by means of a translation model, another textual representation of the mixed language and updated markup language tags based on another set of language tags, wherein the other set of language tags includes at least a fourth language tag different from the set of language tags, updated
  • the markup language tag of is used to indicate parallel corpus data across multiple languages associated with the first, second, third, and fourth languages.
  • the apparatus further includes a determination module configured to: determine performance parameters of the updated translation model; and stop updating the translation model if the performance parameters exceed a threshold parameter, wherein the performance parameters include a bilingual replacement evaluation score .
  • At least a portion of the translation model is based on a Glancing language model.
  • the apparatus further includes an execution module configured to: cause the updated translation model to be deployed for multilingual parallel translation tasks.
  • an apparatus for multilingual processing includes: an encoder configured to: obtain raw text data in a source language and a plurality of target language tags; and encode the raw text data into a source text representation in a source language; and a decoder deployed with a translation model that translates
  • the model has parallel corpus data across multiple languages, and the decoder is configured to: based on multiple target language tags and pre-configured parallel corpus data across multiple languages, decode the source text representation in parallel into multiple target language tags a plurality of target text representations of a target language; and parallel decoding of a plurality of target text representations of a plurality of target languages into a plurality of target text data of a plurality of target languages.
  • the method of the first aspect is performed to train the translation model of the apparatus of the fourth aspect.
  • an electronic device in an embodiment of the fifth aspect, includes: a memory and a processor; wherein the memory is used to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method according to the first aspect or the second aspect.
  • a computer readable storage medium stores one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method according to the first aspect or the second aspect.
  • a computer program product includes one or more computer instructions, wherein the one or more computer instructions implement the method according to the first aspect or the second aspect when executed by a processor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

A method and apparatus for multilingual processing, a device, a storage medium, and a program product. The method comprises: generating a text representation of a second language by means of a translation model on the basis of a text representation of a first language and a second language tag (501); generating a text representation of a mixed language and a tagged language tag by means of the translation model on the basis of a set of language tags and the text representation of the second language (502), wherein the set of language tags at least comprises a third language tag of a third language different from the first language and the second language, and the tagged language tag is used for indicating cross-multilingual parallel corpus data associated with the first language, the second language, and the third language; and taking the text representation of the first language and the text representation of the mixed language as inputs of the translation model to update parameters of the translation model (503), the parameters comprising the cross-multilingual parallel corpus data. In this way, a context-dependent multilingual parallel processing model can be obtained, thereby greatly improving the speed and quality of translation.

Description

用于多语言处理的方法和装置Method and device for multilingual processing
本申请要求2021年9月28日递交的,标题为“用于多语言处理的方法和装置”、申请号为CN202111144057.5的中国发明专利申请的优先权。This application claims the priority of the Chinese invention patent application entitled "Method and Apparatus for Multilingual Processing" and application number CN202111144057.5 submitted on September 28, 2021.
技术领域technical field
本公开的各实施例涉及自然语言处理技术领域,更具体地,涉及用于多语言处理的方法、装置、设备、介质和程序产品。Various embodiments of the present disclosure relate to the technical field of natural language processing, and more specifically, to methods, devices, devices, media and program products for multilingual processing.
背景技术Background technique
多语言神经翻译(MNMT)技术可以将语言模型训练为处理跨多种语言的翻译任务。MNMT具有两个显著优势,其一是能够通过单个模型提供多语言翻译服务,大大降低了在线服务成本。其二,多语言训练使语言模型可以将高资源语言的知识转移到低资源语言,有助于改善低资源语言对的翻译质量。Multilingual neural translation (MNMT) technology can train language models to handle translation tasks across multiple languages. MNMT has two significant advantages, one is the ability to provide multilingual translation services through a single model, which greatly reduces the cost of online services. Second, multilingual training enables language models to transfer knowledge from high-resource languages to low-resource languages, helping to improve the translation quality of low-resource language pairs.
基于MNMT的***针对每种目标语言增加专用的解码器,而不对跨语言表示的对齐信息进行学习。另外,由于其采用自回归模型,在将源语言翻译成目标语言的过程中,需按照顺序进行解码。因此,目前多语言转换技术在解码速度和跨语言表示方面存在一定的改进空间。MNMT-based systems augment dedicated decoders for each target language without learning alignment information across language representations. In addition, because it uses an autoregressive model, it needs to be decoded in order during the process of translating the source language into the target language. Therefore, there is room for improvement in the decoding speed and cross-language representation of current multilingual conversion techniques.
发明内容Contents of the invention
本公开的实施例提供了一种用于多语言处理的方法、装置、设备、介质和程序产品。Embodiments of the present disclosure provide a method, device, device, medium and program product for multilingual processing.
在本公开的第一方面中,提供了一种用于多语言处理的方法。该方法包括:基于第一语言的文本表示和第二语言标签,通过翻译模型 生成第二语言的文本表示;基于一组语言标签和第二语言的文本表示,通过翻译模型获取混合语言的文本表示以及标记语言标签,其中一组语言标签至少包括与第一语言和第二语言不同的第三语言的第三语言标签,标记语言标签用于指示与第一语言、第二语言和第三语言相关联的跨多语言的平行语料数据;以及将第一语言的文本表示和混合语言的文本表示作为翻译模型的输入,来更新翻译模型的参数,参数包括跨多语言的平行语料数据。In a first aspect of the present disclosure, a method for multilingual processing is provided. The method includes: generating a text representation of the second language through a translation model based on the text representation of the first language and the second language tag; and obtaining a text representation of the mixed language through the translation model based on a set of language tags and the text representation of the second language and a markup language tag, wherein the set of language tags includes at least a third language tag in a third language different from the first and second languages, the markup language tag is used to indicate that the first language, the second language, and the third language are related to parallel corpus data across multiple languages; and using the text representation of the first language and the text representation of the mixed language as input to the translation model to update the parameters of the translation model, the parameters including the parallel corpus data across multiple languages.
在本公开的第二方面中,提供了一种用于多语言处理的装置。该装置包括:生成模块,被配置为基于第一语言的文本表示和第二语言标签,生成第二语言的文本表示;获取模块,被配置为基于一组语言标签和第二语言的文本表示,获取混合语言的文本表示以及标记语言标签,其中一组语言标签至少包括与第一语言和第二语言不同的第三语言的第三语言标签,标记语言标签用于指示与第一语言、第二语言和第三语言相关联的跨多语言的平行语料数据;以及更新模块,被配置为将第一语言的文本表示和混合语言的文本表示作为翻译模型的输入,来更新翻译模型的参数,参数包括跨多语言的平行语料数据。In a second aspect of the present disclosure, an apparatus for multilingual processing is provided. The device includes: a generation module configured to generate a text representation in a second language based on a text representation in a first language and a second language tag; an acquisition module configured to generate a text representation in a second language based on a set of language tags and a text representation in a second language, Obtaining a textual representation of a mixed language and markup language tags, wherein a set of language tags includes at least a third language tag in a third language different from the first language and the second language, the markup language tag is used to indicate the difference between the first language, the second language Parallel corpus data across multiple languages associated with the language and the third language; and an update module configured to use the text representation of the first language and the text representation of the mixed language as input to the translation model to update parameters of the translation model, the parameters Includes parallel corpus data across multiple languages.
在本公开的第三方面中,提供了一种用于多语言处理的方法。该方法包括:获取源语言的原始文本数据和多个目标语言标签;将原始文本数据编码为源语言的源文本表示;基于多个目标语言标签和预配置的跨多语言的平行语料数据,将源文本表示并行解码为由多个目标语言标签指示的多个目标语言的多个目标文本表示;以及将多个目标语言的多个目标文本表示并行解码为多个目标语言的多个目标文本数据。In a third aspect of the present disclosure, a method for multilingual processing is provided. The method includes: obtaining original text data in the source language and multiple target language tags; encoding the original text data into source text representations in the source language; based on multiple target language tags and pre-configured parallel corpus data across multiple languages, the parallel decoding of a source text representation into multiple target text representations in multiple target languages indicated by multiple target language tags; and parallel decoding of multiple target text representations in multiple target languages into multiple target text data in multiple target languages .
在本公开的第四方面中,提供了一种用于多语言处理的装置。该装置包括:编码器,被配置为:获取源语言的原始文本数据和多个目标语言标签;以及原始文本数据编码为源语言的源文本表示;以及解码器,被部署有翻译模型,翻译模型具有跨多语言的平行语料数据,解码器被配置为基于多个目标语言标签和预配置的跨多语言的平行语料数据,将源文本表示并行解码为由多个目标语言标签指示的多个 目标语言的多个目标文本表示;以及将多个目标语言的多个目标文本表示并行解码为多个目标语言的多个目标文本数据。In a fourth aspect of the present disclosure, an apparatus for multilingual processing is provided. The apparatus includes: an encoder configured to: obtain raw text data in a source language and a plurality of target language tags; and encode the raw text data into a source text representation in a source language; and a decoder deployed with a translation model, the translation model With parallel corpus data across multiple languages, the decoder is configured to decode in parallel a source text representation into multiple targets indicated by multiple target language tags based on multiple target language tags and pre-configured parallel corpus data across multiple languages a plurality of target text representations of languages; and parallel decoding of the plurality of target text representations of the plurality of target languages into a plurality of target text data of the plurality of target languages.
在本公开的第五方面中,提供了一种电子设备。该电子设备包括:存储器和处理器;其中存储器用于存储一条或多条计算机指令,其中一条或多条计算机指令被处理器执行以实现根据第一方面或者第三方面所述的方法。In a fifth aspect of the present disclosure, an electronic device is provided. The electronic device includes: a memory and a processor; wherein the memory is used to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method according to the first aspect or the third aspect.
在本公开的第六方面中,提供了一种计算机可读存储介质。该计算机可读存储介质上存储有一条或多条计算机指令,其中一条或多条计算机指令被处理器执行以实现根据第一方面或者第三方面所述的方法。In a sixth aspect of the present disclosure, a computer readable storage medium is provided. One or more computer instructions are stored on the computer-readable storage medium, wherein the one or more computer instructions are executed by the processor to implement the method according to the first aspect or the third aspect.
在本公开的第七方面中,提供了一种计算机程序产品。该计算机程序产品包括一条或多条计算机指令,其中一条或多条计算机指令被处理器执行以实现根据第一方面或者第三方面所述的方法。In a seventh aspect of the present disclosure, a computer program product is provided. The computer program product includes one or more computer instructions, wherein the one or more computer instructions are executed by a processor to implement the method according to the first aspect or the third aspect.
提供发明内容部分是为了以简化的形式来介绍对概念的选择,它们在下文的具体实施方式中将被进一步描述。发明内容部分无意标识要求保护的主题的关键特征或主要特征,也无意限制要求保护的主题的范围。This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or principal features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter.
附图说明Description of drawings
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。在附图中,相同或相似的附图标注表示相同或相似的元素,其中:The above and other features, advantages and aspects of the various embodiments of the present disclosure will become more apparent with reference to the following detailed description in conjunction with the accompanying drawings. In the drawings, the same or similar reference numerals indicate the same or similar elements, wherein:
图1示出了根据本公开的某些实施例的多语言处理***的框图;FIG. 1 shows a block diagram of a multilingual processing system according to some embodiments of the present disclosure;
图2示出了根据本公开的某些实施例的基于GLAT的示例架构的框图;Figure 2 shows a block diagram of an example architecture based on GLAT according to some embodiments of the present disclosure;
图3示出了根据本公开的某些实施例的多语言处理模型的自增强学习的示意图;FIG. 3 shows a schematic diagram of self-reinforcement learning of a multilingual processing model according to some embodiments of the present disclosure;
图4A示出了根据本公开的某些实施例的多语言处理***与传统多语言转换***在跨语言标记表示方面的差异的示意图;FIG. 4A shows a schematic diagram of the difference between a multilingual processing system and a traditional multilingual conversion system in terms of cross-language markup representation according to some embodiments of the present disclosure;
图4B示出了根据本公开的某些实施例的多语言处理***的翻译性能的示意图;FIG. 4B shows a schematic diagram of translation performance of a multilingual processing system according to some embodiments of the present disclosure;
图5示出了根据本公开的某些实施例的用于训练多语言处理模型的方法的流程图;FIG. 5 shows a flowchart of a method for training a multilingual processing model according to some embodiments of the present disclosure;
图6示出了根据本公开的某些实施例的用于多语言处理的方法的流程图;FIG. 6 shows a flowchart of a method for multilingual processing according to some embodiments of the present disclosure;
图7示出了根据本公开的某些实施例的用于训练多语言处理模型的装置的框图;以及7 shows a block diagram of an apparatus for training a multilingual processing model according to some embodiments of the present disclosure; and
图8示出了其中可以实现本公开的一个或多个实施例的计算***的框图。Figure 8 shows a block diagram of a computing system in which one or more embodiments of the present disclosure may be implemented.
在所有附图中,相同或相似参考数字表示相同或相似元素。Throughout the drawings, the same or similar reference numerals denote the same or similar elements.
具体实施方式Detailed ways
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.
在本公开中使用的术语“语言”是指语言学中定义的语言的种类,也被称为语种,诸如英语、汉语、法语、德语等等。在本公开中使用的术语“语料”是指呈现语言的形式,诸如以文字呈现的文本,其具有思想内容和含义,能够被掌握该语言的使用者理解。语料也可以是某种性质的信息或数据。信息或数据的类型的示例包括而不限于,语音、视频、文本、图片或者文档等等。在本公开中使用的术语“语料库”是指语料的集合,多个语料库也可以称为“语料库集合”。The term "language" used in the present disclosure refers to a category of language defined in linguistics, also referred to as a language category, such as English, Chinese, French, German, and the like. The term "corpus" as used in this disclosure refers to a form in which language is presented, such as text presented in words, which has thought content and meaning and can be understood by a user of the language. Corpus can also be information or data of a certain nature. Examples of the type of information or data include, but are not limited to, voice, video, text, picture, or document, among others. The term "corpus" used in this disclosure refers to a collection of corpus, and multiple corpora may also be referred to as a "corpus collection".
在本公开中使用的术语“表示”是指将语料映射为对应的低维向量(例如,词嵌入向量),以便于计算***处理。可以采用诸如word2vec、独热(one-hot)等已知技术将语料映射为表示,当然也可 以采用其他现有的或待开发的方法将语料映射为表示,本公开对此不做限制。在本公开中使用的术语“标记(token)”是指通过对语料进行分割得到的具有具体含义的单位,例如以一个单词或几个连续的单词为单位。标记可以用于分析文本信息的内容、含义。例如,文本“The weather is good today”包括标记[“The”,“weather”,“is”,“good”,“today”],而文本“今天天气不错”包括标记[“今天”,“天气”,“不错”]。The term "representation" used in this disclosure refers to mapping a corpus into a corresponding low-dimensional vector (eg, a word embedding vector) for processing by a computing system. Known technologies such as word2vec and one-hot can be used to map corpus into representations, and of course other existing or to-be-developed methods can also be used to map corpus into representations, which is not limited in the present disclosure. The term "token" used in this disclosure refers to a unit with specific meaning obtained by segmenting a corpus, for example, a word or several consecutive words as a unit. Tags can be used to analyze the content and meaning of text information. For example, the text "The weather is good today" includes the tags ["The", "weather", "is", "good", "today"], while the text "The weather is good today" includes the tags ["Today", "weather ","good"].
在本文中使用的术语“转换”是指在任意两种类型的信息或数据之间转换。转换的示例包括但不限于,两种语言之间的翻译、语音与文本之间的转换、文本与图片之间的转换,等等。在本公开的上下文中,出于讨论方便的目的,主要以不同语种之间的翻译过程作为转换过程的示例。通常,转换过程可以借助于相应的转换模型或翻译模型来实现。因此,在本文的描述中有时将使用术语“模型”或“层”来指代相应的转换过程。As used herein, the term "transform" means to convert between any two types of information or data. Examples of conversion include, but are not limited to, translation between two languages, conversion between speech and text, conversion between text and pictures, and the like. In the context of the present disclosure, for convenience of discussion, the translation process between different languages is mainly taken as an example of the conversion process. Usually, the conversion process can be realized by means of corresponding conversion models or translation models. Therefore, the terms "model" or "layer" will sometimes be used in the description herein to refer to the corresponding transformation process.
在本文中使用的术语“训练”或“学习”是指利用经验或者训练数据更新配置参数、优化***性能的过程。例如,机器翻译***可以通过训练或学习过程,逐渐优化翻译性能,例如提高翻译准确性。训练或学习过程可以基于一定的收敛条件而结束。在本公开的上下文中,出于讨论方便的目的,术语“训练”或者“学习”可以互换使用。在本文中使用的术语“推导”或“推理”是指利用经训练或者具有学习到的能力的模型或***针对真实世界中的数据执行特定任务的过程。应当理解,***的训练和推理可能以特定顺序发生,也可能同时发生。The term "training" or "learning" as used herein refers to the process of using experience or training data to update configuration parameters and optimize system performance. For example, a machine translation system can gradually optimize translation performance, such as improving translation accuracy, through a training or learning process. The training or learning process can end based on certain convergence conditions. In the context of this disclosure, the terms "training" or "learning" are used interchangeably for convenience of discussion. The term "inference" or "inference" as used herein refers to the process of performing a specific task on real-world data using a model or system that is trained or has learned capabilities. It should be understood that training and inference of the system may occur in a particular order or concurrently.
在本文中使用的术语“多语言处理方法/模型”是指依据与特定语种的句法、语法、词法等等相关联的先验知识建立的方法/模型,可以用于在转换过程中生成转换结果。转换结果可以包括生成目标语言的语料,也可以包括生成目标语言的语料的表示,目标语言的语料的表示可以继续被其他主体使用,继续用于其他任务,诸如分类任务,标注任务等。The term "multilingual processing method/model" as used in this article refers to a method/model based on prior knowledge associated with a specific language's syntax, grammar, morphology, etc., which can be used to generate conversion results during the conversion process . The conversion result can include generating target language corpus, and can also include generating target language corpus representations, which can be used by other subjects and continue to be used for other tasks, such as classification tasks, labeling tasks, etc.
在本文中使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。在本文中使用的术语“基于”是“至少部分地基于”。在本文中使用的术语“一个实施例”、“实施例”表示“至少一个实施例”;在本文中使用的术语“另一实施例”表示“至少一个另外的实施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "comprise" and its variations are open-ended, ie "including but not limited to". As used herein, the term "based on" is "based at least in part on". As used herein, the terms "one embodiment", "embodiment" mean "at least one embodiment"; as used herein, the term "another embodiment" means "at least one additional embodiment". Relevant definitions of other terms will be given in the description below.
在传统的自然语言处理模型中,不同语言的语料(例如,文本)被映射为低维向量,经过一系列处理后,再从向量转换为文本。以MNMT***为例,给定源语言的语句为X={x 1,x 2,...,x M},目标语言的语句为Y={y 1,y 2,...,y N},其中M和N分别表示语句长度。MNMT重用标准双语神经翻译模型,利用语言标记对源输入和目标输入进行扩展,即将X变为X′={src,x 1,x 2,...,x M},并将Y变为Y′={tgt,y 1,y 2,..., yN}。通常,MNMT通过转换器构建从X’到Y’的模型。转换器包括堆叠的编码器和解码器层,编码器层是自注意力块,其后连接逐位置前馈块。基于该架构,解码层具有额外的编码器-解码器注意力块。对编码器和解码器进行共同训练,使得Y’在给定X’的情况下的条件概率最大,条件概率P(Y′|X′)可以例如根据以下公式(1)来确定: In traditional natural language processing models, corpus (eg, text) in different languages is mapped into low-dimensional vectors, and then converted from vectors to texts after a series of processing. Taking the MNMT system as an example, the sentence in the given source language is X={x 1 , x 2 ,...,x M }, and the sentence in the target language is Y={y 1 , y 2 ,..., y N }, where M and N represent the statement length, respectively. MNMT reuses the standard bilingual neural translation model, and uses language tags to expand the source input and target input, that is, X becomes X′={src,x 1 ,x 2 ,...,x M }, and Y becomes Y '={tgt, y 1 , y 2 , . . . , y N}. Usually, MNMT builds a model from X' to Y' through a transformer. The Transformer consists of stacked encoder and decoder layers, where the encoder layer is a self-attention block followed by a position-wise feed-forward block. Based on this architecture, the decoding layer has an additional encoder-decoder attention block. The encoder and decoder are jointly trained such that the conditional probability of Y' given X' is maximized, the conditional probability P(Y'|X') can be determined, for example, according to the following formula (1):
Figure PCTCN2022116378-appb-000001
Figure PCTCN2022116378-appb-000001
其中θ表示可训练模型的参数。where θ denotes the parameters of the trainable model.
翻译效率和翻译质量是考量机器翻译性能的重要指标。传统的多语言机器翻译***一般采用自回归模型。这类模型逐步地生成译文,每一步译出的目标语言单词均依赖于此前的翻译结果,因此翻译质量较好,但翻译速度较慢。如果待翻译的文本庞大,将需要大量处理时间。此外,在执行多语言处理任务时,具有相关联(例如,相同)语义的不同语言的语料的向量表示是否准确和对齐,将显著地影响转换结果和翻译质量。如果跨多语言表示未经对齐,有时甚至导致转换后的语料丢失语义、包含重复词语、或漏译等等。Translation efficiency and translation quality are important indicators for considering the performance of machine translation. Traditional multilingual machine translation systems generally use autoregressive models. This type of model generates translations step by step, and the translated target language words at each step depend on the previous translation results, so the translation quality is better, but the translation speed is slower. If the text to be translated is large, it will take a lot of processing time. Furthermore, when performing multilingual processing tasks, whether the vector representations of corpora in different languages with associated (eg, same) semantics are accurate and aligned will significantly affect the conversion results and translation quality. If the cross-multilingual representations are not aligned, sometimes even the transformed corpus loses semantics, contains repeated words, or omits translations, etc.
发明人已经认识到,传统的多语言机器翻译***受限于模型结构 和训练方式,无法在翻译性能和翻译速度方面达到折衷。因此,在本公开的实施例中提供了一种非自回归多语言处理***。该***架构具有并行处理能力,能够对源文本中各个位置的标记执行并行解码,大大提高了翻译效率。在训练阶段,翻译模型通过增量创建依赖于上下文的代码转换语句,习得对齐的跨不同语言向量表示。因而,该多语言处理***在多语言处理任务中表现出较高的处理性能和翻译质量。The inventors have realized that traditional multilingual machine translation systems are limited by model structure and training methods, and cannot achieve a compromise between translation performance and translation speed. Therefore, an embodiment of the present disclosure provides a non-autoregressive multilingual processing system. The system architecture is capable of parallel processing, and can perform parallel decoding of tags at various positions in the source text, greatly improving translation efficiency. During the training phase, the translation model learns aligned cross-language vector representations by incrementally creating context-dependent transcoding sentences. Therefore, the multilingual processing system exhibits high processing performance and translation quality in multilingual processing tasks.
在下文描述中,某些实施例将参考语言翻译过程来讨论,例如,英文、中文,等等。但是应当理解,这仅仅是为了使本领域普通技术人员更好地理解本公开实施例的原理和思想,而无意以任何方式限制本公开的范围。In the following description, certain embodiments will be discussed with reference to language translation processes, eg, English, Chinese, and so on. However, it should be understood that this is only to enable those skilled in the art to better understand the principles and ideas of the embodiments of the present disclosure, and is not intended to limit the scope of the present disclosure in any way.
图1示出了根据本公开的某些实施例的多语言处理***100的框图。多语言处理***100可以是计算***、翻译***、以及能够执行语言转换任务的任何其他设备。应当理解,图1所示出的***100仅仅是示例性的,而不应当构成对本公开所描述的实现的功能和范围的任何限制。FIG. 1 shows a block diagram of a multilingual processing system 100 according to some embodiments of the present disclosure. Multilingual processing system 100 may be a computing system, a translation system, and any other device capable of performing language conversion tasks. It should be understood that the system 100 shown in FIG. 1 is exemplary only and should not constitute any limitation on the functionality and scope of the implementations described in this disclosure.
如图1所示,多语言处理***100的组件可以包括但不限于编码器110和解码器120。编码器110和解码器120分别可以包括一个或多个处理器或处理单元、存储器、一个或多个通信单元、一个或多个输入设备以及一个或多个输出设备(未示出)。此外,多语言处理***100部署有翻译模型,该翻译模型经过训练可以获得基于上下文依赖自转换的并行多语言翻译(PCSS)能力,从而可以用于执行多语言处理任务。在本公开的上下文中,多语言处理***100与PCSS***可以互换使用。As shown in FIG. 1 , components of the multilingual processing system 100 may include, but are not limited to, an encoder 110 and a decoder 120 . Encoder 110 and decoder 120 may each include one or more processors or processing units, memory, one or more communication units, one or more input devices, and one or more output devices (not shown). In addition, the multilingual processing system 100 is equipped with a translation model, and the translation model can obtain the parallel multilingual translation (PCSS) capability based on context-dependent self-conversion (PCSS) after training, so that it can be used to perform multilingual processing tasks. In the context of this disclosure, multilingual processing system 100 and PCSS system may be used interchangeably.
针对***100的翻译模型的训练可以包括两个阶段。在第一阶段训练中,可以采用跨多语言对的原始语料数据对翻译模型进行训练,原始语料数据中具有原始平行数据。例如,给定包含L个输入-输出语言对的平行语料数据集
Figure PCTCN2022116378-appb-000002
其中i表示第i个语言对,X表示输入语言的文本数据,输入语言也可以称为源语言,Y表示输出语言的文本数据,输出语言也可以称为目标语言。例如,可以 基于以下公式(2)来训练翻译模型,旨在给定输入的情况下使真值输出的对数概率之和最大:
Training of the translation model for system 100 may include two stages. In the first stage of training, the translation model can be trained by using the original corpus data across multiple language pairs, and the original corpus data has original parallel data. For example, given a parallel corpus dataset containing L input-output language pairs
Figure PCTCN2022116378-appb-000002
Where i represents the i-th language pair, X represents the text data of the input language, the input language can also be called the source language, Y represents the text data of the output language, and the output language can also be called the target language. For example, a translation model can be trained based on the following formula (2), aiming to maximize the sum of the log probabilities of the true output given the input:
Figure PCTCN2022116378-appb-000003
Figure PCTCN2022116378-appb-000003
其中θ M表示翻译模型的参数,α是用于控制损失函数L stage1
Figure PCTCN2022116378-appb-000004
的相对重要性的因子。
Figure PCTCN2022116378-appb-000005
中第i个平行语料库由N i个平行语句构成,表示为
Figure PCTCN2022116378-appb-000006
可以使用特定的训练标准来计算
Figure PCTCN2022116378-appb-000007
表示平行语料库
Figure PCTCN2022116378-appb-000008
中的第k对的长度预测损失。
where θ M represents the parameters of the translation model, α is used to control the loss function L stage1 and
Figure PCTCN2022116378-appb-000004
factor of relative importance.
Figure PCTCN2022116378-appb-000005
The i-th parallel corpus in is composed of N i parallel sentences, expressed as
Figure PCTCN2022116378-appb-000006
can be calculated using specific training criteria
Figure PCTCN2022116378-appb-000007
Represents a parallel corpus
Figure PCTCN2022116378-appb-000008
The length prediction loss for the k-th pair in .
下面以翻译模型采用Glancing转换器(GLAT)架构为例进行讨论,但是,应当理解,本公开的实施例不限于GLAT架构,而是适用于任何语言转换模型。GLAT是一种非自回归翻译(NAT)架构,其在机器翻译中能够实现8至15倍的速度提升。图2示出了根据本公开的某些实施例的基于GLAT的示例架构200的框图。如图2所示,GLAT架构200包括编码模块201、并行解码模块202、采用模块203以及并行解码模块204。当然,GLAT架构200还可以包括执行处理任务所需的任何其他组件、模块、元件等等。The following discusses the translation model using the Glancing Translator (GLAT) architecture as an example. However, it should be understood that the embodiments of the present disclosure are not limited to the GLAT architecture, but are applicable to any language translation model. GLAT is a non-autoregressive translation (NAT) architecture that can achieve 8 to 15 times speedup in machine translation. FIG. 2 shows a block diagram of an example GLAT-based architecture 200 in accordance with certain embodiments of the present disclosure. As shown in FIG. 2 , the GLAT architecture 200 includes an encoding module 201 , a parallel decoding module 202 , an adoption module 203 and a parallel decoding module 204 . Of course, the GLAT architecture 200 may also include any other components, modules, elements, etc. required to perform processing tasks.
在学习过程中,GLAT架构200执行两步解码。具体而言,假设输入编码模块201的源语言的语句为X,目标语言的语句为Y,给定从编码器F e(即,编码模块201)输出的经收集的解码器F d(即,并行解码模块202)输入为
Figure PCTCN2022116378-appb-000009
并且Y可以根据以下公式(3)预测为:
During the learning process, the GLAT architecture 200 performs two-step decoding. Specifically, assuming that the source language sentence input to the encoding module 201 is X , and the target language sentence is Y, given the collected decoder F d ( i.e., Parallel decoding module 202) input is
Figure PCTCN2022116378-appb-000009
And Y can be predicted according to the following formula (3):
Figure PCTCN2022116378-appb-000010
Figure PCTCN2022116378-appb-000010
其中θ表示可训练模型的参数。并行解码模块200计算Y与
Figure PCTCN2022116378-appb-000011
的距离。 采用模块203基于所计算的距离采用相应的扫视采样(Glacing sampling)策略对Y的子集进行采样,得到
Figure PCTCN2022116378-appb-000012
其中
Figure PCTCN2022116378-appb-000013
表示在从目标语言的语句Y中移除经采样的标记之后剩余的子集。
where θ denotes the parameters of the trainable model. Parallel decoding module 200 calculates Y and
Figure PCTCN2022116378-appb-000011
distance. Using module 203 to sample a subset of Y based on the calculated distance using a corresponding glacing sampling (Glacing sampling) strategy to obtain
Figure PCTCN2022116378-appb-000012
in
Figure PCTCN2022116378-appb-000013
Denotes the subset remaining after removing sampled tokens from sentence Y in the target language.
在第二次并行解码中,GLAT架构200可以根据以下公式(4),基于子集
Figure PCTCN2022116378-appb-000014
和源语言的语句X来预测目标语句Y:
In the second parallel decoding, the GLAT architecture 200 can be based on the subset according to the following formula (4):
Figure PCTCN2022116378-appb-000014
and a statement X in the source language to predict a target statement Y:
Figure PCTCN2022116378-appb-000015
Figure PCTCN2022116378-appb-000015
其中
Figure PCTCN2022116378-appb-000016
表示经更新的解码输入。
in
Figure PCTCN2022116378-appb-000016
Represents the updated decoded input.
可以根据如公式(4)所示的基于GLAT的训练标准来计算
Figure PCTCN2022116378-appb-000017
进而,
Figure PCTCN2022116378-appb-000018
可以被如下确定:
can be calculated according to the GLAT-based training criteria shown in Equation (4)
Figure PCTCN2022116378-appb-000017
and then,
Figure PCTCN2022116378-appb-000018
can be determined as follows:
Figure PCTCN2022116378-appb-000019
Figure PCTCN2022116378-appb-000019
其中
Figure PCTCN2022116378-appb-000020
表示独热向量,其表征最佳目标长度分布,并且
Figure PCTCN2022116378-appb-000021
是基于编码模块201的输出
Figure PCTCN2022116378-appb-000022
以及源语言嵌入E src和目标语言嵌入E tgt的预测向量,[]表示级联运算。这里,针对长度预测应用归一化指示函数的操作符(例如,softmax),经归一化的输出与可能的目标长度集相对应。
in
Figure PCTCN2022116378-appb-000020
represents a one-hot vector that characterizes the optimal target length distribution, and
Figure PCTCN2022116378-appb-000021
is based on the output of the encoding module 201
Figure PCTCN2022116378-appb-000022
and the prediction vectors of the source language embedding E src and the target language embedding E tgt , [] denote concatenated operations. Here, a normalized operator indicating a function (eg softmax) is applied for length prediction, the normalized output corresponding to the set of possible target lengths.
基于同样的方式,GLAT模块利用平行语料数据集
Figure PCTCN2022116378-appb-000023
中的所有L个语言对的语料库进行训练,直到满足收敛条件。此时,翻译模型关于L个语言对具有均衡的翻译性能,可以进入第二阶段训练。
In the same way, the GLAT module utilizes parallel corpus data sets
Figure PCTCN2022116378-appb-000023
The corpus of all L language pairs in is trained until the convergence condition is met. At this point, the translation model has balanced translation performance for L language pairs and can enter the second stage of training.
在第二阶段训练中,解码器120可以执行两级解码,对输入的文本表示执行依赖上下文的代码转换。如图1所示,假设作为源语言的第一语言为英语En,作为目标语言的第二语言为德语De,解码器120从输入端获取第一语言的文本表示[The,cat,is,very,cute]和第二语言标签Lang=De。In the second stage of training, the decoder 120 may perform two-stage decoding, performing context-dependent transcoding on the input textual representation. As shown in FIG. 1 , assuming that the first language as the source language is English En, and the second language as the target language is German De, the decoder 120 obtains the text representation of the first language [The, cat, is, very , cute] and the second language label Lang=De.
在第一级解码121中,解码器120基于第一语言的文本表示和第二语言标签,生成第二语言的文本表示[Die,Katze,ist,sehr,süβ]。继而,解码器120基于一组语言标签Lang=De,Fr,Es和第二语言的文本表示,获取混合语言的文本表示[Die,chat,est,muy,lindo]以及标记语言标签。具体而言,在第一训练阶段结束之后,解码器120可以对平 行语料数据集
Figure PCTCN2022116378-appb-000024
进行采样,得到长度为T的平行语料数据子集
Figure PCTCN2022116378-appb-000025
其中
Figure PCTCN2022116378-appb-000026
Figure PCTCN2022116378-appb-000027
分别表示第j个语言对的源语言标签和目标语言标签。然后,在第一级解码121中,解码器120以给定比例P M对目标语言的文本数据Y i进行掩码,得到
Figure PCTCN2022116378-appb-000028
其后,解码器120利用标记语言标签将
Figure PCTCN2022116378-appb-000029
中掩码的位置解码为随机采样的语言。因此,最终解码的文本序列
Figure PCTCN2022116378-appb-000030
包括混合语言的标记语言标签,其可以用于指示跨多语言的平行语料数据。
In the first stage of decoding 121 , the decoder 120 generates a text representation [Die, Katze, ist, sehr, süβ] in the second language based on the text representation in the first language and the tags in the second language. Then, the decoder 120 obtains the mixed language text representation [Die, chat, est, muy, lindo] and markup language tags based on a set of language tags Lang=De, Fr, Es and the text representation of the second language. Specifically, after the end of the first training phase, the decoder 120 can process the parallel corpus data set
Figure PCTCN2022116378-appb-000024
Sampling to obtain a subset of parallel corpus data of length T
Figure PCTCN2022116378-appb-000025
in
Figure PCTCN2022116378-appb-000026
and
Figure PCTCN2022116378-appb-000027
denote the source and target language labels of the j-th language pair, respectively. Then, in the first stage of decoding 121, the decoder 120 masks the text data Y i in the target language with a given ratio P M to obtain
Figure PCTCN2022116378-appb-000028
Thereafter, the decoder 120 uses markup language tags to convert
Figure PCTCN2022116378-appb-000029
The positions in the mask are decoded into a randomly sampled language. Therefore, the final decoded text sequence
Figure PCTCN2022116378-appb-000030
Includes mixed-language markup language tags, which can be used to indicate parallel corpus data across multiple languages.
在第二级解码122中,解码器120可以将混合语言的文本序列
Figure PCTCN2022116378-appb-000031
作为源侧输入,源语言的文本序列X作为目标侧的输入,解码得到合成平行语料库
Figure PCTCN2022116378-appb-000032
In the second stage of decoding 122, the decoder 120 can convert the mixed-language text sequence
Figure PCTCN2022116378-appb-000031
As the input of the source side, the text sequence X of the source language is used as the input of the target side, and decoded to obtain a synthetic parallel corpus
Figure PCTCN2022116378-appb-000032
基于类似的方式,利用经代码转换的语料数据
Figure PCTCN2022116378-appb-000033
和平行语料数据
Figure PCTCN2022116378-appb-000034
通过以下方式交叉迭代地执行翻译模型的第二阶段训练:
In a similar fashion, using transcoded corpus data
Figure PCTCN2022116378-appb-000033
and parallel corpus data
Figure PCTCN2022116378-appb-000034
The second stage of training the translation model is performed cross-iteratively by:
Figure PCTCN2022116378-appb-000035
Figure PCTCN2022116378-appb-000035
这种交叉迭代的训练方式可以实现翻译模型的自增强。最终,解码器120可以获得跨L个语言对的标记语言标签。标记语言标签用于区别在不同语言中的平行语料数据,其是文本表示中相同位置的词标记。例如,多语言处理***100可以包括K个堆叠编码器层和解码器层。特定语言标签被添加到各个位置处的第一层输入和最后一层输出,如下所示:This cross-iterative training method can achieve self-enhancement of the translation model. Finally, the decoder 120 can obtain markup language tags across L language pairs. Markup language tags are used to distinguish parallel corpus data in different languages, which are word tags at the same position in the text representation. For example, the multilingual processing system 100 may include K stacked encoder and decoder layers. Language-specific tags are added to the first layer input and last layer output at various positions, as follows:
Figure PCTCN2022116378-appb-000036
Figure PCTCN2022116378-appb-000036
其中
Figure PCTCN2022116378-appb-000037
表示位置i处的第一编码器层输入,
Figure PCTCN2022116378-appb-000038
表示位置j处的第一解码器层输入。相应地,
Figure PCTCN2022116378-appb-000039
表示位置i处的最后一个编码器层的输出,并且
Figure PCTCN2022116378-appb-000040
表示位置j处的最后一个解码器层输出。E src和E tgt分别表示源语言标记的表示和目标语言标记的表示。由此,可以使用
Figure PCTCN2022116378-appb-000041
来更新公式(4)中的文本表示y t
in
Figure PCTCN2022116378-appb-000037
Denotes the first encoder layer input at position i,
Figure PCTCN2022116378-appb-000038
Denotes the first decoder layer input at position j. Correspondingly,
Figure PCTCN2022116378-appb-000039
denotes the output of the last encoder layer at position i, and
Figure PCTCN2022116378-appb-000040
Denotes the last decoder layer output at position j. E src and E tgt represent the representation of the source language markup and the representation of the target language markup, respectively. From this, you can use
Figure PCTCN2022116378-appb-000041
to update the text representation y t in equation (4).
应当理解的是,解码器120可以包括分别用于执行第一级解码121和第二级解码122的独立的解码单元,也可以包括同时支持第一级解 码121和第二级解码122的单个解码器,本公开在这方面不受限制。It should be understood that the decoder 120 may include independent decoding units for performing the first-level decoding 121 and the second-level decoding 122 respectively, or may include a single decoding unit supporting the first-level decoding 121 and the second-level decoding 122 at the same time. device, the present disclosure is not limited in this respect.
附加地或备选地,随着第二阶段训练的迭代执行,翻译模型可以基于训练的成果逐渐改善掩码比例P M和混合语言的数目。图3示出了根据本公开的某些实施例的多语言处理模型的自增强学习的示意图。给定步数为0.1,每I个轮次(epoch)掩码比例P M的值从0.1迭代为0.5,其可以如下表示: Additionally or alternatively, with the iterative execution of the second stage of training, the translation model can gradually improve the mask ratio PM and the number of mixed languages based on training results. Fig. 3 shows a schematic diagram of self-reinforcement learning of a multilingual processing model according to some embodiments of the present disclosure. Given that the number of steps is 0.1, the value of the mask ratio P M iterates from 0.1 to 0.5 every I round (epoch), which can be expressed as follows:
P M=(((Epoch÷I)mod 5)+1)÷10    (8) P M = (((Epoch÷I)mod 5)+1)÷10 (8)
其中Epoch表示当前轮次数。Where Epoch represents the number of current rounds.
如图3所示,在P M的第一次迭代中,混合语言的数目被设置为1。其后,混合语言的数目增加到总语言数目的三分之一。在迭代过程中,将生成大量代码转换语句。这有助于翻译模型学习依赖上下文的对齐的跨语言表示,因此能够提供更好的翻译性能。 As shown in Fig. 3, the number of mixed languages is set to 1 in the first iteration of PM . Since then, the number of mixed languages has increased to one-third of the total number of languages. During the iteration process, a large number of code conversion statements are generated. This helps translation models learn context-dependent aligned cross-lingual representations, thus enabling better translation performance.
附加地或备选地,在通过深度神经网络训练翻译模型的情况下,可以应用退火放弃(annealed dropout)策略,在训练过程中,逐渐减少随机归零的神经元的数目。例如,可以使用如下所示的针对给定小批量(例如,64000个标记)的线性退火程序:Additionally or alternatively, in the case of training translation models via deep neural networks, an annealed dropout strategy can be applied, gradually reducing the number of randomly zeroed neurons during training. For example, a linear annealing procedure for a given mini-batch (say, 64000 tokens) can be used as follows:
Figure PCTCN2022116378-appb-000042
Figure PCTCN2022116378-appb-000042
其中t表示训练更新,N表示总退火量,P d[0]表示初始化的放弃率(例如,0.3)。退火放弃策略可以稳定训练,改善翻译质量。特别地,对于声学模型,应用退火放弃策略可以大幅降低模型的单词错误率。 where t represents the training update, N represents the amount of total annealing, and Pd [0] represents the abandonment rate for initialization (e.g., 0.3). The annealing drop strategy stabilizes training and improves translation quality. In particular, for acoustic models, applying an annealing drop strategy can substantially reduce the model's word error rate.
在翻译模型的训练阶段中,可以仅涉及翻译***100的解码器120。而在推理阶段中,翻译模型通过编码器110和解码器120执行多语言处理任务。In the training phase of the translation model, only the decoder 120 of the translation system 100 may be involved. While in the inference stage, the translation model performs multilingual processing tasks through the encoder 110 and the decoder 120 .
在推理阶段,经训练的PCSS模型呈现了显著增强的翻译性能,无论是翻译速度方面还翻译质量方面。下面的表1示出了传统翻译模型与基于本公开的训练方法得到的多语言处理模型在执行英语、德语、法语互译任务时的翻译性能的对比结果,其中以双语替换评测平均得分(BLEU)作为衡量翻译性能的性能参数。At the inference stage, the trained PCSS model exhibits significantly enhanced translation performance, both in terms of translation speed and translation quality. Table 1 below shows the comparison results of the translation performance of the traditional translation model and the multilingual processing model obtained based on the training method of the present disclosure when performing English, German, and French mutual translation tasks, wherein the average score of the bilingual replacement evaluation (BLEU ) as a performance parameter to measure translation performance.
表1.多种翻译模型的性能对比结果Table 1. Performance comparison results of various translation models
Figure PCTCN2022116378-appb-000043
Figure PCTCN2022116378-appb-000043
其中Transformer和GLAT为双语翻译模型,M-transformer、GLSR和Adaptor为多语言翻译模型,MNAT与本公开提出的PCSS为多语言NAT模型。如表1所示,相比于M-transformer,PCSS的翻译速度为6.1倍,平均得分超出+1.7BLEU。Among them, Transformer and GLAT are bilingual translation models, M-transformer, GLSR and Adaptor are multilingual translation models, and MNAT and the PCSS proposed by this disclosure are multilingual NAT models. As shown in Table 1, compared to M-transformer, PCSS translates 6.1 times faster, with an average score exceeding +1.7 BLEU.
图4A和图4B直观地示出了根据本公开的某些实施例的多语言处理***的翻译性能。图4A示出了根据英德字典中的双语词语,分别采用现有技术的传统多语言转换***和本公开提出的基于PCSS的翻译***得到的标记表示的示意图,其中蓝色表示英语词语,红色表示德语词语。如图4A中的(A)图所示,传统多语言转换***通过学习获得的跨语言标记表示中存在明显的分界,而基于PCSS的翻译***通过学习获得的跨语言标记表示的对齐程度良好。4A and 4B visually illustrate the translation performance of a multilingual processing system according to some embodiments of the present disclosure. Figure 4A shows a schematic diagram of the mark representation obtained by using the traditional multilingual conversion system of the prior art and the PCSS-based translation system proposed in the present disclosure according to the bilingual words in the English-German dictionary, where blue represents English words and red Indicates German words. As shown in (A) of Figure 4A, there is a clear demarcation in the cross-language token representations learned by traditional multilingual translation systems, while the cross-language token representations learned by PCSS-based translation systems are well aligned.
图4B示出了根据本公开的某些实施例的基于PCSS的翻译***的性能的示意图,其中每个德语单词均以绿色示出,如果与相应德语单词成对的英语单词的相似度大于给定阈值(例如,0.8),则该英文单词以蓝色示出,如果相似度小于给定阈值,则该英文单词以黄色示出。如图4B所示,以蓝色示出的英文单词的数目远高于以黄色示出的英文单词的数目。也就是说,基于PCSS的翻译***可以产生充分对齐的跨语言向量表示,这将大大提升翻译质量。4B shows a schematic diagram of the performance of a PCSS-based translation system according to some embodiments of the present disclosure, where each German word is shown in green, if the similarity of the English word paired with the corresponding German word is greater than a given If the similarity is less than a given threshold (for example, 0.8), the English word is shown in blue, and if the similarity is less than the given threshold, the English word is shown in yellow. As shown in FIG. 4B , the number of English words shown in blue is much higher than the number of English words shown in yellow. That is, PCSS-based translation systems can produce fully aligned cross-lingual vector representations, which will greatly improve translation quality.
图5示出了根据本公开的某些实施例的用于训练多语言处理模型 的方法500的流程图。该方法500可以由多语言翻译***100来实现,例如可以被实现在翻译***100的编码器110和解码器120处。Fig. 5 shows a flowchart of a method 500 for training a multilingual processing model according to some embodiments of the present disclosure. The method 500 can be implemented by the multilingual translation system 100 , for example, can be implemented at the encoder 110 and the decoder 120 of the translation system 100 .
在框501处,解码器120基于第一语言的文本表示和第二语言标签,通过翻译模型生成第二语言的文本表示。作为示例,翻译模型可以基于GLAT语言模型。At block 501, the decoder 120 generates a textual representation in a second language through a translation model based on the textual representation in the first language and tags in the second language. As an example, the translation model may be based on the GLAT language model.
附加地或备选地,在一些实施例中,该方法500还包括:通过翻译模型生成第二语言的文本表示之前,利用平行语料数据训练翻译模型,直到翻译模型关于多个语言对具有均衡的翻译性能。平行语料数据可以包括多个语言对的语料数据。这样使得训练翻译模型对于多个语言对具有相似的翻译性能。Additionally or alternatively, in some embodiments, the method 500 further includes: before generating the text representation of the second language through the translation model, using parallel corpus data to train the translation model until the translation model has a balanced effect on multiple language pairs. translation performance. The parallel corpus data may include corpus data of multiple language pairs. This enables training translation models with similar translation performance for multiple language pairs.
在一些实施例中,该方法还可以包括:针对多个语言对的原始语料数据,确定多个采样因子,每个采样因子与多个语言对中的相应语言对的原始语料数据相关联。该方法基于多个采样因子对多个语言对的原始语料数据进行采样,以得到用于训练翻译模型的平行语料数据。In some embodiments, the method may further include: determining a plurality of sampling factors for the original corpus data of the plurality of language pairs, each sampling factor being associated with the original corpus data of a corresponding language pair among the plurality of language pairs. The method samples the original corpus data of multiple language pairs based on multiple sampling factors to obtain parallel corpus data for training a translation model.
附加地或备选地,在一些实施例中,该训练方法还包括针对多个语言对的原始语料数据,确定多个采样因子。然后,基于该多个采样因子对多个语言对的原始语料数据进行采样,以得到用于训练翻译模型的平行语料数据。Additionally or alternatively, in some embodiments, the training method further includes determining multiple sampling factors for the original corpus data of multiple language pairs. Then, the original corpus data of multiple language pairs are sampled based on the multiple sampling factors to obtain parallel corpus data for training the translation model.
附加地或备选地,在一些实施例中,该训练方法可以基于原始语料数据中每个语言对的语料数据量与总语料数据量,确定采样比例参数。该训练方法还包括:针对采样比例参数应用与相应语言对的重要性相关联的调整系数,以得到多个采样因子。Additionally or alternatively, in some embodiments, the training method may determine a sampling ratio parameter based on the amount of corpus data for each language pair in the original corpus data and the total amount of corpus data. The training method also includes applying an adjustment factor associated with the importance of the corresponding language pair to the sampling scale parameter to obtain a plurality of sampling factors.
在框502处,基于一组语言标签和第二语言的文本表示,通过翻译模型获取混合语言的文本表示以及标记语言标签。该组语言标签至少包括与第一语言和第二语言不同的第三语言的第三语言标签。该标记语言标签用于指示与第一语言、第二语言和第三语言相关联的跨多语言的平行语料数据。At block 502, based on a set of language tags and a text representation in a second language, a mixed language text representation and markup language tags are obtained through a translation model. The set of language tags includes at least a third language tag in a third language different from the first language and the second language. The markup language tag is used to indicate parallel corpus data across multiple languages associated with the first language, the second language and the third language.
附加地或备选地,在一些实施例中,该训练方法可以包括:解码 器120基于第一比例对第二语言的文本表示中的词表示进行采样。基于一组语言标签,将经采样的第一比例的词表示转换为与一组语言相对应的词表示。编码器120确定与经转换的第一比例的词表示相关联的标记语言标签。然后,编码器120基于经转换的第一比例的词表示和第二语言的文本表示中剩余的词表示,生成混合语言的文本表示。Additionally or alternatively, in some embodiments, the training method may include: the decoder 120 sampling word representations in the text representation in the second language based on the first scale. The sampled first proportion of word representations are converted to word representations corresponding to a set of languages based on the set of language labels. The encoder 120 determines markup language tags associated with the transformed word representations of the first scale. The encoder 120 then generates a mixed-language text representation based on the converted word representations of the first scale and remaining word representations in the text representation of the second language.
附加地或备选地,在一些实施例中,该训练方法可以包括,解码器120基于源语言的源文本表示和标记语言标签,通过经更新的翻译模型生成至少一个目标语言的目标文本表示。解码器120确定目标文本表示与源文本表示之间的距离参数。解码器120基于距离参数,来更新第一比例。Additionally or alternatively, in some embodiments, the training method may include that the decoder 120 generates at least one target text representation in the target language through an updated translation model based on the source text representation in the source language and the markup language tags. The decoder 120 determines a distance parameter between the target textual representation and the source textual representation. The decoder 120 updates the first scale based on the distance parameter.
在框503处,编码器120将第一语言的文本表示和混合语言的文本表示作为翻译模型的输入,来更新翻译模型的参数。作为示例,参数可以包括跨多语言的平行语料数据。At block 503, the encoder 120 uses the text representation of the first language and the text representation of the mixed language as input to the translation model to update parameters of the translation model. As an example, parameters may include parallel corpus data across multiple languages.
附加地或备选地,在一些实施例中,编码器120可以执行以下操作至少一次:将混合语言的文本表示作为用于训练的源数据并且将第一语言的文本表示作为用于训练的目标数据输入翻译模型;以及编码器120基于另一组语言标签,获取混合语言的另一文本表示和经更新的标记语言标签,其中另一组语言标签至少包括与一组语言标签不同的第四语言标签,经更新的标记语言标签用于指示与第一语言、第二语言、第三语言和第四语言相关联的跨多语言的平行语料数据。Additionally or alternatively, in some embodiments, encoder 120 may perform the following operations at least once: using a text representation in a mixed language as source data for training and using a text representation in a first language as a target for training the data is input into the translation model; and the encoder 120 obtains another textual representation of the mixed language and updated markup language tags based on another set of language tags, wherein the other set of language tags includes at least a fourth language different from the set of language tags Tags, the updated markup language tags are used to indicate parallel corpus data across multiple languages associated with the first language, the second language, the third language and the fourth language.
附加地或备选地,在一些实施例中,通过训练方法500得到的翻译模型可以用于执行多语言处理任务。Additionally or alternatively, in some embodiments, the translation model obtained by the training method 500 may be used to perform multilingual processing tasks.
图6示出了根据本公开的某些实施例的用于多语言处理的方法600的流程图。该方法600可以由多语言翻译***100来实现,例如可以被实现在翻译***100的编码器110和解码器120处。FIG. 6 shows a flowchart of a method 600 for multilingual processing according to some embodiments of the present disclosure. The method 600 can be implemented by the multilingual translation system 100 , for example, can be implemented at the encoder 110 and the decoder 120 of the translation system 100 .
在框601处,编码器110获取源语言的原始文本数据和多个目标语言标签。At block 601, the encoder 110 obtains raw text data in a source language and a plurality of target language tags.
在框602处,编码器110将原始文本数据编码为源语言的源文本表示。编码器110进而可以向解码器120输出该源文本表示。At block 602, the encoder 110 encodes raw text data into a source text representation in a source language. Encoder 110 may in turn output the source textual representation to decoder 120 .
在框603处,解码器120基于多个目标语言标签和预配置的跨多语言的平行语料数据,将源文本表示并行解码为由多个目标语言标签指示的多个目标语言的多个目标文本表示。At block 603, the decoder 120 decodes the source text representation in parallel into multiple target texts in multiple target languages indicated by multiple target language tags based on multiple target language tags and preconfigured cross-multilingual parallel corpus data express.
在框604处,解码器120将多个目标语言的多个目标文本表示并行解码为多个目标语言的多个目标文本数据。At block 604, the decoder 120 decodes in parallel multiple target text representations in multiple target languages into multiple target text data in multiple target languages.
图7示出了根据本公开的某些实施例的用于训练多语言处理模型的装置700的框图。该装置包括生成模块701,获取模块702以及更新模块703。生成模块701被配置为基于第一语言的文本表示和第二语言标签,生成第二语言的文本表示。获取模块702被配置为基于一组语言标签和第二语言的文本表示,获取混合语言的文本表示以及标记语言标签,其中一组语言标签至少包括与第一语言和第二语言不同的第三语言的第三语言标签,标记语言标签可以用于指示与第一语言、第二语言和第三语言相关联的跨多语言的平行语料数据。更新模块703被配置为将第一语言的文本表示和混合语言的文本表示作为翻译模型的输入,来更新翻译模型的参数,参数包括跨多语言的平行语料数据。FIG. 7 shows a block diagram of an apparatus 700 for training a multilingual processing model according to some embodiments of the present disclosure. The device includes a generation module 701 , an acquisition module 702 and an update module 703 . The generation module 701 is configured to generate a text representation in the second language based on the text representation in the first language and the tags in the second language. The obtaining module 702 is configured to obtain a textual representation of a mixed language and markup language tags based on a set of language tags and a textual representation in a second language, wherein the set of language tags includes at least a third language different from the first language and the second language The markup language tag can be used to indicate parallel corpus data across multiple languages associated with the first language, the second language, and the third language. The update module 703 is configured to take the text representation of the first language and the text representation of the mixed language as input of the translation model to update the parameters of the translation model, the parameters include parallel corpus data across multiple languages.
根据本公开的实施例,提供一种多语言处理装置。该多语言处理装置采用基于依赖上下文的非自回归翻译模型,能够在多个语言对之间针对跨语言表示进行学习。该多语言处理装置可以并行执行多语言翻译任务,从而显著加快了翻译速度。此外,通过对齐的跨语言标记表示,可以获得良好的翻译质量。According to an embodiment of the present disclosure, a multilingual processing device is provided. The multilingual processing device adopts a context-dependent non-autoregressive translation model, and can learn cross-language representations between multiple language pairs. The multilingual processing device can execute multilingual translation tasks in parallel, thereby significantly speeding up the translation speed. Furthermore, good translation quality can be achieved with aligned cross-lingual token representations.
图8示出了其中可以实现本公开的一个或多个实施例的计算***800的框图。图5所示的方法500和图6所示的方法600可以由计算***800实现。图8示出的计算***800仅是示例,其不应当构成对本文所描述的实现的使用的功能和范围的限制。FIG. 8 shows a block diagram of a computing system 800 in which one or more embodiments of the present disclosure may be implemented. The method 500 shown in FIG. 5 and the method 600 shown in FIG. 6 can be implemented by the computing system 800 . The computing system 800 shown in FIG. 8 is an example only, and should not be construed as limiting the functionality and scope of use of the implementations described herein.
如图8所示,计算***800是通用计算设备的形式。计算***800的组件可以包括但不限于一个或多个处理器或处理单元800,存储器820,一个或多个输入设备830,一个或多个输出设备840,存储装置850,和一个或多个通信单元860。处理单元800可以是实际或虚拟处 理器并且能够根据存储器820中存储的持续来执行各种处理。在多处理***中,多处理单元执行计算机可执行指令,以增加处理能力。As shown in FIG. 8, computing system 800 is in the form of a general-purpose computing device. Components of computing system 800 may include, but are not limited to, one or more processors or processing units 800, memory 820, one or more input devices 830, one or more output devices 840, storage 850, and one or more communication Unit 860. The processing unit 800 may be an actual or virtual processor and is capable of performing various processes according to the persistence stored in the memory 820. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power.
计算***800通常包括多个计算机介质。这样的介质可以是计算***800可访问的任何可以获得的介质,包括但不限于易失性和非易失性介质、可拆卸和不可拆卸介质。存储器820可以是易失性存储器(例如寄存器、高速缓存、随机访问存储器(RAM))、非非易失性存储器(例如,只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、闪存)或它们的某种组合。存储装置850可以是可拆卸或不可拆卸,并且可以包括机器可读介质,诸如闪存驱动、磁盘或者任何其他介质,其可以能够用于存储信息并且可以在计算***800内被访问。 Computing system 800 typically includes a plurality of computer media. Such media can be any available media that is accessible to computing system 800, including but not limited to, volatile and nonvolatile media, removable and non-removable media. Memory 820 can be volatile memory (eg, registers, cache, random access memory (RAM), non-volatile memory (eg, read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM) ), flash memory) or some combination of them. Storage 850 may be removable or non-removable, and may include machine-readable media, such as flash drives, magnetic disks, or any other media that may be capable of storing information and that may be accessed within computing system 800 .
计算***800可以进一步包括另外的可拆卸/不可拆卸、易失性/非易失性计算机***存储介质。尽管未在图8中示出,可以提供用于从可拆卸、非易失性磁盘(例如“软盘”)进行读取或写入的磁盘驱动和用于从可拆卸、非易失性光盘进行读取或写入的光盘驱动。在这些情况中,每个驱动可以由一个或多个数据介质接口被连接至总线18。存储器820可以包括至少一个程序产品,具有(例如至少一个)程序模块集合,这些程序模块被配置为执行本文所描述的各种实施例的功能。 Computing system 800 may further include additional removable/non-removable, volatile/nonvolatile computer system storage media. Although not shown in FIG. 8, a disk drive for reading from or writing to a removable, nonvolatile disk (such as a "floppy disk") and a disk drive for reading from a removable, nonvolatile disk may be provided. CD-ROM drive for reading or writing. In these cases, each drive may be connected to bus 18 by one or more data media interfaces. Memory 820 may include at least one program product having (eg, at least one) set of program modules configured to perform the functions of the various embodiments described herein.
具有一个或多个执行模块824的集合的程序/实用程序工具822可以被存储在例如存储器820中。执行模块824可以包括但不限于操作***、一个或多个应用程序、其他程序模块和操作数据。这些示例中的每个示例或特定组合可以包括联网环境的实现。执行模块824通常执行本文所描述的主题的实施例的功能和/或方法,例如方法200。A program/utility tool 822 having a set of one or more execution modules 824 may be stored in memory 820, for example. Execution module 824 may include, but is not limited to, an operating system, one or more application programs, other program modules, and operational data. Each of these examples, or certain combinations, can include the implementation of a networked environment. Execution module 824 generally performs the functions and/or methodologies of embodiments of the subject matter described herein, such as method 200 .
输入单元830可以是一个或多个各种输入设备。例如,输入单元839可以包括用户设备、诸如鼠标、键盘、追踪球等。通信单元860实现在通信介质上向另外的计算实体进行通信。附加地,计算***800的组件的功能可以以单个计算集群或多个计算机器来实现,这些计算机器能够通过通信连接来通信。因此,计算***800可以使用与一个 或多个其他服务器、网络个人计算机(PC)或者另一个一般网络节点的逻辑连接来在联网环境中进行操作。例如但不限于,通信介质包括有线或无线联网技术。The input unit 830 may be one or more various input devices. For example, the input unit 839 may include user equipment such as a mouse, keyboard, trackball, and the like. Communications unit 860 enables communications to other computing entities over a communications medium. Additionally, the functionality of the components of computing system 800 may be implemented in a single computing cluster or as a plurality of computing machines capable of communicating through communication links. Accordingly, computing system 800 may operate in a networked environment using logical connections to one or more other servers, a network personal computer (PC), or another general network node. By way of example and not limitation, communication media includes wired or wireless networking technologies.
计算***800还可以根据需要与一个或多个外部设备(未示出)进行通信,外部设备诸如存储设备、显示设备等等,与一个或多个使得用户与计算***800交互的设备进行通信,或者与使得计算***800与一个或多个其他计算设备通信的任何设备(例如,网卡、调制解调器等)进行通信。这样的通信可以经由输入/输出(I/O)接口(未示出)来执行。 Computing system 800 can also communicate with one or more external devices (not shown), such as storage devices, display devices, etc., and one or more devices that allow users to interact with computing system 800, as needed, Or communicate with any device (eg, network card, modem, etc.) that enables computing system 800 to communicate with one or more other computing devices. Such communication may be performed via an input/output (I/O) interface (not shown).
本文中所描述的功能可以至少部分地由一个或多个硬件逻辑组件来执行。例如但不限于,可以使用的硬件逻辑组件的示意性类型包括现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上***(SOC)、复杂可编程逻辑器件(CPLD)等。The functions described herein may be performed at least in part by one or more hardware logic components. Illustrative types of hardware logic components that may be used include, for example and without limitation, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), System on Chips (SOCs), Complex Programmable Logic Devices (CPLD) and so on.
用于实施本文所描述的主题的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the subject matter described herein can be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special purpose computer, or other programmable data processing devices, so that the program codes, when executed by the processor or controller, make the functions/functions specified in the flow diagrams and/or block diagrams Action is implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
在本公开内容的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行***、装置或设备使用或与指令执行***、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体***、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器 (EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
此外,虽然采用特定次序描绘了各操作,但是这应当理解为要求这样操作以所示出的特定次序或以顺序次序执行,或者要求所有图示的操作应被执行以取得期望的结果。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本文所描述的主题的范围的限制。在单独的实现的上下文中描述的某些特征还可以组合地实现在单个实现中。相反地,在单个实现的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实现中。In addition, while operations are depicted in a particular order, this should be understood to require that such operations be performed in the particular order shown, or in sequential order, or that all illustrated operations should be performed to achieve desirable results. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are contained in the above discussion, these should not be construed as limitations on the scope of the subject matter described herein. Certain features that are described in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable subcombination.
以下列出了本公开的一些示例实现。Some example implementations of the present disclosure are listed below.
在第一方面的某些实施例中,提供了一种用于多语言处理的方法。该方法包括:基于第一语言的文本表示和第二语言标签,通过翻译模型生成第二语言的文本表示;基于一组语言标签和第二语言的文本表示,通过翻译模型获取混合语言的文本表示以及标记语言标签,其中一组语言标签至少包括与第一语言和第二语言不同的第三语言的第三语言标签,标记语言标签用于指示与第一语言、第二语言和第三语言相关联的跨多语言的平行语料数据;以及将第一语言的文本表示和混合语言的文本表示作为翻译模型的输入,来更新翻译模型的参数,参数包括跨多语言的平行语料数据。In some embodiments of the first aspect, a method for multilingual processing is provided. The method includes: generating a text representation of the second language through a translation model based on the text representation of the first language and the second language tag; and obtaining a text representation of the mixed language through the translation model based on a set of language tags and the text representation of the second language and a markup language tag, wherein the set of language tags includes at least a third language tag in a third language different from the first and second languages, the markup language tag is used to indicate that the first language, the second language, and the third language are related to parallel corpus data across multiple languages; and using the text representation of the first language and the text representation of the mixed language as input to the translation model to update the parameters of the translation model, the parameters including the parallel corpus data across multiple languages.
在某些实施例中,该方法还包括:在通过翻译模型生成第二语言的文本表示之前,利用平行语料数据训练翻译模型,直到翻译模型关于多个语言对具有均衡的翻译性能,平行语料数据包括多个语言对的语料数据。In some embodiments, the method further includes: before generating the text representation of the second language by the translation model, using parallel corpus data to train the translation model until the translation model has balanced translation performance with respect to multiple language pairs, the parallel corpus data Corpus data including multiple language pairs.
在某些实施例中,利用平行语料数据训练翻译模型包括:针对多个语言对的原始语料数据,确定多个采样因子,每个采样因子与多个语言对中的相应语言对的原始语料数据相关联;以及基于多个采样因子对多个语言对的原始语料数据进行采样,以得到用于训练翻译模型 的平行语料数据。In some embodiments, using parallel corpus data to train the translation model includes: for the original corpus data of a plurality of language pairs, determining a plurality of sampling factors, each sampling factor is related to the original corpus data of a corresponding language pair in the plurality of language pairs associating; and sampling the original corpus data of multiple language pairs based on multiple sampling factors to obtain parallel corpus data for training the translation model.
在某些实施例中,确定多个采样因子包括:基于原始语料数据中每个语言对的语料数据量与总语料数据量,确定采样比例参数;以及针对采样比例参数应用与相应语言对的重要性相关联的调整系数,以得到多个采样因子。In some embodiments, determining a plurality of sampling factors includes: determining a sampling ratio parameter based on the amount of corpus data of each language pair in the original corpus data and the amount of total corpus data; Adjustment coefficients associated with properties to obtain multiple sampling factors.
在某些实施例中,获取混合语言的文本表示以及标记语言标签包括:基于第一比例对第二语言的文本表示中的词表示进行采样;基于一组语言标签,将经采样的第一比例的词表示转换为与一组语言相对应的词表示;确定与经转换的第一比例的词表示相关联的标记语言标签;以及基于经转换的第一比例的词表示和第二语言的文本表示中剩余的词表示,生成混合语言的文本表示。In some embodiments, obtaining the text representation of the mixed language and the markup language tags includes: sampling word representations in the text representation of the second language based on a first proportion; based on a set of language tags, sampling the first proportion converting the word representations of the converted first proportion of words into word representations corresponding to a set of languages; determining a markup language tag associated with the transformed first proportion of word representations; and text based on the transformed first proportion of word representations and the second language Representing the remaining word representations in the representation, a textual representation of the mixed language is generated.
在某些实施例中,该方法还包括:基于源语言的源文本表示和标记语言标签,通过经更新的翻译模型生成至少一个目标语言的目标文本表示;确定目标文本表示与源文本表示之间的距离参数;以及基于距离参数,来更新第一比例。In some embodiments, the method further includes: generating at least one target text representation in the target language through an updated translation model based on the source text representation in the source language and the markup language tags; determining the relationship between the target text representation and the source text representation a distance parameter; and based on the distance parameter, updating the first scale.
在某些实施例中,更新第一比例包括:如果距离参数超过距离阈值,则将第一比例更新为第二比例,第二比例小于第一比例;以及如果距离参数未超过距离阈值,则将第一比例更新为第三转换比例,第三比例大于第一比例。In some embodiments, updating the first ratio includes: if the distance parameter exceeds the distance threshold, updating the first ratio to a second ratio, the second ratio being smaller than the first ratio; and if the distance parameter does not exceed the distance threshold, updating The first ratio is updated to a third converted ratio, and the third ratio is greater than the first ratio.
在某些实施例中,该方法还包括:如果距离参数超过距离阈值,则降低一组语言标签中的标签数目;以及如果距离参数未超过距离阈值,则增加一组语言标签中的标签数目。In some embodiments, the method further includes: if the distance parameter exceeds a distance threshold, decreasing the number of tags in the set of language tags; and if the distance parameter does not exceed the distance threshold, increasing the number of tags in the set of language tags.
在某些实施例中,更新翻译模型包括执行以下操作至少一次:将混合语言的文本表示作为用于训练的源数据并且将第一语言的文本表示作为用于训练的目标数据输入翻译模型;以及基于另一组语言标签,通过翻译模型获取混合语言的另一文本表示和经更新的标记语言标签,其中另一组语言标签至少包括与一组语言标签不同的第四语言标签,经更新的标记语言标签用于指示与第一语言、第二语言、第三语言和第四语言相关联的跨多语言的平行语料数据。In some embodiments, updating the translation model includes at least one of inputting the mixed language text representation as source data for training and the first language text representation as target data for training into the translation model; and Another textual representation of the mixed language and updated markup language tags are obtained by a translation model based on another set of language tags, wherein the other set of language tags includes at least a fourth language tag different from the set of language tags, the updated markup Language tags are used to indicate parallel corpus data across multiple languages associated with the first language, the second language, the third language and the fourth language.
在某些实施例中,该方法还包括:确定经更新的翻译模型的性能参数;以及如果性能参数超过阈值参数,则停止更新翻译模型,其中性能参数包括双语替换评测得分。In some embodiments, the method further includes: determining a performance parameter of the updated translation model; and stopping updating the translation model if the performance parameter exceeds a threshold parameter, wherein the performance parameter includes a bilingual replacement evaluation score.
在某些实施例中,翻译模型的至少一部分基于Glancing语言模型。In some embodiments, at least a portion of the translation model is based on a Glancing language model.
在某些实施例中,该方法还包括使经更新的翻译模型被部署以用于多语言平行翻译任务。In some embodiments, the method further includes causing the updated translation model to be deployed for multilingual parallel translation tasks.
在第二方面的某些实施例中,提供了一种用于多语言处理的方法。该方法包括:获取源语言的原始文本数据和多个目标语言标签;将原始文本数据编码为源语言的源文本表示;基于多个目标语言标签和预配置的跨多语言的平行语料数据,将源文本表示并行解码为由多个目标语言标签指示的多个目标语言的多个目标文本表示;以及将多个目标语言的多个目标文本表示并行解码为多个目标语言的多个目标文本数据。In some embodiments of the second aspect, a method for multilingual processing is provided. The method includes: obtaining original text data in the source language and multiple target language tags; encoding the original text data into source text representations in the source language; based on multiple target language tags and pre-configured parallel corpus data across multiple languages, the parallel decoding of a source text representation into multiple target text representations in multiple target languages indicated by multiple target language tags; and parallel decoding of multiple target text representations in multiple target languages into multiple target text data in multiple target languages .
在某些实施例中,执行第一方面的方法来对第二方面的装置的翻译模型进行训练。In some embodiments, the method of the first aspect is performed to train the translation model of the apparatus of the second aspect.
在第三方面的实施例中,提供了一种用于多语言处理的装置。该装置包括:生成模块,被配置为基于第一语言的文本表示和第二语言标签,生成第二语言的文本表示;获取模块,被配置为基于一组语言标签和第二语言的文本表示,获取混合语言的文本表示以及标记语言标签,其中一组语言标签至少包括与第一语言和第二语言不同的第三语言的第三语言标签,标记语言标签用于指示与第一语言、第二语言和第三语言相关联的跨多语言的平行语料数据;以及更新模块,被配置为将第一语言的文本表示和混合语言的文本表示作为翻译模型的输入,来更新翻译模型的参数,参数包括跨多语言的平行语料数据。In an embodiment of the third aspect, an apparatus for multilingual processing is provided. The device includes: a generation module configured to generate a text representation in a second language based on a text representation in a first language and a second language tag; an acquisition module configured to generate a text representation in a second language based on a set of language tags and a text representation in a second language, Obtaining a textual representation of a mixed language and markup language tags, wherein a set of language tags includes at least a third language tag in a third language different from the first language and the second language, the markup language tag is used to indicate the difference between the first language, the second language Parallel corpus data across multiple languages associated with the language and the third language; and an update module configured to use the text representation of the first language and the text representation of the mixed language as input to the translation model to update parameters of the translation model, the parameters Includes parallel corpus data across multiple languages.
在某些实施例中,该装置还包括:训练模块,被配置为在通过翻译模型生成第二语言的文本表示之前,利用平行语料数据训练翻译模型,直到翻译模型关于多个语言对具有均衡的翻译性能,平行语料数据包括多个语言对的语料数据。In some embodiments, the apparatus further includes: a training module configured to use parallel corpus data to train the translation model until the translation model has a balanced Translation performance, parallel corpus data includes corpus data of multiple language pairs.
在某些实施例中,训练模块被配置为:针对多个语言对的原始语料数据,确定多个采样因子,每个采样因子与多个语言对中的相应语言对的原始语料数据相关联;以及;以及基于多个采样因子对多个语言对的原始语料数据进行采样,以得到用于训练翻译模型的平行语料数据。In some embodiments, the training module is configured to: determine a plurality of sampling factors for the original corpus data of a plurality of language pairs, each sampling factor is associated with the original corpus data of a corresponding language pair in the plurality of language pairs; and; and sampling the original corpus data of multiple language pairs based on multiple sampling factors to obtain parallel corpus data for training the translation model.
在某些实施例中,确定多个采样因子包括:基于原始语料数据中每个语言对的语料数据量与总语料数据量,确定采样比例参数;以及针对采样比例参数应用与相应语言对的重要性相关联的调整系数,以得到多个采样因子。In some embodiments, determining a plurality of sampling factors includes: determining a sampling ratio parameter based on the amount of corpus data of each language pair in the original corpus data and the amount of total corpus data; Adjustment coefficients associated with properties to obtain multiple sampling factors.
在某些实施例中,获取模块被配置为:基于第一比例对第二语言的文本表示中的词表示进行采样;基于一组语言标签,将经采样的第一比例的词表示转换为与一组语言相对应的词表示;确定与经转换的第一比例的词表示相关联的标记语言标签;以及基于经转换的第一比例的词表示和第二语言的文本表示中剩余的词表示,生成混合语言的文本表示。In some embodiments, the acquisition module is configured to: sample word representations in the textual representation of the second language based on a first scale; and convert the sampled first scale word representations into word representations corresponding to a set of languages; determining markup language tags associated with the transformed first proportion of word representations; and remaining word representations based on the transformed first proportion of word representations and the textual representation in a second language , generating a textual representation of the mixed language.
在某些实施例中,生成模块还被配置为:基于源语言的源文本表示和标记语言标签,通过经更新的翻译模型生成至少一个目标语言的目标文本表示,并且更新模块还被配置为:确定目标文本表示与源文本表示之间的距离参数;基于距离参数,来更新第一比例。In some embodiments, the generation module is further configured to: generate at least one target language representation of the target text through the updated translation model based on the source text representation in the source language and the markup language tags, and the update module is further configured to: A distance parameter between the target text representation and the source text representation is determined; based on the distance parameter, the first scale is updated.
在某些实施例中,更新模块被配置为:如果距离参数超过距离阈值,则将第一比例更新为第二比例,第二比例小于第一比例;以及如果距离参数未超过距离阈值,则将第一比例更新为第三转换比例,第三比例大于第一比例。In some embodiments, the update module is configured to: if the distance parameter exceeds a distance threshold, update the first ratio to a second ratio, the second ratio being smaller than the first ratio; and if the distance parameter does not exceed the distance threshold, then update The first ratio is updated to a third converted ratio, and the third ratio is greater than the first ratio.
在某些实施例中,更新模块还被配置为:如果距离参数超过距离阈值,则降低一组语言标签中的标签数目;以及如果距离参数未超过距离阈值,则增加一组语言标签中的标签数目。In some embodiments, the update module is further configured to: decrease the number of tags in the set of language tags if the distance parameter exceeds a distance threshold; and increase the number of tags in the set of language tags if the distance parameter does not exceed the distance threshold number.
在某些实施例中,更新模块还被配置为执行以下操作至少一次:将混合语言的文本表示作为用于训练的源数据并且将第一语言的文本表示作为用于训练的目标数据输入翻译模型;以及基于另一组语言 标签,通过翻译模型获取混合语言的另一文本表示和经更新的标记语言标签,其中另一组语言标签至少包括与一组语言标签不同的第四语言标签,经更新的标记语言标签用于指示与第一语言、第二语言、第三语言和第四语言相关联的跨多语言的平行语料数据。In some embodiments, the update module is further configured to at least once input the textual representation of the mixed language as source data for training and the textual representation of the first language as target data for training into the translation model and obtaining, by means of a translation model, another textual representation of the mixed language and updated markup language tags based on another set of language tags, wherein the other set of language tags includes at least a fourth language tag different from the set of language tags, updated The markup language tag of is used to indicate parallel corpus data across multiple languages associated with the first, second, third, and fourth languages.
在某些实施例中,该装置还包括确定模块,被配置为:确定经更新的翻译模型的性能参数;以及如果性能参数超过阈值参数,则停止更新翻译模型,其中性能参数包括双语替换评测得分。In some embodiments, the apparatus further includes a determination module configured to: determine performance parameters of the updated translation model; and stop updating the translation model if the performance parameters exceed a threshold parameter, wherein the performance parameters include a bilingual replacement evaluation score .
在某些实施例中,翻译模型的至少一部分基于Glancing语言模型。In some embodiments, at least a portion of the translation model is based on a Glancing language model.
在某些实施例中,该装置还包括执行模块,被配置为:使经更新的翻译模型被部署以用于多语言平行翻译任务。In some embodiments, the apparatus further includes an execution module configured to: cause the updated translation model to be deployed for multilingual parallel translation tasks.
在第四方面的实施例中,提供了一种用于多语言处理的装置。该装置包括:编码器,被配置为:获取源语言的原始文本数据和多个目标语言标签;以及将原始文本数据编码为源语言的源文本表示;以及解码器,被部署有翻译模型,翻译模型具有跨多语言的平行语料数据,解码器被配置为:基于多个目标语言标签和预配置的跨多语言的平行语料数据,将源文本表示并行解码为由多个目标语言标签指示的多个目标语言的多个目标文本表示;以及将多个目标语言的多个目标文本表示并行解码为多个目标语言的多个目标文本数据。In an embodiment of the fourth aspect, an apparatus for multilingual processing is provided. The apparatus includes: an encoder configured to: obtain raw text data in a source language and a plurality of target language tags; and encode the raw text data into a source text representation in a source language; and a decoder deployed with a translation model that translates The model has parallel corpus data across multiple languages, and the decoder is configured to: based on multiple target language tags and pre-configured parallel corpus data across multiple languages, decode the source text representation in parallel into multiple target language tags a plurality of target text representations of a target language; and parallel decoding of a plurality of target text representations of a plurality of target languages into a plurality of target text data of a plurality of target languages.
在某些实施例中,执行第一方面的方法来对第四方面的装置的翻译模型进行训练。In some embodiments, the method of the first aspect is performed to train the translation model of the apparatus of the fourth aspect.
在第五方面的实施例中,提供了一种电子设备。该电子设备包括:存储器和处理器;其中存储器用于存储一条或多条计算机指令,其中一条或多条计算机指令被处理器执行以实现根据第一方面或者第二方面所述的方法。In an embodiment of the fifth aspect, an electronic device is provided. The electronic device includes: a memory and a processor; wherein the memory is used to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method according to the first aspect or the second aspect.
在第六方面的实施例中,提供了一种计算机可读存储介质。该计算机可读存储介质上存储有一条或多条计算机指令,其中一条或多条计算机指令被处理器执行以实现根据第一方面或者第二方面所述的方法。In an embodiment of the sixth aspect, a computer readable storage medium is provided. The computer-readable storage medium stores one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method according to the first aspect or the second aspect.
在第七方面的实施例中,提供了一种计算机程序产品。该计算机程序产品包括一条或多条计算机指令,其中一条或多条计算机指令在被处理器执行时,实现根据第一方面或者第二方面所述的方法。In an embodiment of the seventh aspect, a computer program product is provided. The computer program product includes one or more computer instructions, wherein the one or more computer instructions implement the method according to the first aspect or the second aspect when executed by a processor.
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本公开,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。Although the disclosure has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are merely example forms of implementing the claims.

Claims (18)

  1. 一种用于多语言处理的方法,包括:A method for multilingual processing comprising:
    基于第一语言的文本表示和第二语言标签,通过翻译模型生成第二语言的文本表示;Based on the text representation of the first language and the label of the second language, a text representation of the second language is generated through a translation model;
    基于一组语言标签和所述第二语言的文本表示,通过所述翻译模型获取混合语言的文本表示以及标记语言标签,其中所述一组语言标签至少包括与所述第一语言和所述第二语言不同的第三语言的第三语言标签,所述标记语言标签用于指示与所述第一语言、所述第二语言和所述第三语言相关联的跨多语言的平行语料数据;以及Based on a set of language tags and the text representation of the second language, the translation model is used to obtain the text representation of the mixed language and the markup language tags, wherein the set of language tags includes at least the same as the first language and the second language. A third language tag of a third language different from the second language, the markup language tag being used to indicate parallel corpus data across multiple languages associated with the first language, the second language, and the third language; as well as
    将所述第一语言的文本表示和所述混合语言的文本表示作为所述翻译模型的输入,来更新所述翻译模型的参数,所述参数包括所述跨多语言的平行语料数据。Using the text representation of the first language and the text representation of the mixed language as the input of the translation model to update the parameters of the translation model, the parameters include the cross-multilingual parallel corpus data.
  2. 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:
    在通过所述翻译模型生成所述第二语言的文本表示之前,利用平行语料数据训练所述翻译模型,直到所述翻译模型关于多个语言对具有均衡的翻译性能,所述平行语料数据包括所述多个语言对的语料数据。Before generating a textual representation of the second language by the translation model, the translation model is trained using parallel corpus data, the parallel corpus data comprising all Corpus data describing multiple language pairs.
  3. 根据权利要求2所述的方法,其中利用所述平行语料数据训练所述翻译模型包括:The method according to claim 2, wherein using the parallel corpus data to train the translation model comprises:
    针对所述多个语言对的原始语料数据,确定多个采样因子,每个采样因子与所述多个语言对中的相应语言对的原始语料数据相关联;以及Determining a plurality of sampling factors for the raw corpus data of the plurality of language pairs, each sampling factor associated with the raw corpus data of a corresponding language pair of the plurality of language pairs; and
    基于所述多个采样因子对所述多个语言对的原始语料数据进行采样,以得到用于训练所述翻译模型的所述平行语料数据。Sampling the original corpus data of the plurality of language pairs based on the plurality of sampling factors to obtain the parallel corpus data for training the translation model.
  4. 根据权利要求3所述的方法,其中确定所述多个采样因子包括:The method of claim 3, wherein determining the plurality of sampling factors comprises:
    基于所述原始语料数据中每个语言对的语料数据量与总语料数据量,确定采样比例参数;以及Determining a sampling ratio parameter based on the amount of corpus data for each language pair in the original corpus data and the total amount of corpus data; and
    针对所述采样比例参数应用与相应语言对的重要性相关联的调整系数,以得到所述多个采样因子。An adjustment factor associated with the importance of the corresponding language pair is applied to the sampling scale parameter to obtain the plurality of sampling factors.
  5. 根据权利要求1所述的方法,其中获取所述混合语言的文本表示以及所述标记语言标签包括:The method of claim 1, wherein obtaining the mixed language textual representation and the markup language tags comprises:
    基于第一比例对所述第二语言的文本表示中的词表示进行采样;sampling word representations in the text representation in the second language based on a first proportion;
    基于所述一组语言标签,将经采样的所述第一比例的词表示转换为与所述一组语言相对应的词表示;converting the sampled first proportion of word representations to word representations corresponding to the set of languages based on the set of language tags;
    确定与经转换的所述第一比例的词表示相关联的标记语言标签;以及determining markup language tags associated with the transformed word representations of the first proportion; and
    基于经转换的第一比例的词表示和所述第二语言的文本表示中剩余的词表示,生成所述混合语言的文本表示。The mixed language textual representation is generated based on the converted first scaled word representations and the remaining word representations in the second language textual representation.
  6. 根据权利要求5所述的方法,还包括:The method according to claim 5, further comprising:
    基于源语言的源文本表示和所述标记语言标签,通过经更新的所述翻译模型生成至少一个目标语言的目标文本表示;generating at least one target text representation in a target language with the updated translation model based on the source text representation in the source language and the markup language tags;
    确定所述目标文本表示与所述源文本表示之间的距离参数;determining a distance parameter between said target textual representation and said source textual representation;
    基于所述距离参数,来更新所述第一比例。The first scale is updated based on the distance parameter.
  7. 根据权利要求6所述的方法,还包括:The method of claim 6, further comprising:
    如果所述距离参数超过距离阈值,则将所述第一比例更新为第二比例,所述第二比例小于所述第一比例;以及if the distance parameter exceeds a distance threshold, updating the first scale to a second scale, the second scale being smaller than the first scale; and
    如果所述距离参数未超过所述距离阈值,则将所述第一比例更新为第三转换比例,所述第三比例大于所述第一比例。If the distance parameter does not exceed the distance threshold, the first ratio is updated to a third conversion ratio, and the third ratio is greater than the first ratio.
  8. 根据权利要求6所述的方法,还包括:The method of claim 6, further comprising:
    如果所述距离参数超过距离阈值,则降低所述一组语言标签中的标签数目;以及If the distance parameter exceeds a distance threshold, reducing the number of tags in the set of language tags; and
    如果所述距离参数未超过所述距离阈值,则增加所述一组语言标签中的标签数目。If the distance parameter does not exceed the distance threshold, increasing the number of tags in the set of language tags.
  9. 根据权利要求1所述的方法,其中更新所述翻译模型包括执行以下操作至少一次:The method of claim 1, wherein updating the translation model comprises performing at least one of the following operations:
    将所述混合语言的文本表示作为用于训练的源数据并且将所述 第一语言的文本表示作为用于训练的目标数据输入所述翻译模型;以及inputting the mixed language text representation as source data for training and the first language text representation as target data for training into the translation model; and
    基于另一组语言标签,通过所述翻译模型获取混合语言的另一文本表示和经更新的标记语言标签,其中所述另一组语言标签至少包括与所述一组语言标签不同的第四语言标签,经更新的所述标记语言标签用于指示与所述第一语言、所述第二语言、所述第三语言和所述第四语言相关联的跨多语言的平行语料数据。Obtaining, by the translation model, another textual representation of a mixed language and updated markup language tags based on another set of language tags, wherein the another set of language tags includes at least a fourth language different from the set of language tags A tag, the updated markup language tag is used to indicate parallel corpus data across multiple languages associated with the first language, the second language, the third language and the fourth language.
  10. 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:
    确定经更新的所述翻译模型的性能参数;以及determining performance parameters of the updated translation model; and
    如果所述性能参数超过阈值参数,则停止更新所述翻译模型,其中所述性能参数包括双语替换评测得分。If the performance parameter exceeds a threshold parameter, stop updating the translation model, wherein the performance parameter includes a bilingual replacement evaluation score.
  11. 根据权利要求1所述的方法,其中所述翻译模型的至少一部分基于Glancing语言模型。The method of claim 1, wherein at least a portion of the translation model is based on a Glancing language model.
  12. 根据权利要求1所述的方法,还包括:The method according to claim 1, further comprising:
    使经更新的所述翻译模型被部署以用于多语言平行翻译任务。The updated translation model is caused to be deployed for multilingual parallel translation tasks.
  13. 一种用于多语言处理的装置,包括:An apparatus for multilingual processing, comprising:
    生成模块,被配置为基于第一语言的文本表示和第二语言标签,生成第二语言的文本表示;a generating module configured to generate a textual representation in a second language based on the textual representation in the first language and the second language tag;
    获取模块,被配置为基于一组语言标签和所述第二语言的文本表示,获取混合语言的文本表示以及标记语言标签,其中所述一组语言标签至少包括与所述第一语言和所述第二语言不同的第三语言的第三语言标签,所述标记语言标签用于指示与所述第一语言、所述第二语言和所述第三语言相关联的跨多语言的平行语料数据;以及An acquisition module configured to acquire a text representation in a mixed language and a markup language tag based on a set of language tags and the text representation in the second language, wherein the set of language tags includes at least A third language tag in a third language different from the second language, the markup language tag being used to indicate parallel corpus data across multiple languages associated with the first language, the second language, and the third language ;as well as
    更新模块,被配置为将所述第一语言的文本表示和所述混合语言的文本表示作为翻译模型的输入,来更新翻译模型的参数,所述参数包括所述跨多语言的平行语料数据。The update module is configured to use the text representation of the first language and the text representation of the mixed language as an input of the translation model to update the parameters of the translation model, the parameters include the cross-multilingual parallel corpus data.
  14. 一种用于多语言处理的方法,包括:A method for multilingual processing comprising:
    获取源语言的原始文本数据和多个目标语言标签;Get raw text data in the source language and multiple target language tags;
    将所述原始文本数据编码为所述源语言的源文本表示;encoding said raw text data into a source text representation in said source language;
    基于所述多个目标语言标签和预配置的跨多语言的平行语料数据,将所述源文本表示并行解码为由所述多个目标语言标签指示的多个目标语言的多个目标文本表示;以及Decoding the source text representation in parallel into a plurality of target text representations in a plurality of target languages indicated by the plurality of target language tags based on the plurality of target language tags and pre-configured cross-multilingual parallel corpus data; as well as
    将所述多个目标语言的多个目标文本表示并行解码为所述多个目标语言的多个目标文本数据。A plurality of target text representations in the plurality of target languages are decoded in parallel into a plurality of target text data in the plurality of target languages.
  15. 一种用于多语言处理的装置,包括:An apparatus for multilingual processing, comprising:
    编码器,被配置为:Encoder, configured as:
    获取源语言的原始文本数据和多个目标语言标签;以及obtain raw text data in the source language and multiple target language tags; and
    将所述原始文本数据编码为所述源语言的源文本表示;以及encoding said raw text data into a source text representation in said source language; and
    解码器,被部署有翻译模型,所述翻译模型具有跨多语言的平行语料数据,所述解码器被配置为:a decoder deployed with a translation model having parallel corpus data across multiple languages, the decoder configured to:
    基于所述多个目标语言标签和预配置的跨多语言的平行语料数据,将所述源文本表示并行解码为由所述多个目标语言标签指示的多个目标语言的多个目标文本表示;以及Decoding the source text representation in parallel into a plurality of target text representations in a plurality of target languages indicated by the plurality of target language tags based on the plurality of target language tags and pre-configured cross-multilingual parallel corpus data; as well as
    将所述多个目标语言的多个目标文本表示并行解码为所述多个目标语言的多个目标文本数据。A plurality of target text representations in the plurality of target languages are decoded in parallel into a plurality of target text data in the plurality of target languages.
  16. 一种电子设备,包括:An electronic device comprising:
    存储器和处理器;memory and processor;
    其中所述存储器用于存储一条或多条计算机指令,其中所述一条或多条计算机指令被所述处理器执行以实现根据权利要求1至12以及权利要求14中任一项所述的方法。Wherein the memory is used to store one or more computer instructions, wherein the one or more computer instructions are executed by the processor to implement the method according to any one of claims 1 to 12 and claim 14.
  17. 一种计算机可读存储介质,其上存储有一条或多条计算机指令,其中所述一条或多条计算机指令被处理器执行以实现根据权利要求1至12以及权利要求14中任一项所述的方法。A computer-readable storage medium having one or more computer instructions stored thereon, wherein the one or more computer instructions are executed by a processor to implement any one of claims 1 to 12 and claim 14 Methods.
  18. 一种计算机程序产品,包括一条或多条计算机指令,其中所述一条或多条计算机指令被处理器执行以实现根据权利要求1至12以及权利要求14中任一项所述的方法。A computer program product comprising one or more computer instructions, wherein the one or more computer instructions are executed by a processor to implement the method according to any one of claims 1 to 12 and claim 14.
PCT/CN2022/116378 2021-09-28 2022-08-31 Method and apparatus for multilingual processing WO2023051148A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111144057.5 2021-09-28
CN202111144057.5A CN114186569A (en) 2021-09-28 2021-09-28 Method and apparatus for multi-language processing

Publications (1)

Publication Number Publication Date
WO2023051148A1 true WO2023051148A1 (en) 2023-04-06

Family

ID=80601385

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/116378 WO2023051148A1 (en) 2021-09-28 2022-08-31 Method and apparatus for multilingual processing

Country Status (2)

Country Link
CN (1) CN114186569A (en)
WO (1) WO2023051148A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116595999A (en) * 2023-07-17 2023-08-15 深圳须弥云图空间科技有限公司 Machine translation model training method and device
CN116738345A (en) * 2023-08-15 2023-09-12 腾讯科技(深圳)有限公司 Classification processing method, related device and medium

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114186569A (en) * 2021-09-28 2022-03-15 北京有竹居网络技术有限公司 Method and apparatus for multi-language processing
CN115409044A (en) * 2022-08-26 2022-11-29 北京有竹居网络技术有限公司 Translation method, translation device, readable medium and electronic equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3392782A1 (en) * 2017-04-18 2018-10-24 Salesforce.com, Inc. Natural language translation and localization
CN109117483A (en) * 2018-07-27 2019-01-01 清华大学 The training method and device of neural network machine translation model
US20190332677A1 (en) * 2018-04-30 2019-10-31 Samsung Electronics Co., Ltd. Multilingual translation device and method
JP2020160917A (en) * 2019-03-27 2020-10-01 国立研究開発法人情報通信研究機構 Method for training neural machine translation model and computer program
US20200342182A1 (en) * 2018-08-30 2020-10-29 Google Llc Cross-lingual classification using multilingual neural machine translation
CN114186569A (en) * 2021-09-28 2022-03-15 北京有竹居网络技术有限公司 Method and apparatus for multi-language processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3392782A1 (en) * 2017-04-18 2018-10-24 Salesforce.com, Inc. Natural language translation and localization
US20190332677A1 (en) * 2018-04-30 2019-10-31 Samsung Electronics Co., Ltd. Multilingual translation device and method
CN109117483A (en) * 2018-07-27 2019-01-01 清华大学 The training method and device of neural network machine translation model
US20200342182A1 (en) * 2018-08-30 2020-10-29 Google Llc Cross-lingual classification using multilingual neural machine translation
JP2020160917A (en) * 2019-03-27 2020-10-01 国立研究開発法人情報通信研究機構 Method for training neural machine translation model and computer program
CN114186569A (en) * 2021-09-28 2022-03-15 北京有竹居网络技术有限公司 Method and apparatus for multi-language processing

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
JIA HAO, WANG XU, JI BAIJUN, DUAN XIANGYU, ZHANG MIN: "Non-autoregressive neural based on masking mechanism", JOURNAL OF XIAMEN UNIVERSITY, vol. 60, no. 4, 31 July 2021 (2021-07-31), pages 648 - 654, XP093053359 *
RAJ DABRE. ET AL.: "A Survey of Multilingual Neural Machine Translation", ACM COMPUTING SURVEYS, vol. 53, no. 5, 30 September 2020 (2020-09-30), pages 1 - 38, XP058666950, DOI: 10.1145/3406095 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116595999A (en) * 2023-07-17 2023-08-15 深圳须弥云图空间科技有限公司 Machine translation model training method and device
CN116595999B (en) * 2023-07-17 2024-04-16 深圳须弥云图空间科技有限公司 Machine translation model training method and device
CN116738345A (en) * 2023-08-15 2023-09-12 腾讯科技(深圳)有限公司 Classification processing method, related device and medium
CN116738345B (en) * 2023-08-15 2024-03-01 腾讯科技(深圳)有限公司 Classification processing method, related device and medium

Also Published As

Publication number Publication date
CN114186569A (en) 2022-03-15

Similar Documents

Publication Publication Date Title
WO2023051148A1 (en) Method and apparatus for multilingual processing
CN107967262B (en) A kind of neural network illiteracy Chinese machine translation method
Zhang et al. Deep Neural Networks in Machine Translation: An Overview.
CN109117483B (en) Training method and device of neural network machine translation model
Lu et al. Bi-encoder transformer network for mandarin-english code-switching speech recognition using mixture of experts.
WO2019154210A1 (en) Machine translation method and device, and computer-readable storage medium
CN110472255B (en) Neural network machine translation method, model, electronic terminal, and storage medium
CN113836271B (en) Method and product for natural language processing
JP2021033995A (en) Text processing apparatus, method, device, and computer-readable storage medium
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
WO2023061106A1 (en) Method and apparatus for language translation, device, and medium
Al-Ibrahim et al. Neural machine translation from Jordanian Dialect to modern standard Arabic
CN107798386B (en) Multi-process collaborative training based on unlabeled data
Vashistha et al. Active learning for neural machine translation
Zhao et al. An efficient character-level neural machine translation
CN112380882B (en) Mongolian Chinese neural machine translation method with error correction function
CN115906854A (en) Multi-level confrontation-based cross-language named entity recognition model training method
Xu Research on neural network machine translation model based on entity tagging improvement
Al Nahas et al. Supervised text style transfer using neural machine translation: converting between old and modern Turkish as an example
CN114896396A (en) Text classification and model training method, system, equipment and storage medium
Chen et al. Fast OOV words incorporation using structured word embeddings for neural network language model
Jiang et al. English-Vietnamese machine translation model based on sequence to sequence algorithm
CN111597827A (en) Method and device for improving machine translation accuracy
CN117034968B (en) Neural machine translation method, device, electronic equipment and medium
WO2024055707A1 (en) Translation method and related device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22874539

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE