CN112633017A

CN112633017A - Translation model training method, translation processing method, translation model training device, translation processing equipment and storage medium

Info

Publication number: CN112633017A
Application number: CN202011555680.5A
Authority: CN
Inventors: 姜博健; 张睿卿; 李芝; 何中军; 吴华
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-24
Filing date: 2020-12-24
Publication date: 2021-04-09
Anticipated expiration: 2040-12-24
Also published as: CN112633017B

Abstract

The application discloses translation model training and translation processing methods, devices, equipment and storage media, and relates to the technical field of artificial intelligence such as deep learning. The specific implementation scheme is as follows: the method comprises the steps of obtaining a plurality of language training corpora, clustering the language training corpora according to languages, obtaining a plurality of cluster-like training corpora, carrying out training corpus processing on target language resources in each cluster-like training corpora, obtaining each cluster-like target training corpora, training a translation model according to each cluster-like target training corpora, and generating a plurality of sub-translation models. Therefore, languages with similar language features are trained together by using a clustering method, the generalization capability of the translation model is favorably improved, the training corpus data volume of low-resource languages is increased to train the translation model, and the translation quality is improved.

Description

Translation model training method, translation processing method, translation model training device, translation processing equipment and storage medium

Technical Field

The application relates to the technical field of artificial intelligence such as deep learning in the technical field of data processing, in particular to a translation model training and translation processing method, device, equipment and storage medium.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

With the continuous development of deep learning technology and global internationalization, the demand of machine translation is increasing, international communication is more frequent, and the demand of multilingual machine translation is also increasing.

In the related art, a one-to-one translation model is used for modeling a bilingual sentence pair, however, the translation directions of multiple languages are many, the deployment cost is high, and parallel corpora are probably not present between any two languages, so that the translation device in some translation directions cannot be trained, and the translation quality and efficiency are poor.

Disclosure of Invention

The present disclosure provides a method, apparatus, device, and storage medium for translation model training, translation processing.

According to an aspect of the present disclosure, there is provided a translation model training method, including:

acquiring a plurality of language training corpora, clustering the plurality of language training corpora according to languages, and acquiring a plurality of cluster training corpora;

performing corpus processing on target language resources in each cluster-like corpus to obtain each cluster-like target corpus;

and training the translation model according to the target training corpus of each cluster to generate a plurality of sub-translation models.

According to another aspect of the present disclosure, there is provided a translation processing method of the translation model according to any one of claims 1 to 5, including:

acquiring a text to be translated and a target language;

under the condition that the source language and the target language of the text to be translated belong to the same cluster, acquiring a translation sub-model, translating the text to be translated and acquiring a translation result;

under the condition that the source language and the target language of the text to be translated do not belong to the same cluster, acquiring a first translation sub-model to translate the text to be translated, and acquiring a candidate translation result;

and acquiring a second translation sub-model, translating the candidate translation result and acquiring a target translation result.

According to still another aspect of the present disclosure, there is provided a translation model training apparatus including:

the first acquisition module is used for acquiring a plurality of language training corpora;

the second obtaining module is used for clustering the multiple language training corpuses according to languages to obtain multiple cluster training corpuses;

the first processing module is used for carrying out corpus processing on target language resources in each cluster-like corpus to obtain each cluster-like target corpus;

and the training module is used for training the translation model according to the target training corpus of each cluster class to generate a plurality of sub-translation models.

According to still another aspect of the present disclosure, there is provided a translation processing apparatus of the translation model, including:

the fourth acquisition module is used for acquiring the text to be translated and the target language;

the fifth obtaining module is used for obtaining a translation sub-model under the condition that the source language and the target language of the text to be translated belong to the same cluster, translating the text to be translated and obtaining a translation result;

the sixth obtaining module is used for obtaining a first translation sub-model to translate the text to be translated under the condition that the source language and the target language of the text to be translated do not belong to the same cluster, and obtaining a candidate translation result;

and the seventh obtaining module is used for obtaining the second translation sub-model, translating the candidate translation result and obtaining a target translation result.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the translation model training, translation processing methods described in the above embodiments.

According to a sixth aspect, a non-transitory computer-readable storage medium is proposed, in which computer instructions are stored, the computer instructions being configured to cause the computer to execute the translation model training and translation processing method described in the above embodiments.

According to a seventh aspect, a computer program product is proposed, in which instructions, when executed by a processor, enable a server to perform the translation model training and translation processing method of the first aspect.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is a flow chart of a translation model training method according to a first embodiment of the present application;

FIG. 2 is a flow chart of a translation model training method according to a second embodiment of the present application;

FIG. 3 is a flow chart of a translation model training method according to a third embodiment of the present application;

FIG. 4 is a flow chart of a translation processing method according to a fourth embodiment of the present application;

FIG. 5 is an exemplary diagram of a translation process according to an embodiment of the present application;

fig. 6 is a flowchart of a translation processing method according to a fifth embodiment of the present application;

FIG. 7 is a block diagram of a translation model training apparatus according to a sixth embodiment of the present application;

fig. 8 is a configuration diagram of a translation processing apparatus according to a seventh embodiment of the present application;

fig. 9 is a block diagram of an electronic device for implementing a translation model training and translation process according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Based on the above description, in practical applications, for example, 200 languages are supported for mutual translation, the translation directions reach 40000, 40000 translation devices are required for modeling a bilingual sentence pair using a one-to-one translation model, the maintenance cost is extremely high, and parallel corpora are not likely to exist between any two languages, so that the translation devices in some translation directions cannot be trained, and the translation quality and efficiency are poor.

Aiming at the problems, the application provides a translation model training method, which uses a clustering method to train languages with similar language characteristics together, is beneficial to improving the generalization capability of a translation model and increasing the training corpus data volume of low-resource small languages so as to improve the translation quality.

First, fig. 1 is a flowchart of a translation model training method according to a first embodiment of the present application, where the translation model training method is used in an electronic device, where the electronic device may be any device with computing capability, such as a Personal Computer (PC), a mobile terminal, and the like, and the mobile terminal may be a hardware device with various operating systems, touch screens, and/or display screens, such as a mobile phone, a tablet Computer, a Personal digital assistant, a wearable device, and an in-vehicle device.

As shown in fig. 1, the method includes:

step 101, obtaining a plurality of language training corpora, clustering the plurality of language training corpora according to languages, and obtaining a plurality of cluster training corpora.

In the embodiment of the present application, the multi-language training corpus refers to training corpora corresponding to different languages, such as the Turkish language training corpus 1, the Russian language training corpus 2, the Chinese language training corpus 3, and the like, so that the training corpus 1, the Russian language training corpus 2, the Chinese language training corpus 3, and the like constitute the multi-language training corpus.

In the embodiment of the application, the corpus may acquire text information of a corresponding language in real time, or may be acquired from a database corresponding to a history record, taking a text as an example, for example, text information such as "how today's weather" and "weather is good" input by a user may be acquired in real time as the corpus, or historical text information may be acquired based on a user history search record and the like as the corpus, and the setting is specifically selected according to an application scenario.

Further, the multiple language training corpora are clustered according to languages to obtain multiple cluster training corpora, that is, clustering is performed according to similar language features according to languages, and settings can be selected according to application scene needs, as described below.

The first example is that for multi-language training corpora, labels corresponding to each target language are added at preset positions of a source language, a language translation model from the source language to each target language is trained, after training is completed, label codes of each target language are obtained, clustering is conducted according to the label codes of each target language through a preset clustering algorithm, a plurality of clusters are obtained, the multi-language training corpora are divided according to the clusters, and the multi-cluster training corpora are obtained.

In the second example, a monolingual training corpus is used for training a pre-training language model, corresponding labels are added in front of each sentence, after training is completed, label codes are obtained, clustering is performed according to the label codes, a plurality of clusters are obtained, the multi-language training corpus is divided according to the clusters, and a plurality of cluster training corpora are obtained.

And 102, performing corpus processing on target language resources in each cluster type corpus to obtain each cluster type target corpus.

In the embodiment of the present application, the target language resource refers to a resource that needs to increase the data volume of the corpus, i.e. a low resource language. Therefore, the target corpus of each cluster is obtained by performing corpus processing on the target corpus resources in the corpus of each cluster, so that the data volume of the corpus of each language in each cluster is ensured to be large, and the subsequent translation quality of the translation model obtained by training is improved.

In the embodiment of the present application, the corpus processing is performed on the target language resources in each cluster corpus, and there are various ways to obtain the target corpus of each cluster, which can be selectively set according to the actual application requirements, as illustrated below.

The first example includes obtaining a target phrase fragment of a target language resource, obtaining a related phrase fragment matched with the target phrase fragment, determining related language resources corresponding to the related phrase fragment, sampling a corpus of the related language resources, obtaining candidate corpus, adding the candidate corpus to the corpus corresponding to the target language resource, and obtaining target corpus of each cluster.

In the second example, the candidate language resources corresponding to the target language resources are obtained from each class cluster corpus, the candidate corpus of the candidate language resources is obtained, the candidate corpus is subjected to word splitting, a plurality of word corpora are obtained, the plurality of word corpora are added to the corpus corresponding to the target language resources, and each class cluster target corpus is obtained.

In the third example, in the process of training the translation model according to each cluster target corpus, the monolingual data in each cluster target corpus is obtained, the monolingual data is encoded through the pre-training language model, and the translation model is trained on the encoded training vector.

Step 103, training the translation model according to the target training corpus of each cluster class to generate a plurality of sub-translation models.

In the embodiment of the present application, after the target corpus of each class cluster is obtained, translation model training is performed according to each class cluster, for example, an input processing encoder network and an output generating decoder network which are composed of two cyclic neural networks are trained according to the target corpus, so as to generate a plurality of sub-translation models.

The translation model training method of the embodiment of the application obtains multiple language training corpora, clusters the multiple language training corpora according to languages, obtains multiple cluster-like training corpora, performs training corpus processing on target language resources in each cluster-like training corpora, obtains each cluster-like target training corpora, trains a translation model according to each cluster-like target training corpus, and generates multiple sub-translation models. Therefore, languages with similar language features are trained together by using a clustering method, the generalization capability of the translation model is favorably improved, the training corpus data volume of low-resource languages is increased to train the translation model, and the translation quality is improved.

Based on the above embodiments, there are various ways to increase the data size of the corpus of the target language, and the following processes are performed in a single manner or in multiple manners with reference to fig. 2 and 3, respectively.

Specifically, the present application proposes another translation model training method, which performs data processing in a single manner, and fig. 2 is a flowchart of a translation model training method according to a second embodiment of the present application, as shown in fig. 2:

step 201, adding a label corresponding to each target language at a preset position of a source language aiming at a plurality of language training corpora, training a language translation model from the source language to each target language, and acquiring a label code of each target language after training.

In this application embodiment, the source language is english usually, and the language translation model of this application is english to the translation model of the small language promptly, adds the label of target language before english data, and after the training is accomplished, can extract out the label code and carry out clustering, for example the target language is turkish, russian and ukraining respectively, adds label A and trains the back and obtain label A code at english data, adds label B and trains the back and obtain label B code at english data, wherein, the only language of label sign.

Step 202, clustering is carried out according to the label codes of each target language through a preset clustering algorithm to obtain a plurality of clusters, and the multi-language training corpus is divided according to the plurality of clusters to obtain a plurality of cluster training corpora.

Further, a preset clustering algorithm, such as a Kmeans unsupervised clustering algorithm, is used for unsupervised clustering of label codes of all languages, the label codes with the same attribute languages are clustered together into a plurality of clusters, the multi-language training corpus is divided according to the plurality of clusters, and a plurality of cluster training corpora are obtained, such as Russian and Ukrainian are taken as one cluster, so that the Russian training corpus and the Ukran training corpus are taken as the training corpora of the cluster from the multi-language training corpus. Therefore, languages with similar language features are trained together by using a clustering method, and the generalization capability of the translation model is favorably improved.

Step 203, obtaining a target phrase fragment of the target language resource, obtaining a related phrase fragment matched with the target phrase fragment, and determining a related language resource corresponding to the related phrase fragment.

Step 204, sampling the training corpora of the related language resources to obtain candidate training corpora, adding the candidate training corpora to the training corpora corresponding to the target language resources, and obtaining the target training corpora of each cluster.

In the embodiment of the application, the training corpus of the target language resource of the low-resource phrase is scarce, bilingual data is difficult to collect, usually, the low-resource language can generally find the high-resource language with higher correlation with the low-resource language, the low-resource language and the high-resource language have strong correlation on grammar and characters, the same continuous phrase fragments also exist, the training corpus pair of the high-resource language containing the same phrase fragments is oversampled, and the data quantity of the low-resource phrase training corpus is increased by the way, so that the problem of data scarcity of the low-resource phrase is solved.

Step 205, training the translation model according to the target training corpus of each cluster class to generate a plurality of sub-translation models.

The translation model training method of the embodiment of the application comprises the steps of adding labels corresponding to each target language to preset positions of a source language aiming at a plurality of language training corpora, training the language translation model from the source language to each target language, obtaining label codes of each target language after training is finished, clustering according to the label codes of each target language through a preset clustering algorithm to obtain a plurality of clusters, dividing the plurality of language training corpora according to the plurality of clusters to obtain a plurality of cluster training corpora, obtaining target phrase fragments of target language resources, obtaining related phrase fragments matched with the target phrase fragments, determining related language resources corresponding to the related phrase fragments, sampling the training corpora of the related language resources to obtain candidate training corpora, adding the candidate training corpora to the training corpora corresponding to the target language resources, and acquiring target training corpora of each cluster, training the translation model according to the target training corpora of each cluster, and generating a plurality of sub-translation models. Therefore, languages with similar language features are trained together by using a clustering method, the generalization capability of the translation model is favorably improved, the training corpus data volume of low-resource languages is increased to train the translation model, and the translation quality is improved.

Fig. 3 is a flowchart of a translation model training method according to a third embodiment of the present application, as shown in fig. 3:

step 301, obtaining a plurality of language training corpora, clustering the plurality of language training corpora according to languages, and obtaining a plurality of cluster training corpora.

Step 302, obtaining a target phrase fragment of the target language resource, obtaining a related phrase fragment matched with the target phrase fragment, and determining a related language resource corresponding to the related phrase fragment.

Step 303, sampling the corpus of the related language resources to obtain candidate corpus, and adding the candidate corpus to the corpus corresponding to the target language resource.

Step 304, obtaining candidate language resources corresponding to the target language resources in each class cluster training language material, and obtaining candidate training language materials of the candidate language resources.

Step 305, performing word splitting on the candidate corpus to obtain a plurality of word corpora, adding the plurality of word corpora to the corpus corresponding to the target language resource, and generating each cluster-like target corpus.

In the embodiment of the present application, the target language resource is a low resource language, a high resource language (e.g., Turkey) and a low resource language (e.g., Abira Jiang) are mixed for training, because of the great similarity between the two languages, the low resource language can learn useful knowledge representation from the high resource language, and in general, the root expression of the high resource language is unique, and the high resource language is expressed into a plurality of root granularities, so that the low resource language can learn more knowledge expressions from the high resource language with mixed granularities, that is, splitting the candidate corpus into words and phrases to obtain multiple word and phrase corpuses, adding the multiple word and phrase corpuses into the corpus corresponding to the target language resource, and generating target training corpora of each class cluster, increasing the training data volume of the low-resource languages, and relieving the data scarcity problem of the low-resource languages.

And step 306, acquiring the monolingual data in the target corpus of each cluster in the process of training the translation model according to the target corpus of each cluster.

And 307, coding the single-language data through the pre-training language model, and training the translation model of the coded training vector to generate a plurality of sub-translation models.

In the embodiment of the application, the pre-training language model is adopted to encode the monolingual data, the training vector for encoding is used for training the translation model, although the parallel corpora of the low-resource languages are few, the monolingual data can be collected in a large amount through the internet, when the translation model is trained only by using the parallel corpora, the translation direction of the low-resource languages is not sufficiently trained, so that the monolingual data is used for pre-training the language model and then performing transfer learning, and the training effect of the translation model is improved.

Therefore, the monolingual data is fully utilized to pre-train the language model, and then the transfer learning is carried out, so that the semantic representation of the low-resource small language is enriched, and the problem of insufficient training data of the low-resource small language is solved.

The translation model training method of the embodiment of the application obtains a plurality of language training corpora by obtaining a plurality of language training corpora and clustering the language training corpora according to languages, obtains a target phrase fragment of a target language resource, obtains a related phrase fragment matching the target phrase fragment, determines a related language resource corresponding to the related phrase fragment, samples the language training corpora of the related language resource to obtain candidate language corpora, adds the candidate language corpora to the language training corpus corresponding to the target language resource, obtains a candidate language resource corresponding to the target language resource in each language training corpus, obtains candidate language corpora of the candidate language resource, performs word splitting on the candidate language corpora to obtain a plurality of word corpora, adds the plurality of word corpora to the language training corpus corresponding to the target language resource, generating each cluster target training corpus, acquiring monolingual material data in each cluster target training corpus in the process of training the translation model according to each cluster target training corpus, coding the monolingual material data through a pre-training language model, training the translation model by using the coded training vector, and generating a plurality of sub-translation models. Therefore, languages with similar language features are trained together by using a clustering method, the generalization capability of the translation model is favorably improved, the training corpus data volume of low-resource languages is increased to train the translation model, and the translation quality is improved.

Fig. 4 is a flowchart of a translation processing method according to a fourth embodiment of the present application, as shown in fig. 4:

step 401, obtaining a text to be translated and a target language.

Step 402, under the condition that it is detected that the source language and the target language of the text to be translated belong to the same cluster, acquiring a translation sub-model, translating the text to be translated, and acquiring a translation result.

Step 403, under the condition that it is detected that the source language and the target language of the text to be translated do not belong to the same cluster, obtaining a first translation sub-model to translate the text to be translated, and obtaining a candidate translation result.

And step 404, acquiring a second translation sub-model, translating the candidate translation result, and acquiring a target translation result.

In the embodiment of the application, a to-be-translated text and a target language input by a client are received, a translation sub-model is determined according to the source language and the target language of the to-be-translated text, the translation sub-model is obtained under the condition that the source language and the target language of the to-be-translated text belong to the same cluster, the translation result is obtained, the first translation sub-model is obtained under the condition that the source language and the target language of the to-be-translated text do not belong to the same cluster, the translation of the to-be-translated text is carried out, a candidate translation result is obtained, and the target translation sub-model is obtained by obtaining the second translation model.

For example, in the above embodiment, each translation sub-model is trained using parallel corpora from non-english language to english, any two languages in two different sub-translation sub-models jump using english, as shown in fig. 5, for example, a source language of a text to be translated is russian and a target language is ukraine, which belong to the same class of clusters, and a translation result is obtained by directly translating the corresponding translation sub-model of the class of clusters; and for example, under the condition that the source language of the text to be translated is Turkish and the target language is Ukrainian and the text to be translated does not belong to the same cluster, acquiring a first translation sub-model of the cluster to which Turkish belongs to translate the text to be translated to acquire a candidate translation result, and then translating the candidate translation result by acquiring a second translation sub-model of the cluster to which Ukrainian belongs to acquire a target translation result.

According to the translation processing method, by obtaining the text to be translated and the target language, under the condition that the source language and the target language of the text to be translated belong to the same cluster, the translation sub-model is obtained, the text to be translated is translated, the translation result is obtained, under the condition that the source language and the target language of the text to be translated do not belong to the same cluster, the first translation sub-model is obtained, the text to be translated is translated, the candidate translation result is obtained, the second translation sub-model is obtained, the candidate translation result is translated, and the target translation result is obtained. Thus, a high-quality translated text can be acquired quickly.

Based on the description of the foregoing embodiments, fig. 6 is a flowchart of a translation processing method according to a fifth embodiment of the present application, and as shown in fig. 6, in the process of translating a text to be translated, further processing may be performed, so as to improve translation quality and accelerate decoding speed by limiting a word candidate set. Specifically, the method comprises the following steps:

step 501, in the process of translating the text to be translated, acquiring each word to be translated in the text to be translated.

Step 502, a word candidate set corresponding to each word to be translated is obtained, and the error probability of each candidate word in the word candidate set corresponding to each word to be translated is obtained.

Step 503, in case the error probability is larger than the preset threshold, deleting the candidate word from the word candidate set.

In the embodiment of the present application, since the target language is a multi-language mixture, the autoregressive method for generating a translation may generate characters of other languages, for example, a task of translating english to turkish, turkish is used as the target language, and there are no non-latin characters (such as characters used in arabic), but candidate characters of arabic may appear during the translation process, and in the task of generating a natural language processing, all characters of the target language vocabulary are used as a character candidate set, which makes it difficult to control the loyalty of the translation.

Therefore, for example, aiming at the translation direction of the target language which is not Latin, the word candidate set of the target language vocabulary is limited, and by the mode, the translation fidelity of the scarce resource in the small language is obviously improved, and the manual evaluation is greatly improved.

The preset threshold value can be selectively set according to an application scene.

For example, in the process of translating "how are you" into chinese, such as in the process of translating "are", the error probability of "yes" existing in the acquired word candidate set is ninety percent, which is greater than sixty percent of the preset threshold, and the word candidate set is deleted.

Therefore, not only can the fidelity of the generation of the translation be improved, but also the translation process can be accelerated, and a large amount of calculation can be reduced by using a small vocabulary.

In order to implement the above embodiments, the present application further provides a translation model training device. Fig. 7 is a schematic structural diagram of a translation model training apparatus according to a sixth embodiment of the present application, and as shown in fig. 7, the translation model training apparatus includes: a first obtaining module 701, a second obtaining module 702, a first processing module 703 and a training module 704.

The first obtaining module 701 is configured to obtain a plurality of language training corpora.

A second obtaining module 702, configured to cluster the multiple language training corpora according to languages, and obtain multiple cluster-like training corpora.

The first processing module 703 is configured to perform corpus processing on the target language resource in the corpus of each cluster class, and obtain a target corpus of each cluster class.

And the training module 704 is configured to train the translation model according to the target training corpus of each class cluster to generate a plurality of sub-translation models.

In this embodiment of the application, the second obtaining module 702 is specifically configured to: adding a label corresponding to each target language at a preset position of a source language aiming at a plurality of language training corpora;

training a language translation model from a source language to each target language, and acquiring a label code of each target language after training is completed; clustering is carried out according to the label code of each target language through a preset clustering algorithm to obtain a plurality of clusters, and the multi-language training corpus is divided according to the plurality of clusters to obtain a plurality of cluster training corpora.

In this embodiment of the application, the first processing module 703 is specifically configured to: acquiring a target phrase fragment of a target language resource; acquiring related phrase fragments matched with the target phrase fragments, and determining related language resources corresponding to the related phrase fragments; sampling the training corpora of the related language resources to obtain candidate training corpora; and adding the candidate training corpora into the training corpora corresponding to the target language resources to obtain the target training corpora of each cluster.

In this embodiment of the application, the first processing module 703 is specifically configured to: acquiring candidate language resources corresponding to the target language resources from each class cluster training corpus; acquiring candidate training corpora of the candidate language resources, and performing word splitting on the candidate training corpora to acquire a plurality of word corpora; and adding a plurality of word linguistic data into a training linguistic data corresponding to the target linguistic resource to obtain a target training linguistic data of each cluster.

In this embodiment of the present application, the translation model training apparatus further includes: the third obtaining module is used for obtaining the monolingual data in the target training corpus of each cluster in the process of training the translation model according to the target training corpus of each cluster; and the second processing module is used for coding the single corpus data through a pre-training language model and training the translation model on the coding-processed training vector.

It should be noted that the foregoing explanation of the translation model training method is also applicable to the translation model training apparatus according to the embodiment of the present invention, and the implementation principle is similar, and is not repeated herein.

The translation model training device of the embodiment of the application obtains a plurality of kinds of cluster training corpora by obtaining a plurality of kinds of language training corpora and clustering the plurality of kinds of language training corpora according to languages, carries out training corpus processing on target language resources in each kind of cluster training corpora, obtains each kind of cluster target training corpora, trains the translation model according to each kind of cluster target training corpora, and generates a plurality of sub-translation models. Therefore, languages with similar language features are trained together by using a clustering method, the generalization capability of the translation model is favorably improved, the training corpus data volume of low-resource languages is increased to train the translation model, and the translation quality is improved.

In order to implement the above embodiments, the present application further provides a translation processing apparatus. Fig. 8 is a schematic structural diagram of a translation processing apparatus according to a seventh embodiment of the present application, and as shown in fig. 8, the translation processing apparatus includes: a fourth obtaining module 801, a fifth obtaining module 802, a sixth obtaining module 803, and a seventh obtaining module 804.

The fourth obtaining module 801 is configured to obtain a text to be translated and a target language.

A fifth obtaining module 802, configured to obtain a translation sub-model, translate the text to be translated, and obtain a translation result, when it is detected that the source language and the target language of the text to be translated belong to the same cluster.

A sixth obtaining module 803, configured to obtain the first translation sub-model to translate the text to be translated, and obtain a candidate translation result, when it is detected that the source language and the target language of the text to be translated do not belong to the same cluster.

A seventh obtaining module 804, configured to obtain the second translation sub-model, translate the candidate translation result, and obtain a target translation result.

In this embodiment, the translation processing apparatus further includes: the eighth obtaining module is used for obtaining each word to be translated in the text to be translated in the process of translating the text to be translated; the ninth obtaining module is used for obtaining a word candidate set corresponding to each word to be translated and obtaining the error probability of each candidate word in the word candidate set corresponding to each word to be translated; and the deleting module is used for deleting the candidate words from the word candidate set under the condition that the error probability is greater than a preset threshold value.

It should be noted that the foregoing explanation of the translation processing method is also applicable to the translation processing apparatus according to the embodiment of the present invention, and the implementation principle thereof is similar and will not be described herein again.

The translation processing device of the embodiment of the application obtains the translation sub-model by obtaining the text to be translated and the target language under the condition that the source language and the target language of the text to be translated belong to the same cluster, translates the text to be translated and obtains the translation result, obtains the first translation sub-model to translate the text to be translated and obtains the candidate translation result under the condition that the source language and the target language of the text to be translated do not belong to the same cluster, obtains the second translation sub-model, translates the candidate translation result and obtains the target translation result. Thus, a high-quality translated text can be acquired quickly.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 9 is a block diagram of an electronic device according to a method for translation model training and translation processing according to an embodiment of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 9, the electronic apparatus includes: one or more processors 901, memory 902, and interfaces for connecting the various components, including a high-speed interface and a low-speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 9 illustrates an example of a processor 901.

Memory 902 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the translation model training and translation processing methods provided herein. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of translation model training, translation processing provided herein.

The memory 902, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the translation model training and translation processing methods in the embodiments of the present application (for example, the first obtaining module 701, the second obtaining module 702, the first processing module 703, and the training module 704 shown in fig. 7). The processor 901 executes various functional applications of the server and data processing, namely, a method of implementing translation model training and translation processing in the above method embodiments, by executing non-transitory software programs, instructions, and modules stored in the memory 902.

The memory 902 may include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created from use of the electronic device for translation model training, translation processing, and the like. Further, the memory 902 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 902 may optionally include memory located remotely from the processor 901, which may be connected over a network to translation model training, translation processing electronics. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the translation model training and translation processing method may further include: an input device 903 and an output device 904. The processor 901, the memory 902, the input device 903 and the output device 904 may be connected by a bus or other means, and fig. 9 illustrates the connection by a bus as an example.

The input device 903 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device for translation model training, translation processing, such as a touch screen, a keypad, a mouse, a track pad, a touch pad, a pointing stick, one or more mouse buttons, a track ball, a joystick, and the like. The output devices 904 may include a display device, auxiliary lighting devices (e.g., LEDs), tactile feedback devices (e.g., vibrating motors), and the like. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server may be a cloud Server, also called a cloud computing Server or a cloud host, which is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service extensibility in a traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"), and the Server may also be a Server of a distributed system or a Server combining a block chain.

According to the technical scheme of the embodiment of the application, a plurality of language training corpora are obtained by obtaining a plurality of language training corpora and clustering the language training corpora according to languages, a plurality of cluster-like training corpora are obtained, target language resources in each cluster-like training corpora are subjected to corpus processing, each cluster-like target training corpora is obtained, a translation model is trained according to each cluster-like target training corpora, and a plurality of sub-translation models are generated. Therefore, languages with similar language features are trained together by using a clustering method, the generalization capability of the translation model is favorably improved, the training corpus data volume of low-resource languages is increased to train the translation model, and the translation quality is improved.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A translation model training method, comprising:

2. The translation model training method according to claim 1, wherein said clustering said multiple language training corpuses according to languages to obtain multiple cluster training corpuses comprises:

adding a label corresponding to each target language at a preset position of a source language aiming at the multi-language training corpus;

training a language translation model from the source language to each target language, and acquiring a label code of each target language after training is completed;

clustering according to the label code of each target language through a preset clustering algorithm to obtain a plurality of clusters, and dividing the multi-language training corpus according to the plurality of clusters to obtain the plurality of cluster training corpora.

3. The method for training translation models according to claim 1, wherein said performing corpus processing on target language resources in each cluster-like corpus to obtain each cluster-like target corpus comprises:

acquiring a target phrase fragment of the target language resource;

acquiring related phrase fragments matched with the target phrase fragments, and determining related language resources corresponding to the related phrase fragments;

sampling the training corpora of the related language resources to obtain candidate training corpora;

and adding the candidate corpus into the corpus corresponding to the target language resource to obtain the target corpus of each cluster.

4. The method for training translation models according to claim 1, wherein said performing corpus processing on target language resources in each cluster-like corpus to obtain each cluster-like target corpus comprises:

acquiring candidate language resources corresponding to the target language resources from the training corpus of each class cluster;

acquiring candidate training corpora of the candidate language resources, and performing word splitting on the candidate training corpora to acquire a plurality of word corpora;

and adding the word linguistic data into a training linguistic data corresponding to the target linguistic resource to obtain the target training linguistic data of each cluster.

5. The translation model training method of claim 1, further comprising:

acquiring monolingual data in each cluster target training corpus in the process of training the translation model according to each cluster target training corpus;

and coding the single corpus data through a pre-training language model, and training the translation model through the coded training vector.

6. A translation processing method for a translation model according to any one of claims 1 to 5, comprising:

acquiring a text to be translated and a target language;

7. The translation processing method according to claim 6, further comprising:

in the process of translating the text to be translated, acquiring each word to be translated in the text to be translated;

acquiring a word candidate set corresponding to each word to be translated, and acquiring the error probability of each candidate word in the word candidate set corresponding to each word to be translated;

deleting the candidate word from the word candidate set if the error probability is greater than a preset threshold.

8. A translation model training apparatus comprising:

9. The translation model training device of claim 8, wherein the second obtaining module is specifically configured to:

10. The translation model training device of claim 8, wherein the first processing module is specifically configured to:

acquiring a target phrase fragment of the target language resource;

11. The translation model training device of claim 8, wherein the first processing module is specifically configured to:

12. The translation model training apparatus according to claim 8, further comprising:

a third obtaining module, configured to obtain monolingual data in the target corpus of each cluster during the process of training the translation model according to the target corpus of each cluster;

and the second processing module is used for coding the single corpus data through a pre-training language model and training the translation model on the coding-processed training vector.

13. A translation processing apparatus for applying the translation model according to any one of claims 8 to 12, comprising:

14. The translation processing apparatus according to claim 13, further comprising:

the eighth obtaining module is used for obtaining each word to be translated in the text to be translated in the process of translating the text to be translated;

a ninth obtaining module, configured to obtain a word candidate set corresponding to each word to be translated, and obtain an error probability of each candidate word in the word candidate set corresponding to each word to be translated;

and the deleting module is used for deleting the candidate words from the word candidate set under the condition that the error probability is greater than a preset threshold value.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-7.