CN114757212A

CN114757212A - Translation model training method and device, electronic equipment and medium

Info

Publication number: CN114757212A
Application number: CN202210328982.1A
Authority: CN
Inventors: 黄继豪; 李长亮
Original assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Current assignee: Beijing Kingsoft Digital Entertainment Co Ltd
Priority date: 2022-03-30
Filing date: 2022-03-30
Publication date: 2022-07-15

Abstract

The present disclosure relates to a translation model training method, apparatus, electronic device, and medium; wherein, the method comprises the following steps: obtaining a first parallel corpus pair, and determining a target parallel corpus pair according to the first parallel corpus pair, wherein the target parallel corpus pair comprises: a first parallel corpus pair, and/or a second parallel corpus pair, the first parallel corpus pair comprising: a corpus pair in which the first monolingual corpus matches the second monolingual corpus, or a corpus pair in which the second monolingual corpus matches the first monolingual corpus; and training the initial translation model according to the target parallel corpus pair to obtain a target translation model. The embodiment can effectively improve the training efficiency of the translation model.

Description

Translation model training method and device, electronic equipment and medium

Technical Field

The present disclosure relates to the field of model training, and in particular, to a translation model training method, apparatus, electronic device, and medium.

Background

The translation model can be a mutual content translation model between two languages, such as a Chinese-English translation model, an English-Chinese translation model, a Chinese-Mongolian translation model, a Mongolian Chinese translation model and the like.

In the related art, a machine translation model (e.g., a chinese-english translation model obtained by training a chinese-english parallel corpus) is obtained mainly by training a large number of parallel corpuses, and then a monolingual corpus is retranslated by using the machine translation model to obtain a paired corpus (e.g., a monolingual chinese is retranslated to obtain a corresponding english language, thereby forming an english-chinese parallel corpus) so as to train another machine translation model (e.g., an english-chinese translation model) corresponding thereto.

However, in the translation process, the fluency and the parallel corpus difference of the corpus pairings obtained based on the translation are large, and thus the training precision of the translation model is reduced, so that the training efficiency of the translation model is not high.

Disclosure of Invention

In order to solve the technical problem, the present disclosure provides a translation model training method, apparatus, electronic device, and medium.

In a first aspect, this embodiment provides a translation model training method, including:

obtaining a first parallel speech material pair, the first parallel speech material pair comprising: corpus pairs matching the first monolingual corpus and the second monolingual corpus, or corpus pairs matching the second monolingual corpus and the first monolingual corpus;

determining a target parallel corpus pair according to the first parallel corpus pair, wherein the target parallel corpus pair comprises: the corresponding relation between the monolingual corpus in the second parallel corpus pair is the same as that between the monolingual corpus in the first parallel corpus pair;

and training the initial translation model according to the target parallel corpus pair to obtain a target translation model.

Optionally, the target parallel corpus pair includes: a first parallel corpus pair and a second parallel corpus pair;

The training the initial translation model according to the target parallel corpus pair to obtain a target translation model comprises the following steps:

determining the frequency of a training period;

determining a training batch of the first parallel corpus pair and a training batch of the second parallel corpus pair within the training cycle frequency, wherein the training batches comprise: training times and training sequence numbers;

training an initial translation model according to the first parallel corpus pair and the second parallel corpus pair based on the training batch of the first parallel corpus pair and the training batch of the second parallel corpus pair to obtain a target translation model.

Optionally, the determining the training batch of the first parallel corpus pair and the training batch of the second parallel corpus pair includes:

determining a training batch of the first parallel corpus pair and a training batch of the second parallel corpus pair based on the weight of the first parallel corpus pair and the weight of the second parallel corpus pair.

Optionally, the determining a target parallel corpus pair according to the first parallel corpus pair includes:

determining a second parallel corpus pair according to the first parallel corpus pair;

and determining a target parallel corpus pair according to the second parallel corpus pair and the first parallel corpus pair.

Optionally, the determining a second parallel corpus pair according to the first parallel corpus pair includes:

training an initial translation model according to the first parallel corpus pair to obtain a first translation model;

and determining a second parallel language material pair according to the low-frequency dictionary base of the first translation model.

Optionally, before determining the second parallel speech pair from the low frequency dictionary library of the first translation model, the method further includes:

determining a low frequency dictionary base of the first translation model;

determining a second parallel speech material pair according to the low frequency dictionary bank of the first translation model comprises:

determining target monolingual corpus from a low-frequency dictionary library of the first translation model;

determining a matching monolingual corpus of the target monolingual corpus according to the target monolingual corpus and the first translation model;

and determining a second parallel corpus pair according to the target monolingual corpus and the paired monolingual corpus.

Optionally, the determining a low frequency dictionary library of the first translation model includes:

and constructing a low-frequency dictionary library according to the candidate monolingual corpus of which the occurrence frequency is greater than a first preset threshold and less than a second preset threshold in the first parallel corpus pair.

Optionally, before determining the monolingual corpus to be paired according to the target monolingual corpus and the first translation model, the method further includes:

acquiring a random monolingual corpus;

and updating the target monolingual corpus according to the weight of the candidate monolingual corpus and the weight of the random monolingual corpus.

In a second aspect, this embodiment provides a translation model training apparatus, including:

an obtaining module, configured to obtain a first parallel speech pair, where the first parallel speech pair includes: a corpus pair in which the first monolingual corpus matches the second monolingual corpus, or a corpus pair in which the second monolingual corpus matches the first monolingual corpus;

a determining module, configured to determine a target parallel corpus pair according to the first parallel corpus pair, where the target parallel corpus pair includes: the corresponding relation between the monolingual corpus in the second parallel corpus pair is the same as that between the monolingual corpus in the first parallel corpus pair;

and the training module is used for training the initial translation model according to the target parallel corpus pair to obtain a target translation model.

A training module comprising: the device comprises a first determining unit, a second determining unit and a training unit;

the first determining unit is used for determining the training cycle frequency;

a second determining unit, configured to determine, within the training cycle frequency, a training batch of the first parallel corpus pair and a training batch of the second parallel corpus pair, where the training batches include: training times and training sequence numbers;

and the training unit is used for training an initial translation model according to the first parallel corpus pair and the second parallel corpus pair based on the training batch of the first parallel corpus pair and the training batch of the second parallel corpus pair to obtain a target translation model.

Optionally, the training unit is specifically configured to:

Optionally, the determining module includes: a third determination unit and a fourth determination unit;

a third determining unit, configured to determine a second parallel corpus pair according to the first parallel corpus pair;

and a fourth determining unit, configured to determine a target parallel corpus pair according to the second parallel corpus pair and the first parallel corpus pair.

Optionally, the third determining unit is specifically configured to:

Optionally, the determining module is further configured to determine a low-frequency dictionary library of the first translation model;

a third determining unit, specifically configured to:

and determining a second parallel corpus pair according to the target monolingual corpus and the matching monolingual corpus.

Optionally, the determining module is specifically configured to:

Optionally, the method further includes: updating the module;

the acquisition module is also used for acquiring the random monolingual corpus;

and the updating module updates the target monolingual corpus according to the weight of the candidate monolingual corpus and the weight of the random monolingual corpus.

In a third aspect, this embodiment further provides an electronic device, including:

one or more processors;

a storage device to store one or more programs,

when the one or more programs are executed by the one or more processors, the one or more processors implement the translation model training method according to any of the embodiments of the present invention.

In a fourth aspect, this embodiment further provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement any one of the translation model training methods in the embodiments of the present invention.

Compared with the prior art, the technical scheme provided by the embodiment of the disclosure has the following advantages: by obtaining a first parallel-language-material pair, the first parallel-language-material pair comprises: the corpus pair that first monolingual corpus matches with the second monolingual corpus, or the corpus pair that the second monolingual corpus matches with the first monolingual corpus, determine the parallel corpus pair of the goal according to the parallel corpus pair of the first, wherein, the parallel corpus pair of the goal can include: the first parallel corpus pair and/or the second parallel corpus pair are/is used for expanding the training sample of the translation model, training the initial translation model according to the target parallel corpus pair to obtain the target translation model, and further effectively improving the training efficiency of the translation model.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the embodiments or technical solutions in the prior art description will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

Fig. 1 is a schematic flowchart of a translation model training method provided in this embodiment;

FIG. 2 is a schematic flowchart of another translation model training method provided in this embodiment;

FIG. 3 is a flowchart illustrating a further translation model training method provided in this embodiment;

FIG. 4 is a flowchart illustrating a further translation model training method provided in this embodiment;

fig. 5 is a schematic structural diagram of a translation model training apparatus provided in this embodiment;

fig. 6 is a schematic structural diagram of an electronic device provided in this embodiment.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced otherwise than as described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

At present, when a translation model is trained, an initial translation model is mainly trained through a large number of training samples to obtain a target translation model.

For example, when a chinese-english translation model is trained, a first set of chinese-english parallel corpus pairs may be used to train an initial translation model to obtain an english-chinese translation model, some chinese monolingual corpora are searched, an english monolingual corpus corresponding to the chinese monolingual corpus is obtained through the english-chinese translation model, the chinese monolingual corpus and the english monolingual corpus are combined to obtain a second set of english-chinese parallel corpus pairs, and the initial translation model is updated and trained based on the second set of english-chinese parallel corpus pairs.

The training of the existing translation model is to alleviate the problem of parallel corpus pair shortage by constructing a pseudo parallel corpus pair (the pseudo parallel corpus is a group of parallel corpuses similar to a parallel corpus obtained according to a group of parallel corpuses, such as the above-mentioned second group of english-chinese parallel corpus pair), but the fluency and the parallel corpuses of the pseudo parallel corpus pair obtained after the model is translated back are different, so that the training precision of the model is difficult to guarantee, and the training efficiency of the model is affected.

Illustratively, the present disclosure provides a translation model training method, apparatus, electronic device, and medium, by obtaining a first parallel language material pair, the first parallel language material pair includes: the corpus pair that first monolingual corpus matches with the second monolingual corpus, or the corpus pair that the second monolingual corpus matches with the first monolingual corpus, determine the parallel corpus pair of the goal according to the parallel corpus pair of the first, wherein, the parallel corpus pair of the goal can include: the first parallel corpus pair and/or the second parallel corpus pair extend training samples of the translation model, the initial translation model is trained according to the target parallel corpus pair to obtain a target translation model, and then the training efficiency of the translation model is effectively improved.

Fig. 1 is a schematic flowchart of a translation model training method provided in this embodiment. The method of the embodiment may be performed by a translation model training apparatus, which may be implemented in hardware and/or software and may be configured in an electronic device. The translation model training method can be realized according to any embodiment of the application. As shown in fig. 1, the method specifically includes the following steps:

and S110, acquiring a first parallel language material pair.

Wherein, the first parallel corpus pair comprises: and the corpus pair is matched between the first monolingual corpus and the second monolingual corpus, or the corpus pair is matched between the second monolingual corpus and the first monolingual corpus.

The first parallel-language-material pair includes: and the corpus pair is matched between the first monolingual corpus and the second monolingual corpus, or the corpus pair is matched between the second monolingual corpus and the first monolingual corpus.

For example, in a first parallel language pair, comprising: and when the first monolingual corpus is matched with the second monolingual corpus, and the first monolingual corpus is English, and the second monolingual corpus is Chinese, the first parallel corpus pair can be an English-Chinese corpus pair.

In combination with the above example, the first parallel language pair includes: and when the second monolingual corpus is paired with the corpus matched with the first monolingual corpus, the first parallel corpus pair can be a Chinese-English corpus pair.

It should be noted that the monolingual corpus can be used to describe words/sentences composed of one language, and the first parallel corpus pair/second parallel corpus pair is a set pair composed of two different languages but the same/similar meanings.

The method comprises the steps of obtaining a large number of initial parallel corpus pairs through network resource searching, manual construction and the like.

After a large number of initial parallel corpus pairs are obtained, data cleaning can be performed on the large number of initial parallel corpus pairs, and the initial parallel corpuses are unified to obtain a first parallel corpus pair.

Wherein the data cleansing operation may include: corpus supplement, corpus deletion, semantic correction, format unification, size alignment and the like.

For example, when the first parallel corpus pair is a chinese-english corpus pair, and part of the chinese monolingual corpus included in the chinese-english corpus pair does not have an english monolingual corpus paired therewith, the corpus pairing may be performed on all the chinese monolingual corpora that do not have an english monolingual corpus paired by using operations such as manual collection/machine translation, so as to implement corpus supplementation of the missing corpus.

After corpus supplement is performed by combining the above example, each chinese monolingual corpus in a part of chinese monolingual corpuses included in the obtained chinese-english corpus pair has two or more paired english monolingual corpuses, and then an english monolingual corpus paired with the chinese monolingual corpus is determined according to a similarity between the chinese monolingual corpus and each english monolingual corpus in the two or more paired english monolingual corpuses, and if an english monolingual corpus corresponding to the maximum similarity is selected as a paired corpus of the chinese monolingual corpus, corpus deletion of duplicate corpuses is realized.

By combining the above example, after the corpus is deleted, the obtained chinese-english corpus pair has partial sentences with different semantics, and the semantic correction of the sentence can be performed by combining the upper sentence and the lower sentence of the sentence according to the keywords in the sentence, so as to implement the semantic correction of the sentence with semantic error.

By combining the above examples, after semantic correction, the formats of partial sentences included in the obtained chinese-english material pairs are different, such as line spacing, punctuation, and the problem after segment front segment, the formats of the sentences can be unified by adopting a pre-prepared unified format, so as to realize the unified formats of sentences with irregular formats.

With reference to the foregoing example, after the formats are unified, sizes of partial fonts included in the obtained chinese-english corpus pairs are different, and the alignment operation may be performed on the partial fonts by using a pre-prepared unified font, so as to achieve size alignment of all fonts included in the chinese-english corpus pairs.

It should be noted that, the sequence of the above-mentioned various cleaning operations in the actual cleaning process is not specifically limited in the embodiments of the present disclosure.

Therefore, the corpus quality of the parallel corpus pair is improved by performing data cleaning on the obtained initial parallel corpus pair, and then the high-quality first parallel corpus pair is obtained.

And S120, determining a target parallel corpus pair according to the first parallel corpus pair.

Wherein, the target parallel corpus pair comprises: the first parallel corpus pair, and/or the second parallel corpus pair.

The target parallel corpus pair may include at least one parallel corpus pair, such as a first parallel corpus pair, or a second parallel corpus pair, or a first parallel corpus pair and a second parallel corpus pair.

It should be noted that the first parallel corpus pair may be a corpus pair combining two monolingual languages, such as a corpus pair obtained by matching the first monolingual corpus with the second monolingual corpus, or a corpus pair matching the second monolingual corpus with the first monolingual corpus.

Wherein, the first monolingual corpus may be: english, or chinese, or mongol or korean, etc., and the second monolingual corpus may be: chinese, or mongolian, or korean, or english, etc.

It should be noted that the first monolingual corpus is different from the second monolingual corpus, and accordingly, when the first monolingual corpus is english, the second monolingual corpus may be chinese, or mongolian language, korean language, or the like, or, when the first monolingual corpus is chinese, the second monolingual corpus may be english, or mongolian language, korean language, or the like.

The first parallel corpus pair may be a chinese-english parallel corpus pair, or an english-chinese parallel corpus pair, or a chinese-english parallel corpus pair, or a chinese-korean parallel corpus pair, or a korean-chinese parallel corpus pair.

The second parallel corpus pair may be a pseudo parallel corpus pair of the first parallel corpus pair, that is, the correspondence between the monolingual corpora of the second parallel corpus pair is the same as the correspondence between the monolingual corpora of the first parallel corpus pair.

For example, when the first parallel corpus pair is a chinese-english parallel corpus pair, the second parallel corpus pair is also a chinese-english parallel corpus pair, or when the first parallel corpus pair is a english-chinese parallel corpus pair/a montmorillo-chinese parallel corpus pair/a chinese-montmorillonoid parallel corpus pair/a chinese-korean parallel corpus pair/a korean-chinese parallel corpus pair, the corresponding second parallel corpus pair is also a english-chinese parallel corpus pair/a montmorillo-chinese parallel corpus pair/a chinese-korean parallel corpus pair.

S130, training the initial translation model according to the target parallel corpus pair to obtain a target translation model.

The initial translation model may be a machine model, such as a neural network model.

Specifically, the target parallel corpus pair may be used as a training sample pair to train the initial translation model to obtain the target translation model.

The translation model training method provided by the disclosure obtains a first parallel language material pair, and the first parallel language material pair comprises: the corpus pair that first monolingual corpus matches with the second monolingual corpus, or the corpus pair that the second monolingual corpus matches with the first monolingual corpus, determine the parallel corpus pair of the goal according to the parallel corpus pair of the first, wherein, the parallel corpus pair of the goal can include: the first parallel corpus pair and/or the second parallel corpus pair are/is used for expanding the training sample of the translation model, training the initial translation model according to the target parallel corpus pair to obtain the target translation model, and further effectively improving the training efficiency of the translation model.

Fig. 2 is a schematic flowchart of another translation model training method provided in this embodiment. Based on the foregoing embodiment, in this embodiment, when the target parallel corpus pair includes the first parallel corpus pair and the second parallel corpus pair, a possible implementation manner of S130 is as follows:

and S1301, determining the training cycle frequency.

When the target parallel corpus pair includes two parallel corpus pairs, that is, the target parallel corpus pair includes a first parallel corpus pair and a second parallel corpus pair, the training period may be the sum of the training times of the first parallel corpus pair and the training times of the second parallel corpus pair.

For example, when the frequency of the training cycle is 8, it means that 8 times of model training can be performed in one training cycle, the number of times of training of the first parallel corpus pair can be 5, and the number of times of training of the second parallel corpus pair can be 3, that is, the initial translation model is trained 5 times by using the first parallel corpus pair, and then the initial translation model is trained 3 times by using the second parallel corpus pair, so that one training cycle can be trained.

S1302, determining a training batch of the first parallel speech material pair and a training batch of the second parallel speech material pair within the training cycle frequency.

Wherein the training batch comprises: training times and training sequence numbers.

The training sequence number can be effectively represented as the training sequence of the parallel corpus pair to be trained, and when the initial translation model is trained, the parallel corpus pair in the same sequence number range can be used for training the initial translation model continuously at one time.

With reference to the above example, if the training time of the first parallel corpus pair is 5, the corresponding training sequence number is 1, the training time of the second parallel corpus pair is 3, and the corresponding training sequence number is 2, it can be shown that the training sequence of the first parallel corpus pair is earlier than the training sequence of the second parallel corpus pair in a complete training period.

Optionally, determining a training batch of the first parallel speech material pair and a training batch of the second parallel speech material pair includes:

and determining a training batch of the first parallel corpus pair and a training batch of the second parallel corpus pair based on the weight of the first parallel corpus pair and the weight of the second parallel corpus pair.

The weight of the first parallel corpus pair and the weight of the second parallel corpus pair can be measured according to the source/credibility/precision/accuracy of the corpuses in the first parallel corpus pair and the second parallel corpus pair.

For example, the source of the corpus is assumed to be a database a, and the source of the second parallel corpus is assumed to be a database B, where the update frequency of the data in the database a is higher than the update frequency of the data in the database B, which may indicate that the data in the database a is higher in real-time performance and higher in accuracy than the data in the database B, and the weight of the first parallel corpus pair may be greater than the weight of the second parallel corpus pair, and if the weight of the first parallel corpus pair is set to 0.7, the weight of the second parallel corpus pair is set to 0.3, and the specific weight may be set based on training requirements, which is not specifically limited by the present disclosure.

When the training cycle frequency is 10, it can be determined that the training frequency of the first parallel corpus pair is 7, the training sequence number of the training corresponds to 1, the training frequency of the second parallel corpus pair is 3, and the training sequence number of the training corresponds to 2.

Therefore, the training batch of the first parallel corpus pair and the training batch of the second parallel corpus pair are constant by taking the weight of the first parallel corpus pair and the weight of the second parallel corpus pair as reference, and the training batch of the first parallel corpus pair and the training batch of the second parallel corpus pair can be accurately and effectively determined.

S1303, training the initial translation model according to the first parallel corpus pair and the second parallel corpus pair based on the training batch of the first parallel corpus pair and the training batch of the second parallel corpus pair to obtain the target translation model.

For example, an initial translation model is taken as an english-chinese neural network model for example, assuming that the first parallel corpus pair and the second parallel corpus pair are both english-chinese corpus pairs, the english corpus is input into the english-chinese neural network model, the target chinese corpus is obtained through output, the difference between the target chinese corpus and the chinese corpus in the english-chinese neural network is compared, and when the difference between the target chinese corpus and the chinese corpus is smaller than a preset difference threshold value, the model training is considered to be successful, and the target translation model is obtained through training.

The first parallel corpus pair may include a first chinese monolingual corpus and a first english monolingual corpus, and the second parallel corpus pair may include a second chinese monolingual corpus and a second english monolingual corpus, and when the initial translation model is trained, the first chinese monolingual corpus and the second chinese monolingual corpus may be input into the initial translation model, respectively, or the first chinese monolingual corpus and the second chinese monolingual corpus may be input into the initial translation model at the same time, and the initial translation model is trained to obtain the target translation model.

Inputting a first Chinese monolingual corpus and a second Chinese monolingual corpus into an initial translation model respectively, training the initial translation model to obtain a target translation model, performing example, inputting the first Chinese monolingual corpus into the initial translation model, performing error comparison on the first English monolingual corpus corresponding to the first Chinese monolingual corpus according to an output result of the initial translation model, optimizing training precision, inputting the second Chinese monolingual corpus into the initial translation model, performing error comparison on the second English monolingual corpus corresponding to the second Chinese monolingual corpus according to an output result of the initial translation model, and optimizing the training precision until model convergence is met to obtain the target translation model (such as a Chinese-English translation model).

The model parameters in the initial translation model can be adjusted according to the result of error comparison between the output result of the initial translation model and the second English monolingual corpus corresponding to the second Chinese monolingual corpus, the adjusted model parameters are adopted to translate the second Chinese monolingual corpus until the error between the output English corpus and the second English corpus is within a preset error range, and the requirement of model convergence is determined.

Or, according to the output result of the initial translation model and the error comparison result of the first english monolingual corpus corresponding to the first chinese monolingual corpus, adjusting the model parameters in the initial translation model, and translating the first chinese monolingual corpus by using the new model parameters until the error between the output english corpus and the first english corpus is within the preset error range, so as to determine that the model convergence is satisfied.

The training times in the training batch of the first parallel corpus pair/the second parallel corpus pair can embody the interaction times between the first parallel corpus pair/the second parallel corpus pair and the initial translation model, and the training sequence number in the training batch of the first parallel corpus pair/the second parallel corpus pair can embody the interaction sequence of the first parallel corpus pair/the second parallel corpus pair in the initial translation model.

By combining the above example, the frequency of the training period is 10, the number of times of training of the first parallel corpus pair is 7, the training sequence number of training corresponds to 1, the number of times of training of the second parallel corpus pair is 3, and the training sequence number of training corresponds to 2, where the smaller the number corresponding to the training sequence number is, the earlier the training sequence of the parallel corpus pair corresponding to the training sequence number is in a complete training period is.

When the first parallel corpus pair is a first Chinese-English parallel corpus pair and the second parallel corpus pair is a second Chinese-English parallel corpus pair, training the initial translation model based on the training batch of the first parallel corpus pair and the training batch of the second parallel corpus pair according to the first parallel corpus pair and the second parallel corpus pair to obtain a target translation model, for example, when a training period starts, the initial translation model is trained 7 times by using the first Chinese-English parallel corpus pair, after the training is finished, the initial translation model is trained 3 times by using the second Chinese-English parallel corpus pair, and when it is determined that an error between an output monolingual corpus and a matching corpus corresponding to the input monolingual corpus satisfies a preset threshold, the model convergence is determined to be satisfied, so that the trained Chinese-English translation model can be obtained.

Therefore, the initial translation model can be alternately trained by utilizing the first parallel corpus pair and the second parallel corpus pair based on the training batch of the first parallel corpus pair and the training batch of the second parallel corpus pair, and the target translation model with high precision is obtained.

Fig. 3 is a schematic flowchart of another translation model training method provided in this embodiment. Based on the foregoing embodiment, in this embodiment, when the target parallel corpus pair includes the first parallel corpus pair and the second parallel corpus pair, a possible implementation manner of S120 is as follows:

and S1201, determining a second parallel speech pair according to the first parallel speech pair.

Wherein, the parallel corpus of second is to the pseudo parallel corpus that can be first parallel corpus to right, on the basis that first parallel corpus is right, the parallel corpus of second that determines is right, can effectively guarantee with the viscidity that first parallel corpus is right to, be convenient for promote the stability of training the model.

Optionally, determining the second parallel speech pair according to the first parallel speech pair includes:

training the initial translation model according to the first parallel language material pair to obtain a first translation model;

When the first parallel corpus pair is a chinese-english corpus pair, the initial translation model may be a neural network model, the initial translation model is trained, and the obtained first translation model may be a chinese-english translation model.

The first translation model may include a plurality of dictionaries, such as a high frequency dictionary library, a low frequency dictionary library, and the like, and the low frequency dictionary library may store some monolingual corpus with a low frequency in the training samples of the first translation model.

Wherein, determining the second parallel speech pair according to the low frequency dictionary base of the first translation model may include: and determining a plurality of monolingual corpora according to the low-frequency dictionary base of the first translation model, and then respectively inputting the monolingual corpora into the initial translation model to obtain matching monolingual corpora corresponding to the monolingual corpora, so that a second parallel corpus pair is obtained.

Therefore, the second parallel corpus pair is determined through the low-frequency dictionary base, the determined second parallel corpus pair can be expanded to the corpus in the first parallel corpus pair, the problem that model training precision is influenced due to the fact that low-frequency vocabularies are lost in a model training sample is avoided, and translation accuracy of a translation model obtained through training is improved.

Before determining the second parallel language material pair according to the low-frequency dictionary base of the first translation model, the method further comprises the following steps:

determining a low-frequency dictionary base of the first translation model;

determining a second parallel speech pair from the low frequency dictionary bank of the first translation model, comprising:

determining target monolingual corpus from a low-frequency dictionary base of the first translation model;

According to training requirements, a preset number of target monolingual corpora can be determined from a low-frequency dictionary base of the first translation model, and the target monolingual corpora are respectively input into the first translation model to obtain matched monolingual corpora matched with the target monolingual corpora.

For example, when the first translation model is a chinese-english translation model and the number of the required second parallel corpus pairs is 500, 500 chinese monolingual corpora may be determined from the low-frequency dictionary library of the chinese-english translation model, and the chinese monolingual corpora are input into the chinese-english translation model respectively to obtain 500 english monolingual corpora corresponding to the chinese monolingual corpora, so as to obtain 500 chinese-english parallel corpus pairs (e.g., the second parallel corpus pairs).

It should be noted that the first translation model corresponds to the target monolingual corpus and the matching monolingual corpus, that is, the corpus type of the input sample of the first translation model is the same as the corpus type of the target monolingual corpus, and the corpus type of the output sample is the same as the corpus type of the matching monolingual corpus.

For example, when the first translation model is a chinese-english translation model, the target monolingual corpus may be a chinese monolingual corpus, and the matching monolingual corpus matching the target monolingual corpus may be an english monolingual corpus.

Therefore, part of low-frequency words are determined as target monolingual corpus from the low-frequency dictionary base, and the paired monolingual corpus corresponding to the target monolingual corpus is obtained through the first translation model, so that the second parallel corpus pair formed by the low-frequency words is obtained, and the comprehensiveness of the training sample of the translation model is expanded.

It should be noted that after the target monolingual corpus is determined from the low-frequency dictionary library of the first translation model, the target monolingual corpus may be updated before the paired monolingual corpus of the target monolingual corpus is determined according to the target monolingual corpus and the first translation model.

Optionally, determining the low-frequency dictionary library of the first translation model includes:

And constructing a low-frequency dictionary base according to the candidate monolingual corpus of which the occurrence frequency is greater than a first preset threshold and less than a second preset threshold in the first parallel corpus pair.

Wherein the low frequency dictionary library is determined from the input sample of the first translation model.

For example, when the initial translation model is trained by using the first parallel corpus, a plurality of chinese monolingual corpora can be input, and the first translation model is obtained after the training is completed, that is, the input sample of the first translation model contains a plurality of chinese monolingual corpora, and the low-frequency dictionary database can be composed of low-frequency corpora in the plurality of chinese monolingual corpora.

Wherein, the first preset threshold value can be set as 100, and the second preset threshold value can be set as 300.

For example, the input sample of the first translation model includes 50000 monolingual corpora, wherein the preset low-frequency threshold range may be 100 (e.g., a first preset threshold) -300 (e.g., a second preset threshold) (times), the low-frequency term range can effectively select the low-frequency monolingual corpora from the numerous monolingual corpora, and if the frequency of occurrence of a certain portion of the monolingual corpora in the 50000 monolingual corpora is within the range of 100-300 (times), the portion of the monolingual corpora and the matching corpus thereof can be added to the low-frequency dictionary library.

It should be noted that when the occurrence frequency of the monolingual corpus is less than 100, the monolingual corpus can be discarded and not included in the dictionary base of the first translation model.

Therefore, based on the set first preset threshold and the second preset threshold, proper monolingual corpus is selected, and a low-frequency dictionary base with complete and stable corpus data is constructed.

Optionally, the updating of the target monolingual corpus may include the following implementation manners.

Acquiring a random monolingual corpus;

The random monolingual corpus may be a new monolingual corpus different from the target monolingual corpus, and specifically, the random monolingual corpus may be determined by screening out a corpus that is not in a common word from the low-frequency dictionary base, and may include a high-frequency monolingual corpus and/or a low-frequency monolingual corpus, or may be obtained from a database other than the low-frequency dictionary base of the first translation model, and the random monolingual corpus may include a high-frequency monolingual corpus and/or a low-frequency monolingual corpus in another database.

The same/different weights may be set for the candidate monolingual corpus and the random monolingual corpus to select an appropriate number of monolingual corpora to update the target monolingual corpus.

The candidate monolingual corpus included in the low-frequency dictionary base is the low-frequency corpus, so that the weight of the candidate monolingual corpus can be set to be larger than that of the random monolingual corpus, the fact that the target monolingual corpus is heavier than the low-frequency corpus is guaranteed, and therefore the determined second parallel corpus pair can balance the monolingual corpus in the first parallel corpus pair.

For example, the weight of the candidate monolingual corpus may be set to 0.7, and the weight of the random monolingual corpus may be set to 0.3.

Before the target monolingual corpus is updated according to the weight of the candidate monolingual corpus and the weight of the random monolingual corpus, data cleaning can be carried out on the candidate monolingual corpus and the random monolingual corpus, and completeness of the corpus is guaranteed.

Wherein the data cleansing operation may include: corpus deletion, semantic correction, format unification, size alignment and the like.

For example, when the candidate monolingual corpus and the random monolingual corpus are Chinese monolingual corpora, and some of the included Chinese monolingual corpora may have sentence duplication/redundancy, the corresponding part of the sentence duplication/redundancy may be deleted to implement corpus deletion.

After the corpus is deleted by combining the above examples, part of the chinese monolingual corpus included in the corpus may be discontent, and the semantic correction can be performed on the sentence by combining the upper sentence and the lower sentence of the sentence according to the keywords in the sentence, so as to implement the semantic correction.

With the above example, after semantic correction is performed, if some of the chinese monolingual corpus included in the corrected corpus may have different sentence formats, such as line spacing, punctuation, and post-segment preceding segment, the sentence formats may be unified by using a pre-prepared unified format, so as to achieve format unification.

With reference to the foregoing example, after performing format unification, where the sizes of partial fonts of partial chinese monolingual corpus included in the corpus may be different, the size alignment operation may be performed on the partial fonts by using a pre-prepared unified font, so as to achieve size alignment.

And the corpus quality of the candidate monolingual corpus and the random monolingual corpus can be improved by performing data cleaning on the obtained candidate monolingual corpus and the obtained random monolingual corpus.

Therefore, data cleaning is carried out on the candidate monolingual corpus and the random monolingual corpus, the corpus quality of the determined target monolingual corpus can be effectively guaranteed, and then the corpus quality of the second parallel corpus pair determined based on the target monolingual corpus is improved.

And S1202, determining a target parallel corpus pair according to the second parallel corpus pair and the first parallel corpus pair.

In combination with the above example, when the first parallel corpus pair and the second parallel corpus pair are both chinese-english corpus pairs, it is assumed that the first parallel corpus pair includes a first chinese corpus and a second english corpus matching the first chinese corpus, and the second parallel corpus pair includes a third chinese corpus and a fourth english corpus matching the third chinese corpus, then the first chinese corpus and the third chinese corpus may be combined based on a matching relationship between the first chinese corpus and the second english corpus, and a matching relationship between the third chinese corpus and the fourth english corpus to obtain a combined chinese corpus, and the second english corpus and the fourth english corpus are combined to obtain a combined english corpus, and then the first chinese corpus and the third chinese corpus may be paired one by one according to a corpus sequence in the combined chinese corpus and a corpus sequence in the combined english corpus to obtain a target parallel corpus pair.

For example, the first chinese corpus includes: a1, a2 and A3, the second english corpus including: b1 corresponding to a1, B2 corresponding to a2, and B3 corresponding to A3, and the third chinese corpus includes: c1, C2, and C3, the fourth english corpus comprising: d1 corresponding to C1, D2 corresponding to C2, and D3 corresponding to C3, the obtained target parallel corpus pair includes: A1-B1, A2-B2, A3-B3, C1-D1, C2-D2 and C3-D3.

The second parallel corpus centering emphasizes the low-frequency corpus, and the corpus frequency in the first parallel corpus centering can be effectively balanced, so that the determined corpus in the target parallel corpus centering has both high-frequency corpus and low-frequency corpus, and the multi-surface coverage of the target parallel corpus can be effectively improved.

In addition, the second parallel corpus may be determined as a target parallel corpus pair, or the first parallel corpus pair may be determined as a target parallel corpus pair, thereby saving processing time of training samples.

The embodiment further provides a complete flow schematic process of translation model training, which can be specifically seen in fig. 4 for example.

The example is illustrated with training of an english to chinese translation model.

And S410, obtaining Chinese-English parallel linguistic data, and cleaning to obtain Chinese-English parallel linguistic data pairs.

The obtaining method may include network resource collection, manual construction, and the like, and specifically, the number of each part of the monolingual corpus in the parallel corpus may be 2000 ten thousand.

The cleaning operation can be used for data normalization aiming at the problems that part of the linguistic data are incomplete, punctuations are inconsistent, formats need to be unified, cases need to be set and the like.

And S420, training a Chinese-English translation model by using the obtained Chinese-English parallel corpus pairs.

Wherein, the existing network model structure can be adopted to train and obtain the Chinese-English translation model.

In the setting of the model dictionary, the maximum number of chinese-english parallel corpus pairs in the vocabulary may be 50008 entries (where the word frequency may be 100 as the lowest and may be discarded below 100), the word frequency is sorted from high to low, and the words behind 50008 entries may be regarded as an alternative dictionary and called a low-frequency word dictionary.

S430, decoding the collected new monolingual Chinese by using a Chinese-English translation model to obtain English corresponding to the monolingual Chinese, and obtaining Chinese-English pseudo-parallel corpus pairs.

The new monolingual Chinese can be the monolingual Chinese in the low-frequency dictionary.

The method comprises the steps of selecting linguistic data which are not in common words from the linguistic data of the common words by using a low-frequency dictionary for a large number of new monolingual Chinese, obtaining random monolingual Chinese from the linguistic data which are not in the common words, wherein the random monolingual Chinese comprises high-frequency linguistic data and low-frequency linguistic data, and the weight of the random monolingual Chinese can be set to be 0.3, namely the number of the random monolingual Chinese is 30% of the number of the linguistic data in a Chinese-English pseudo-parallel linguistic data pair.

The low-frequency monolingual Chinese in the low-frequency dictionary can be obtained, and the weight of the corpora can be set to be 0.7, namely, the number of the low-frequency monolingual Chinese is 70% of the number of the corpora in the Chinese-English pseudo-parallel corpus pair.

The data of the obtained Chinese-English pseudo parallel corpus pair can be cleaned, so that the data quality is improved.

S440, training according to the Chinese-English parallel corpus pair and the Chinese-English pseudo parallel corpus pair to obtain an English-Chinese translation model.

Wherein, the training frequency in one period can be set to 8; in each period, model training is carried out by adopting Chinese-English parallel linguistic data pairs in 1-5 batches, and model training is carried out by adopting Chinese-English pseudo parallel linguistic data pairs in 6-8 batches, so that alternate training of parallel linguistic data and pseudo parallel linguistic data is realized.

It should be noted that, in the training process, the ratio of the number of the chinese-english parallel corpus to the number of the chinese-english pseudo-parallel corpus may be 1: 3.

Fig. 5 is a schematic structural diagram of a translation model training apparatus provided in this embodiment; the device is configured in the electronic equipment, and can realize the translation model training method in any embodiment of the application. The device specifically comprises the following steps:

an obtaining module 510, configured to obtain a first parallel speech pair, where the first parallel speech pair includes: corpus pairs matching the first monolingual corpus and the second monolingual corpus, or corpus pairs matching the second monolingual corpus and the first monolingual corpus;

a determining module 520, configured to determine a target parallel corpus pair according to the first parallel corpus pair, where the target parallel corpus pair includes: the corresponding relation between the monolingual corpus in the second parallel corpus pair is the same as that between the monolingual corpus in the first parallel corpus pair;

And the training module 530 is configured to train the initial translation model according to the target parallel corpus pair to obtain a target translation model.

In this embodiment, optionally, the target parallel corpus pair includes: a first parallel corpus pair and a second parallel corpus pair;

a training module 530 comprising: the device comprises a first determining unit, a second determining unit and a training unit;

the first determining unit is used for determining the training period frequency;

In this embodiment, optionally, the training unit is specifically configured to:

In this embodiment, optionally, the determining module 520 includes: a third determination unit and a fourth determination unit;

In this embodiment, optionally, the third determining unit is specifically configured to:

and determining a second parallel language material pair according to the low-frequency dictionary library of the first translation model.

In this embodiment, optionally, the determining module 520 is further configured to determine a low-frequency dictionary library of the first translation model;

a third determining unit, specifically configured to:

determining a matched monolingual corpus of the target monolingual corpus according to the target monolingual corpus and the first translation model;

In this embodiment, optionally, the determining module 520 is specifically configured to:

And constructing a low-frequency dictionary base according to the candidate monolingual corpus of which the occurrence frequency is greater than a first preset threshold value and less than a second preset threshold value in the first parallel corpus pair.

In this embodiment, optionally, the apparatus of this embodiment further includes: updating the module;

the obtaining module 510 is further configured to obtain a random monolingual corpus;

By the translation model training device of the embodiment of the invention, the first parallel language material pair is obtained, and comprises: the corpus pair that first monolingual corpus matches with the monolingual corpus of second, perhaps, the corpus pair that second monolingual corpus matches with first monolingual corpus determines the parallel corpus pair of target according to first parallel corpus pair, wherein, the parallel corpus pair of target can include: the first parallel corpus pair and/or the second parallel corpus pair extend training samples of the translation model, the initial translation model is trained according to the target parallel corpus pair to obtain a target translation model, and then the training efficiency of the translation model is effectively improved.

The translation model training device provided by the embodiment of the invention can execute the translation model training method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method.

Fig. 6 is a schematic structural diagram of an electronic device provided in this embodiment, and as shown in fig. 6, the electronic device includes a processor 610, a memory 620, an input device 630, and an output device 640; the number of the processors 610 in the electronic device may be one or more, and one processor 610 is taken as an example in fig. 6; the processor 610, the memory 620, the input device 630 and the output device 640 in the electronic apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 6.

The memory 620 is used as a computer readable storage medium for storing software programs, computer executable programs, and modules, such as program instructions/modules corresponding to the translation model training method in the embodiment of the present invention. The processor 610 executes various functional applications and data processing of the electronic device by executing software programs, instructions and modules stored in the memory 620, so as to implement the translation model training method provided by the embodiment of the present invention.

The memory 620 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 620 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, the memory 620 can further include memory located remotely from the processor 610, which can be connected to an electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 630 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic device, and may include a keyboard, a mouse, and the like. The output device 640 may include a display device such as a display screen.

The present embodiments also provide a storage medium containing computer-executable instructions, which when executed by a computer processor, are used to implement the translation model training method provided by the embodiments of the present invention.

Of course, the storage medium provided by the embodiments of the present invention contains computer-executable instructions, and the computer-executable instructions are not limited to the operations of the method described above, and may also perform related operations in the translation model training method provided by any embodiments of the present invention.

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the search apparatus, each included unit and each included module are merely divided according to functional logic, but are not limited to the above division, as long as corresponding functions can be implemented; in addition, the specific names of the functional units are only for the convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is noted that, in this document, relational terms such as "first" and "second," and the like, are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above description is only a specific implementation of the present embodiment, so that those skilled in the art can understand or implement the present embodiment. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the embodiments. Thus, the present embodiments are not intended to be limited to the embodiments shown herein but are to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A translation model training method is characterized by comprising the following steps:

obtaining a first parallel speech material pair, the first parallel speech material pair comprising: a corpus pair in which the first monolingual corpus matches the second monolingual corpus, or a corpus pair in which the second monolingual corpus matches the first monolingual corpus;

determining a target parallel corpus pair according to the first parallel corpus pair, wherein the target parallel corpus pair comprises: the first parallel corpus pair and/or the second parallel corpus pair, the corresponding relation between the monolingual corpus in the second parallel corpus pair is the same as the corresponding relation between the monolingual corpus in the first parallel corpus pair;

2. The method of claim 1, wherein the target parallel corpus pair comprises: a first parallel speech pair and a second parallel speech pair;

determining the frequency of a training period;

determining a training batch of the first parallel corpus pair and a training batch of the second parallel corpus pair within the training cycle frequency, wherein the training batches include: training times and training sequence numbers;

3. The method of claim 2, wherein determining the training batch of the first parallel corpus pair and the training batch of the second parallel corpus pair comprises:

4. The method according to claim 2, wherein determining a target parallel corpus pair from the first parallel corpus pair comprises:

5. The method of claim 4, wherein determining a second parallel corpus pair from the first parallel corpus pair comprises:

6. The method of claim 5, wherein prior to determining the second parallel speech material pair from the low frequency dictionary library of the first translation model, further comprising:

determining a low frequency dictionary library of the first translation model;

7. The method of claim 6, wherein determining the low frequency lexicon library of the first translation model comprises:

8. The method of claim 7, wherein prior to determining the target monolingual corpus based on the target monolingual corpus and the first translation model, further comprising:

acquiring a random monolingual corpus;

9. A translation model training apparatus, comprising:

10. An electronic device, comprising:

one or more processors;

a storage device to store one or more programs,

when executed by the one or more processors, cause the one or more processors to implement a translation model training method according to any one of claims 1 to 8.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out a translation model training method according to any one of claims 1 to 8.