CN114333830A

CN114333830A - Simultaneous interpretation model training method, simultaneous interpretation method, device and storage medium

Info

Publication number: CN114333830A
Application number: CN202011062800.8A
Authority: CN
Inventors: 董修岗; 周祥生; 屠要峰; 黄震江; 徐进
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2022-04-12

Abstract

The application discloses a training method of a simultaneous interpretation model, a simultaneous interpretation method, equipment and a storage medium, and belongs to the technical field of simultaneous interpretation and artificial intelligence. The method comprises the following steps: loading a simultaneous interpretation model to be trained, acquiring initial data for training, and training the simultaneous interpretation model according to the initial data to obtain a basic model; and receiving a model fine-tuning corpus, and fine-tuning the basic model based on the model fine-tuning corpus to obtain a trained simultaneous interpretation model. According to the technical scheme, the using effect of the simultaneous interpretation model can be improved, and the robustness of the simultaneous interpretation model is improved.

Description

Simultaneous interpretation model training method, simultaneous interpretation method, device and storage medium

Technical Field

The application relates to the technical field of simultaneous interpretation and artificial intelligence, in particular to a training method, a calling method, equipment and a storage medium of a simultaneous interpretation model.

Background

Simultaneous interpretation, also called simultaneous interpretation and synchronous interpretation for short, refers to an interpretation mode that an interpreter uninterruptedly interprets contents to audiences without interrupting the speaking of a speaker, and the simultaneous interpretation provides instant interpretation through special equipment, so that the mode is suitable for large seminars and international conferences, and two to three interpreters are alternated under normal conditions to realize simultaneous interpretation.

Along with the continuous expansion of the co-transmission demand, the co-transmission technical cost is continuously improved, and under more and more scenes, the man power is replaced by the artificial intelligence co-transmission system, so that the labor cost can be saved, and the co-transmission intellectualization can be realized. Therefore, more and more concurrent systems based on various artificial intelligence technologies are slowly emerging on the market.

The prior simultaneous transmission system in the market usually decomposes the whole simultaneous transmission task into voice recognition, machine translation and text-to-speech, such as voice recognition and machine translation, when realizing simultaneous transmission, however, practice proves that due to various noises in practical application scenes and many irregular pronunciations of speakers, the system has certain simultaneous transmission errors, and the final effect is not ideal.

Disclosure of Invention

The embodiment of the application mainly aims to provide a training method, a calling method, equipment and a storage medium of a simultaneous interpretation model, and aims to improve the use effect of the simultaneous interpretation model and improve the robustness of the simultaneous interpretation model.

In order to achieve the above object, an embodiment of the present application provides a method for training a simultaneous interpretation model, where the method includes the following steps: loading a simultaneous interpretation model to be trained, acquiring initial data for training, and training the simultaneous interpretation model according to the initial data to obtain a basic model; and receiving a model fine-tuning corpus, and fine-tuning the basic model based on the model fine-tuning corpus to obtain a trained simultaneous interpretation model.

In order to achieve the above object, an embodiment of the present application provides a simultaneous interpretation method, including the following steps: receiving input voice information and loading a trained simultaneous interpretation model, wherein the simultaneous interpretation model is obtained based on the described training method of the simultaneous interpretation model; and inputting the voice information into the trained simultaneous interpretation model so as to display the output text information in a corresponding display frame.

In order to achieve the above object, an embodiment of the present application further provides a computer device, where the computer device includes a memory and a processor; the memory is used for storing a computer program; the processor is configured to execute the computer program and to implement the training method of the simultaneous interpretation model and/or the steps of the simultaneous interpretation method as described above when the computer program is executed.

To achieve the above object, the present application provides a storage medium for a computer readable storage, the storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the method for training a simultaneous interpretation model and/or the steps of the simultaneous interpretation method described above.

In the training method, the simultaneous interpretation method, the device and the storage medium of the simultaneous interpretation model, when the simultaneous interpretation model is trained, initial data for pre-training is firstly obtained, the simultaneous interpretation model to be trained is trained according to the initial data to obtain a corresponding basic model, and for the basic model, the simultaneous interpretation can be realized, only the simultaneous interpretation model which is not customized but can be suitable for a used scene is obtained, and in order to improve the accuracy and efficiency of the simultaneous interpretation, the basic model is subjected to fine tuning after the basic model is obtained, and the fine tuning corpus is received, so that the basic model is subjected to directional fine tuning according to the obtained fine tuning corpus to obtain the final trained simultaneous interpretation model. The method and the device have the advantages that when the simultaneous interpretation model is trained, the obtained simultaneous interpretation model can have a better using effect in a specific scene through directional fine adjustment, and meanwhile, the robustness of the model is better improved through the fine-adjustment secondary training.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart illustrating a training method of a simultaneous interpretation model according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating the steps for training a base model according to an embodiment of the present disclosure;

fig. 3 is a graph illustrating a learning rate variation trend according to an embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating the steps for obtaining a base model from training samples according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating the steps provided in one embodiment of the present application for validating a base model;

FIG. 6 is a flowchart illustrating the steps of fine-tuning a base model according to an embodiment of the present application;

FIG. 7 is a block diagram illustrating a flow chart of a model training process according to an embodiment of the present application;

fig. 8 is a schematic block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be divided, combined, or combined, and thus the actual execution order may be changed according to the actual situation.

As shown in fig. 1, fig. 1 is a schematic flowchart of a training method of a simultaneous interpretation model according to an embodiment of the present application, the method including the following steps:

s101, loading a simultaneous interpretation model to be trained, acquiring initial data for training, and training the simultaneous interpretation model according to the initial data to obtain a basic model.

For a system or a device for realizing the simultaneous interpretation function, a device for performing simultaneous interpretation needs to be provided, so that when a related speaker performs the speech or speaks, the speech of the speaker can be quickly and accurately subjected to simultaneous interpretation so as to be displayed on a corresponding display interface.

In practical applications, for the usage scenario of simultaneous interpretation, generally under the condition of performing live broadcasting or playing, it is not necessarily good for the watching personnel to understand what the speaker says, for example, when the speaker is speaking in a Chinese language, it is not understood to foreigners who do not know Chinese, then simultaneous interpretation needs to be performed to translate the speech of the speaker, and the translated english content is displayed on the interface for performing the interpretation, so that a device capable of more accurately realizing the simultaneous interpretation is needed.

In an embodiment, when a model training instruction is detected, a device or a server for model training loads a to-be-trained simultaneous interpretation model to be trained, and obtains initial data for training, and trains the to-be-trained simultaneous interpretation model by using the obtained initial data. The model training instruction is used to trigger a corresponding model training function, so that the device or the server may train a corresponding simultaneous interpretation model, where the model training instruction may be triggered by a user performing a corresponding operation, or may be automatically triggered by the server, which is not limited herein.

It should be noted that the loaded simultaneous interpretation model to be trained is generated based on the transformer network architecture.

In an actual simultaneous interpretation scene, usually accompanying the simultaneous presentation of the voice information and the text information, the output of the voice information directly from the speaker is accompanied by the output of the text information on the display interface, and the acquisition of the text information is obtained based on the trained simultaneous interpretation model.

When the simultaneous interpretation model to be trained is trained according to the received initial data, a corresponding basic model is obtained after training, and for the basic model obtained at this time, the basic model itself can be directly applied to simultaneous interpretation, but the basic model is not further optimized, so that the basic model may have a poor effect when performing simultaneous interpretation, such as inaccuracy of interpretation or low interpretation efficiency. Therefore, in general, after obtaining the base model, the base model is not directly used, but further model optimization is performed, so that the finally optimized model has a better effect.

Referring to fig. 2, fig. 2 is a flowchart illustrating a procedure of training to obtain a base model according to an embodiment of the present application.

In one embodiment, after loading the simultaneous interpretation model to be trained and acquiring the initial data for training, the loaded simultaneous interpretation model to be trained is trained. Therefore, as shown in fig. 2, the step of training the simultaneous interpretation model to be trained includes steps S201 to S204.

Step S201, obtaining video data carrying text information, and performing audio extraction on the video data to obtain audio information.

When the initial data is used for training the simultaneous interpretation model to be trained, the initial data is correspondingly preprocessed, and the preprocessed data is used for training the simultaneous interpretation model to be trained. In practical applications, the input initial data for training is related to simultaneous interpretation, such as video data in each scene of simultaneous interpretation, and the video data includes corresponding text information, where the audio data is data input by a speaker, and the text information is data for presentation, such as text information presented on a display interface.

Therefore, when receiving input video data carrying text information, audio extraction is performed on the video data to obtain audio information corresponding to the video data. Meanwhile, the carried text information corresponds to the audio information, and the specific correspondence can be determined according to the time stamp.

In practical applications, the initial data to be trained is usually video data containing target language subtitles and source language speech, the target language is a language used by a speaker, such as chinese or english, the target language subtitles are text information displayed on the display interface, such as a chinese or english sentence, in most cases, the speaker uses language 1 to perform speech, and in the corresponding display interface, language 2 is required to perform text display, such as using a chinese speech and displaying an english word corresponding to the content of the speech on the display interface, the source language speech is audio information, that is, the initial data contains all information of the data, so as to facilitate model training by using the initial data.

It should be noted that, for the initial data, the data size is in the GB level, and in general, in order to ensure sufficiency of the training sample and accuracy of the training, the data size of the initial data may be set to 10GB or 20 GB. Meanwhile, the audio information and the text information contained in the video information can be obtained based on the same language, such as Chinese speech and Chinese display, and can also be obtained based on two different languages, such as Chinese speech and English display.

After receiving the video information, performing audio extraction on the video information to obtain audio information contained in the video data, wherein the audio information is voice information input by a speaker during simultaneous interpretation, and the video information also contains text information, namely text information corresponding to the voice information after interpretation, and meanwhile, there is a relevant corresponding relation between the voice information and the text information, so that training of a simultaneous interpretation model to be trained is realized by using the obtained audio information and the carried text information without performing operations such as voice conversion and the like.

Step S202, time calibration and association are carried out on the audio information according to the text information, so that a training sample of the text which is newly associated with the audio information is obtained.

After the audio information contained in the video data is extracted, since the audio information has a corresponding correspondence with the text information carried in the video data, but the correspondence between the audio information and the text information is not extracted when the audio information is extracted, the association between the audio information and the text information is reestablished after the extraction of the audio information is completed.

For text information, when the text information is obtained, corresponding processing is required to be carried out on the text information so as to remove and clean useless information in the text information, when the text information is processed, operations such as punctuation normalization, full angle and half angle normalization, impurity removal and the like are included, meanwhile, word segmentation and the like are carried out on the text information by utilizing a subword technology, and when the word segmentation is carried out, a corresponding subword word list can be established, wherein the concept of the subwords is that words with small occurrence times and complex structures are represented by a small subword combination form, the word list is shortened, and the problem of oov (out-of-word) is optimized.

The subword table form is as follows:

TABLE 1

Subword	Word frequency
		c@@	352824
account	351725
		groups	351454
framework	351102
		capac@@	20984
rever@@	20983
		...	...

As can be seen from Table 1, the vocabulary contains the "@" character, which means that the word is "subword", i.e. for a 3-letter word, the first is c, the two latter can be any letters, for example, caa and cdd belong to c @, and the order of the vocabulary is determined by word frequency from top to bottom.

For example, for video data, every day audio information is played, a piece of corresponding text information is displayed on a display interface, and therefore, after the audio information is extracted from the video data, time calibration is performed on the audio information according to the text information, so as to obtain a training sample for training a simultaneous interpretation model to be trained.

In addition, after the audio information is extracted, the audio information and the text information are aligned, so that reference information for alignment needs to be acquired, and in an embodiment, the alignment is performed according to a time stamp, so that when the audio information is extracted, time stamp information included in the audio information is extracted to obtain the audio information including the time stamp.

Therefore, when time calibration is performed, the correspondence relationship between the extracted audio information and the time stamp included in the text information is established by using the time stamp included in the text information, and the text information and the audio information corresponding to the same time stamp are associated and aligned by using the time stamp included in the text information and the time stamp included in the audio information. And then the associated text information and audio information are used as a training sample, and all the audio information and the text information are aligned to obtain the training sample for finally carrying out model training.

Step S203, receiving basic parameters, and setting parameters of the simultaneous interpretation model to be trained based on the basic parameters.

For the trained simultaneous interpretation model to be trained, corresponding basic parameters need to be set before training so that model training can be completed better, therefore, when the generation and acquisition of training samples are completed, input basic parameters are received, the simultaneous interpretation model to be trained is subjected to parameter setting according to the input basic parameters, and finally, the model after parameter setting is completed is trained by using the obtained training samples.

As can be seen from the above description, the loaded simultaneous interpretation model to be trained is obtained based on the transform network architecture, and thus when setting the basic parameters, the set basic parameters include: the number of Encoder layers, the number of Decoder layers, the number of neurons in a hidden layer, Learning rate, Batch size, Decoder length and the like, and by setting each basic parameter before training, the training of the model can be completed more quickly and accurately.

In practical applications, the setting of the basic parameters may be set according to actual requirements, and in general, when the setting of the basic parameters is performed, the specific setting mode may be as shown in table 2 below:

TABLE 2

Parameter name	Parameter value
		Number of Encoder layers	6 layers of
decoder layer number	6 layers of
		Number of hidden layer neurons	512
Muti-head attribute number	8
		Learning rate	Attenuated learning rate with arm _ up
Batch size	Fixed word number 4096
		Decoder length	256

In practical applications, the number of encorder layers and decoder layers can be set to 6 to 10 layers in general, and the number of hidden layer neurons can be set to a range of 512 to 1024 in general, and it can be set to data as recorded in table 2 in consideration of practical training cost and efficiency and accuracy of training.

For the set parameters, the Learning rate is set as an attenuation Learning rate with a Learning rate of war _ up, the war _ up is a method for preheating the Learning rate, and an updating method of the war _ up _ lr is used, and the specific updating method is shown in the following formula:

wherein, hidden _ size is the number of neuron nodes in the hidden layer and is a constant; the arm _ up _ steps is a constant set, and is typically set to 10000; step is the number of training steps for training.

Because only the training step number step is a variable in the formula, the war _ up _ lr is used as the learning rate in the training process, the value of the war _ up _ lr is changed every time when a step passes, the new war _ up _ lr is used for the next training, and when the step reaches the set war _ up _ steps value, the learning rate starts to smoothly decay. The specific trend is shown in fig. 3.

And step S204, training the simultaneous interpretation model to be trained after parameter setting is completed according to the training samples to obtain a basic model.

After the parameter setting of the simultaneous interpretation model to be trained is completed, the simultaneous interpretation model with two trainers after the parameter setting is completed is trained according to a training sample obtained in advance, and then a basic model is obtained through training.

The basic model is obtained based on initial data training, and in the actual use process, the basic model can also be used as a trained simultaneous interpretation model, but the problem of poor effect may exist. Therefore, in an embodiment, the obtained base model is not directly used as the final trained simultaneous interpretation model, and after the base model is obtained, further adjustment and training of the base model are required to make the final obtained model meet the actual application requirements.

In an embodiment, when the obtained training sample is input into the simultaneous interpretation model to be trained, the parameters of which are set to the completion parameter, the training of the simultaneous interpretation model is completed through the training sample, and in the training process, when the training reaches a certain degree, the basic model obtained by the current training is obtained.

When the model is trained, the model is continuously adjusted, for example, some training parameters of the model are continuously optimized and adjusted along with the continuous training of the model, so that the model finally tends to be in a reasonable state. In general, during model training, it is determined that model training is completed when the model converges, and during the actual application process, there are many ways to determine whether the model converges, for example, determining that the model converges when the number of times of model training reaches a preset number, for example, determining that the model converges when one or some specific parameters in the model satisfy a preset condition, or determining that the model converges when the loss value of the model satisfies the preset condition when the model outputs based on the training set and the verification set. Because there are many ways to determine model convergence, it is only necessary to select a suitable way, and therefore, the way of determining model convergence is not limited.

Illustratively, when the training sample is used to train the simultaneous interpretation model to be trained after the setting of the basic parameters is completed to obtain the basic model, the basic model can be obtained by using the concept of minimum risk training. When determining when the current training is completed, the current corresponding loss value is actually calculated according to the loss function.

Referring to fig. 4, fig. 4 is a flowchart illustrating a step of obtaining a base model according to a training sample according to an embodiment of the present application.

After the training sample is obtained by processing the initial data, the simultaneous interpretation model to be trained after the setting of the parameters is completed is trained according to the training sample, and the acquisition of the final model is completed through training and testing in the actual model training process, so that step S204 includes steps S401 to S403.

Step S401, obtaining a training set sample from the training samples, and inputting the training set sample into the simultaneous interpretation model to be trained after parameter setting is completed.

When obtaining the training sample, first, the training sample is divided accordingly, for example, the training sample is divided into a training set sample, a validation set sample, and a test set sample, and for different sample types, the use in the model training process is different, for example, the training set sample is a sample used when training a model to be trained, and the validation set sample and the test set sample are samples used when validating and testing the model after training, so for different types of samples, the corresponding sample numbers are different, and in general, the sample number of the training set sample is far higher than the sample numbers of the validation set sample and the test set sample, for example, the ratio of the training set sample to the sample numbers of the validation set sample and the test set sample is 9: 0.5: 0.5, when there is no test set, the ratio of the number of samples of the training set samples to the number of samples of the validation set samples is 9: 1.

in an embodiment, when the simultaneous interpretation model to be trained after parameter setting is trained, a training set sample is obtained from a training sample, so as to input the training set sample into the model which needs to be trained, that is, in the simultaneous interpretation model after parameter setting is completed, model parameters of the model are continuously self-adjusted through continuous training, so that the model meeting requirements can be obtained.

Illustratively, the training of the model is a continuous self-adjusting process, the model can be suitable for all sample data used for training by continuously adjusting the model parameters, for the sample data to be trained, the model itself has a certain data identifier or label, for example, data a corresponds to label a, data B corresponds to label B, the model parameters are adjusted once and again by inputting the sample data to be trained, so that the finally obtained model can output label a when data a is input, and can output label B when data B is input, and similarly, when the training data is more, the output of labels of more data can be realized.

Step S402, obtaining a verification set sample from the training sample, and determining whether the trained simultaneous interpretation model is converged.

For training samples, they may be divided into different classes of samples according to a certain ratio for achieving different functions, as described above. Therefore, when the simultaneous interpretation model is trained according to the training set sample in the training sample, the verification set sample in the training sample is obtained, and in addition, when the simultaneous interpretation model is trained, the training process is not performed endlessly, so that whether the simultaneous interpretation model obtained by training converges or not needs to be determined in the training process, the current training is determined to be finished when the model converges, and the training needs to be continued when the model does not converge.

In one embodiment, when determining whether the trained simultaneous interpretation model converges, the minimum risk training concept is used to describe the degree of difference between the standard and the model by using a loss function to try to find a set of parameters to minimize the expected value of the loss (i.e., risk) of the model, i.e., determine whether the model converges by the expected value of the loss of the model.

If the model input is x (n) and the criteria is y (n), the predicted output of the model is y, and the corresponding expected value (risk) of loss is:

wherein, Y (x)⁽ⁿ⁾) Denotes x⁽ⁿ⁾Corresponding to all possible output sets.

In a minimum risk case, as shown in Table 2 below, assume for input x⁽ⁿ⁾Output Y (x)⁽ⁿ⁾) Including y₁，y₂，y₃. For each candidate output, the penalty for the standard answer may be calculated, in this example, the penalty for the three candidates is-1.0, -0.3, and-0.5, respectively. That is, the standard answer is considered y₁Preferably, y₃Next, y₂The worst. While the goal of minimum risk training is to find a set of model parameters that minimize the expected value of the loss. Four sets of probability distributions are given in table 3:

TABLE 3

The first set of probability distributions is considered y₂＞y₃＞y₁This violates the standard answer, thus yielding a very high risk value of-0.50.

The second set of probability distributions is considered y₃＞y₁＞y₂The relevance to the standard answers is improved relative to the first set of probability distributions, thus obtaining a lower risk value of-0.61 than the first set.

Ranking y with third set of probability distributions consistent with the standard answers₁＞y₃＞y₂Thus further reducing the risk value to-0.71.

The fourth group of probability distribution improves the optimal output y under the condition of ensuring the ordering to be consistent₁Thereby reducing the risk value to-0.83.

It can be seen that the minimum risk training considers that a good set of parameters should be as consistent as possible with the standard answer in the ranking of all candidate elements, and the loss function defines the computation method of the ranking.

When determining whether the model converges, the model may be determined according to an expected value of loss corresponding to the model obtained by the current training, for example, an expected value threshold a is set, and by comparing an expected value X of loss corresponding to the model obtained by the current training with the expected value threshold a, when X is less than or equal to a, it is determined that the model obtained at this time converges, otherwise, it does not converge.

And S403, when the trained simultaneous interpretation model is determined to be converged, verifying the trained simultaneous interpretation model according to the verification set sample, and obtaining a basic model when the verification is passed.

When the trained simultaneous interpretation model is determined to be converged, the model training at the current stage is determined to be finished, at this time, the simultaneous interpretation model obtained by the training at this time is further verified according to a verification set sample obtained in advance, and when the verification is determined to be passed, the corresponding basic model after the current training is obtained.

In general, when a model is verified, it is determined whether the model to be verified can accurately complete data output, and therefore, when the model is verified, the model to be verified is verified by using corresponding verification data. Illustratively, after obtaining the simultaneous interpretation model trained according to the training set samples, the obtained model is trained by using the verification set samples to verify the training obtained simultaneous interpretation model.

Referring to FIG. 5, FIG. 5 is a schematic flow chart illustrating the steps provided in one embodiment of the present application for performing validation to obtain a base model; step S403 includes steps S501 to S503.

Step S501, when the fact that the trained simultaneous interpretation model is converged is determined, obtaining a plurality of groups of model parameters corresponding to a plurality of simultaneous interpretation models in a convergence state when training is carried out based on the training set sample;

step S502, inputting the verification set sample into the trained simultaneous interpretation model, and recording a BLEU value corresponding to the verification set sample so as to determine whether the trained simultaneous interpretation model is stable according to the BLEU value;

and S503, when the trained simultaneous interpretation model is determined to be stable, carrying out weight fusion on the plurality of groups of model parameters to obtain a basic model according to the model parameters after weight fusion.

When the convergence of the trained simultaneous interpretation model is determined, the training of the model is completed at the current stage, but the model obtained at the moment is not necessarily determined to meet the actual use requirement, so the obtained simultaneous interpretation model is verified when the convergence is determined.

In an embodiment, when it is determined that the trained simultaneous interpretation model converges, a plurality of sets of model parameters are obtained, where the obtained sets of model parameters are model parameters corresponding to a plurality of models within a period of time after the model converges, for example, when it is determined that the model converges, the model is trained 10 times, then model parameters corresponding to the models obtained by 10 times of training are obtained, so as to obtain a plurality of currently obtained model parameters in a summary manner, meanwhile, when the verification is performed, a verification set sample is input into the trained simultaneous interpretation model, and based on a BLEU value corresponding to the verification set sample, whether the model is stable is determined according to the obtained BLEU value, and finally, when the model is determined to be stable, weight fusion is performed according to a plurality of sets of model parameters obtained in advance, and further, a basic model is obtained according to the model parameters after the weight fusion.

The full name of BLEU is: bilingual evaluation understudy, namely: the bilingual mutual translation quality assessment auxiliary tool. It is a tool used to assess the quality of machine translation. During simultaneous interpretation, a process needing translation exists, and the BLEU value can well judge the translation quality. In the whole training process of the model, besides the accurate determination of language expression, the translation of sentences also needs to be accurately realized, the model has better language expression capability through model convergence, and after the model is determined to have better language expression capability, the translation capability of the model is optimized, so that the finally obtained simultaneous interpretation model has better language expression capability and better translation capability.

When the basic model is finally obtained, the basic model is obtained according to the model parameters corresponding to the plurality of models in the stable period, and the first stable model is not directly used as the basic model. For a model, parameters included in the model are trained and adjusted, so that the obtained model meets the actual application requirements, and when a basic model is obtained, model parameters of a plurality of converged models in a stable state are subjected to weight fusion, so that model parameters corresponding to the basic model are obtained. The specific way of performing weight fusion is not limited, for example, an average value may be calculated for the model parameters, and the obtained average value of each model parameter is used as the model parameter of the base model.

And S102, receiving a model fine-tuning corpus, and fine-tuning the basic model based on the model fine-tuning corpus to obtain a trained simultaneous interpretation model.

After the pre-training of the simultaneous interpretation model to be trained is completed to obtain the basic model, because the basic model still has a certain problem, such as inaccuracy or low efficiency of the simultaneous interpretation, after the basic model is obtained, further processing needs to be performed on the basic model, such as further adjustment of the model, so that the finally obtained model can better realize the simultaneous interpretation.

In practical applications, there are certain limitations to the scenes used for the simultaneous interpretation, and the scenes are generally limited, or the scenes used are limited, and generally, not all people can use the related devices, and more likely, the person who uses the simultaneous interpretation may be a news speaker or a national leader, for example, in a news conference, the speech content of the speaker in the news conference needs to be translated and displayed on a television interface in real time, and for example, when the national leader meets a foreign leader, the speech content also needs to be translated and displayed in the live broadcasting. Therefore, for the simultaneous interpretation model, corresponding fine adjustment can be performed according to different scenes and different people, so that the finally obtained simultaneous interpretation model can better realize translation expression.

In an embodiment, in order to make the trained simultaneous interpretation model have a better simultaneous interpretation effect, after the basic model is obtained, the input model fine tuning corpus is received, and then the pre-trained basic model is trained and fine-tuned according to the received model fine tuning corpus, so as to finally obtain the trained simultaneous interpretation model.

When the basic model is fine-tuned, the fine tuning can be performed for a plurality of times according to different requirements, wherein the fine tuning frequency is not limited and is determined according to actual requirements.

Illustratively, when the basic model is fine-tuned, the basic model can be fine-tuned according to different scenes, the basic model can be fine-tuned according to different people, the basic model can be trained according to scenes and individuals, namely, when the basic model is trained, the basic model can be fine-tuned according to a condition, and the basic model can be fine-tuned according to a combination of a plurality of different conditions. The basic model is finely adjusted through setting different conditions, so that the finally obtained model can better finish simultaneous transmission.

Referring to fig. 6, fig. 6 is a flowchart illustrating a step of fine-tuning a base model according to an embodiment of the present application.

When the basic model is finely adjusted, customized fine adjustment is performed according to different requirements, so that the finally obtained model has a better using effect. Therefore, when the base model is trimmed, it is not limited to performing the trimming twice on the base model based on the first trimming corpus and the second trimming corpus mentioned below, and in practical applications, the number of times of the trimming of the base model is not limited, and the direction of the trimming is also not limited.

In one embodiment, when tuning the base model, step S102 includes step S601 to step S602.

Step S601, receiving a model fine-tuning corpus, and preprocessing the model fine-tuning corpus to obtain a model fine-tuning sample;

step S602, inputting the model fine tuning sample into the basic model for training, and obtaining a trained simultaneous interpretation model when the trained basic model is determined to be converged.

After the basic model is obtained, customized fine adjustment is performed on the basic model according to actual application requirements, so that the fine-adjusted basic model can better meet the actual requirements. Therefore, when the model fine-tuning corpus is received, the model fine-tuning corpus is preprocessed to obtain a model fine-tuning sample, the obtained model fine-tuning sample is input into a pre-obtained basic model to train the basic model, and a trained simultaneous interpretation model is obtained when the model after training is determined to be converged in the training process.

In an embodiment, when a model fine-tuning corpus for fine-tuning a base model is received, corresponding preprocessing is performed on the model fine-tuning corpus first, so that data obtained after preprocessing can be used for model training. It should be noted that the model fine-tuning corpus may include a plurality of different corpora to perform fine-tuning in different fine-tuning directions.

When carrying out the preliminary treatment to model fine setting corpus, reject the garbage in the model fine setting corpus to there is not garbage to disturb the model training when making training, consequently when carrying out the preliminary treatment to model fine setting corpus, include: and extracting audio information and text information corresponding to the model fine tuning corpus so as to obtain a model fine tuning sample based on the audio information and the text information. Illustratively, when the model fine-tuning corpus is preprocessed, audio information is obtained by extracting voice features, wherein the most commonly used voice features are Mel-scale Frequency Cepstral Coefficients (MFCC), text information without useless information is obtained by cleaning and removing impurities from text data in the model fine-tuning corpus, and then information alignment is performed through timestamps corresponding to the audio information and the text information, so that a model fine-tuning sample for training a basic model is obtained.

After obtaining the model fine tuning sample for training fine tuning, inputting the model fine tuning sample into the basic model to train and fine tune the basic model, and when determining the convergence of the trained and fine tuned basic model, outputting the converged basic model obtained at the moment as a trained simultaneous interpretation model.

When determining whether the trained basic model converges, the method may determine by obtaining a BLEU value output by the trained basic model, for example, determine to converge when the obtained BLEU value is greater than a preset threshold, otherwise determine not to converge, and when not to converge, continue to train the basic model by using the model fine-tuning sample until the finally obtained model converges.

Illustratively, when training the base model using the model fine-tuning samples, the method includes: inputting a first model fine tuning sample in the model fine tuning samples into the basic model, and obtaining an intermediate model when the basic model trained based on the first model fine tuning sample is determined to be converged; and inputting a second micro model tuning sample in the model tuning samples into the intermediate model, and obtaining a trained simultaneous interpretation model when determining that the intermediate model trained based on the second model tuning sample is converged.

In practical application, the model fine-tuning corpus may include a plurality of corpora of different categories for different scenes or individuals, and therefore, when the model fine-tuning corpus is preprocessed to obtain the model fine-tuning sample, the model fine-tuning sample may include a first model fine-tuning sample and a second model fine-tuning sample, where the first model fine-tuning sample and the second model fine-tuning sample only include different corpora, for example, the first fine-tuning corpus is the scene corpus, and the second fine-tuning corpus is the character corpus, that is, different scenes and different people are used to perform fine-tuning on the basic model, so that the finally obtained model can better conform to a specific scene and a specific person.

For character prediction, for the distinction of characters realized with respect to voice information, in order to determine the uniqueness of the characters, further fine-tuning of the model can be realized by using the tone of the person as a second model fine-tuning sample, so that the most obtained model can be more suitable for the currently used person.

During actual training, the next training is started when convergence is required to be determined during each training, namely when the basic model is trained according to the first model fine tuning sample, after the convergence of the training is determined, the training and the fine tuning are performed for the second time according to the second model fine tuning sample, and finally the trained simultaneous interpretation model is obtained during the convergence.

For example, when the basic model is trained based on the model fine tuning sample, the basic parameters in the basic model may be adjusted correspondingly, for example, the set learning rate is adjusted and modified, for example, the learning rate of the basic model is adjusted to be 0.5 times of the current learning rate, and for other basic parameters, the basic parameters may be adjusted correspondingly according to actual requirements, or may not be adjusted, and are set specifically according to actual requirements.

In practical application, when customized fine tuning is performed, customized fine tuning is usually performed according to different actual scenes and different people, so that the model fine tuning samples include scene fine tuning samples and character fine tuning samples, and for a scene, a specific application scene of the contact interpretation is combined, the actual scene includes a meeting room scene, an outdoor scene, a scene with large echo and the like, and corresponding corpus information can be specifically obtained according to the actual application scene.

For example, if the first fine tuning is to fine tune the scene, and the second fine tuning is to fine tune the person, when the first fine tuning is performed, a first model fine tuning sample for the first fine tuning is obtained first, and the first model fine tuning sample specifically includes: the language material containing noise in a specific field (for example, if the simultaneous transmission system is used outdoors, and the area is very windy and the flow of people is large, the data of wind noise, the incomplete language material and part of high-quality samples are added into the first model fine adjustment sample).

The incomplete linguistic data can be understood as a negative sample made by people, the capacity of the system for restoring the incomplete linguistic data is improved, for example, sampling is intentionally omitted on a section of normal voice, then a correct label is given, the training model meets the response capacity of similar incomplete conditions in the future, the overall stability and the anti-interference capacity of the system are also improved, the incomplete linguistic data can be obtained by carrying out incomplete sampling on the voice and enabling part of voice to be repeated, and the incomplete linguistic data are obtained by carrying out incomplete selection on linguistic data randomly selected from general high-quality linguistic data.

For the three parts of corpora, the three parts of corpora can be mixed according to a certain proportion, for example, the proportion is 1:0.5:1, wherein, the proportion of the incomplete corpora is smaller, because according to the experience of the machine translation model, the appropriate incomplete data can increase the generalization capability of the model, and if too much incomplete data affects the effect. And putting the mixed first model fine tuning sample into the basic model for training, and simultaneously adjusting the learning rate to be 0.5 times of the corresponding learning rate when the basic model is completely converged. Finally, an intermediate model is obtained when the training converges for use in a second fine tuning.

In the second model fine adjustment, the second model fine adjustment sample is processed in the same manner as the first model fine adjustment sample. Wherein the second model fine tuning samples comprise: historical voice information of a speaker to be presented, corpus of a fixed domain containing noise and part of high-quality samples. The three samples can be mixed according to the proportion of 1:1:1, and then the second model fine tuning sample obtained by mixing is wholly put into the intermediate model for continuous training, and meanwhile, the learning rate can be adjusted to be 0.5 times of the learning rate used when the basic model is completely converged.

After the model fine tuning of the basic model is completed, when the model after fine tuning meets the set conditions, such as convergence, the trained simultaneous interpretation model is output and is called and used in the subsequent use process.

In one embodiment, after completing the training of the simultaneous interpretation model, when the trained simultaneous interpretation model is used, the method includes: receiving input voice information and loading a trained simultaneous interpretation model; and inputting the voice information into the trained simultaneous interpretation model to obtain text information corresponding to the voice information.

When receiving the voice information, loading the simultaneous interpretation model which needs to be trained, then inputting the received voice information into the loaded simultaneous interpretation model to output and obtain corresponding text information, and finally displaying the output text information in a corresponding text display box.

In practical applications, a trained model is embedded or imported into a corresponding device or apparatus in advance when being used, so that a user can call the trained model when using the device or apparatus. Therefore, the pre-trained simultaneous interpretation model is fused into the relevant device, like a simultaneous interpretation device, so that simultaneous interpretation can be realized when the simultaneous interpretation device is used. Since the trained simultaneous interpretation model is directionally and custom-trained based on scenes and individuals, the trained simultaneous interpretation model can be used by specific people and specific scenes, for example, for the user 1, in an outdoor scene, a device which imports the obtained simultaneous interpretation model trained based on the relevant data of the user 1 and the outdoor scene is used, and for the user 2, in an indoor scene, a device which imports the obtained simultaneous interpretation model trained based on the relevant data of the user 2 and the indoor scene is used.

In one embodiment, when the model is called, voice information input by a user is received, and because the called simultaneous interpretation model already determines the currently used scene and person, when the voice information is input, customized and accurate simultaneous interpretation can be directly realized according to the scene and the person.

In an embodiment, when training a simultaneous interpretation model to be trained, the overall training process may be as shown in fig. 7, and fig. 7 is a schematic flow diagram of a model training process provided in an embodiment of the present application.

When the simultaneous interpretation model to be trained is continuously trained, the method comprises the following steps:

step 701, pre-training a model;

step 702, fine tuning the model.

When model pre-training is carried out, initial data for model pre-training is firstly obtained, the obtained initial data for model training is preprocessed, training data for model training is obtained, and then pre-training is carried out on a simultaneous interpretation model to be trained according to the training data, so that a basic model is obtained. After the basic model is obtained, the basic model is finely adjusted according to actual requirements, so that the basic model is directionally finely adjusted, and a high-quality simultaneous interpretation model is obtained.

For the obtained initial data, including source language data and target language text data, after the initial data is obtained, the language data and the text data are processed, for example, the language data is embedded and the text information is subjected to bpe word segmentation, and then the initial data after data preprocessing is used as an input of model training to obtain a basic model when the training is completed.

For constructing the simultaneous interpretation model, the simultaneous interpretation model is obtained by performing optimization design based on a transform architecture, and therefore, basic parameters of the model need to be set correspondingly during training.

In the model fine tuning, the number of times of model fine tuning is not limited, and is set according to actual use requirements. When the basic model is trimmed, firstly, trimming data for trimming the model is obtained, then the trimming data is subjected to data processing, corresponding processing is further carried out to obtain trimming samples for trimming, for example, the trimming samples are obtained through proportional mixing, and finally the basic model is trained according to the obtained trimming samples. When the number of times of fine tuning is 1, the model fine tuning is realized in the described manner, and when the number of times of fine tuning is greater than 1, after one fine tuning is completed, subsequent fine tuning, such as second fine tuning and third fine tuning, is required.

If the number of times of fine tuning is 2, after the first fine tuning is completed, an intermediate model is obtained according to the basic model, and when the second fine tuning is performed, a fine tuning sample obtained by the second fine tuning is input into the intermediate model to be used as fine tuning input of the intermediate model, so that training of the intermediate model is realized, and a final simultaneous interpretation model is obtained when the final fine tuning is completed.

In the above-described training method, apparatus, and storage medium for the simultaneous interpretation model, when the simultaneous interpretation model is trained, initial data for pre-training is first obtained, the simultaneous interpretation model to be trained is trained according to the initial data to obtain a corresponding basic model, and for the basic model, the simultaneous interpretation can be implemented, which is a simultaneous interpretation model that is not customized but applicable to a used scene. The method and the device have the advantages that when the simultaneous interpretation model is trained, the obtained simultaneous interpretation model can have a better using effect in a specific scene through directional fine adjustment, and meanwhile, the robustness of the model is better improved through the fine-adjustment secondary training.

Referring to fig. 8, fig. 8 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application.

Illustratively, the device may be a tablet, notebook, desktop, or the like.

The apparatus also includes a processor, a memory for storing a computer program.

The processor is configured to execute the computer program and, when executing the computer program, implement a training method and/or a simultaneous interpretation method of any simultaneous interpretation model provided in an embodiment of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Embodiments of the present application further provide a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program causes the processor to implement any one of the training methods and/or the simultaneous interpretation methods of the simultaneous interpretation models provided in the embodiments of the present application.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable storage media, which may include computer readable storage media (or non-transitory media) and communication media (or transitory media).

The term computer-readable storage medium includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer-readable storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

For example, the computer readable storage medium may be an internal storage unit of the electronic device according to the foregoing embodiment, for example, a hard disk or a memory of the electronic device. The computer readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the electronic device.

The electronic device and the computer-readable storage medium provided by the foregoing embodiments enable information to be input through at least two virtual keyboards by displaying at least two virtual keyboards on different display areas on a display screen when a user inputs information; the difficulty of the malicious software for speculating the input information by monitoring the state of the sensor is improved, and the safety of information input is enhanced.

While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training a simultaneous interpretation model, the method comprising the steps of:

loading a simultaneous interpretation model to be trained, acquiring initial data for training, and training the simultaneous interpretation model according to the initial data to obtain a basic model;

and receiving a model fine-tuning corpus, and fine-tuning the basic model by the model fine-tuning corpus to obtain a trained simultaneous interpretation model.

2. The method of claim 1, wherein the obtaining initial training data to train the simultaneous interpretation model based on the initial training data to obtain a base model comprises:

acquiring video data carrying text information, and performing audio extraction on the video data to obtain audio information;

time calibration and correlation are carried out on the audio information according to the text information so as to obtain a training sample of the text information and the audio information;

receiving basic parameters, and setting parameters of the simultaneous interpretation model to be trained based on the basic parameters;

and training the simultaneous interpretation model to be trained after parameter setting is completed according to the training sample to obtain a basic model.

3. The method according to claim 2, wherein the training the simultaneous interpretation model to be trained after completing parameter setting according to the training samples to obtain a basic model comprises:

acquiring a training set sample from the training sample so as to input the training set sample into the simultaneous interpretation model to be trained after parameter setting is completed;

obtaining a verification set sample from the training sample, and determining whether the trained simultaneous interpretation model is converged;

and when the trained simultaneous interpretation model is determined to be converged, verifying the trained simultaneous interpretation model according to the verification set sample, and obtaining a basic model when the verification is passed.

4. The method of claim 3, wherein verifying the trained simultaneous interpretation model according to the validation set samples when the trained simultaneous interpretation model is determined to be converged and obtaining a base model when the verification is passed comprises:

when the trained simultaneous interpretation model is determined to be converged, acquiring a plurality of groups of model parameters corresponding to a plurality of simultaneous interpretation models in a converged state during training based on the training set sample;

inputting the verification set sample into a trained simultaneous interpretation model, and recording a BLEU value corresponding to the verification set sample so as to determine whether the trained simultaneous interpretation model is stable according to the BLEU value;

when the trained simultaneous interpretation model is determined to be stable, carrying out weight fusion on the plurality of groups of model parameters to obtain a basic model according to the model parameters after weight fusion;

when the trained simultaneous interpretation model is determined to be unstable, executing the following steps: and acquiring a training set sample from the training sample so as to input the training set sample into the simultaneous interpretation model to be trained after parameter setting is completed.

5. The method according to any one of claims 1 to 4, wherein the receiving a model fine-tuning corpus and fine-tuning the base model based on the model fine-tuning corpus to obtain a trained simultaneous interpretation model comprises:

receiving a model fine-tuning corpus, and preprocessing the model fine-tuning corpus to obtain a model fine-tuning sample;

and inputting the model fine tuning sample into the basic model for training, and obtaining a trained simultaneous interpretation model when the trained basic model is determined to be converged.

6. The method according to claim 5, wherein the preprocessing the model fine-tuning corpus to obtain model fine-tuning samples comprises:

and extracting audio information and text information corresponding to the model fine tuning corpus so as to obtain a model fine tuning sample based on the audio information and the text information.

7. The method of claim 5, wherein inputting the model fine-tuning samples into the base model for training and obtaining a trained simultaneous interpretation model when it is determined that the trained base model converges comprises:

inputting a first model fine tuning sample in the model fine tuning samples into the basic model, and obtaining an intermediate model when the basic model trained based on the first model fine tuning sample is determined to be converged;

and inputting a second micro model tuning sample in the model tuning samples into the intermediate model, and obtaining a trained simultaneous interpretation model when determining that the intermediate model trained based on the second model tuning sample is converged.

8. A simultaneous interpretation method, comprising the steps of:

receiving input voice information and loading a trained simultaneous interpretation model, wherein the simultaneous interpretation model is obtained based on the method of any one of claims 1-7;

and inputting the voice information into the trained simultaneous interpretation model to obtain text information corresponding to the voice information.

9. A computer device comprising a memory and a processor;

the memory is used for storing a computer program;

the processor for executing the computer program and implementing, when executing the computer program, the method of training the simultaneous interpretation model according to any one of claims 1 to 7 and/or the steps of the simultaneous interpretation method according to claim 8.

10. A storage medium for computer readable storage, characterized in that the storage medium stores one or more programs executable by one or more processors to implement the method of training the simultaneous interpretation model of any one of claims 1 to 7 and/or the steps of the simultaneous interpretation method of claim 8.