CN114416955A

CN114416955A - Heterogeneous language model training method, device, equipment and storage medium

Info

Publication number: CN114416955A
Application number: CN202210074846.4A
Authority: CN
Inventors: 姜迪
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2022-01-21
Filing date: 2022-01-21
Publication date: 2022-04-29

Abstract

The embodiment of the application provides a training method, a device, equipment and a storage medium of a heterogeneous language model, wherein the method comprises the following steps: acquiring a voice training sample set; training the first initial network model and the second initial network model by adopting a voice training sample set to obtain at least two first network models and at least two second network models; the first network model and the second network model have different structures, the first network model is used for processing the input pinyin sequence to obtain at least one character sequence corresponding to the pinyin sequence, and the second network model is used for determining a target character sequence corresponding to the pinyin sequence from the at least one character sequence; determining a heterogeneous language model according to the at least two first network models and the at least two second network models. The method, the device, the equipment and the storage medium for training the heterogeneous language model are used for improving the accuracy of the language model.

Description

Heterogeneous language model training method, device, equipment and storage medium

Technical Field

The present application relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, a device, and a storage medium for training a heterogeneous language model.

Background

An Automatic Speech Recognition (ASR) system is a system that recognizes Speech to obtain text words corresponding to the Speech. ASR systems include Acoustic Models (AM) and Language models (Language, Model, LM). And the AM is used for obtaining the corresponding pinyin according to the voice. LM is isomorphic language model, it is used for on the basis of the above-mentioned spelling, get the text word.

In the related art, the initial LM is typically trained using a speech data set to obtain the LM in the ASR system described above. However, due to information privacy restrictions, the number of speech samples in the speech data set is small, and the isomorphic language model is an n-gram model or a Deep Neural Network (DNN), so that the accuracy of the trained LM is low (i.e., the accuracy of the text word obtained through the LM is low).

Disclosure of Invention

The embodiment of the application provides a training method, a training device, equipment and a storage medium of a heterogeneous language model, and aims to solve the problem of low accuracy of the language model.

In a first aspect, an embodiment of the present application provides a method for training a heterogeneous language model, including:

acquiring a voice training sample set;

training the first initial network model and the second initial network model by adopting a voice training sample set to obtain at least two first network models and at least two second network models; the first network model and the second network model have different structures, the first network model is used for processing the input pinyin sequence to obtain at least one character sequence corresponding to the pinyin sequence, and the second network model is used for determining a target character sequence corresponding to the pinyin sequence from the at least one character sequence;

determining a heterogeneous language model according to the at least two first network models and the at least two second network models.

Optionally, determining a heterogeneous language model according to the at least two first network models and the at least two second network models includes:

determining at least four first language models according to at least two first network models and at least two second network models;

acquiring a voice verification sample set;

determining an error rate for each first language model based on the set of speech verification samples;

and determining the heterogeneous language model according to the at least four first language models and the error rate.

Optionally, determining at least four first language models from the at least two first network models and the at least two second network models comprises:

and carrying out random combination processing on the at least two first network models and the at least two second network models to obtain at least four first language models.

performing binary conversion processing on model parameters of the first network models aiming at each first network model to obtain a first initial parameter sequence; performing cross processing and mutation processing on the first initial parameter sequence to obtain at least two first intermediate parameter sequences; replacing the model parameters of the first network model with model parameters corresponding to at least two first intermediate parameter sequences to obtain at least two third network models corresponding to the first network model;

performing binary conversion processing on the model parameters of the second network models aiming at each second network model to obtain a second initial parameter sequence; performing cross processing and mutation processing on the second initial parameter sequence to obtain at least two second intermediate parameter sequences; replacing the model parameters of the second network model with model parameters corresponding to at least two second intermediate parameter sequences to obtain at least two fifth network models corresponding to the second network model;

and carrying out random combination processing on at least two third network models corresponding to the at least two first network models and at least two fifth network models corresponding to the at least two second network models to obtain at least four first language models.

Optionally, the voice verification sample set includes a plurality of pinyin verification samples and text verification results corresponding to the plurality of pinyin verification samples; aiming at each first language model in at least four first language models, wherein the first language models comprise a first network model and a second network model;

determining an error rate for the first language model based on the set of speech verification samples, comprising:

processing the multiple pinyin verification samples sequentially through the first network model and the second network model to obtain character output results corresponding to the multiple pinyin verification samples;

and determining the ratio of the number of the pinyin verification samples with different text output results and text verification results to the total number of the pinyin verification samples as the error rate of the first language model.

Optionally, determining a heterogeneous language model according to the at least four first language models and the error rate includes:

judging that a first language model with an error rate smaller than a preset value exists in at least four first language models;

if so, determining the first language model with the error rate smaller than the preset value as a heterogeneous language model;

if not, obtaining a first model parameter sequence corresponding to the model parameters of a preset number of first language models in at least four first language models to obtain at least one second language model corresponding to the preset number of first language models, and determining the heterogeneous language models according to the error rates of the plurality of second language models and each second language model.

determining a first target network model according to at least two first network models, and determining a second target network model according to at least two second network models;

and determining the first target network model and the second target network model as heterogeneous language models.

Optionally, determining a first target network model from at least two first network models comprises:

and determining the first target network model according to the plurality of third network models.

Optionally, determining the first target network model according to a plurality of third network models includes:

acquiring a voice verification sample set;

determining an error rate of each third network model according to the voice verification sample set;

and determining the first target network model according to the plurality of third network models and the error rate.

Optionally, the voice verification sample set includes a plurality of pinyin verification samples and text verification results corresponding to the plurality of pinyin verification samples;

determining an error rate of the third network model from the set of speech verification samples, comprising:

sequentially processing the multiple pinyin verification samples through the third network model and any second network model to obtain character output results corresponding to the multiple pinyin verification samples;

and determining the ratio of the number of the pinyin verification samples with different text output results and text verification results to the total number of the pinyin verification samples as the error rate of the third network model.

Optionally, determining the first target network model according to the plurality of third network models and the error rate includes:

judging that a third network model with an error rate smaller than a preset value exists in the plurality of third network models;

if so, determining the third network model with the error rate smaller than the preset value as the first target network model;

if not, obtaining a first initial parameter sequence corresponding to the model parameters of a preset number of third network models in the plurality of third network models to obtain at least one fourth network model corresponding to the preset number of third network models, and determining a first target network model according to the error rates of the plurality of fourth network models and each fourth network model.

Optionally, determining a first target network model according to the at least two first network models, and determining a second target network model according to the at least two second network models, includes:

generating a first weight of each first network model and a second weight of each second network model through the weight generation model;

according to the first weight, model parameters of the first network models are fused to obtain first target model parameters; according to the second weight, model parameters of each second network model are fused to obtain second target model parameters;

replacing the model parameters of the first network model with first target model parameters to obtain a first target network model; and replacing the model parameters of the second network model with the second target model parameters to obtain a second target network model.

Optionally, the method further comprises:

acquiring a voice verification sample set;

determining the error rate of the heterogeneous language model according to the voice verification sample set, and determining a reward value according to the error rate;

and updating the model parameters of the weight generation model according to the reward value, and regenerating the first weight of each first network model and the second weight of each second network model to obtain a new heterogeneous language model.

Optionally, the speech training sample set comprises at least two sample subsets; training the first initial network model and the second initial network model by adopting a voice training sample set to obtain at least two first network models and at least two second network models, and the method comprises the following steps:

respectively adopting at least two sample subsets to train the first initial network model to obtain first network models corresponding to the at least two sample subsets;

respectively adopting at least two sample subsets to train the second initial network model to obtain second network models corresponding to the at least two sample subsets;

determining a first network model corresponding to each of the at least two sample subsets as at least two first network models; and determining the second network model corresponding to each of the at least two sample subsets as at least two second network models.

In a second aspect, an embodiment of the present application provides a training apparatus for a heterogeneous language model, including: a processing module; the processing module is used for:

acquiring a voice training sample set;

Optionally, the processing module is specifically configured to:

acquiring a voice verification sample set;

Optionally, the processing module is specifically configured to:

the processing module is specifically configured to:

Optionally, the processing module is specifically configured to:

acquiring a voice verification sample set;

the processing module is specifically configured to:

Optionally, the processing module is specifically configured to:

Optionally, the processing module is further configured to:

acquiring a voice verification sample set;

Optionally, the speech training sample set comprises at least two sample subsets; the processing module is specifically configured to:

In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored by the memory to implement the method of any one of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the method of any one of the first aspect is implemented.

In a fifth aspect, the present application provides a computer program product, which includes a computer program that, when executed by a processor, implements the method of any one of the first aspect.

The embodiment of the application provides a training method, a device, equipment and a storage medium of a heterogeneous language model, wherein the method comprises the following steps: acquiring a voice training sample set; training the first initial network model and the second initial network model by adopting a voice training sample set to obtain at least two first network models and at least two second network models; the first network model and the second network model have different structures, the first network model is used for processing the input pinyin sequence to obtain at least one character sequence corresponding to the pinyin sequence, and the second network model is used for determining a target character sequence corresponding to the pinyin sequence from the at least one character sequence; determining a heterogeneous language model according to the at least two first network models and the at least two second network models.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application.

Fig. 1 is a schematic application scenario diagram of a training method for a heterogeneous language model according to an embodiment of the present application;

fig. 2 is a flowchart illustrating a method for training a heterogeneous language model according to an embodiment of the present application;

FIG. 3 is a flowchart of a method for determining a heterogeneous language model according to an embodiment of the present application;

FIG. 4 is a flowchart of another method for determining a heterogeneous language model according to an embodiment of the present application;

FIG. 5 is a flowchart of another method for determining a heterogeneous language model according to an embodiment of the present application;

FIG. 6 is a flowchart of a method for updating a heterogeneous language model according to an embodiment of the present application;

fig. 7 is an architecture diagram for obtaining a heterogeneous language model based on GMMA according to an embodiment of the present application;

FIG. 8 is an architecture diagram for obtaining a heterogeneous language model based on RLGMA according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a training apparatus for a heterogeneous language model according to an embodiment of the present application;

fig. 10 is a hardware schematic diagram of an electronic device according to an embodiment of the present application.

With the above figures, there are shown specific embodiments of the present application, which will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

In the related art, due to the information privacy limitation, the number of voice samples in the voice data set is small, and the isomorphic language model is an n-gram model or a Deep Neural Network (DNN), so that the accuracy of the obtained language model is generally low after the initial LM is trained by using the voice data set with the small number of voice samples.

In the present application, in order to train to obtain a language model with a higher accuracy using a speech data set with a smaller number of speech samples, the inventor thinks that a plurality of first network models and a plurality of second network models are obtained by first training the sample set using speech, wherein the first network models and the second network models have different structures and functions, and then determines a heterogeneous language model according to the plurality of first network models and the plurality of second network models.

An application scenario of the training method for the heterogeneous language model provided in the embodiment of the present application is described below with reference to fig. 1.

Fig. 1 is a schematic view of an application scenario to which the training method for a heterogeneous language model provided in the embodiment of the present application is applied. As shown in fig. 1, the application scenario includes: a first initial network model, a second initial network model, a plurality of first network models, a plurality of second network models, and a heterogeneous language model.

The plurality of first network models are obtained by training the first initial network model by adopting a voice training sample set.

The plurality of second network models are obtained by training the second initial network model by adopting a voice training sample set.

The first network model and the second network model are different in structure and function.

The structure of the first network model is the same as the structure of the n-gram model. The structure of the second network model is the same as that of Deep Neural Networks (DNNs).

The function of the first network model is: and processing the input pinyin sequence to obtain at least one character sequence corresponding to the pinyin sequence. The function of the second network model is: and determining a target character sequence corresponding to the pinyin sequence from at least one character sequence.

The heterogeneous language model is determined based on the plurality of first network models and the plurality of second network models.

In the application scenario shown in fig. 1, the heterogeneous language model is determined according to the plurality of first network models and the plurality of second network models, and a speech data set with a small number of speech samples can be used to train to obtain the heterogeneous language model with a high accuracy, so that the accuracy of the language model is improved.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 2 is a flowchart illustrating a training method of a heterogeneous language model according to an embodiment of the present application. As shown in fig. 2, the method includes:

s201, a voice training sample set is obtained.

Optionally, an execution subject of the training method for the heterogeneous language model provided in the embodiment of the present application is an electronic device, and may also be a training device for the heterogeneous language model, where the training device for the heterogeneous language model may be implemented by a combination of software and/or hardware.

Optionally, the voice training sample set includes at least one voice data set obtained from a preset open website. The preset open website may be OpenSLR 1, for example. The at least one voice data set may include, for example, at least one data set of SLR18, SLR33, SLR38, SLR47, SLR62, SLR68, SLR93, and the like.

S202, training the first initial network model and the second initial network model by adopting a voice training sample set to obtain at least two first network models and at least two second network models.

The first initial network model may be, for example, an initial n-gram model.

The second initial network model may be, for example, an initial Deep Neural Network (DNN).

The first network model is used for processing the input pinyin sequence to obtain at least one character sequence corresponding to the pinyin sequence. The phonetic sequence is a phonetic sequence corresponding to a phonetic training sample in a phonetic training sample set. The second network model is used for determining a target character sequence corresponding to the pinyin sequence from at least one character sequence.

It should be noted that the first network model and the first initial network model have the same structure, and the second network model and the second initial network model have the same structure.

Alternatively, the first network model and the second network model may be obtained by the following manner 11 and manner 12.

In the method 11, at least two first network models are obtained in the process of training the first initial network model by using the voice training sample set; and obtaining at least two second network models in the process of training the second initial network model by adopting the voice training sample set.

Optionally, in the process of training the first initial network model by using the speech training sample set, the model parameters of the first initial network model are updated according to the speech training sample set, and the network models updated X1, X2, and X3 … … times are determined as at least two first network models.

Alternatively, X1, X2, X3 … … may be positive integers that increase in sequence. For example, X1 is 50, X2 is 100, X3 is 300, etc. In the present application, specific values of X1, X2, and X3 … … are not limited.

In the method 11, the method of obtaining at least two second network models is similar to the method of obtaining at least two first network models, and is not described herein again.

Mode 12, the speech training sample set includes at least two sample subsets; respectively adopting at least two sample subsets to train the first initial network model to obtain a first network model corresponding to each sample subset; respectively adopting at least two sample subsets to train the second initial network model to obtain a second network model corresponding to each sample subset; determining a first network model corresponding to each of the at least two sample subsets as at least two first network models; and determining the second network model corresponding to each of the at least two sample subsets as at least two second network models.

For example, when the at least two sample subsets include SLR18, SLR33, a first initial network model may be trained using SLR18 to obtain one first network model, and a second initial network model may be trained using SLR33 to obtain another first network model.

In the method 12, the method of obtaining at least two second network models is similar to the method of obtaining at least two first network models, and is not described herein again.

S203, determining a heterogeneous language model according to the at least two first network models and the at least two second network models.

Alternatively, the heterogeneous language model may be determined by the following manner 21 and manner 22.

Mode 21, determining at least four first language models according to at least two first network models and at least two second network models; acquiring a voice verification sample set; determining an error rate for each first language model based on the set of speech verification samples; and determining the heterogeneous language model according to the at least four first language models and the error rate.

In some embodiments, the at least two first network models and the at least two second network models are randomly combined to obtain at least four first language models.

For a detailed description of the method 21, please refer to the following detailed contents of the embodiment shown in fig. 3.

In the method 22, a first target network model is determined according to at least two first network models, and a second target network model is determined according to at least two second network models; and determining the first target network model and the second target network model as heterogeneous language models. The first target network model and the first network model have the same structure, and the second target network model and the second network model have the same structure.

For a detailed description of the method 22, please refer to the following embodiment shown in fig. 4.

In the training method for the heterogeneous language model provided in the embodiment of fig. 2, a speech training sample set is used to train the first initial network model and the second initial network model to obtain at least two first network models and at least two second network models, the heterogeneous language model is determined according to the at least two first network models and the at least two second network models, and the speech training sample set with fewer speech samples can be used to obtain the heterogeneous language model with higher accuracy, so that the accuracy of the heterogeneous language model is improved.

In the present application, the first network model and the second network model are different in structure and function, and therefore, the heterogeneous language model determined according to the at least two first network models and the at least two second network models also includes two network models different in structure and function. In the heterogeneous language model provided in the embodiment of the application, two network models with different structures and functions are used in cooperation, so that a rough range of a result can be obtained first (that is, an input pinyin sequence is processed to obtain at least one character sequence corresponding to the pinyin sequence), and then the range is subjected to fine judgment (for example, a target character sequence corresponding to the pinyin sequence is determined from the at least one character sequence), so that the accuracy of the heterogeneous language model is improved.

In the related art, the isomorphic language model includes a network model (e.g., an n-gram model or DNN), a target character sequence corresponding to an input pinyin sequence is directly obtained through the network model, and a process of determining the target character sequence in at least one obtained character sequence is absent, so the accuracy of the isomorphic language model is generally poor.

On the basis of the above embodiment, a specific implementation process of the above mode 21 will be described below with reference to fig. 3.

Fig. 3 is a flowchart of a method for determining a heterogeneous language model according to an embodiment of the present application. As shown in fig. 3, the method includes:

s301, aiming at each of at least two first network models, carrying out binary conversion processing on model parameters of the first network models to obtain a first initial parameter sequence; performing cross processing and mutation processing on the first initial parameter sequence to obtain at least two first intermediate parameter sequences; and replacing the model parameters of the first network model with the model parameters corresponding to the at least two first intermediate parameter sequences to obtain at least two third network models corresponding to the first network model.

Alternatively, at least two first intermediate parameter sequences may be obtained in the following manner 30.

In the method 30, a first initial parameter sequence is subjected to a cross processing and a mutation processing to obtain a first intermediate parameter sequence; performing cross processing and mutation processing on the obtained first intermediate parameter sequence to obtain another first intermediate parameter sequence; and determining the two first intermediate parameter sequences as at least two first intermediate parameter sequences.

S302, aiming at each of at least two second network models, carrying out binary conversion processing on model parameters of the second network models to obtain a second initial parameter sequence; performing cross processing and mutation processing on the second initial parameter sequence to obtain at least two second intermediate parameter sequences; and replacing the model parameters of the second network model with the model parameters corresponding to the at least two second intermediate parameter sequences to obtain at least two fifth network models corresponding to the second network model.

Alternatively, the second initial parameter sequence may be subjected to cross processing and mutation processing in a similar manner to the method 30 described above, so as to obtain at least two second intermediate parameter sequences.

S303, carrying out random combination processing on at least two third network models corresponding to the at least two first network models and at least two fifth network models corresponding to the at least two second network models to obtain at least four first language models.

It should be noted that, in the above S301 to S303, an explanation of at least four first language models is determined according to at least two first network models and at least two second network models.

And S304, acquiring a voice verification sample set.

The set of speech verification samples is different from the set of speech training samples. The voice verification sample set includes, for example, at least one data set of SLR18 and SLR68 in a preset open web site.

The voice verification sample set comprises a plurality of voice verification samples, a pinyin verification sample corresponding to each voice verification sample and a character verification result corresponding to the pinyin verification sample.

S305, determining the error rate of each first language model according to the voice verification sample set.

Optionally, for each of the at least four first language models, the first language model includes a first network model and a second network model; determining an error rate for the first language model based on the set of speech verification samples, comprising:

In this application, "sequentially" means that after one network model processes a sample, an intermediate result is provided to the other network model, so that the other network model processes the intermediate result to obtain an output result.

S306, determining the heterogeneous language model according to the at least four first language models and the error rate.

Alternatively, the heterogeneous language model may be determined by the following means 31 and 32.

In the method 31, the first language model with the smallest error rate in the at least four first language models is determined as the heterogeneous language model.

In the mode 32, whether a first language model with an error rate smaller than a preset value exists in the at least four first language models is judged;

Optionally, the preset number of first language models may be a preset number of first language models with the smallest error rate in the at least four first language models in sequence, or may be any preset number of first language models in the at least four first language models.

Optionally, when the first language model includes a first network model and a second network model, the first model parameter sequence includes a first initial parameter sequence corresponding to model parameters of the first network model in the first language model and a second initial parameter sequence corresponding to model parameters of the second network model.

Optionally, when the first language model includes a third network model and a fifth network model, the first model parameter sequence includes a third initial parameter sequence corresponding to the model parameter of the third network model in the first language model and a fifth initial parameter sequence corresponding to the model parameter of the fifth network model.

Optionally, determining the heterogeneous language model according to the plurality of second language models and the error rate of each second language model includes: and (3) repeatedly executing the method similar to the method 32 for the plurality of second language models until obtaining the Nth language model with the error rate smaller than the preset value, and determining the Nth language model with the error rate smaller than the preset value as the heterogeneous language model. Wherein N may be a positive integer greater than or equal to 2. In the first language model, the second language model, and the … … nth language model, the latter language model is obtained based on the former language model. For example, the second language model is derived based on the first language model.

In the following, taking the first language model including the third network model and the fifth network model as an example, a method for obtaining at least one second language model corresponding to the first language model is described:

performing binary conversion processing on the model parameters of the third network model to obtain a third initial parameter sequence;

performing cross processing and mutation processing on the third initial parameter sequence to obtain at least one third intermediate parameter sequence, and replacing the model parameters of the third network model with the model parameters corresponding to the at least one third intermediate parameter sequence to obtain at least one sixth network model corresponding to the third network model;

performing binary conversion processing on the model parameters of the fifth network model to obtain a fifth initial parameter sequence;

performing cross processing and mutation processing on the fifth initial parameter sequence to obtain at least one fifth intermediate parameter sequence, and replacing the model parameters of the fifth network model with the model parameters corresponding to the at least one fifth intermediate parameter sequence to obtain at least one seventh network model corresponding to the fifth network model;

and randomly combining the plurality of sixth network models and the plurality of seventh network models to obtain at least one second language model corresponding to the first language model.

Specifically, S304 to S306 are explanatory descriptions for determining the heterogeneous language model based on at least four first language models.

In the method for determining a heterogeneous language model provided in the embodiment of fig. 3, at least one third network model corresponding to each first network model and at least one fifth network model corresponding to each second network model are randomly combined to obtain at least four first language models, and a plurality of first language models can be obtained under the condition that a speech training sample set has fewer speech samples. Further, determining an error rate of each first language model according to the voice verification sample set; and determining the heterogeneous language model according to the at least four first language models and the error rate, so that the accuracy of the heterogeneous language model can be improved.

On the basis of the above embodiment, a specific implementation process of the above mode 22 will be described below with reference to fig. 4.

Fig. 4 is a flowchart of another method for determining a heterogeneous language model according to an embodiment of the present application. As shown in fig. 4, the method includes:

s401, aiming at each of at least two first network models, obtaining a first initial parameter sequence corresponding to model parameters of the first network model by performing binary conversion processing on the model parameters of the first network model; performing cross processing and mutation processing on the first initial parameter sequence to obtain at least two first intermediate parameter sequences; and replacing the model parameters of the first network model with the model parameters corresponding to the at least two first intermediate parameter sequences to obtain at least two third network models corresponding to the first network model.

Alternatively, the first initial parameter sequence may be subjected to a cross-processing and a mutation processing in a similar manner to the method 30 described above, so as to obtain at least two first intermediate parameter sequences.

Specifically, a first intermediate parameter sequence is subjected to binary conversion inverse processing to obtain model parameters corresponding to the first intermediate parameter sequence, and then the model parameters of the first network model are replaced by the model parameters corresponding to the first intermediate parameter sequence to obtain a third network model corresponding to the first network model.

S402, determining a first target network model according to the plurality of third network models.

In S402, the plurality of third network models includes at least two third network models corresponding to the at least two first network models in S401.

In some embodiments, S402 specifically includes: acquiring a voice verification sample set; determining an error rate of each third network model according to the voice verification sample set; the first target network model is determined based on the plurality of third network models and the error rate of each third network model.

Here the set of voice verification samples may be the same as the set of voice verification samples in S304.

In some embodiments, for each of a plurality of third network models; determining an error rate of the third network model from the set of speech verification samples, comprising: sequentially processing the multiple pinyin verification samples through the third network model and any second network model to obtain character output results corresponding to the multiple pinyin verification samples; and determining the ratio of the number of the pinyin verification samples with different text output results and text verification results to the total number of the pinyin verification samples as the error rate of the third network model.

Optionally, any one of the second network models is any one of the at least two second network models, and may also be the second network model after the model parameters of the second initial network model are updated for the Y-th time. Alternatively, Y may be the maximum value among Y1, Y2, and Y3 … …, or may be a preset fixed value. Wherein Y1, Y2, Y3 … … are the number of times the model parameters of the second initial network model are updated.

It should be noted that, for a third network model, in the process of determining the error rate of the third network model, any one of the foregoing second network models is a second network model that is fixed after being selected.

Optionally, a method similar to S306 may be adopted to determine the first target network model according to the plurality of third network models and the error rate of each third network model, which is not described herein again.

S403, aiming at each of the at least two second network models, obtaining a second initial parameter sequence corresponding to the model parameters of the second network model by performing binary conversion processing on the model parameters of the second network model; performing cross processing and mutation processing on the second initial parameter sequence to obtain at least two second intermediate parameter sequences; and replacing the model parameters of the second network model with the model parameters corresponding to the at least two second intermediate parameter sequences to obtain at least two fifth network models corresponding to the second network model.

Specifically, the method of S403 is similar to the method of S401, and is not described herein again.

S404, determining a second target network model according to the plurality of fifth network models.

In S404, the plurality of fifth network models includes at least two fifth network models corresponding to the at least two second network models in S403.

In some embodiments, S404 specifically includes: acquiring a voice verification sample set; determining the error rate of each fifth network model according to the voice verification sample set; and determining a second target network model according to the plurality of fifth network models and the error rate of each fifth network model.

In some embodiments, determining the error rate of the fifth network model from the set of speech verification samples comprises, for each fifth network model of the plurality of fifth network models: processing the plurality of voice verification samples sequentially through the first target network model and the fifth network model to obtain character output results corresponding to the plurality of voice verification samples;

and determining the ratio of the number of the voice verification samples with different text output results and text verification results to the total number of the plurality of voice verification samples as the error rate of the fifth network model.

Optionally, a method similar to S306 may be adopted to determine the second target network model according to the plurality of fifth network models and the error rate of each fifth network model, which is not described herein again.

S405, determining the first target network model and the second target network model as heterogeneous language models.

In the method provided in the embodiment of fig. 4, the first initial parameter sequence is subjected to cross processing and mutation processing to obtain at least one third network model corresponding to the first network model, so that a plurality of third network models can be obtained under the condition that the number of voice samples in the voice training sample set is small. And performing cross processing and mutation processing on the second initial parameter sequence to obtain at least one fifth network model corresponding to the second network model, so that a plurality of fifth network models can be obtained under the condition that the voice training sample set has fewer voice samples. Further, the error rate of each third network model is determined according to the voice verification sample set, the error rate of each fifth network model is determined according to the voice verification sample set, the first target network model is determined according to the error rates of the plurality of third network models and the third network models, the second target network model is determined according to the error rates of the plurality of fifth network models and the fifth network models, the accuracy of the obtained first target network model and the second target network model can be improved, and the accuracy of the obtained language model is further improved.

It should be noted that the above-mentioned embodiments in fig. 3 and fig. 4 are specific methods for obtaining a heterogeneous language model based on a Genetic Matching Merge Algorithm (GMMA).

On the basis of the above embodiments, the embodiments of the present application further provide a method for determining a heterogeneous language model, which is described below with reference to the embodiment of fig. 5.

Fig. 5 is a flowchart of another method for determining a heterogeneous language model according to an embodiment of the present application. As shown in fig. 5, the method includes:

s501, generating a first weight of each first network model and a second weight of each second network model through the weight generation model.

And S502, according to the first weight, fusing the model parameters of the first network models to obtain first target model parameters.

Optionally, according to the first weight, the model parameters of each first network model are subjected to weighted summation processing, so as to implement fusion of the model parameters of each first network model.

For example, the number of the first network models is 2, the weights of 2 first network models are a1 and a2, respectively, the model parameters of one first network model include X11, X12 and X13, the model parameters of the other first network model include X21, X22 and X23, and the target model parameters include a1 × X11+ a2 × 21, a1 × 12+ a2 × X22 and a1 × 13+ a2 × 23.

And S503, according to the second weight, fusing the model parameters of the second network models to obtain second target model parameters.

Specifically, the method of S502 is similar to the method of S503, and is not described herein again.

S504, replacing the model parameters of the first network model with the first target model parameters to obtain a first target network model.

Here, the first network model is any one of the first network models in S501 described above.

And S505, replacing the model parameters of the second network model with the second target model parameters to obtain a second target network model.

Here, the second network model is any one of the second network models in S502 described above.

S501 to S505 are detailed descriptions of determining a first target network model according to at least two first network models and determining a second target network model according to at least two second network models.

S506, determining the first target network model and the second target network model as heterogeneous language models.

In the application, a first weight corresponding to each first network model and a second weight corresponding to each second network model are generated through a weight generation model to obtain a first target model parameter and a second target network model, the first network model with the first target model parameter is determined as the first target network model, the second network model with the second target model parameter is determined as the second target network model, a voice training sample set with few voice samples is adopted to obtain a heterogeneous language model with high accuracy, and the efficiency of obtaining the heterogeneous language model is further improved.

In some embodiments, the method provided in the embodiments of the present application further includes: acquiring a voice verification sample set; determining the error rate of the heterogeneous language model according to the voice verification sample set; determining a reward value according to the error rate; and updating the model parameters of the weight generation model according to the reward value, and regenerating the first weight of each first network model and the second weight of each second network model so as to update the heterogeneous language model.

Optionally, the reward value is equal to 1 minus the product of the error rate and 100%.

On the basis of fig. 5, a method for updating the heterogeneous language model will be described below with reference to fig. 6.

Fig. 6 is a flowchart of a method for updating a heterogeneous language model according to an embodiment of the present application. As shown in fig. 6, the method includes:

s601, generating a first weight corresponding to each first network model and a second weight of each second network model through a weight generation model with the ith group of model parameters.

S602, according to the first weight, model parameters of the first network models are fused to obtain first target model parameters; and fusing the model parameters of the second network models according to the second weight to obtain second target model parameters.

S603, replacing the model parameters of the first network model with the first target model parameters to obtain a first target network model; replacing the model parameters of the second network model with second target model parameters to obtain a second target network model; and determining the first target network model and the second target network model as the ith heterogeneous language model.

And S604, acquiring a voice verification sample set.

And S605, determining the error rate of the ith heterogeneous language model according to the voice verification sample set.

S606, determining the ith reward value according to the error rate of the ith heterogeneous language model.

S607, judging whether the ith reward value is larger than or equal to the preset reward value.

If so, go to S608, otherwise go to S609.

And S608, determining the ith heterogeneous language model as a final heterogeneous language model.

And S609, determining the (i + 1) th group of model parameters of the weight generation model according to the reward value, and repeatedly executing S601-S609 according to the obtained (i + 1) th group of model parameters. Initially, i equals 1.

Alternatively, the reward value may be substituted into a preset formula, and the preset formula after the reward value is substituted is processed to obtain the (i + 1) th group of model parameters.

For example, the preset formula may have the form of formula 1 as follows:

wherein J is an objective function, ω_aGenerating model parameters of the model for the weights, T being a total part of a series of weights that need to be combined for all models (including each first network model and each second network model in S601), a_tIs the weight of the T-th partial model in T, a_(t-1)Is the weight of the T-1 th partial model in T, R is the reward value, E is ω_aP is ω_aIs determined by the probability function of (a) a,

is omega_aLog is a log operation.

It should be noted that, the formula 1 can be processed by a gradient descent convergence method to obtain ω_aAnd the obtained omega_aAnd determining as the (i + 1) th group of model parameters.

In the process of repeatedly executing S601 to S610 according to the obtained i +1 th group of model parameters, the model parameters of the weight generation model are updated according to the reward value, and the first weights of the first network models and the second weights of the second network models are regenerated to obtain the new interpretation of the heterogeneous language model.

In the method provided in the embodiment of fig. 6, in the process of executing S601 to S609 each time, the weight generation model generates a first weight corresponding to each first network model and a second weight of each second network model, and obtains the heterogeneous language model according to the first weight and the second weight, so that the efficiency of obtaining the heterogeneous language model can be improved, and the efficiency of obtaining the final heterogeneous language model can be improved. Further, an ith reward value is determined according to the error rate of the ith heterogeneous language model, the (i + 1) th group of model parameters of the weight generation model is determined according to the ith reward value, S601-S609 are repeatedly executed, the heterogeneous language model with the reward value larger than or equal to the preset reward value is determined as the final heterogeneous language model.

Unlike the prior art, in the prior art, the LM is an isomorphic language model, and the isomorphic language model includes a network model, and the network model obtains a target character sequence directly according to an input pinyin sequence, and a process of determining the target character sequence in at least one obtained character sequence is lacked, so the accuracy of the isomorphic language model is poor. In the application, the LM is a heterogeneous language model, the heterogeneous language model includes two network models with different structures and functions, and the two network models with different structures and functions are used in cooperation, so that an approximate range of a result can be obtained first (that is, an input pinyin sequence is processed to obtain at least one character sequence corresponding to the pinyin sequence), and then the approximate range is judged finely (for example, a target character sequence corresponding to the pinyin sequence is determined from the at least one character sequence), thereby improving the accuracy of the heterogeneous language model.

Specifically, the embodiments of fig. 5 and 6 provide a specific method for obtaining the heterogeneous language model based on a Reinforcement Learning Guided Merge Algorithm (RLGMA).

Fig. 7 is an architecture diagram for obtaining a heterogeneous language model based on GMMA according to an embodiment of the present application. As shown in fig. 7, includes: at least two first network models and at least two second network models.

Optionally, the at least two first network models comprise first network models 11-1M and the at least two second network models comprise second network models 21-2N. M is the total number of the first network models, and N is the total number of the second network models. Wherein M, N are each an integer of 2 or more.

Specifically, on the basis of fig. 3, the first fusion process includes, for example: in S301, for any one of the at least two first network models, the related processing involved in the process of obtaining at least two third network models corresponding to each first network model is obtained;

the second fusion process includes, for example: in S302, for any one of the at least two second network models, the related processing involved in the process of obtaining at least two fifth network models corresponding to each second network model is obtained;

the third process includes, for example: and S303-S306, obtaining at least four first language models, obtaining a voice verification sample set, determining the error rate of each first language model according to the voice verification sample set, and determining related processing involved in the process of determining the heterogeneous language models according to the at least four first language models and the error rate.

Specifically, on the basis of fig. 4, the first fusion process includes, for example: at least two third network models corresponding to each first network model are obtained in S401, and related processing involved in the process of determining the first target network model according to the plurality of third network models in S402;

the second fusion process includes, for example: at least two fifth network models corresponding to each second network model are obtained in S403, and related processing involved in the process of determining a second target network model according to the plurality of fifth network models in S404;

the third process includes, for example: in S405, the first target network model and the second target network model are determined as the related processing of the heterogeneous language model.

Fig. 8 is an architecture diagram for obtaining a heterogeneous language model based on RLGMA according to an embodiment of the present application, where the heterogeneous language model includes: the weight generation model, the first weight corresponding to each first network model and the second weight of each second network model.

In each first network model, the first weight corresponding to the ith first network model is A_1i，i＝1…M。

In each second network model, the second weight corresponding to the jth second network model is B_2j，j＝1…N。

The following process may be repeatedly performed in fig. 8: the weight generation model generates a first weight corresponding to each first network model and a second weight of each second network model based on the reward value, the first network model and the second network model obtain the language model based on the first weight and the second weight, and the first weight and the second weight are continuously updated until the reward value is larger than or equal to the preset reward value, so that a final heterogeneous language model is obtained.

Fig. 9 is a schematic structural diagram of a training apparatus for a heterogeneous language model according to an embodiment of the present application. As shown in fig. 9, the training apparatus 10 for a heterogeneous language model includes: a processing module; the processing module 101 is configured to:

acquiring a voice training sample set;

The training device for the heterogeneous language model provided by the embodiment of the application can execute the training method for the heterogeneous language model, the implementation principle and the beneficial effect are similar, and the repeated description is omitted here.

Optionally, the processing module 101 is specifically configured to:

acquiring a voice verification sample set;

Optionally, the processing module 101 is specifically configured to:

the processing module 101 is specifically configured to:

Optionally, the processing module is specifically configured to:

Optionally, the processing module 101 is specifically configured to:

acquiring a voice verification sample set;

the processing module 101 is specifically configured to:

Optionally, the processing module 101 is specifically configured to:

Optionally, the processing module 101 is further configured to:

acquiring a voice verification sample set;

Optionally, the speech training sample set comprises at least two sample subsets; the processing module 101 is specifically configured to:

The training device 10 for the heterogeneous language model provided in the embodiment of the present application can execute the training method for the heterogeneous language model, and the implementation principle and the beneficial effect thereof are similar, and are not repeated here.

Fig. 10 is a hardware schematic diagram of an electronic device according to an embodiment of the present application. As shown in fig. 10, the electronic device 20 may include: a transceiver 201, a memory 202, a processor 203. The transceiver 201 may include: a transmitter and/or a receiver. The transmitter may also be referred to as a sender, a transmitter, a sending port or a sending interface, and the like, and the receiver may also be referred to as a receiver, a receiving port or a receiving interface, and the like. The transceiver 201, the memory 202 and the processor 203 are illustratively interconnected via a bus 204.

The memory 202 is used for storing computer execution instructions;

the processor 203 is configured to execute computer-executable instructions stored in the memory 202 such that the processor 203 performs a method of training a heterogeneous language model.

The embodiment of the application provides a computer-readable storage medium, in which computer-executable instructions are stored, and when the computer-executable instructions are executed by a processor, the method for training a heterogeneous language model is implemented.

The embodiment of the present application further provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the method for training a heterogeneous language model can be implemented.

All or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The aforementioned program may be stored in a readable memory. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned memory (storage medium) includes: read-only memory (ROM), RAM, flash memory, hard disk, solid state disk, magnetic tape (magnetic tape), floppy disk (optical disk), and any combination thereof.

Embodiments of the present application are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the embodiments of the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to encompass such modifications and variations.

In the present application, the terms "include" and variations thereof may refer to non-limiting inclusions; the term "or" and variations thereof may mean "and/or". The terms "first," "second," and the like in this application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. In the present application, "at least one" means two or more. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for training a heterogeneous language model, comprising:

acquiring a voice training sample set;

training the first initial network model and the second initial network model by adopting the voice training sample set to obtain at least two first network models and at least two second network models; the first network model and the second network model have different structures, the first network model is used for processing an input pinyin sequence to obtain at least one character sequence corresponding to the pinyin sequence, and the second network model is used for determining a target character sequence corresponding to the pinyin sequence from the at least one character sequence;

and determining a heterogeneous language model according to the at least two first network models and the at least two second network models.

2. The method for training the heterogeneous language model according to claim 1, wherein the determining the heterogeneous language model according to the at least two first network models and the at least two second network models comprises:

determining at least four first language models according to the at least two first network models and the at least two second network models;

acquiring a voice verification sample set;

determining an error rate for each first language model from the set of speech verification samples;

3. The method for training heterogeneous language models according to claim 2, wherein the determining at least four first language models according to the at least two first network models and the at least two second network models comprises:

4. The method for training heterogeneous language models according to claim 2, wherein the determining at least four first language models according to the at least two first network models and the at least two second network models comprises:

performing binary conversion processing on model parameters of each first network model to obtain a first initial parameter sequence; performing cross processing and mutation processing on the first initial parameter sequence to obtain at least two first intermediate parameter sequences; replacing the model parameters of the first network model with model parameters corresponding to at least two first intermediate parameter sequences to obtain at least two third network models corresponding to the first network model;

for each second network model, carrying out binary conversion processing on the model parameters of the second network model to obtain a second initial parameter sequence; performing cross processing and mutation processing on the second initial parameter sequence to obtain at least two second intermediate parameter sequences; replacing the model parameters of the second network model with model parameters corresponding to at least two second intermediate parameter sequences to obtain at least two fifth network models corresponding to the second network model;

5. The method for training a heterogeneous language model according to claim 2, wherein the set of speech verification samples includes a plurality of pinyin verification samples and text verification results corresponding to the plurality of pinyin verification samples; for each of the at least four first language models, including the first network model and the second network model;

determining an error rate for a first language model from the set of speech verification samples, comprising:

sequentially processing the multiple pinyin verification samples through the first network model and the second network model to obtain character output results corresponding to the multiple pinyin verification samples;

6. The method for training the heterogeneous language model according to claim 5, wherein the determining the heterogeneous language model according to the at least four first language models and the error rate comprises:

judging that a first language model with an error rate smaller than a preset value exists in the at least four first language models;

if so, determining the first language model with the error rate smaller than a preset value as the heterogeneous language model;

if not, obtaining a first model parameter sequence corresponding to the model parameters of a preset number of first language models in the at least four first language models to obtain at least one second language model corresponding to the preset number of first language models, and determining the heterogeneous language model according to the error rates of the plurality of second language models and each second language model.

7. The method for training the heterogeneous language model according to claim 1, wherein determining the heterogeneous language model according to the at least two first network models and the at least two second network models comprises:

determining a first target network model according to the at least two first network models, and determining a second target network model according to the at least two second network models;

and determining the first target network model and the second target network model as the heterogeneous language model.

8. The method for training the heterogeneous language model according to claim 7, wherein the determining the first target network model according to the at least two first network models comprises:

determining the first target network model according to a plurality of third network models.

9. The method for training a heterogeneous language model according to claim 8, wherein the determining the first target network model according to the plurality of third network models comprises:

acquiring a voice verification sample set;

determining a first target network model based on the plurality of third network models and the error rate.

10. The method for training a heterogeneous language model according to claim 9, wherein the set of speech verification samples includes a plurality of pinyin verification samples and text verification results corresponding to the plurality of pinyin verification samples;

determining an error rate of a third network model from the set of speech verification samples, comprising:

sequentially processing the multiple pinyin verification samples through the third network model and any one of the second network models to obtain character output results corresponding to the multiple pinyin verification samples;

11. The method for training a heterogeneous language model according to claim 9, wherein the determining a first target network model according to the plurality of third network models and the error rate comprises:

if so, determining the third network model with the error rate smaller than a preset value as the first target network model;

if not, obtaining a first initial parameter sequence corresponding to model parameters of a preset number of third network models in the plurality of third network models to obtain at least one fourth network model corresponding to each of the preset number of third network models, and determining the first target network model according to the error rates of the plurality of fourth network models and each fourth network model.

12. The method for training the heterogeneous language model according to claim 7, wherein the determining a first target network model according to the at least two first network models and determining a second target network model according to the at least two second network models comprises:

replacing the model parameters of the first network model with the first target model parameters to obtain the first target network model; and replacing the model parameters of the second network model with the second target model parameters to obtain the second target network model.

13. The method for training the heterogeneous language model according to claim 12, wherein the method further comprises:

acquiring a voice verification sample set;

14. A method for training a heterogeneous language model according to any one of claims 1 to 13, wherein the speech training sample set comprises at least two sample subsets; the training of the first initial network model and the second initial network model by using the voice training sample set to obtain at least two first network models and at least two second network models includes:

respectively adopting the at least two sample subsets to train the first initial network model to obtain first network models corresponding to the at least two sample subsets;

training the second initial network model by respectively adopting the at least two sample subsets to obtain second network models corresponding to the at least two sample subsets;

determining a first network model corresponding to each of the at least two sample subsets as the at least two first network models; and determining a second network model corresponding to each of at least two sample subsets as the at least two second network models.

15. An apparatus for training a heterogeneous language model, comprising: a processing module; the processing module is used for:

acquiring a voice training sample set;

16. An electronic device, comprising: a processor, and a memory communicatively coupled to the processor;

the memory stores computer-executable instructions;

the processor executes computer-executable instructions stored by the memory to implement the method of any of claims 1 to 14.

17. A computer-readable storage medium having computer-executable instructions stored thereon, which when executed by a processor, are configured to implement the method of any one of claims 1 to 14.

18. A computer program product, characterized in that it comprises a computer program which, when executed by a processor, implements the method of any one of claims 1 to 14.