CN113420121B

CN113420121B - Text processing model training method, voice text processing method and device

Info

Publication number: CN113420121B
Application number: CN202110704938.1A
Authority: CN
Inventors: 周军; 张震; 李成章; 李鹏; 刘建; 石瑾; 刘睿霖; 颜永红
Original assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Current assignee: Institute of Acoustics CAS; National Computer Network and Information Security Management Center
Priority date: 2021-06-24
Filing date: 2021-06-24
Publication date: 2023-07-28
Anticipated expiration: 2041-06-24
Also published as: CN113420121A

Abstract

The application provides a text processing model training method, a voice text processing method and a device, and relates to the technical field of natural language processing. The method comprises the following steps: crawling dialogue texts from the Internet to obtain positive samples; performing transformation operation on sentences in the dialogue text to obtain a negative sample and first label information of the negative sample; correspondingly inputting the positive sample and the negative sample into a first text processing model to be trained and a second text processing model to be trained, and generating a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model; and carrying out knowledge distillation on the second text processing model according to the first feature vector and the second feature vector to obtain a trained second text processing model. According to the embodiment of the application, the problems that the efficiency of correcting the voice text is low, the time consumption is long and the occupied computing resource is large in the related technology can be solved.

Description

Text processing model training method, voice text processing method and device

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a text processing model training method, a voice text processing method, and a device.

Background

With the development of natural language processing technology and the demand of people for high efficiency, speech recognition technology has been widely applied to various fields of life, such as converting text after recording conference content as a conference summary; the recorded content of the lecture for the teacher is converted into text as a classroom note, and so on.

Currently, in order to accurately recognize that a voice is converted into a text and the converted text can be easily understood by a user, it is necessary to convert the recognized voice into a text and then correct the text, thereby obtaining a text that is easy to understand by the user. However, when the model for text collation in the related art is trained, the model is difficult to train due to lack of enough training samples, and even if the model is successful, the obtained text collation model needs to be iterated for a plurality of times to complete text collation, so that the time consumption is long, the efficiency is low, and the calculation resource occupation is large.

Disclosure of Invention

The embodiment of the application provides a text processing model training method, a voice text processing method and a voice text processing device, which can solve the problems of low efficiency, long time consumption and large occupation of computing resources in the related technology for correcting voice texts.

In a first aspect, an embodiment of the present application provides a text processing model training method, including:

Crawling dialogue texts from the Internet to obtain positive samples; the sentences in the dialogue text are sentences with correct grammar, and the positive samples are sentences in the dialogue text;

performing transformation operation on sentences in the dialogue text to obtain negative samples and first label information of the negative samples, wherein the sentences in the negative samples are sentences with grammar errors, and the first label information represents a transformation sequence for transforming the positive samples into the negative samples;

correspondingly inputting the positive sample and the negative sample into a first text processing model to be trained and a second text processing model to be trained, and generating a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model; the dimension of the second text processing model is smaller than that of the first text processing model, and the first text processing model is obtained through training according to a transformation sequence of a positive sample, a negative sample and a negative sample;

and carrying out knowledge distillation on the second text processing model according to the first feature vector and the second feature vector to obtain a trained second text processing model.

In one possible implementation, in a case where the number of layers of the first text processing model is the same as the number of layers of the second text processing model, the first feature vector includes a first input layer feature vector, a first hidden layer feature vector, a first attention vector, and a first predictive collation vector for collating negative samples, and the second feature vector includes a second input layer feature vector, a second hidden layer feature vector, a second attention vector, and a second predictive collation vector for collating negative samples.

In one possible implementation manner, knowledge distillation is performed on the second text processing model according to the first feature vector and the second feature vector to obtain a trained second text processing model, including:

determining a projection matrix according to the dimension of the first text processing model and the dimension of the second text processing model;

calculating a first mean square error loss between an input layer of the first text processing model and an input layer of the second text processing model according to the projection matrix, the first input layer feature vector and the second input layer feature vector;

calculating a second mean square error loss between the hidden layer of the first text processing model and the hidden layer of the second text processing model according to the projection matrix, the first hidden layer feature vector and the second hidden layer feature vector;

calculating a third mean square error loss between the first and second attention vectors;

calculating cross entropy loss of the first prediction correction vector and the second prediction correction vector according to preset temperature parameters;

and updating the second text processing model according to the first mean square error loss, the second mean square error loss, the third mean square error loss and the cross entropy loss.

In one possible implementation, in a case where the number of layers of the first text processing model is M, the number of layers of the second text processing model is N, and M is not equal to N, the first feature vector includes a first attention vector of each of the M layers of the first text processing model, a first hidden layer feature vector of each hidden layer, a first input layer feature vector, and a first predictive collation vector for collating negative samples, and the second feature vector includes a second attention vector of each of the N layers of the second text processing model, a second hidden layer feature vector of each hidden layer, a first input layer feature vector, and a first predictive collation vector for collating negative samples.

comparing the first attention vector of each layer of the M layers with the second attention vector of each layer of the N layers in pairs to obtain an attention loss matrix between the first text processing model and the second text processing model;

comparing the first hidden layer feature vector of each layer in the M layers with the second hidden layer feature vector of each layer in the N layers in pairs to obtain a hidden layer loss matrix between the first text processing model and the second text processing model;

calculating a first land movement distance EMD matrix according to the weight of each layer in the first text processing model, the weight of each layer in the second text processing model and the attention loss matrix;

Calculating a second EMD matrix according to the weight of each layer in the first text processing model, the weight of each layer in the second text processing model and the hidden layer loss matrix;

calculating a fourth mean square error loss between the first attention vector of the M layer in the first text processing model and the second attention vector of the N layer in the second text processing model according to the first EMD matrix and the attention loss matrix;

according to the second EMD matrix and the hidden layer loss matrix, calculating a fifth mean square error loss between a first hidden layer feature vector of an M layer in the first text processing model and a second hidden layer feature vector of an N layer in the second text processing model;

and updating the weight of each layer in the first text processing model and the weight of each layer in the second text processing model according to the first mean square error loss, the cross entropy loss, the fourth mean square error loss and the fifth mean square error loss until the first mean square error loss, the cross entropy loss, the fourth mean square error loss and the fifth mean square error loss converge.

In one possible implementation, the method further includes:

inputting the positive sample and the negative sample into a second text processing model to be trained, and generating a predicted collation sequence of the negative sample;

Training a text processing model based on the negative sample predicted collation sequence and the first tag information.

In one possible implementation, the training samples further include a positive sample pair and second tag information of the positive sample pair, the second tag information representing a conversion sequence that converts the positive sample into a positive sample, the two positive samples in the positive sample pair being identical, the method further comprising:

inputting the positive sample pair into a trained text processing model to generate a predicted collation sequence of the positive sample;

and training a text processing model according to the predicted collation sequence of the positive sample and the second label information.

In one possible implementation, inputting the positive and negative samples into a second text processing model to be trained, generating a predicted collation sequence of negative samples includes:

under the condition that the number of characters in the positive sample is larger than the preset number, inputting the preset number of characters in the positive sample and the characters corresponding to the preset number of characters in the positive sample in the negative sample into a second text processing model to be trained according to the sequence from front to back to obtain a prediction correction sequence of the preset number of characters in the negative sample;

and taking the characters left in the positive sample and the characters left in the negative sample as training samples of the next model training process.

In a second aspect, an embodiment of the present application provides a method for processing a voice text, where the method includes:

recognizing a voice text corresponding to the target voice;

inputting the phonetic text into a second text processing model as in the first aspect or any of the possible implementation manners of the first aspect, determining a collation sequence of the phonetic text, the collation sequence representing a collation rule for each character in the phonetic text;

and correcting the voice text according to the correction sequence to obtain the correction text corresponding to the target voice.

In a third aspect, an embodiment of the present application provides a text processing model training apparatus, including:

the acquisition module is used for crawling dialogue texts from the Internet to obtain positive samples; the sentences in the dialogue text are sentences with correct grammar, and the positive samples are sentences in the dialogue text;

the transformation module is used for combining first label information of the negative sample, wherein sentences in the negative sample are sentences with grammar errors, and the first label information represents a transformation sequence for transforming the positive sample into the negative sample;

the generation module is used for correspondingly inputting the positive sample and the negative sample into a first text processing model which is trained in advance and a second text processing model which is to be trained, and generating a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model; the dimension of the second text processing model is smaller than that of the first text processing model, and the first text processing model is obtained through training according to a transformation sequence of a positive sample, a negative sample and a negative sample;

And the training module is used for carrying out knowledge distillation on the second text processing model according to the first feature vector and the second feature vector to obtain a trained second text processing model.

In one possible implementation, the training module is configured to:

In one possible implementation, the apparatus further includes:

the determining module is used for determining a transformation sequence corresponding to the negative sample according to transformation operation to obtain first label information of the negative sample; wherein the first tag information represents a transformation sequence that transforms positive samples into negative samples;

the generation module is also used for inputting the positive sample and the negative sample into a second text processing model to be trained, and generating a predicted collation sequence of the negative sample;

The training module is also used for training a text processing model according to the predicted collation sequence of the negative sample and the first label information.

In one possible implementation manner, the training sample further includes a positive sample pair and second tag information of the positive sample pair, the second tag information represents a conversion sequence for converting the positive sample into the positive sample, two positive samples in the positive sample pair are identical, and the generating module is further configured to input the positive sample pair into the trained text processing model, and generate a predicted calibration sequence of the positive sample;

the training module is also used for training a text processing model according to the predicted collation sequence of the positive sample and the second label information.

In one possible implementation, the generating module is configured to:

In a fourth aspect, an embodiment of the present application provides a voice text processing apparatus, where the method includes:

the recognition module is used for recognizing the voice text corresponding to the target voice;

a determining module, configured to input a phonetic text into the second text processing model as in the first aspect or any of the possible implementation manners of the first aspect, and determine a collation sequence of the phonetic text, where the collation sequence represents a collation rule of each character in the phonetic text;

and the proofreading module is used for proofreading the voice text according to the proofreading sequence to obtain the proofreading text corresponding to the target voice.

In a fifth aspect, embodiments of the present application provide an electronic device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, the computer program implementing the method as provided in the first aspect or any one of the possible implementations of the first aspect or implementing the method as provided in the second aspect.

In a sixth aspect, embodiments of the present application provide a computer storage medium having instructions stored therein, which when run on a computer, cause the computer to perform the method provided in the first aspect or any one of the possible implementations of the first aspect, or to implement the method provided in the second aspect, as described above.

According to the text processing model training method, the voice text processing method and the voice text processing device, the text with correct grammar such as text related to scene dialogue, conference summary text and the like is crawled from the Internet, so that positive samples are obtained, and a large number of positive samples can be obtained by crawling the text with correct grammar from the Internet; then, the conversion operation is performed on the sentences in the dialogue text, such as deleting characters, replacing homophones, merging natural segments and the like, so that the converted sentences are sentences with grammar errors, a negative sample is obtained, and label information of the negative sample and the negative sample is obtained through the conversion operation, so that a large number of negative samples with labeling information can be obtained. And then, respectively inputting the positive sample and the negative sample into a trained first text processing model and a trained second text processing model to obtain a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model, wherein the dimension of the second text processing model is smaller than that of the first text processing model. And carrying out knowledge distillation on the second text processing model based on the first feature vector and the second feature vector, so that the second text processing model can learn the features of the first text processing model, and further, the text is checked. Therefore, a lightweight second text processing model can be obtained to correct the text, and the occupation of resources is reduced. And iteration is not needed in the process of text proofreading by using the second text processing model, so that the text proofreading efficiency is improved and the occupation of computing resources is reduced.

Drawings

Fig. 1 shows a schematic flow chart of a text processing model training method according to an embodiment of the present application;

fig. 2 shows a schematic flow chart of a method for processing a voice text according to an embodiment of the present application;

fig. 3 shows a schematic structural diagram of a text processing model training device according to an embodiment of the present application;

fig. 4 shows a schematic structural diagram of a voice text processing device according to an embodiment of the present application;

fig. 5 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be described below with reference to the accompanying drawings.

In the description of embodiments of the present application, words such as "exemplary," "such as" or "for example," are used to indicate by way of example, illustration, or description. Any embodiment or design described herein as "exemplary," "such as" or "for example" is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary," "such as" or "for example," etc., is intended to present related concepts in a concrete fashion.

In the description of the embodiments of the present application, the term "and/or" is merely an association relationship describing an association object, and indicates that three relationships may exist, for example, a and/or B may indicate: a alone, B alone, and both A and B. In addition, unless otherwise indicated, the term "plurality" means two or more. For example, a plurality of systems means two or more systems, and a plurality of screen terminals means two or more screen terminals.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating an indicated technical feature. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The dialogue content is usually lengthy, tortuous, informal, repetitive, wherein the situations of illness, reversing, repetitive, reconfirming, hesitation, topic interruption of speaker, etc. also occur, and important information is dispersed in a plurality of roles, a plurality of time points. Moreover, the recognition errors in the voice recognition process often result in poor readability of the generated voice recognition text, which is not beneficial to post-hoc review, summarization and arrangement of content.

Based on the above, the embodiment of the application provides a text processing model training method, a voice text processing method and a device, which can acquire enough training texts, and the trained model is a lightweight model, so that the occupation of storage resources is reduced, the text correction efficiency is improved, and meanwhile, the occupation of computing resources is reduced. The text processing model training method provided in the embodiment of the application is described in detail below.

Fig. 1 is a flow chart of a text processing model training method according to an embodiment of the present application. As shown in fig. 1, the text processing model training method provided in the embodiment of the present application may include S101-S104.

S101: crawling dialogue texts from the Internet to obtain positive samples; the sentences in the dialogue text are sentences with correct grammar, and the positive samples are sentences in the dialogue text.

To obtain a large number of training samples, dialog text, such as forum dialog text, contextual dialog text, video captions, scripts, etc., may be crawled from the internet. Wherein, the sentences in the dialogue text in the Internet are sentences with correct grammar. Sentences in the dialogue text are taken as positive samples in the training samples.

In some embodiments, to ensure accuracy of positive samples, data cleansing may also be performed on the crawled dialog text, e.g., removing special characters and nonsensical spaces, links, pictures, etc. in the dialog text.

S102: and carrying out transformation operation on the sentences in the dialogue text to obtain negative samples and first label information of the negative samples, wherein the sentences in the negative samples are sentences with grammar errors, and the first label information represents a transformation sequence for transforming the positive samples into the negative samples.

And the negative samples corresponding to the positive samples are sentences with grammar errors, so that the positive samples and the negative samples form error correction text parallel corpus pairs. The method and the device for converting the sentence in the dialogue text are capable of converting the sentence in the dialogue text into the sentence with the grammar error through converting the sentence in the dialogue text. For example, punctuation, homophones, natural segments are merged, randomly deleted, characters added, and so forth. In this way, a negative sample corresponding to the positive sample can be generated. For example, the positive sample is the statement "i am in a coast street, you come in a bar" and then can be converted to "i am in a coast street, you go out of a bar".

The transformation sequence for transforming the negative sample into the positive sample can be determined based on the transformation operation, so as to obtain the first label information of the negative sample. Wherein the transformation sequence represents a transformation operation corresponding to each character in a sentence. For example, the statement "I'm go to the coming bar faster than you like the coast street" is converted to the statement "I'm go to the coming bar faster than you like the coast street", the corresponding transformation sequence is ' hold, replace in, hold hold, delete, hold).

S103: correspondingly inputting the positive sample and the negative sample into a first text processing model to be trained and a second text processing model to be trained, and generating a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model; the second text processing model has a smaller dimension than the first text processing model, the first text processing model being trained from a transformation sequence of positive, negative and negative samples.

The first text processing model is trained beforehand by a transformation sequence of positive, negative and negative samples. The second text processing model may be a pre-built untrained model or a model generated based on the first text processing model, e.g., by extracting multiple intermediate layers from the first text processing model. Here, the dimensions of the second text processing model are smaller than the dimensions of the first text processing model.

And correspondingly inputting the positive sample and the negative sample into a first text processing model to be trained and a second text processing model to be trained, and generating a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model.

In some embodiments, the target layers of the first text processing model include an input layer, a hidden layer, and an output layer, and the target layers of the second text processing model include an input layer, a hidden layer, and an output layer.

The number of layers of the first text processing model and the number of layers of the second text processing model may be the same or different.

In the case where the number of layers of the first text processing model is the same as the number of layers of the second text processing model, the first feature vector includes a first input layer feature vector, a first hidden layer feature vector, a first attention vector, and a first predictive collation vector collating negative samples. The first hidden layer vector refers to a feature vector commonly determined by all hidden layers in the first text processing model, for example, there are 3 hidden layers, and the first hidden layer feature vector refers to a feature vector commonly determined by 3 hidden layers. The first attention vector is an attention vector commonly determined by all hidden layers in the first text processing model. The second feature vector includes a second input layer feature vector, a second hidden layer feature vector, a second attention vector, and a second predictive collation vector that collates negative samples. The second hidden layer feature vector refers to a feature vector commonly determined by all hidden layers in the second text processing model. The second attention vector is an attention vector commonly determined by all hidden layers in the second text processing model.

If the number of layers of the first text processing model is different from the number of layers of the second text processing model, the number of layers of the first text processing model is set to be M, the number of layers of the second text processing model is set to be N, and the first feature vector comprises a first attention vector of each of the M layers of the first text processing model, a first hidden layer feature vector of each hidden layer, a first input layer feature vector, and a first prediction correction vector for correcting the negative sample, and the second feature vector comprises a second attention vector of each of the N layers of the second text processing model, a second hidden layer feature vector of each hidden layer, a first input layer feature vector, and a first prediction correction vector for correcting the negative sample.

S104: and carrying out knowledge distillation on the second text processing model according to the first feature vector and the second feature vector to obtain a trained second text processing model.

In some embodiments, in the case that the number of layers of the first text processing model is the same as the number of layers of the second text processing model, knowledge distillation is performed on the second text processing model according to the first feature vector and the second feature vector, so that the second text processing model learns the parameter features of the first text processing model.

Wherein in S104, a projection matrix is determined from the dimensions of the first text processing model and the dimensions of the second text processing model.

And calculating a first mean square error loss between the input layer of the first text processing model and the input layer of the second text processing model according to the projection matrix, the first input layer feature vector and the second input layer feature vector. Wherein, the liquid crystal display device comprises a liquid crystal display device,first mean square error loss L _embd The following formula (1) is satisfied:

L _embd ＝MSE(E ^S W,E ^T ) (1)

E ^S a second input layer vector representing a second text processing model, W representing a projection matrix, E ^T A first input layer vector representing a second text processing model.

And calculating a second mean square error loss between the hidden layer of the first text processing model and the hidden layer of the second text processing model according to the projection matrix, the first hidden layer feature vector and the second hidden layer feature vector. Wherein the second mean square error loss L _hidden The following formula (2) is satisfied:

w represents the projection matrix and,a second hidden layer feature vector representing an ith layer of the second text processing model, +.>A first hidden layer feature vector representing an i-th layer of the first text processing model.

A third mean square error penalty between the first and second attention vectors is calculated. Wherein the third mean square error loss L _atten The following formula (3) is satisfied:

h is the number of attention multi-heads,attention vector representing the ith layer of the second text processing model, +.>An attention vector representing an i-th layer of the first text processing model.

And calculating the cross entropy loss of the first prediction correction vector and the second prediction correction vector according to the preset temperature parameter. Wherein the cross entropy loss L _pred The following formula (4) is satisfied:

L _pred ＝-softmax(z ^T )log _- softmax(z ^s /t) (4)

z ^T representing a first predicted collation vector, z ^s Representing a second predicted collation vector, t temperature.

In some embodiments, in the case that the number of layers of the first text processing model is different from the number of layers of the second text processing model, knowledge distillation is performed on the second text processing model according to the first feature vector and the second feature vector, so that the second text processing model learns the parameter features of the first text processing model. Here, each layer of the first text processing model and each layer of the second text processing model has a weight. The greater the weighted layer contribution, the greater the weight that will be placed on the second text processing model when learning from the first text processing model. The same weights are given at initialization, e.g. the first text processing model has M layers, the second text processing model has N layers, the weights of each layer in the first text processing model 1/M, weight per layer in the second text processing model +.>Is 1/N.

In S104, determining a projection matrix according to the dimension of the first text processing model and the dimension of the second text processing model; and calculating a first mean square error loss between the input layer of the first text processing model and the input layer of the second text processing model according to the projection matrix, the first input layer feature vector and the second input layer feature vector, and calculating a cross entropy loss of the first prediction correction vector and the second prediction correction vector according to a preset temperature parameter. Wherein the first mean square error matrix satisfies the above formula (1), and the cross entropy loss satisfies the above formula (4), which will not be described in detail herein.

And then, comparing the first attention vector of each layer of the M layers with the second attention vector of each layer of the N layers in pairs to obtain an attention loss matrix between the first text processing model and the second text processing model. And comparing the first hidden layer feature vector of each layer in the M layers with the second hidden layer feature vector of each layer in the N layers in pairs to obtain a hidden layer loss matrix between the first text processing model and the second text processing model.

Here, the attention loss matrix refers to a loss matrix between all layers of the first text processing model and all layers of the second text processing model; the hidden layer loss matrix refers to a loss matrix between all hidden layers of the first text processing model and all hidden layers of the second text processing model.

After determining the attention loss matrix and the hidden layer loss matrix, calculating a first land movement Distance (EMD) matrix according to the weight of each layer in the first text processing model, the weight of each layer in the second text processing model and the attention loss matrix. And calculating a second EMD matrix according to the weight of each layer in the first text processing model, the weight of each layer in the second text processing model and the hidden layer loss matrix.

And calculating a fourth mean square error loss between the first attention vector of the M layer in the first text processing model and the second attention vector of the N layer in the second text processing model according to the first EMD matrix and the attention loss matrix. Wherein the fourth mean square error loss L _attn The following formula (5) is satisfied:

where M represents the number of layers of the first text processing model, N represents the number of layers of the second text processing model, A first EMD matrix representing a first text processing model between an ith layer and a jth layer of a second text processing model,/th layer>Representing the attention loss matrix.

And calculating a fifth mean square error loss between the first hidden layer feature vector of the M layer in the first text processing model and the second hidden layer feature vector of the N layer in the second text processing model according to the second EMD matrix and the hidden layer loss matrix. Wherein the fifth mean square error matrix satisfies the following formula (6):

where M represents the number of layers of the first text processing model, N represents the number of layers of the second text processing model,a second EMD matrix representing a first text processing model between an ith layer and a jth layer of a second text processing model,/th layer>Representing the hidden layer loss matrix.

And updating the weight of each layer in the first text processing model and the weight of each layer in the second text processing model according to the fourth mean square error loss and the fifth mean square error loss until the fourth mean square error loss and the fifth mean square error loss converge.

In some embodiments, the text processing model training method provided in the embodiments of the present application further includes a training process of the first text processing model. Specifically, inputting the positive sample and the negative sample into a second text processing model to be trained, and generating a predicted collation sequence of the negative sample; training a text processing model based on the negative sample predicted collation sequence and the first tag information. Here, in order to ensure the accuracy of the first text processing model, the manually marked positive and negative samples may also be input into the first text processing model for training during the training process.

In some embodiments, to ensure accuracy of the first text processing model, the training samples further include a positive sample pair and second tag information for the positive sample pair, the second tag information representing a conversion sequence that converts the positive sample to a positive sample, the two positive samples in the positive sample pair being identical. In the training process, the positive sample pair can be input into a trained text processing model to generate a prediction correction sequence of the positive sample; and training a text processing model according to the predicted collation sequence of the positive sample and the second label information.

As one example, a sufficient number of positive and negative sample pairs are combined with real artificial annotation data for a three-segment training of the transition from synthetic data to real data. Wherein first, only sentence pairs containing grammar errors-grammar correctness are trained. Then, a small amount of manually marked positive samples and negative samples are used for fine adjustment of parameters of the trained first text processing model; and finally, performing fine adjustment on parameters of the trained first text processing model by using a small amount of manually marked positive samples and negative samples and positive samples, so as to improve the performance of the model.

In some embodiments, since the second text input model may have a limitation on the input characters, in order to enable the second text processing model to perform long text segmentation, in S103, first, in a case that the number of characters in the positive sample is greater than the preset number, the preset number of characters in the positive sample and the characters in the negative sample corresponding to the preset number of characters in the positive sample may be input into the second text processing model to be trained according to the sequence from front to back, so as to obtain a predicted correction sequence of the preset number of characters in the negative sample; then, the characters remaining in the positive sample and the characters remaining in the negative sample are used as training samples for the next model training process.

According to the text processing model training method, the positive samples are obtained through crawling dialogue texts with correct grammar from the Internet, such as texts related to scene dialogue, conference summary texts and the like, and therefore a large number of positive samples can be obtained through crawling dialogue texts from the Internet; then, the conversion operation is performed on the sentences in the dialogue text, such as deleting characters, replacing homophones, merging natural segments and the like, so that the converted sentences are sentences with grammar errors, a negative sample is obtained, and label information of the negative sample and the negative sample is obtained through the conversion operation, so that a large number of negative samples with labeling information can be obtained. And then, respectively inputting the positive sample and the negative sample into a trained first text processing model and a trained second text processing model to obtain a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model, wherein the dimension of the second text processing model is smaller than that of the first text processing model. And carrying out knowledge distillation on the second text processing model based on the first feature vector and the second feature vector, so that the second text processing model can learn the features of the first text processing model, and further, the text is checked. Therefore, a lightweight second text processing model can be obtained to correct the text, and the occupation of resources is reduced. And iteration is not needed in the process of text proofreading by using the second text processing model, so that the text proofreading efficiency is improved and the occupation of computing resources is reduced.

The embodiment of the application also provides an application scheme of the second text processing model, and the application scheme is described in detail below.

Fig. 2 is a flowchart of a voice text processing method provided in an embodiment of the present application, and as shown in fig. 2, the voice text processing method provided in the embodiment of the present application may include S201-S203.

S201: and recognizing the voice text corresponding to the target voice.

The target speech may be speech obtained in any way, for example, telephone recordings, conference recordings, speech generated during a voice chat. After the target voice is acquired, the target voice is recognized, so that a voice text corresponding to the target voice is determined.

S202: and inputting the voice text into a second text processing model, and determining a proofreading sequence of the voice text, wherein the proofreading sequence represents a proofreading rule of each character in the voice text.

The phonetic text is entered into the second text processing model and a collation sequence for the phonetic text can be determined. For example, the phonetic text is "just like a firework lost in the wind", the proof reading sequence is "hold, replace with image, hold hold,".

S203: and proofreading the voice text according to the proofreading sequence to obtain the proofreading text corresponding to the target voice.

For example, the phonetic text is "just like a firework lost in the wind", the checking sequence is' hold, replace with image, hold hold, ", the corrected text is" just like a firework that is lost in the wind ".

According to the voice text processing method, the positive samples are obtained by crawling dialogue texts with correct grammar from the Internet, such as texts related to scene dialogues, conference summary texts and the like, and therefore a large number of positive samples can be obtained by crawling dialogue texts from the Internet; then, the conversion operation is performed on the sentences in the dialogue text, such as deleting characters, replacing homophones, merging natural segments and the like, so that the converted sentences are sentences with grammar errors, a negative sample is obtained, and label information of the negative sample and the negative sample is obtained through the conversion operation, so that a large number of negative samples with labeling information can be obtained. And then, respectively inputting the positive sample and the negative sample into a trained first text processing model and a trained second text processing model to obtain a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model, wherein the dimension of the second text processing model is smaller than that of the first text processing model. And carrying out knowledge distillation on the second text processing model based on the first feature vector and the second feature vector, so that the second text processing model can learn the features of the first text processing model, and further, the text is checked. Therefore, a lightweight second text processing model can be obtained to correct the text, and the occupation of resources is reduced. And iteration is not needed in the process of text proofreading by using the second text processing model, so that the text proofreading efficiency is improved and the occupation of computing resources is reduced.

Based on the text processing model training method in the above embodiment, the embodiment of the present application further provides a text processing model training device. Fig. 3 is a schematic structural diagram of a text processing model training apparatus 300 according to an embodiment of the present application, and as shown in fig. 3, the apparatus 300 may include an obtaining module 301, a transforming module 302, a generating module 303, and a training module 304.

The obtaining module 301 is configured to crawl a dialogue text from the internet to obtain a positive sample; the sentences in the dialogue text are sentences with correct grammar, and the positive samples are sentences in the dialogue text;

the transformation module 302 is configured to sum first tag information of a negative sample, where a sentence in the negative sample is a sentence with a grammar error, and the first tag information represents a transformation sequence for transforming the positive sample into the negative sample;

a generating module 303, configured to correspondingly input the positive sample and the negative sample into a first text processing model that is trained in advance and a second text processing model that is to be trained, and generate a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model; the dimension of the second text processing model is smaller than that of the first text processing model, and the first text processing model is obtained through training according to a transformation sequence of a positive sample, a negative sample and a negative sample;

And the training module 304 is configured to perform knowledge distillation on the second text processing model according to the first feature vector and the second feature vector, so as to obtain a trained second text processing model.

In one possible implementation, training module 304 is to:

In one possible implementation, the apparatus further includes:

the generating module 303 is further configured to input the positive sample and the negative sample into a second text processing model to be trained, and generate a predicted collation sequence of the negative sample;

the training module 304 is further configured to train a text processing model based on the negative sample predicted collation sequence and the first tag information.

In a possible implementation manner, the training sample further includes a positive sample pair and second tag information of the positive sample pair, where the second tag information represents a conversion sequence for converting the positive sample into the positive sample, and the two positive samples in the positive sample pair are identical, and the generating module 303 is further configured to input the positive sample pair into the trained text processing model, and generate a predicted calibration sequence of the positive sample;

the training module 304 is further configured to train a text processing model based on the predicted collation sequence of positive samples and the second tag information.

In one possible implementation, the generating module 303 is configured to:

The text processing model training device provided in the embodiment of the present application can execute the method steps in the embodiment shown in fig. 1 and achieve the same technical effects, and for avoiding repetition, detailed descriptions thereof are omitted.

According to the text processing model training device, the positive samples are obtained by crawling dialogue texts with correct grammar from the Internet, such as texts related to scene dialogues, conference summary texts and the like, and a large number of positive samples can be obtained by crawling dialogue texts from the Internet; then, the conversion operation is performed on the sentences in the dialogue text, such as deleting characters, replacing homophones, merging natural segments and the like, so that the converted sentences are sentences with grammar errors, a negative sample is obtained, and label information of the negative sample and the negative sample is obtained through the conversion operation, so that a large number of negative samples with labeling information can be obtained. And then, respectively inputting the positive sample and the negative sample into a trained first text processing model and a trained second text processing model to obtain a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model, wherein the dimension of the second text processing model is smaller than that of the first text processing model. And carrying out knowledge distillation on the second text processing model based on the first feature vector and the second feature vector, so that the second text processing model can learn the features of the first text processing model, and further, the text is checked. Therefore, a lightweight second text processing model can be obtained to correct the text, and the occupation of resources is reduced. And iteration is not needed in the process of text proofreading by using the second text processing model, so that the text proofreading efficiency is improved and the occupation of computing resources is reduced.

Based on the voice text processing method in the above embodiment, the embodiment of the present application further provides a voice text processing device. Fig. 4 is a schematic structural diagram of a voice text processing apparatus 400 according to an embodiment of the present application, and as shown in fig. 4, the apparatus 400 may include a recognition module 401, a determination module 402, and a collation module 403.

The recognition module 401 is configured to recognize a voice text corresponding to the target voice.

A determining module 402, configured to input the phonetic text into the second text processing model as in the first aspect or any of the possible implementation manners of the first aspect, and determine a collation sequence of the phonetic text, where the collation sequence represents a collation rule of each character in the phonetic text.

And the proofreading module 403 is configured to proofread the voice text according to the proofreading sequence, and obtain a proofreading text corresponding to the target voice.

The voice text processing device provided in the embodiment of the present application can execute the method steps in the embodiment shown in fig. 2 and achieve the same technical effects, and for avoiding repetition, detailed descriptions thereof are omitted.

According to the voice text processing device, the positive samples are obtained by crawling dialogue texts with correct grammar from the Internet, such as texts related to scene dialogues, conference summary texts and the like, and a large number of positive samples can be obtained by crawling dialogue texts from the Internet; then, the conversion operation is performed on the sentences in the dialogue text, such as deleting characters, replacing homophones, merging natural segments and the like, so that the converted sentences are sentences with grammar errors, a negative sample is obtained, and label information of the negative sample and the negative sample is obtained through the conversion operation, so that a large number of negative samples with labeling information can be obtained. And then, respectively inputting the positive sample and the negative sample into a trained first text processing model and a trained second text processing model to obtain a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model, wherein the dimension of the second text processing model is smaller than that of the first text processing model. And carrying out knowledge distillation on the second text processing model based on the first feature vector and the second feature vector, so that the second text processing model can learn the features of the first text processing model, and further, the text is checked. Therefore, a lightweight second text processing model can be obtained to correct the text, and the occupation of resources is reduced. And iteration is not needed in the process of text proofreading by using the second text processing model, so that the text proofreading efficiency is improved and the occupation of computing resources is reduced.

An electronic device provided in an embodiment of the present application is described below.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 5, the electronic device provided in the embodiment of the present application may be used to implement the text processing model training method or the phonetic text processing method described in the above method embodiment.

The electronic device may include a processor 501 and a memory 502 storing computer program instructions.

In particular, the processor 501 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or may be configured to implement one or more integrated circuits of embodiments of the present application.

Memory 502 may include mass storage for data or instructions. By way of example, and not limitation, memory 502 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, magnetic tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. Memory 502 may include removable or non-removable (or fixed) media, where appropriate. Memory 502 may be internal or external to the integrated gateway disaster recovery device, where appropriate. In a particular embodiment, the memory 502 is a non-volatile solid state memory.

The memory may include Read Only Memory (ROM), random Access Memory (RAM), magnetic disk storage media devices, optical storage media devices, flash memory devices, electrical, optical, or other physical/tangible memory storage devices. Thus, in general, the memory includes one or more tangible (non-transitory) computer-readable storage media (e.g., memory devices) encoded with software comprising computer-executable instructions and when the software is executed (e.g., by one or more processors) it is operable to perform operations described with reference to methods in accordance with the present application.

The processor 501 implements any one of the text processing model training methods or the phonetic text processing method of the above embodiments by reading and executing the computer program instructions stored in the memory 502.

In one example, the electronic device may also include a communication interface 503 and a bus 510. As shown in fig. 5, the processor 501, the memory 502, and the communication interface 503 are connected to each other by a bus 510 and perform communication with each other.

The communication interface 503 is mainly used to implement communication between each module, apparatus, unit and/or device in the embodiments of the present application.

Bus 510 includes hardware, software, or both that couple components of the electronic device to one another. By way of example, and not limitation, the buses may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a micro channel architecture (MCa) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus, or a combination of two or more of the above. Bus 510 may include one or more buses, where appropriate. Although embodiments of the present application describe and illustrate a particular bus, the present application contemplates any suitable bus or interconnect.

In addition, in connection with the above embodiments, the embodiments of the present application may be implemented by providing a computer storage medium. The computer storage medium has stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the text processing model training methods or phonetic text processing methods of the above embodiments.

The functional blocks shown in the above-described structural block diagrams may be implemented in hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the present application are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave. A "machine-readable medium" may include any medium that can store or transfer information. Examples of machine-readable media include electronic circuitry, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and the like. The code segments may be downloaded via computer networks such as the internet, intranets, etc.

It should also be noted that the exemplary embodiments mentioned in this application describe some methods or systems based on a series of steps or devices. However, the present application is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be different from the order in the embodiments, or several steps may be performed simultaneously.

Aspects of the present application are described above with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, enable the implementation of the functions/acts specified in the flowchart and/or block diagram block or blocks. Such a processor may be, but is not limited to being, a general purpose processor, a special purpose processor, an application specific processor, or a field programmable logic circuit. It will also be understood that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware which performs the specified functions or acts, or combinations of special purpose hardware and computer instructions.

In the foregoing, only the specific embodiments of the present application are described, and it will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, modules and units described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein. It should be understood that the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the present application, which are intended to be included in the scope of the present application.

Claims

1. A method of training a text processing model, the method comprising:

performing transformation operation on sentences in the dialogue text to obtain a negative sample and first label information of the negative sample, wherein the sentences in the negative sample are sentences with grammar errors, the first label information represents a transformation sequence for transforming the positive sample into the negative sample, and the transformation sequence represents transformation operation corresponding to each character in one sentence;

Correspondingly inputting the positive sample and the negative sample into a first text processing model to be trained and a second text processing model to be trained, and generating a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model; the dimension of the second text processing model is smaller than that of the first text processing model, and the first text processing model is obtained through training according to the transformation sequences of the positive sample, the negative sample and the negative sample;

performing knowledge distillation on the second text processing model according to the first feature vector and the second feature vector to obtain a trained second text processing model;

in the case that the number of layers of the first text processing model is the same as the number of layers of the second text processing model, the first feature vector includes a first input layer feature vector, a first hidden layer feature vector, a first attention vector, and a first predictive collation vector for collating the negative samples, and the second feature vector includes a second input layer feature vector, a second hidden layer feature vector, a second attention vector, and a second predictive collation vector for collating the negative samples;

And performing knowledge distillation on the second text processing model according to the first feature vector and the second feature vector to obtain a trained second text processing model, wherein the method comprises the following steps:

calculating the cross entropy loss of the first prediction correction vector and the second prediction correction vector according to preset temperature parameters;

2. The method of claim 1, wherein, in a case where the number of layers of the first text processing model is M, the number of layers of the second text processing model is N, and M is not equal to N, the first feature vector includes a first attention vector for each of the M layers of the first text processing model, a first hidden layer feature vector for each hidden layer, a first input layer feature vector, and a first predictive collation vector for collating the negative sample, and the second feature vector includes a second attention vector for each of the N layers of the second text processing model, a second hidden layer feature vector for each hidden layer, a second input layer feature vector, and a second predictive collation vector for collating the negative sample.

3. The method according to claim 2, wherein performing knowledge-based distillation on the second text processing model based on the first feature vector and the second feature vector to obtain a trained second text processing model comprises:

calculating a fourth mean square error loss between a first attention vector of an M layer in the first text processing model and a second attention vector of an N layer in the second text processing model according to the first EMD matrix and the attention loss matrix;

Calculating a fifth mean square error loss between a first hidden layer feature vector of an M layer in the first text processing model and a second hidden layer feature vector of an N layer in the second text processing model according to the second EMD matrix and the hidden layer loss matrix;

4. A method according to any one of claims 1-3, wherein the method further comprises:

inputting the positive sample and the negative sample into a second text processing model to be trained, and generating a prediction correction sequence of the negative sample;

training the text processing model according to the predicted collation sequence of the negative sample and the first tag information.

5. The method of claim 4, wherein training samples further comprise a positive sample pair and second tag information for the positive sample pair, the second tag information representing a conversion sequence that converts the positive sample to the positive sample, the two positive samples in the positive sample pair being identical, the method further comprising:

training the text processing model according to the predicted collation sequence of the positive sample and the second label information.

6. A method according to any one of claims 1-3, wherein inputting the positive and negative samples into a second text processing model to be trained, generating a predicted collation sequence for the negative samples, comprises:

under the condition that the number of characters in the positive sample is larger than the preset number, inputting the preset number of characters in the positive sample and the characters corresponding to the preset number of characters in the negative sample into the second text processing model to be trained according to the sequence from front to back to obtain a prediction correction sequence of the preset number of characters in the negative sample;

and taking the characters remaining in the positive sample and the characters remaining in the negative sample as training samples of the next model training process.

7. A method of speech text processing, the method comprising:

recognizing a voice text corresponding to the target voice;

inputting the phonetic text into a second text processing model according to any of claims 1-6, determining a collation sequence for the phonetic text, the collation sequence representing a collation rule for each character in the phonetic text;

And proofreading the voice text according to the proofreading sequence to obtain the proofreading text corresponding to the target voice.

8. A text processing model training apparatus, the apparatus comprising:

the conversion module is used for carrying out conversion operation on sentences in the dialogue text to obtain negative samples and first label information of the negative samples, wherein the sentences in the negative samples are sentences with grammar errors, the first label information represents a conversion sequence for converting the positive samples into the negative samples, and the conversion sequence represents conversion operation corresponding to each character in one sentence;

the generation module is used for correspondingly inputting the positive sample and the negative sample into a first text processing model which is trained in advance and a second text processing model which is to be trained, and generating a first feature vector of a target layer of the first text processing model and a second feature vector of a target layer of the second text processing model; the dimension of the second text processing model is smaller than that of the first text processing model, and the first text processing model is obtained through training according to the transformation sequences of the positive sample, the negative sample and the negative sample;

The training module is used for carrying out knowledge distillation on the second text processing model according to the first characteristic vector and the second characteristic vector to obtain a trained second text processing model;

the training module is used for: